- primary objective is to identify and gather data we intend to use for machine learning
- check data accuracy
- the process of describing, visualizing, and analyzing data in order to better understand it
allows us to answer questions such as
- how many rows and columns are in the data
- what type of data do we have
- are there missing, inconsistent or duplicate values in the data
- the process of making sure that our data (by modifying it) is suitable for the machine learning approach that we choose to use
- check for missing data (changes in data, human error, bias, lack of reliable input)
- normalizing data: ensures that values share a common property
- involves scaling data to fall within a small or specified range
- often required, reduces complexity, improves interpretability
- sampling data:
- the process of selecting a subset of the instances in a dataset as a proxy for the whole
- original dataset is referred to as population, subset is sample
- dimensionality reduction:
- the process of reducing the number of features in a dataset prior to modeling
- helps to reduce time and storage required to process data
- improves data visualization and model interpretability
- reduces complexity and helps avoid the curse of dimensionality
- feature selection:
- the process of identifying the minimal set of features needed to build a good model
- also known as variable subset selection
- feature extraction:
- use of mathematical functions to transform high-dimensional data into lower dimension
- also known as feature projection
- the process of reducing the number of features in a dataset prior to modeling
- involves choosing and applying the appropriate machine learning approach that works well with the data we have and solves the problem that we intend to solve
- in supervised ML: objective is to build a model that maps a given input (which we call the independent variables) to the given output (which we call the dependent variable)
- depending on the nature of the dependent variable, problem can be either be called Classification or Regression
- Classification: if dependent variable is a categorical value (e.g.: color, yes or no, the weather)
- Regression: if we intend to predict a continuous value (e.g.: age, income, temperature)
- ml algo that solves only regression
- logistic regression, simple linear regression, multiple linear regression, poisson regression, polynomial regression
- ml algo that solves only regression
- depending on the nature of the dependent variable, problem can be either be called Classification or Regression
- ML algo that can solve both Classification and regression problems
- descision tree, Naive Bayes, neural networks, k-nearest neighbors, support vector machines
- in supervised ML: objective is to build a model that maps a given input (which we call the independent variables) to the given output (which we call the dependent variable)
- after training a ML model, important to see how well suited it is to the problem at hand
- in order to get an unbiased evaluation of the performance of our model, we must train the model with a different dataset (training data, and test data) from the one we use to evaluate it