# Data preprocessing

## Lecture 1

### GRA 4160
### Predictive modelling with machine learning

#### Lecturer: Vegard H. Larsen

Data preprocessing is an important step in the machine learning process. It involves cleaning and formatting the data in a way that allows a model to learn from it effectively. This can include tasks such as removing missing or incorrect data, scaling the data to a consistent range, and encoding categorical variables. Here are some topics we will talk about throughout this course related to preprocessing of data:

1. **Data cleaning**: Data cleaning involves identifying and correcting or removing errors or inconsistencies in the data. This can include handling missing values, removing duplicate records, and correcting errors in the data.
2. **Feature selection**: Feature selection involves selecting the most relevant and meaningful features to include in the model. This can help improve the performance of the model and reduce the risk of overfitting.
3. **Feature engineering**: Feature engineering involves creating new features from existing data, or transforming existing features to make them more suitable for the model. This can include techniques such as normalization, standardization, and scaling.
4. **Data splitting**: Data splitting involves dividing the data into separate sets for training, validation, and testing. This is important to ensure that the model is evaluated on unseen data, and to prevent overfitting.
5. **Data transformation**: Data transformation involves applying transformations to the data in order to make it more suitable for the model. This can include techniques such as encoding categorical variables, scaling numerical variables, and applying dimensionality reduction techniques.

## Data cleaning

Data cleaning is an important step in the machine learning process, as it helps to ensure that the data is accurate, consistent, and ready for analysis. You should learn about the different types of data cleaning techniques and when to use them. Some common data cleaning techniques include:

1. **Handling missing values**: Missing values can occur in a dataset for a variety of reasons, such as data entry errors or missing responses to survey questions. You should learn how to identify missing values and how to handle them, such as by imputing missing values using statistical techniques or by dropping rows or columns with too many missing values.
2. **Handling outliers**: Outliers are data points that are significantly different from the rest of the data. You should learn how to identify and handle outliers, as they can have a significant impact on the results of a machine learning model. Outliers can be handled by dropping them, transforming them, or treating them as missing values.
3. **Handling errors**: Data errors can occur due to typos, data entry mistakes, or formatting issues. You should learn how to identify and fix errors in the data.

Overall, data cleaning is a critical step in the machine learning process, and you should be familiar with a variety of techniques for handling missing values, outliers, and errors in the data. By investing time in data cleaning, you can ensure that the machine learning models are based on high-quality data and are more likely to produce accurate and reliable results.

Here are a few example of data cleaning using pandas:

In [None]:
import pandas as pd

# Load the dataset using pandas

df = pd.read_csv('../data/data_cleaning_example.csv')
df.head()

In [None]:
# Handle missing values

df = df.dropna()  # Drop rows with missing values
df.head()

In [None]:
# Handle outliers

df = df[df['A'] < df['A'].quantile(0.90)]  # Drop rows with values in the 'A' column that is greater than the 90th percentile
df.head()

In [None]:
# Handle errors
df = df[df['B'].apply(lambda x: x.isnumeric())] # Remove rows with non-numeric values in the 'B' column
df.head()

## Feature selection

Feature selection can have a significant impact on the performance and interpretability of a model. By learning about different techniques and considerations for feature selection, you can develop the skills and knowledge necessary to effectively select and use the most relevant features in a machine learning projects. We will also talk about future selection when we get to the part on regularisation in lecture 3.

When building a classification model, it is important to select features that are able to distinguish between the different classes. For a regression model, it is important to select features that are correlated with the target variable. You should be aware of the potential risks of using too few or too many features, as this can affect the model's performance.

We can look at an example of future selection using the iris dataset. This dataset is a widely used for demonstrating machine learning techniques. It contains measurements of 150 iris flowers from three different species: Setosa, Versicolor, and Virginica. The measurements include the length and width of the petals and sepals, which are the structures that support the flower. The dataset is often used in classification tasks, where the goal is to predict the species of an iris flower based on its measurements. The iris dataset is considered to be a fairly easy dataset, as the three species of iris are relatively well separated and the task of classifying them is not too difficult. It is often used as a baseline or benchmark for comparing the performance of different machine learning algorithms.

Here we use Scikit-learn to select two out of the four futures in this dataset:

In [None]:
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

# Load the iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Select the top 2 features using chi-squared test
selector = SelectKBest(chi2, k=4)
X_new = selector.fit_transform(X, y)

# Get the selected feature names
selected_feature_names = [iris.feature_names[i] for i in range(len(iris.feature_names)) if selector.get_support()[i]]
print(selected_feature_names)

In this example, we use the `load_iris` function to load the iris dataset, which includes features and target variables. We then use the `SelectKBest` class and the chi-squared test to select the top 2 features, and use the fit_transform method to fit the model to the data and transform the data by selecting the top 2 features. Finally, we use the `get_support` method to get the names of the selected features.

The chi-squared test is a statistical test used to determine whether there is a significant difference between the observed frequencies of a categorical variable and the expected frequencies of the variable, based on some underlying distribution or hypothesis.

In the context of feature selection, the chi-squared test can be used to select the features that are most correlated with the target variable. The test calculates the chi-squared statistic for each feature, which is a measure of the difference between the observed and expected frequencies of the values of the feature and the target variable. The features with the highest chi-squared statistic are considered the most important for predicting the target variable, and can be selected for use in a model.

The chi-squared test is often used in machine learning to select features for classification models, as it can be used to identify the features that are most correlated with the class labels. It is particularly useful for selecting features from a dataset with a large number of features, as it can help to reduce the dimensionality of the data and improve the performance of the model.

## Data splitting



Data splitting involves dividing a dataset into separate training and testing sets. Sometimes we also use a validation set. The training set is used to train the model, the validation sample is used to tune the hyperparameters, and the testing set is used to evaluate the final performance of the model on unseen data.

Using a validation sample can help to prevent overfitting, which is when the model performs well on the training data but poorly on the test data, by allowing the model to be fine-tuned to the characteristics of the training data without being overfitted to it. The model is trained using the training set, and the performance is evaluated on the validation sample. The model's hyperparameters are then adjusted based on the results of the evaluation, and the model is retrained using the updated hyperparameters. This process is repeated until the model's performance on the validation sample is satisfactory.

Data splitting is important because it allows you to evaluate the generalization ability of the model, which is how well the model can make predictions on unseen data. This is important because you want your model to be able to generalize to new data, rather than just memorizing the training data.

There are several ways to split the data, including simple random sampling, stratified sampling, and k-fold cross-validation. Simple random sampling involves randomly selecting a percentage of the data for the training set and the remaining data for the testing set. Stratified sampling involves sampling the data in such a way that the proportion of each class in the training and testing sets is the same as in the original dataset. K-fold cross-validation involves dividing the data into k folds, training the model on k-1 folds, and evaluating the model on the remaining fold. This process is repeated k times, with a different fold being used as the testing set each time.



In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load the iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In this example, we use the `train_test_split` function from scikit-learn's model selection module to split the data into training and testing sets. The `test_size` parameter specifies the proportion of the data that should be used for the test set (in this case, 20 %), and the `random_state` parameter specifies the random seed to use when shuffling the data before splitting it.

The `train_test_split` function returns four arrays: `X_train`, which contains the training data, `X_test`, which contains the testing data, `y_train`, which contains the training labels, and `y_test`, which contains the testing labels.

## Data transformations

Data transformation is the process of transforming raw data into a form that is more suitable for analysis or modeling. There are various techniques that can be used to transform data, including scaling, aggregation, and feature engineering.

**Scaling** is the process of transforming the data so that all the features have the same scale. This is often necessary because some machine learning algorithms are sensitive to the scale of the input features. There are several scaling techniques that can be used, such as min-max scaling and standardization. Min-max scaling, scales the data to a fixed range, usually between 0 and 1. Standardization scales the data so that it has zero mean and unit variance.

**Aggregation** is the process of combining data from multiple sources or time periods or grouping data by certain attributes. Aggregation can be used to reduce the dimensionality of the data and simplify the analysis process.

**Feature engineering** is the process of creating new features from the existing data. This can be done by combining existing features, extracting features from text or images, or creating synthetic features using domain knowledge. It can help to improve the performance of the model by providing additional information.

In [None]:
from sklearn.datasets import load_iris
from sklearn.preprocessing import MinMaxScaler

# Load the iris dataset
iris = load_iris()
X = iris.data

# Scale the data using min-max scaling
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)
print(X[0:5])
print(X_scaled[0:5])