# Data Pre-processing Techniques

Data preprocessing involves several transformations that are applied to the raw data to make it more amenable for learning. It is carried out before using it for model training or prediction.

There are many pre-processing techniques for

* Data Cleaning
  * Data Imputation
  * Feature scaling

* Feature Transformation
  * Polynomial Features
  * Discretization
  * Handling categorical features
  * Custom Transformers
  * Composite Transformers
    * Apply transformation to diverse features
    * TargetTransformedRegresor
* Feature Selection
  * Filter based feature selection
  * Wrapper based feature selection
* Feature Extraction
  * PCA

The transformations are applied in a specific order and the order can be specified via Pipeline. We need to apply different transformations based on the feature type. FeatureUnion helps us perform that task and combine outputs from multiple transformations into a single transformed feature matrix. We will also study as how to visualize this pipeline.

# Importing basic Libraries

In this colab, we are importing libraries as needed. However it is a good practice to have all imports in one cell-arranged in an alphabetical order. This helps us weed out any duplicate imports and some such issues.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
sns.set_theme(style='whitegird')

# 1. Feature Extraction

## DictVectorizer

Many a times the data is present as a list of dictionary objects. ML algorithms expect the data in matrix form with shape (n,m) where n is the number of samples and m is the number of features.

DictVectorizer converts a list of dictionary objects to feature matrix.

Let's create a sample data for demo purpose containing age and height of children.

  Each record/sample is a dictionary with two keys age and height and corresponding values.

In [None]:
data = [{'age':4,'height':96.0},
        {'age':1,'height':73.9},
        {'age':3,'height':88.9},
        {'age':2,'height':81.6}]

There are 4 data samples with 2 features each.

Let's make use of DictVectorizer to convert the list of dictionary objects to the feature matrix.

In [None]:
from sklearn.feature_extraction import DictVectorizer
dv = DictVectorizer(sparse=False)
data_transformed = dv.fit_transform(data)
data_transformed

In [None]:
data_transformed.shape

The transformed data is in a feature matrix form-4 samples with 2 features each i.e. shape (4,2)

# 2. Data Imputation

* Many machine learning algorithms need full feature matrix and they may not work in presence of missin data.
* Data imputation identifies missing values in each feature of the dataset and replaces them with an appropriate value based on a fixed strategy such as
  * mean or median or mode of that feature.
  * use specified constant value.

Sklearn library provides sklearn.impute.SimpleImputer class for this purpose.

In [None]:
from sklearn.impute import SimpleImputer

Some of its important parameters:

* missing_values: Could be int, float, str, np.nan or None. Default is np.nan
* strategy: string, default is 'mean'. One of following strategies can be used:

  * mean- missing values are replaced using the mean along each column.
  * median- missing values are replaced using the median along each column.
  * most_frequent- missing values are replaced using the most frequent along each column.
  * constant- missing values are replaced using the fill_value arguement.
* add_indicator is a boolean parameter that when set to True returns missing value indicators in indicator_ member variable.

Note:
* mean and mode strategies can only be used with numeric data.
* most_frequent and constant strategies can be used with strings or numeric data.

# Data imputation on real world dataset

Let's perform data imputation on real world dataset. We will be using heart disease dataset from uci machine learning repo for this purpose. We will load this dataset from csv file.

In [None]:
cols = ['age','sex','cp','trestbps','chol','fbs','restecg','thalach','exang','oldpeak','slope','ca','thal','num']
heart_data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data',header=None, names=cols)

The dataset has the following features:

1. Age (in years)
2. Sex (1=male; 0=female)
3. cp - cheap pain type
4. trestbps - resting blood pressure (anything above 130-140 is typically cause for concern)
5. chol - serum cholestrol in mg/dl (above 200 is cause for concern)
6. fbs -  fasting blood sugar (>120 mg/dl)(1=true;0=false)
7. restecg - resting electrocardiographic results
  * 0=normal
  * 1=having ST-T wave abnormality
  * 2=showing probable or definite left ventricular hypertrophy by Estes' criteria
8. thalach - maximum heart rate achieved
9. exang - exercise induced angina
  * 1=yes
  * 0=no
10. oldpeak - depression induced by exercise relative to rest
11. slope - slope of peak exercise ST segment
  * 1=upsloping
  * 2=flat value
  * 3=downsloping
12. ca = number of major vessels (0-3) colored by flourosopy
13. thal - (3=normal; 6=fixed defect; 7=reversable defect)
14. num - diagnosis of heart disease (angiographic disease status)
  *0: <50% diameter narrowing
  *1: >50% diameter narrowing

**STEP 1:** Check if the dataset contains missing values.

* This can be checked via dataset description or by check number of nan or np.null in the dataframe. However such a check can be performed only for numerical features.
* Fr non-numerical features, we can list their unique values and check if there are values like $?$.

In [None]:
heart_data.info()

Let's check if there are any missing values in numerical columns-here we have checked it for all columns in the dataframe.

In [None]:
(heart_data.isnull().sum())

There are two non-numerical features: ca and thal.

* List their unique values

In [None]:
print(heart_data.ca.unique(),heart_data.thal.unique())

Both of them contain ?, which is a missing values. Let's count the number of missing values.

In [None]:
print(heart_data.loc[heart_data.ca=='?','ca'].count(),heart_data.loc[heart_data.thal==?,'thal'].count())

**Step 2:** Replace '?' with nan

In [None]:
heart_data.replace('?',np.nan,inplace=True)

**Step 3:** Fill the missing values with sklearn missing value imputation utilities

Here we use SimpleImputer with mean strategy.

We will try two variations-

* add_indicator = False: Default choice that only imputes missing values.

In [None]:
imputer = SimpleImputer(missing_values = np.nan, strategy = 'mean')
imputer = imputer.fit(heart_data)
heart_data_imputed = imputer.transform(heart_data)
print(heart_data_imputed.shape)

* add_indicator = True: Adds additonal column for each column containing missing values, In our case, this adds two columns one for ca and other for thal. It indicates if the sample has missing value.

In [None]:
imputer = SimpleImputer(missing_values = np.nan, strategy = 'mean',add_indicator = True)
imputer = imputer.fit(heart_data)
heart_data_imputed = imputer.transform(heart_data)
print(heart_data_imputed.shape)

# 3. Feature scaling

Feature scaling transforms feature values such that all the features are on the same scale.

When we use feature matrix with all features on the same scale, it provides us certain advantages as listed below:

* Enables faster convergence in iterative optimization algorithms like gradient descent and its variants.
* The performance of ML algorithms such as SVM, K-NN and K-menas etc that compute euclidean distance among input samples gets impacted if the features are not scaled.

Tree based ML algorithms are not affected by feature-scaling. In other words, feature scaling is not required for tree based ML algorithms.

feature scaling can be performed with the following methods:

* Standardization
* Normalization
* MaxAbsScaler

Let's demonstrate feature scaling on a real world dataset. For this purpose we will be using ablone dataset. We will use different scaling utilities in sklearn library.

In [None]:
cols = ['Sex','Length','Diameter','Height','Whole weight','Shucked weight','Viscera weight','Shell weight','Rings']
abalone_data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data',header=None, names=cols)

**Step 1:** Examine the dataset

Feature scaling is performed only on numerical attributes. Let's check which are numerical attributes in this dataset. We can get that via info() method

In [None]:
abalone_data.info()

**Step 1a:** [Optional]: Convert non-numerical attributes to numerical ones.

In this dataset, Sex is a non-numeric column in this dataset. Let's examine it and see if we can convert it to numeric representation.

In [None]:
abalone_data.Sex.unique()

In [None]:
abalone_data = abalone_data.replace({'Sex':{'M':1,'F':2,'I':3}})
abalone_data.info()

**Step 2:** Separate labels from features

In [None]:
y = abalone_data.pop('Rings')
abalone_data.info()

**Step 3:** Examine feature scales

Statistical method

Check the scales of different feature with describe() method of dataframe.

In [None]:
abalone_data.describe().T

Note that

* There are 4177 examples or rows in this dataset.
* The mean and standard deviation of features are quite different from one another.

We can confirm that with a variety of visualization techniques and plots.

## Visualization of feature distributions

Visualize feature distributions

* Histogram
* Kernel density estimation KDE plot
* Box
* Violin

Feature Histogram

We will have separate and combined histogram plots to check if the feature are indeed on different scales.

# to be added

**Step 4:** Scaling

* Normalization
* MaxAbsScaler
* MinMaxScaler

In [None]:
from sklearn.preprocessing import MaxAbsScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler

x = np.array([4,2,5,-2,-100]).reshape(-1,1)
mas = MaxAbsScaler()
x_mas = mas.fit(x)

x = abalone_data
mm = MinMaxScaler()
x_n = mm.fit_transform(x)

ss = StandardScaler()
x_s = ss.fit_transform(x)

print(x_mas)
print(x_n)
print(x_s)

# 4. add_dummy_feature

Augments dataset with a column vector, each value in the column vector is 1. This is useful for adding a parameter for bias term in the model.

In [None]:
x = np.array([[7,1],[1,8],[2,0],[9,6]])

from sklearn.preprocessing import add_dummy_feature

x_new = add_dummy_feature(x)
x_new

# 5. Custom transformers

Enables conversion of an existing Python function into a tranformer to assist in data cleaning or processing.

Useful when:

1. The dataset consists of heterogeneous data type
2. The dataset is stored in a pandas dataframe and different columns require different processing pipelines
3. We need stateless transformations such as taking the log of frequencies , custom scaling etc

In [None]:
from sklearn.preprocessing import FunctionTransformer

You can implement a transformer from an arbitrary function with FunctionTransformer

In [None]:
wine_data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv',sep=';')

In [None]:
wine_data.describe().T

Let's use np.log1p which returns natural logarithm of 1+feature value

In [None]:
transformer = FunctionTransformer(np.log1p, validate=True)
wine_data_trasformed = transformer.transform(np.array(wine_data))
pd.DataFrame(wine_data_trasformed, columns=wine_data.columns).describe().T

# 6. Polynomial Features

Generate a new feature consisting of all polynomial combinations of the features with degree less than or equal to the specified degree.

* For example, if an input sample is two dimensional and of the form [a,b], the degree-2 polynomial features are [1,a,b,a^2 ,ab,b^2]



In [None]:
from sklearn.preprocessing import PolynomialFeatures

wine_data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv',sep=';')
wine_data_copy = wine_data.copy()
wine_data = wine_data.drop(['quality'],axis=1)
print(wine_data.shape)

poly = PolynomialFeatures(degree=2)
poly_wine_data = poly.fit_transform(wine_data)
print(poly_wine_data.shape)

After transformation we have 78 features, let's list them out

In [None]:
poly.get_feature_names_out()

#7. Discretization

Discretization/quantization/binning provides a way to partition continuous features into discrete values.

In [None]:
from sklearn.preprocessing import KBinsDiscretizer

Let us demonstrate KBinsDiscretizer using wine quality dataset.

In [None]:
wine_data = wine_data_copy.copy()

enc = KBinsDiscretizer(n_bins=10, encode='onehot')
X=np.array(wine_data['chloride']).reshape(-1,1)
X_binned = enc.fit_transform(X)
X_binned

In [None]:
X_binned.toarray()[:5]

# 8. Handling Categorical Features

We need to convert the categorical features into numerical features

1. Ordinal encoding
2. One-Hot encoding
3. Label encoder
4. Using dummy variables

In [None]:
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder

cols = ['sepal length','sepal width','petal length','petal width','label']
iris_data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data',header=None, names=cols)

onehotencoder = OneHotEncoder(categories='auto')
iris_labels = onehotencoder.fit_transform(iris_data.label.values.reshape(-1,1))
iris_labels.toarray()[:5]

Let us observe the difference between one hot encoding and ordinal encoding.

In [None]:
enc = OrdinalEncoder()
iris_labels = np.array(iris_data['label'])

iris_labels_transformed = enc.fit_transform(iris_labels.reshape(-1,1))
print(np.unique(iris_labels_transformed))
print(iris_labels_transformed[:5])

In [None]:
from sklearn.preprocessing import LabelEncoder
iris_labels = np.array(iris_data['label'])
enc = LabelEncoder()
label_integer = enc.fit_transform(iris_labels)
label_integer

In [None]:
movie_genres = [{'action','comedy'},{'comedy'},{'action','thriller'},{'science-fiction','action','thriller'}]
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
mlb.fit_transform(movie_genres)

# Using dummy variables

In [None]:
iris_data_onehot = pd.get_dummies(iris_data, columns=['label'],prefix=['one_hot'])
iris_data_onehot

# 9. Composite Transformers

It applies a set of transformers to columns of an array or pandas.DataFrame, concatanates the transformed outputs from different transformers into a single matrix.

* It is useful for transforming heterogenous data by applying different transformers to separate subsets of features.
* It combines different feature selection mechanisms and transformation into a single transformer object.

In [None]:
x = [[20.0,'male'],[11.2,'female'],[15.6,'female'],[13.0,'male'],[18.6,'male'],[16.4,'female']]
x = np.array(x)

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.compose import MaxAbsScaler, OneHotEncoder

ct = ColumnTransformer([('scaler',MaxAbsScaler(),[0]),
                        ('pass','passthrough',[0]),
                        ('encoder',OneHotEncoder(),[1])])
ct.fit_transform(x)

# TransformedtargetRegressor

In [None]:
from sklearn.compose import TransformedTargetRegressor
from sklearn.preprocessing import MaxAbsScaler
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing

X,y = fetch_california_housing(return_X_y=True)
X,y = X[:2000,:], y[:2000]

transformer = MaxAbsScaler()

regressor = LinearRegression()

regr = TransformedTargetRegressor(regressor=regressor,transformer=transformer)

X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=0)
regr.fit(X_train, y_train)
print(regr.score(X_test, y_test))
raw_target_regr = LinearRegression.fit(X_train,y_train)
print(raw_target_regr.score(X_test,y_test))

# 10. Feature Selection

sklearn.feature_selection module has useful APIs to select features/reduce dimensionality, either to improve estimators accuracy scores or to boost their performance on very high-dimensional datasets.

##Filter based methods

VarianceThreshold

This transformer helps to keep only high variance features by providing a certain threshold.

Features with variance greater or equal to threhold value are kept rest are removed.

By default it removes any feature with same value ie 0 variance

In [None]:
data = [{'age':4,'height':96.0},
        {'age':1,'height':73.9},
        {'age':3,'height':88.9},
        {'age':2,'height':81.6}]

dv = DictVectorizer(sparse=False)
data_transformed = dv.fit_transform(data)
np.var(data_transformed, axis=0)

In [None]:
from sklearn.feature_selection import VarianceThreshold
vt = Threshold(threshold=9)
data_new = vt.fit_transform(data_transformed)
data_new

As you may observe from output of above cell, the transformer has removed the age feature because its variance is below he threshold.

SelectKBest

It selects k highest scoring features based on a function and removes the rest of the features.
Lets take an example of California Housing dataset.

In [None]:
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, mutual_info_regression

X_cal, y_cal = fetch_california_housing(return_X_y=True)

X,y = X_cal[:2000,:], y_cal[:2000]

X.shape

Let's take 3 most important features, since it is a regression problem, we can use mutual_info_regression or f_regression scoring functions only.

In [None]:
skb = SelectKBest(mutual_info_regression, k=3)
X_new = skb.fit_transform(X,y)
X_new.shape

In [None]:
skb.get_features_names_out()

SelectPercentile

This is very similar to SelectKBest from previous section

In [None]:
from sklearn.feature_selection import SelectPercentile
sp = SelectPercentile(mutual_info_regression, percentile=30)
X_new = s.fit_transform(X,y)
X_new.shape

In [None]:
sp.get_features_names_out()

GenericUnivariateSelect

It applies univariate feature selection with a certain strategy, which is passed to the API via mode parameter, mode can take one of the following values: percentile,k_best,fpr,fdr,fwe

for similar to SelectKBest results, below is the code

In [None]:
from sklearn.feature_selection import GenericUnivariateSelect

gud = GenericUnivariateSelect(mutual_info_regression, mode = k_best, param=3)
X_new = gud.fit_transform(X,y)
X_new.shape

## Wrapper based methods

RFE [recursive feature elimination]

first fit, remove least ranked feature

In [None]:
from sklearn.datasets import make_friendmanl
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression

estimator = LinearRegression()
selector = RFE(esimator, n_features_to_select=3,step=1)
selector = selector.fit(X,y)
print(selector.support_)
print(selector.ranking_)

RFE-CV

to add another layer of cross_validation to RFE

SelectFromModel

Select desired no of important features above certain threshold of feature importance as obtained from trained estimator.

In [None]:
from sklearn.feature_selection import SelectFromModel

estimator = LinearRegression()
estimator.fit(X,y)
print(estimator.coef_)
print(np.argsort(estimator.coef_)[-3:])
t=np.argsort(np.abs(estimator.coef_))[-3:]
model = SelectFromModel(estimator, max_features=3,prefit=True)
X_new = model.transform(X)
print(X_new.shape)

SequentialFeatureSelection

It performs feature selection by selecting or deselecting or deselecting features one by one in a greedy manner.

In [None]:
from sklearn.feature_selection import SequentialFeatureSelection
%%time estimator = LinearRegression()
sfs = SequentialFeatureSelection(estimator, n_features_to_select=3)
sfs.fit_transform(X,y)
sfs.support()

The features corresponding to True in the output

In [None]:
%%time estimator = LinearRegression()
sfs = SequentialFeatureSelection(estimator, n_features_to_select=3, direction='backward')
sfs.fit_transform(X,y)
sfs.support()

# 11. PCA

PCA is used to decompose a multivariate dataset in a set of successive orthogonal components that capture maximum amount of variance.

It helps in reducing dimensions of a dataset, thus computational cost of next steps, eg training a model, cross validation etc

Let's fit a PCA transformer on this data and compute its two principal components

In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
print(pca.fit(X))
print(pca.components_)
print(pca.explained_variance_)
print(pca.mean_)

Reduced dimensions

In [None]:
pca = PCA(n_components=1)
pca.fit(X)
X_pca = pca.transform(X)
print(X.shape,X_pca.shape)

# 12. Chaining Transformers

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
estimators = [('simpleimputer', SimpleImputer()),
              ('standardscaler', StandardScaler()),]
pipe = Pipeline(steps=estimators)

same can be done via make_pipeline

In [None]:
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(SimpleImputer(),
                     StandardScaler())

GridSearch with pipeline

by using naming convention of nested parameters, grid search can be implemented

In [None]:
from sklearn.impute import KNNImputer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

param_grid = dict(imputer=['pasthrough',
                           SimpleImputer(),
                           KNNImputer()],
                  clf = [SVC(), LogisticRegression()],
                  clf__C=[0.1,10,100])
grid_search = GridSearchCV(pipe, param_grid=param_grid)

c is an inverse of regularization, lower its value stronger the regularization

In this example clf__C provides a set of values for grid search.

Caching Transformers

Transforming data is a computationally expensive step.

* for grid search, transformers need not be applied for every parameter configuration. They can be applied only once, and the transformed data can be reused.

In [None]:
import tempfile
tempDirPath = tempfile.TemporaryDirectory()

estimators = [('simpleimputer', SimpleImputer()),
              ('pca', PCA()),
              ('regressor',LinearRegression())]
pipe = Pipeline(steps=estimators, memory=tempDirPath)

FeatureUnion

Concatanates results of multiple transformer objects

# 13. Visualizing Pipelines

In [4]:
from sklearn.preprocessing import StandardScaler, LabelBinarizer
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer

num_pipeline = Pipeline([('selector', ColumnTransformer([('select_first_4',
                                                          'passthrough',
                                                          slice(0.4))])),
                         ('imputer',SimpleImputer(strategy = 'median')),
                         ('std_scaler', StandardScaler()),
                         ])
cat_pipeline = ColumnTransformer([('label_binarizer', LabelBinarizer(),[4]),])
full_pipeline = FeatureUnion(transformer_list = [('num_pipeline',num_pipeline),
                                                 ('cat_pipeline',cat_pipeline),
                                                 ])

In [5]:
from sklearn import set_config
set_config(display='diagram')
full_pipeline

# 14. Handling imbalanced data

Imbalanced datasets are those where one class is very less represented than other classes

Two main approaches to handle imbalanced data

* Undersampling
* Oversampling

In [None]:
wine_data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv',sep=';')
wine_data['quality'].hist(bins=50)
plt.xlabel('Quality')
plt.ylabel('Number of Samples')
plt.show()

Undersampling

In [None]:
from imblearn.under_sampling import RandomSampler

class_count_3,class_count_4,class_count_5,class_count_6,class_count_7,class_count_8 = wine_data['quality'].value_counts()

class_3 =  wine_data[wine_data['quality']==3]
class_4 =  wine_data[wine_data['quality']==4]
class_5 =  wine_data[wine_data['quality']==5]
class_6 =  wine_data[wine_data['quality']==6]
class_7 =  wine_data[wine_data['quality']==7]
class_8 =  wine_data[wine_data['quality']==8]

print('class 3:', class_3.shape)
print('class 4:', class_4.shape)
print('class 5:', class_5.shape)
print('class 6:', class_6.shape)
print('class 7:', class_7.shape)
print('class 8:', class_8.shape)

from collections import Counter
X=wine_data.drop(['quality'], axis=1)
y=wine_data['quality']
undersample = RandomSampler(random_state=0)
X_runs,y_runs = undersample.fit_resample(X,y)
print(Counter(y),Counter(y_runs))

OverSampling

In [None]:
from imblearn.over_sampling import RandomOverSampler

ros = RandomOverSampler()
X_ros,y_ros = ros.fit_resample(X,y)
print(Counter(y),Counter(y_ros))

Types of smote:

* Borderline SMOTE
* Borderline-SMOTE SVM
* Adaptive Synthetic Sampling (ADASYN)