# **Data Preprocessing Techniques**
Data preprocessing involves several transformations that are applied to the raw data and make it more amenable for learning. It is carried out before using it for model training or prediction.

There are many pre-processing techniques for
- Data cleaning
  - Data imputation
  - Feature scaling
- Feature transformation
  - Polynomial features
  - Discretization
  - Handling categorical features
  - Custom Transformers
  - Composite Transformers
    - Apply transformation of diverse features
    - TargetTransformedRegressor
- Feature Selection
  - Filter based feature selection
  - Wrapper based feature selection
- Feature Extraction
  - PCA

The transformations are applied in a specific order and the order can be specified via ``Pipeline``. We need to apply different transformations based on the feature type. ``FeatureUnion`` helps us perform that task and combine outputs from multiple transformations into a single transformed feature matrix. We will also study how to visualize this pipeline.

## Importing basic libraries

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

sns.set_theme(style="whitegrid")

## **1. Feature Extraction**

### DictVectorizer

Many a times the data is present as a **list of dictionary objects**. ML algorithms expect the data to be in **matrix form** of shape $(n,m)$ where $n$ is the number of samples and $m$ is the number of features.

``DictVectorizer`` **converts** a *list of dictionary objects to feature matrix*.

Let's create a sample data for demo purpose containing ``age`` and ``height`` of children
> Each record/sample in dictionary with two keys ``age`` and ``height``, and their corresponding values.

In [2]:
data = [{'age' : 4, 'height' : 96.0},
        {'age' : 1, 'height' : 73.9},
        {'age' : 3, 'height' : 88.9},
        {'age' : 2, 'height' : 81.6}]

> There are 4 data samples with 2 features each

Let's make use of ``DictVectorizer`` to convert the list of dictionary objects to the feature matrix

In [3]:
from sklearn.feature_extraction import DictVectorizer
dv = DictVectorizer(sparse = False)
data_transformed = dv.fit_transform(data)
data_transformed

array([[ 4. , 96. ],
       [ 1. , 73.9],
       [ 3. , 88.9],
       [ 2. , 81.6]])

In [4]:
data_transformed.shape

(4, 2)

The transformed data is in the feature matrix form- 4 examples with 2 features each i.e shape $(4,2)$

## **2. Data Imputation**
- Many machine learning algorithms need full feature matrix and they may not work in the presence of missing data
- Data imputation identified **missing values** in each feature of the dataset and **replaces** them with an **appropriate value** based on **fixed strategy** such as:
  - **mean** or **median** or **mode** of that feature.
  - **use specified constant** value

Sklearn library provides ``sklearn.impute.SimpleImputer`` class for this purpose

In [5]:
from sklearn.impute import SimpleImputer

Some of its important parameters:
- *missing_values*: Could be ``int, float, str, np.nan`` or ``None``. By default its ``np.nan``.
- *strategy*: default is 'mean'. One of the following strategies can be used.   
  - ``mean``- missing values are replaced using the **mean** along each column
  - ``median`` - missing values are replaced using the **median** along each column
  - ``most_frequent`` - missing values are replaced using the **most frequent** along each column
  - ``constant`` - missing values are replaced with values specified in ``fill_value`` argument.
- ``add_indicator`` is a boolean parameter that when set to ``True`` returns **missing value indicators** in ``indicators_`` member variable.

**Note**:
- ``mean`` and ``mode`` strategies can only be used with numeric data.
- ``most_frequent`` and ``constant`` strategies can be used with strings or numeric data.

### Data imputation on real world dataset
Let's perform data imputation on real world dataset. We will be using [heart-disease from uci machine learning repository](https://archive.ics.uci.edu/ml/datasets/Heart+Disease) for this purpose. We will load this dataset from csv file.

In [6]:
cols = ['age','sex','cp','trestbps','chol','fbs','restecg','thalach','exang','oldpeak','slope','ca','thal','num']
heart_data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data',header=None,names=cols)

**STEP 1.**: Check if dataset contains missing values.
- This can be checked via dataset description or by check number of ``nan`` or ``np.null`` in the dataframe. Howevver such check can be performed only for numerical features.
- For non-numerical features, we can list their unique values and check if there are values like ``?``.


In [7]:
heart_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    float64
 1   sex       303 non-null    float64
 2   cp        303 non-null    float64
 3   trestbps  303 non-null    float64
 4   chol      303 non-null    float64
 5   fbs       303 non-null    float64
 6   restecg   303 non-null    float64
 7   thalach   303 non-null    float64
 8   exang     303 non-null    float64
 9   oldpeak   303 non-null    float64
 10  slope     303 non-null    float64
 11  ca        303 non-null    object 
 12  thal      303 non-null    object 
 13  num       303 non-null    int64  
dtypes: float64(11), int64(1), object(2)
memory usage: 33.3+ KB


Let's check if there are missing values in numerical columns - here we have checked it for all columns in the dataframe.

In [8]:
(heart_data.isnull().sum())

age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          0
thal        0
num         0
dtype: int64

There are two non-numerical features: ``ca`` and ``thal``.
- List their unique values.

In [9]:
print('Unique values in ca:', heart_data.ca.unique())
print('Unique values in thal:', heart_data.thal.unique())

Unique values in ca: ['0.0' '3.0' '2.0' '1.0' '?']
Unique values in thal: ['6.0' '3.0' '7.0' '?']


Both of them contain ``?`` which is a missing values. Let's count the number of missing values.

In [10]:
print('# missing values in ca:', heart_data.loc[heart_data.ca == '?','ca'].count())
print('# missing values in thal:', heart_data.loc[heart_data.thal =="?",'thal'].count())

# missing values in ca: 4
# missing values in thal: 2


**STEP 2**: Replace '?' with ``nan``.

In [11]:
heart_data.replace('?',np.nan, inplace=True)

**STEP 3**: Fill the missing values with ``sklearn`` missing value imputation utilities.
> Here we use ``SimpleImputer`` with ``mean`` strategy.

We will try two variations- 
- ``add_indicator = False``: Default choice that only imputes missing values.

In [12]:
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer = imputer.fit(heart_data)
heart_data_imputed = imputer.transform(heart_data)
print(heart_data_imputed.shape)

(303, 14)


- ``add_indicator = True``: Adds additional column for each column containing missing values. In this case it adds two column, one for ``ca`` and the other for ``thal``.

In [14]:
imputer = SimpleImputer(missing_values= np.nan, strategy='mean', add_indicator=True)
imputer = imputer.fit(heart_data)
heart_data_imputed_with_indicator = imputer.transform(heart_data)
print(heart_data_imputed_with_indicator.shape)

(303, 16)


## **3. Feature Scaling**

Feature scaling **transforms feature values** such that **all the features are on the same scale**.
When we use feature matrix with all the features on the same scale.
- **Enables faster convergence** in iterative optimization algorithms like gradient descent and its variants.
- The performance of ML algorithms such as SVM, K-NN and K-means etc, that compute euclidean distance among input samples gets impacted if the features are not scaled.

Tree based ML algorithms are not affected by feature-scaling. In other words, feature scaling is not required for tree based ML algorithms

Feature scaling can be performed with the following methods:
- Standardization
- Normalization
- MaxAbsScaler.

Let's demonstrate feature scaling on real world dataset. For this purpose, we will be using [abalone dataset](https://archive.ics.uci.edu/ml/datasets/abalone). We will use different scaling utilities in ``sklearn`` library.

In [17]:
cols = ['Sex','Length','Diameter','Height','Whole weight','Shucked weight','Viscera weight','Shell weight','Rings']
abalone_data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data',header=None,names=cols)

**STEP 1**: Examine the dataset

In [18]:
abalone_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4177 entries, 0 to 4176
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Sex             4177 non-null   object 
 1   Length          4177 non-null   float64
 2   Diameter        4177 non-null   float64
 3   Height          4177 non-null   float64
 4   Whole weight    4177 non-null   float64
 5   Shucked weight  4177 non-null   float64
 6   Viscera weight  4177 non-null   float64
 7   Shell weight    4177 non-null   float64
 8   Rings           4177 non-null   int64  
dtypes: float64(7), int64(1), object(1)
memory usage: 293.8+ KB


**STEP 1a**: [Optional]: convert non-numerical attributes into numerical ones
> In this dataset only ``Sex`` is the non-numeric column

In [19]:
abalone_data.Sex.unique()

array(['M', 'F', 'I'], dtype=object)

In [20]:
#Assign numeric values to sex.
abalone_data = abalone_data.replace({'Sex': {'M':1,'F':2,'I':3}})
abalone_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4177 entries, 0 to 4176
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Sex             4177 non-null   int64  
 1   Length          4177 non-null   float64
 2   Diameter        4177 non-null   float64
 3   Height          4177 non-null   float64
 4   Whole weight    4177 non-null   float64
 5   Shucked weight  4177 non-null   float64
 6   Viscera weight  4177 non-null   float64
 7   Shell weight    4177 non-null   float64
 8   Rings           4177 non-null   int64  
dtypes: float64(7), int64(2)
memory usage: 293.8 KB


**STEP 2**: Separate labels from features.

In [21]:
y = abalone_data.pop('Rings')
print('The dataframe object after deleting the column')
abalone_data.info()

The dataframe object after deleting the column
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4177 entries, 0 to 4176
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Sex             4177 non-null   int64  
 1   Length          4177 non-null   float64
 2   Diameter        4177 non-null   float64
 3   Height          4177 non-null   float64
 4   Whole weight    4177 non-null   float64
 5   Shucked weight  4177 non-null   float64
 6   Viscera weight  4177 non-null   float64
 7   Shell weight    4177 non-null   float64
dtypes: float64(7), int64(1)
memory usage: 261.2 KB


**STEP 3**: Examing the feature scales

#### Statistical method
Check the scales of different features with ``describe()`` method of dataframe.

In [23]:
abalone_data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Sex,4177.0,1.95547,0.827815,1.0,1.0,2.0,3.0,3.0
Length,4177.0,0.523992,0.120093,0.075,0.45,0.545,0.615,0.815
Diameter,4177.0,0.407881,0.09924,0.055,0.35,0.425,0.48,0.65
Height,4177.0,0.139516,0.041827,0.0,0.115,0.14,0.165,1.13
Whole weight,4177.0,0.828742,0.490389,0.002,0.4415,0.7995,1.153,2.8255
Shucked weight,4177.0,0.359367,0.221963,0.001,0.186,0.336,0.502,1.488
Viscera weight,4177.0,0.180594,0.109614,0.0005,0.0935,0.171,0.253,0.76
Shell weight,4177.0,0.238831,0.139203,0.0015,0.13,0.234,0.329,1.005
