# Data Preporcessing
The real world training data is usually not clean and has many issues such as **missing values** for certain features, features on **different scales, non-numeric attributes** etc.  

Often, there is a need to **pre-process** the data to make it amenable for training the model.

> Sklearn provides a rich set of transformers for this job.  

The ***same pre-processing** should be applied to both training and test set.  

> Sklearn provides **pipeline** for making it easier to chain multiple transforms together and apply them *uniformly across train, eval and test sets*.

Once you get the training data, the first job is to *explore the data* and list down **preprocessing** needed.  
Typical problems include:
* **Missing values** in features
* Numerical features are *not on the same scale*.
* **Categorical attributes** need to be represented with sensible numerical representation.
* **Too many features**, reduce them.
* **Extract features** from non-numeric data.

Sklearn provides a *library of transformers* for *data preprocessing*.
* Data cleaning (sklearn.preprocessing) such as *standardization*, *missing value imputation*, etc.
* Feature extraction (sklearn.feature_extraction)
* Feature reduction (sklearn.decomposition.pca)
* Feature expansion (sklearn.kernal_approximation)

**Transformer methods**
Each transformer has the following methods:
* `fit()` method learns model parameters from a training set.
* `transform()` method applies the learnt transformation to the new data.
* `fit_transform()` method performns both `fit()` and `transform()` methods and is more convenient and efficient to use.

## Part 1: Feature extraction
`sklearn.feature_extraction` has useful APIs to extract features from data:
* `DictVectorizer`
* `FeatureHasher`

### `DictVectorizer`  
Converts lists of mappings (dict-like objects) of feature names to feature values, into a matrix.

In [10]:
from sklearn.feature_extraction import DictVectorizer

In [11]:
dv = DictVectorizer(sparse=False)
X = dv.fit_transform([{'height': 1, 'length': 0, 'width': 1}, \
    {'height': 2, 'length': 1, 'width': 0}, \
    {'height': 1, 'length': 3, 'width': 2}])
X

array([[1., 0., 1.],
       [2., 1., 0.],
       [1., 3., 2.]])

In [12]:
dv.inverse_transform(X)

[{'height': 1.0, 'width': 1.0},
 {'height': 2.0, 'length': 1.0},
 {'height': 1.0, 'length': 3.0, 'width': 2.0}]

### `FeatureHasher`
* High-speed, low-memory vectorizer that uses feature hashing technique.

### Feature Extraction from images and text
* `sklearn.feature_extraction.image.*` has useful APIs to extract features from images data.
* `sklearn.feature_extraction.text.*` has useful APIs to extract features from text data.


## Part 2: Data Cleaning
### 2.1 Handling missing values
* Many ML algorithms do not work with missing data and need all features to be present.  
* Discarding records containing missing values would result in loss of valuable training samples.
> `sklearn.impute` API provides functionality to fill missing values in a dataset.
  * `SimpleImputer`
  * `KNNImputer`

> `MissingIndicator` provides indicators for missing values.

#### SimpleImputer
* Fills missing values with one of the following strategies:
  * mean, median, most_frequent and constant.

#### KNNImputer
* Uses k-nearest neighbours approach to fill missing values in a dataset.
  * filled with the **mean** value of the same attribute of `n_neighbhors` closest neighbours.
* The nearest neighbours are determined using the **Euclidean distance**.

```python
knnImputer = KNNImputer(n_neighbors=3, weights='uniform')
knnImputer.fit_transform(X)
```
>By default, the `KNNImputer` uses `mean` strategy.

#### Marking imputed values
* It is important to indicate the *presence of missing values* in the dataset.
* `MissingIndicator` helps us to get those indications.
  * It returns a **binary matrix**,
    * **True values** correspond to missing entries in the original dataset.

### 2.2 Numeric transformers
1. Feature scaling
2. Polynomial transformation
3. Discretization

#### Feature scaling
Numerical features with different scales leads to slower convergence of iterative optimization procedures.  
It is a good practice to scale numerical features so that all of them are on the same scale.  

Three feature scaling APIs are available in sklearn:
* `StandardScaler`
* `MinMaxScaler`
* `MaxAbsScaler`

`StandardScaler`  
Transforms the original feature vector x into new feature vector x' using following formula:  
> $$
x' = \frac{x - \mu}{\sigma}
$$
After standardization, the trasformed feature vector $x'$ will have mean $(\mu) = 0$ and standard deviation $(\sigma) = 1$.  

`MinMaxScaler`  
Transforms to x' so that all values fall within range [0, 1].
> $$
x' = \frac{x - x.\min}{x.\max - x.\min}
$$

`MaxAbsScaler`  
Transforms to x' so that all values fall within range [-1, 1].  
> $$
x' = \frac{x}{\max(|x|)}
$$

`FunctionalTransformer`  
Constructs transformed features by applying **a user defined function** to the original features.
```python
ft = FunctionTransformer(numpy.log2)
ft.fit_transform(X)
```

#### Polynomial transformation
Generates a new feature matrix consisting of **all polynomial combinations** of the features with **degree less than or equal to the specified degree**.  
```python
pf = PolynomialFeatures(degree=2)
X = [a, b]
pf.fit_transform(X)
```
$$
X' = [1, a, b, a^2, ab, b^2,]
$$

```python
pf = PolynomialFeatures(degree=3)
X = [a, b]
pf.fit_transform(X)
```
$$
X' = [1, a, b, ab, a^2, b^2, a^2b, ab^2, a^3, b^3]
$$

#### `KBinsDiscretizer`
* Divides a continuous feature into bins.
* *One hot encoding* or *ordinal encoding* is further applied to the bin labels.

In [32]:
from sklearn.preprocessing import KBinsDiscretizer
import numpy as np
kbins = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='uniform')

In [34]:
X = np.arange(9)
X = X*0.125
X = X.reshape(9,1)

In [35]:
kbins.fit_transform(X)

array([[0.],
       [0.],
       [1.],
       [1.],
       [2.],
       [3.],
       [3.],
       [4.],
       [4.]])

## 2.3 Categorical transformers
1. Feature encoding
2. Label encoding

### `OneHotEncoder`
* Encodes categorical feature or label as a *one-hot numeric array*.
* Creates **one binary column* for each of *K unique values*.
* **Exactly one column* has 1 in it and the rest have 0.

In [46]:
from sklearn.preprocessing import OneHotEncoder
x = np.array([1, 2, 3, 1]).reshape(-1, 1)
x

array([[1],
       [2],
       [3],
       [1]])

In [47]:
ohe = OneHotEncoder()
x_tr = ohe.fit_transform(x)
x_tr.toarray()

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [1., 0., 0.]])

*Number of unique values in [1, 2, 3, 1]:*  
K = 3  

*Number of columns in transformed matrix* = 3

### `LabelEncoder`
Encodes target labels with value between 0 and K-1, where K is number of distinct values.

### `OrdinalEncoder`
Encodes categorical features with value between 0 and K-1, where K is number of distinct values.  

> OrdinalEncoder can operate **multi dimensijonal data*, while LabelEncoder can transform *only 1D data*.

### `LabelBinarizer`
Several regression and binary classification can be extended to multi-class setup in **one-vs-all** fashion.  
>This involves training a single regressor or classifier per class.  
For this, we need to **convert multi-class labels to binary labels** and **LabelBinarizer** performs this task.  

*If estimator supports multiclass data, LabelBinarizer is not needed.*

### `MultiLabelBinarizer`
* Encodes **categorical features** with value between 0 and K-1, where K is number of classes.

### `add_dummy_feature`
**Augments** dataset with **a column vector**, each value in the column is 1.

## Part 3: Feature selection
- Filter based
- Wrapper based

- Sometimes in real world dataset, **all features do not contribute well enough towards fitting a model**.
- The features that do not contribute significantly, can be **removed**. It leads to **decrease in size of the dataset** and hence, the **computation cost** of fitting a model.
- `sklearn.feature_selection` provides many APIs to accomplish this task.

### Filter
- Variance threshold
- SelectKBest
- SelectPercentile
- GenericUnivariateSelect

### Wrapper
- RFE
- RFECV
- SelectFromModel
- SequentialFeatureSelector

### Filter based feature selection methods
#### Variance threshold
**Removes** all features with **variance below a certain threshold**, as specified by the user, from input feature matrix.
> By default removes a feature which has same value, i.e. zero variance.

### Univariate feature selection
Univariate feature selection **selects** features based on univariate statistical tests.  

There are three APIs for univariate feature selection:
* `SelectKBest`: Removes **all but** the **k highest scoring** features
* `SelectPercentile`: Removes **all but** a user-specified **highest scoring percentage** of features
* `GenericUnivariateSelect`: Performs univariate feature selection with a **configurable strategy**, which can be found via **hyper-parameter search**.

sklearn provides one more class of univariate feature selection methods that work on **common univariate statistical tests** for each feature:
- **SelectFpr** selects features based on a false positive rate test.
- **SelectFdr** selects features based on an estimated false discovery rate.
- **SelectFwe** selects features based on family-wise error rate.

### Univariate scoring function
* Each API need a **scoring function** to score each feature.
* Three classes of scoring functions are proposed:
  - Mutual information (MI)
  - Chi-square
  - F-statistics
* MI and F-statistics can be used in both **classification** and **regression** problems.
  - `mutual_info_regression`
  - `mutual_info_classif`
  - `f_regression`
  - `f_classif`
* Chi-square can be used only in **regression** problem.
  - `chi2`

#### Mutual Information (MI)
- **Measures dependency** between two variables.
- It returns a **non-negative** value.
  - MI = 0 for **independent* variables.
  - Higher MI indicates **higher dependency**.

#### Chi-square
* **Measures dependence** between two variables.
* Computes chi-square stats between **non-negative feature** (boolean or frequencies) and **class label**.
* **Higher chi-square values** indicates that the **feature and labels are likely to be correlated**. Hence we choose to include such features with higher chi-square value.

*Mi and chi-square feature selection is recommended for **sparse data***.  

#### GenericUnivariateSelect
```python
transformer = GenericUnivariateSelect(chi2, mode='k_best', param=20)
X_new = transformer.fit_transform(X, y)
```
- Selects 20 best features based on chi-square scoring function
- Selects set of features based on a feature selection mode and a scoring function.
- The `mode` could be `percentile` (default), `k_best`, `fpr`, `fdr`, `fwe`.
- The `param` argument takes value corresponding to the `mode`.

> *Do not use regression feature scoring function with classification problem. It will lead to useless results.*

### Wrapper based feature selection
Unlike filter based feature selection, wrapper based methods use **estimator class** rather than a **scoring function**.
#### Recursive Feature Elimination (RFE)
- Uses an **estimator** to **recursively remove features**.
  - Initially fits an estimator on all features.
  - Obtains **feature importances** from the estimator and **removes one or more least important features** (depending upon `step` parameter).
  - Repeats the process **until the specified number of features** is reached.

> - Use `RFECV` if we do not want to specify the desired number of features in `RFE`.
> - It performs `RFE` in a **cross-validation loop** to find the **optimal number of features**.

#### `SelectFromModel`
First trains an estimator on all features, then it selects a desired number of features (as specified with **max_features** parameter) above **certain threshold of feature importance**.

```python
clf = LinearSVC(C=0.01, penalty="l1", dual=False)
clf = clf.fit(X, y)
clf.coef_

model = SelectFromModel(clf, prefit=True)
X_new = model.transform(X)
```
- Here we use a linear support vector classifier to get coefficients of the features for `SelectFromModel` transformer.
- **L1 regularizer** essentially gets us **non-zero weights only to features that are useful**, all other weights are zero.
- It ends up selecting features with non-zero weights or coefficients.

#### Sequential feature selection
Performs feature selection by **selecting or deselecting features one by one in a greedy manner**.
- Forward selection:
  - Starting with zero feature, it finds one feature that obtains the best cross validation score for an estimator when trained on that feature.
  - Repeats the process by adding a new feature to the set of selected features.
- Backward selection:
  - Starting with all features and removes least important features one by one following the idea of forward selection.
Stops when reach the desired number of features.  

- In general, forward and backward selection **do not yield equivalent results**.
- Select the direction that is **efficient** for the required number of selected features.
- SFS does not require the underlying model to expose a `coef_` or `feature_importances_` attribute unlike in `RFE` and `SelectFromModel`.
- SFS may be slower than `RFE` and `SelectFromModel` as it needs to evaluate more models compared to the other two approaches.

For example in backward selection, the iteration going from **m** features to **m-1** features using k-fold CV requires fitting m x k models, while
  - `RFE` would require only a single fit, and
  - `SelectFromModel` performs a single fit and requires no iterations.

### Heterogeneous features transformations
Generally training data contains diverse features such as numerical and categorical.  

Different feature types are processed with different transformers.  

Need a way to combine different feature transformers seamlessly.
`sklearn.compose` has useful classes and methods to apply transformation on subset of features and combine them:
- `ColumnTransformer`
  - It applies **a set of transformers** to columns of an array or `pandas.DataFrame`, **concatenates the transformed outputs** from different tranformers into a **single matrix**.
  - It is useful for **transforming heterogenous data** by applying **different transfrormers to separate subsets of features**.
  - It combines different feature selection mechanisms and transformation into a single transformer object.
  - Each tuple has format
    - ```(transformerName, transformer(), columnIndices)```

- `TransformedTargetRegressor`
  - Transforms the **target variable** `y` **before fitting** a regression model.
  - The predicted values are mapped back to the original space via an inverse transform.
  - `TransformedTargetRegressor` takes `regressor` and `transformer` to be applied to the target variable as arguments.


## Part 4: Dimensionality reduction
Another way to reduce the number of feature is through **unsupervised dimensionality reduction** techniques.  
`sklearn.decomposition` model has a number of APIs for this task.  
> We will focus on how to perform feature reduction with **Principal Component Analysis (PCA)** in sklearn.

### PCA 101
- **PCA**, is a linear dimensionality reduction technique.
- It uses **singular value decomposition (SVD)** to project the feautre matrix or data to a **lower dimensional space**.
- The first principal component (PC) is in the **direction of maximum variance** in the data.
  - It captures **bulk of the variance** in the data.
- The **subsequent PCs** are **orthogonal** to the first PC and **gradually capture lesser and lesser variance** in the data.
- We can **select first k PCs** such that we are able to capture the **desired variance** in the data.

> `sklearn.decomposition.PCA` API is used for performing PCA based dimensionality reduction.

## Part 5: Chaining Transformers
The preprocessing transformations are applied one after another on the input feature matrix.  

> It is important to apply **exactly same transformation** on training, evaluation and test set **in the same order**.

Failing to do so would lead to **incorrect predictions** from model due to **distribution shift** and hence **incorrect performance evaluation**.  

The `sklearn.pipeline` module provides utilities to build a **composite estimator**, as a **chain of transformers and estimators**.  
There are two classes:
1. Pipeline
   - Constructs a chain of multiple transformers to execute a fixed sequence of steps in data preprocessing and modelling.
2. FeatureUnion
   - Combines output from several transformer objects by creating a new transformer from them.

#### Accessing individual steps in a pipeline

In [2]:
from sklearn.decomposition import PCA
from sklearn.impute import SimpleImputer

estimator = [
    ('simpleImputer', SimpleImputer()),
    ('pca', PCA()),
    ('regressor', LinearRegression())
]
pipe = Pipeline(estimator)

SyntaxError: invalid syntax (531685235.py, line 3)