## Preprocessing Data

- Pandas: https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html
- Scikit-learn: https://scikit-learn.org/stable/modules/preprocessing.html

*** Notes: Read the scikit-learn preprocessing library functions and examples before you use them. 

In [None]:
# Library import

import numpy as np
import pandas as pd
from sklearn import preprocessing

### Standadization, or mean removal and variance scaling

In [None]:
# A simple example
# set the data as training set
X_train = np.array([[ 1., -1.,  2.],
                    [ 2.,  0.,  0.],
                    [ 0.,  1., -1.]])

print("data: \n{} \ntype: {}\n".format(X_train, type(X_train)))

#### Basic standarization

In [None]:
# Standardization
data_standardized = preprocessing.scale(X_train)
print("standardized data: \n{}\n".format(data_standardized))

# scaled data has zero mean and unit variance:
print("Scaled data by standarization:")
print("Mean = {}".format(data_standardized.mean(axis=0)))
print("Std deviation = {}\n".format(data_standardized.std(axis=0)))

#### StandardScaler

Compute the mean and standard deviation on a training set so as to be able to later reapply the same transformation on the testing set. 

In [None]:
# create a scaler and apply it for both train data and test data
scaler = preprocessing.StandardScaler().fit(X_train)
print(scaler)

print("scaler mean: ", scaler.mean_)
print("scaler scale: ", scaler.scale_)

train_scaled = scaler.transform(X_train)

print("train data: \n", X_train)
print("scaled train data: \n", train_scaled)

X_test = [[-1., 1., 0.]]
test_scaled = scaler.transform(X_test)

print("test data: \n", X_test)
print("scaled test data: \n", test_scaled)

#### Min Max Scaler

In [None]:
# min max scaling
data_scaler = preprocessing.MinMaxScaler(feature_range=(0, 1))
data_scaled = data_scaler.fit_transform(X_train)
print("Min max scaled data:\n", data_scaled)

### Data normalization

Normalization is the process of scaling individual samples to have unit norm. This process can be useful if you plan to use a quadratic form such as the dot-product or any other kernel to quantify the similarity of any pair of samples.

This assumption is the base of the Vector Space Model often used in text classification and clustering contexts.

The function normalize provides a quick and easy way to perform this operation on a single array-like dataset, either using the l1 or l2 norms:

In [None]:
# normalization
data_normalized1 = preprocessing.normalize(X_train, norm='l1')
print("L1 normalized data:\n", data_normalized1)

In [None]:
data_normalized2 = preprocessing.normalize(X_train, norm='l2')
print("L2 normalized data:\n", data_normalized2)

### Discretization

Discretization (otherwise known as quantization or binning) provides a way to partition continuous features into discrete values. Certain datasets with continuous features may benefit from discretization, because discretization can transform the dataset of continuous attributes to one with only nominal attributes.

#### Feature binarization

Feature binarization is the process of thresholding numerical features to get boolean values. 

In [None]:
# binarization
binarizer = preprocessing.Binarizer(threshold=1.1)
data_binarized = binarizer.transform(X_train)

print("X_train: \n", X_train)
print("Binarized data:\n", data_binarized)

#### K-bins discretization

KBinsDiscretizer discretizes features into k bins.

In [None]:
X = np.array([[ -3., 5., 15 ],
                [  0., 6., 14 ],
                [  6., 3., 11 ]])
est = preprocessing.KBinsDiscretizer(n_bins=[3, 2, 2], encode='ordinal').fit(X)

By default the output is one-hot encoded into a sparse matrix (See Encoding categorical features) and this can be configured with the encode parameter. For each feature, the bin edges are computed during fit and together with the number of bins, they will define the intervals. Therefore, for the current example, these intervals are defined as:

![image.png](attachment:image.png)

In [None]:
est.transform(X)