# Data Preprocess

*Author:* Github - @wklchris

*Feb 2019*

In [4]:
import numpy as np
import pandas as pd
from sklearn import preprocessing as skp  # This is my personal preferred alias

## Imputation of Missing Values

Toy data preparation:

In [68]:
data = pd.DataFrame({"y": ["A", "B", "B", "N/A"], 
                    "x1": [0, 1, np.nan, 5],
                    "x2": [5, np.nan, 1, 9]})
data.head()

Unnamed: 0,y,x1,x2
0,A,0.0,5.0
1,B,1.0,
2,B,,1.0
3,,5.0,9.0


### Example 1: Self-imputate by mean values

For column x1, we have $(0+1+5)/3 = 2$. For column x2, we have $(5+1+9)/3=5$.

In [69]:
from sklearn.impute import SimpleImputer

imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
# Filled with mean of the same column data
imp_mean.fit_transform(data.loc[:, ["x1", "x2"]].values)

array([[0., 5.],
       [1., 5.],
       [2., 1.],
       [5., 9.]])

### Example 2: External-imputate by mode values

If imputater is applied to single column (feature), use `reshape(-1, 1)`. If applied to single sample, use `reshape(1,-1)`.

In [70]:
imp_freq = SimpleImputer(missing_values="N/A", strategy='most_frequent')
# Filled with mode (string & num) from another given data
## reshape(-1, 1) is needed for single feature, or (1, -1) for single sample
fit_data = pd.DataFrame(["E", "D", "D", "E"], dtype="category")
imp_freq.fit(fit_data)
imp_freq.transform(data.y.values.reshape(-1, 1))

array([['A'],
       ['B'],
       ['B'],
       ['D']], dtype=object)

### Example 3: Imputate by given constants

Just use `fit_transform()` for simplicity.

In [71]:
imp_constant = SimpleImputer(missing_values=np.nan, strategy='constant', fill_value=-1)
imp_constant.fit_transform(data.x1.values.reshape(-1, 1))

array([[ 0.],
       [ 1.],
       [-1.],
       [ 5.]])

## Encoding Categorical Features

Categorical features may need further operations compared to numeric features. For example, gender is a typical categorical feature, which can normally choose value from `[male, female]` only.

Toy data preparation:

In [57]:
data = pd.DataFrame({"x1": list("ABBCACAB"), 
                   "x2": [5,6,7,8] * 2})
data

Unnamed: 0,x1,x2
0,A,5
1,B,6
2,B,7
3,C,8
4,A,5
5,C,6
6,A,7
7,B,8


### Example 1: Ordinal encoding

A normal encoder for this scenario is `sklearn.preprocessing.OrdinalEncoder`.

In [58]:
enc_ordinal = skp.OrdinalEncoder()
enc_ordinal.fit([["A", 5], ["C", 6], ["B", 7], ["A", 8]])
enc_ordinal.transform(data.values)

array([[0., 0.],
       [1., 1.],
       [1., 2.],
       [2., 3.],
       [0., 0.],
       [2., 1.],
       [0., 2.],
       [1., 3.]])

In [65]:
enc_ordinal = skp.OrdinalEncoder()
enc_ordinal.fit_transform(data.values)

array([[0., 0.],
       [1., 1.],
       [1., 2.],
       [2., 3.],
       [0., 0.],
       [2., 1.],
       [0., 2.],
       [1., 3.]])

Use `inverse_transform()` to transform encoded values back to categorical strings:

In [62]:
enc_ordinal.inverse_transform([[0, 0], [1, 1]])

array([['A', 5],
       ['B', 6]], dtype=object)

### Example 2: One-hot encoding

In each feature, one given value would be transformed to 1, while all others are set to 0.

Though not shown in the example below, `inverse_transform()` is also supported.

In [86]:
data = pd.DataFrame([["A", "X"], ["A", "Y"], ["B", "Y"]], columns=["x", "y"])
data

Unnamed: 0,x,y
0,A,X
1,A,Y
2,B,Y


In [87]:
enc_onehot = skp.OneHotEncoder()
enc_data = enc_onehot.fit_transform(data).toarray()
enc_data_index = enc_onehot.get_feature_names(input_features=data.columns)
data_encoded = pd.DataFrame(enc_data, columns=enc_data_index)
data_encoded

Unnamed: 0,x_A,x_B,y_X,y_Y
0,1.0,0.0,1.0,0.0
1,1.0,0.0,0.0,1.0
2,0.0,1.0,0.0,1.0


Using `enc_onehot = skp.OneHotEncoder()` when creating an one-hot encoder can deal with unknown categories of fit data. It will automatically set missing category values as 0. 

For example, following code gives "C" to the first feature, so values "A" and "B" are all set to 0. Furthermore, there is no column "x0_A" and "x0_B" in the transformed data.

In [88]:
enc_onehot = skp.OneHotEncoder(handle_unknown='ignore')
enc_onehot.fit([["C", "X"], ["C", "Y"]])
pd.DataFrame(enc_onehot.transform(data).toarray(),
            columns=enc_onehot.get_feature_names(input_features=data.columns))

Unnamed: 0,x_C,y_X,y_Y
0,0.0,1.0,0.0
1,0.0,0.0,1.0
2,0.0,0.0,1.0


## Training & test sets split

In [117]:
# Example from scikit-learn 0.20.0 official user guide
from sklearn.model_selection import train_test_split
X, y = np.arange(10).reshape((5, 2)).astype(np.float64), range(5)
X

array([[0., 1.],
       [2., 3.],
       [4., 5.],
       [6., 7.],
       [8., 9.]])

In [118]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
X_train, X_test

(array([[4., 5.],
        [0., 1.],
        [6., 7.]]), array([[2., 3.],
        [8., 9.]]))

In [119]:
y_train, y_test

([2, 0, 3], [1, 4])

## Scaler

`scale` is a normal scaler, which transforms each feature (column) so that the result has mean value 0 and standard deviation 1.

In [120]:
data = pd.DataFrame({"x1": [0.0, 1, 2], "y": [-2, 0.0, 1]})
data_scaled = skp.scale(data)
data_scaled

array([[-1.22474487, -1.33630621],
       [ 0.        ,  0.26726124],
       [ 1.22474487,  1.06904497]])

In [121]:
data_scaled.mean(), data_scaled.std()

(0.0, 1.0)

Another scaler is `StandardScaler`:

In [122]:
st_scaler = skp.StandardScaler().fit(X_train)
X_train_stscaled = st_scaler.transform(X_train)
X_train_stscaled.mean(), X_train_stscaled.std()

(3.700743415417188e-17, 1.0)

In [123]:
X_test_stscaled = st_scaler.transform(X_test)
X_test_stscaled.mean(), X_test_stscaled.std()

(0.6681531047810609, 1.2026755886059097)

`MinMaxScaler` and `MaxAbsScaler` are introduced here:

In [104]:
skp.MinMaxScaler(feature_range=(-1, 1)).fit_transform(data)

array([[-1.        , -1.        ],
       [ 0.        ,  0.33333333],
       [ 1.        ,  1.        ]])

In [136]:
skp.minmax_scale(data, feature_range=(-1, 1))

array([[-1.        , -1.        ],
       [ 0.        ,  0.33333333],
       [ 1.        ,  1.        ]])

In [105]:
skp.MaxAbsScaler().fit_transform(data)

array([[ 0. , -1. ],
       [ 0.5,  0. ],
       [ 1. ,  0.5]])

`RobustScaler` uses interquantile to do the scaling:

In [138]:
data = np.array([[0, 1], [1, -1], [2, 3], [3, -9], [4, 10]])
data

array([[ 0,  1],
       [ 1, -1],
       [ 2,  3],
       [ 3, -9],
       [ 4, 10]])

In [114]:
rb_scaler = skp.RobustScaler()
rb_scaler.fit_transform(data)

array([[-1.  ,  0.  ],
       [-0.5 , -0.5 ],
       [ 0.  ,  0.5 ],
       [ 0.5 , -2.5 ],
       [ 1.  ,  2.25]])

In [116]:
rb_scaler.scale_, rb_scaler.center_

(array([2., 4.]), array([2., 1.]))

In [139]:
skp.robust_scale(data, quantile_range=(25.0, 75.0))

array([[-1.  ,  0.  ],
       [-0.5 , -0.5 ],
       [ 0.  ,  0.5 ],
       [ 0.5 , -2.5 ],
       [ 1.  ,  2.25]])

`normalize` or `Normalizer` would do unit length scaling by samples (i.e. row by row):

$$
x' = \frac{x}{\|x\|}
$$

where the definition of norm $\|\cdot\|$ is determined by parameter `norm`. The default is `l2`, i.e. $\sqrt{x_1^2+x_2^2+\cdots}$. 

In [135]:
skp.normalize(data)  # Normalized by row

array([[ 0.        , -1.        ],
       [ 1.        ,  0.        ],
       [ 0.89442719,  0.4472136 ]])