In [1]:
from sklearn import preprocessing
import numpy as np

# Definitions
### Rescaling
"Rescaling" a vector means to add or subtract a
constant and then multiply or divide by a constant, as you would do to
change the units of measurement of the data, for example, to convert a
temperature from Celsius to Fahrenheit.
### Normalizing
"Normalizing" a vector most often means dividing by a norm of the vector,
for example, to make the Euclidean length of the vector equal to one. In the
NN literature, "normalizing" also often refers to rescaling by the minimum
and range of the vector, to make all the elements lie between 0 and 1. 
### Standardizing
"Standardizing" a vector most often means subtracting a measure of location
and dividing by a measure of scale. For example, if the vector contains
random values with a Gaussian distribution, you might subtract the mean and
divide by the standard deviation, thereby obtaining a "standard normal"
random variable with mean 0 and standard deviation 1. 

# Preprocessing data
## 1. Standardization, or mean removal and variance scaling
**Standardization** of datasets is **a common requirement for many machine learning estimators** implemented in scikit-learn; they might behave badly if the individual features do not more or less look like standard normally distributed data: Gaussian with **zero mean and unit variance**.

In [2]:
X_train = np.array([[ 1., -1.,  2.],
                    [ 2.,  0.,  0.],
                    [ 0.,  1., -1.]])

The function ```scale``` provides a quick and easy way to perform this operation on a single array-like dataset:

In [3]:
X_scaled = preprocessing.scale(X_train)
X_scaled

array([[ 0.        , -1.22474487,  1.33630621],
       [ 1.22474487,  0.        , -0.26726124],
       [-1.22474487,  1.22474487, -1.06904497]])

Scaled data has zero mean and unit variance:

In [4]:
print("Mean: ",X_scaled.mean(axis=0))
print("Std: ",X_scaled.std(axis=0))

Mean:  [0. 0. 0.]
Std:  [1. 1. 1.]


The preprocessing module further provides a utility class **StandardScaler** that implements the Transformer API to compute the mean and standard deviation on a training set so as to be able to later reapply the same transformation on the testing set.

In [5]:
scaler = preprocessing.StandardScaler().fit(X_train)
scaler

StandardScaler(copy=True, with_mean=True, with_std=True)

In [6]:
print("Column mean:       ", scaler.mean_ )
print("Column derivation: ", scaler.scale_ )

Column mean:        [1.         0.         0.33333333]
Column derivation:  [0.81649658 0.81649658 1.24721913]


In [7]:
scaler.transform(X_train) 

array([[ 0.        , -1.22474487,  1.33630621],
       [ 1.22474487,  0.        , -0.26726124],
       [-1.22474487,  1.22474487, -1.06904497]])

### 1.1 Scaling features to a range
An alternative standardization is scaling features to lie between a given minimum and maximum value, often between **zero** and **one**, or so that the maximum absolute value of each feature is scaled to unit size. (```MinMaxScaler``` or ```MaxAbsScaler```)

The **motivation** to use this scaling include robustness to very small standard deviations of features and preserving zero entries in sparse data.

### 1.1.1 MinMaxScaler

In [8]:
X_train = np.array([[ 1., -1.,  2.],
                    [ 2.,  0.,  0.],
                    [ 0.,  1., -1.]])

In [9]:
min_max_scaler = preprocessing.MinMaxScaler()
X_train_minmax = min_max_scaler.fit_transform(X_train)
X_train_minmax

array([[0.5       , 0.        , 1.        ],
       [1.        , 0.5       , 0.33333333],
       [0.        , 1.        , 0.        ]])

The same scaling and shifting operations will be applied to be consistent with the transformation performed on the train data:

In [10]:
X_test = np.array([[ -3., -1.,  4.]])
X_test_minmax = min_max_scaler.transform(X_test)
print(X_test_minmax)

[[-1.5         0.          1.66666667]]


It is possible to introspect the scaler attributes to find about the exact nature of the transformation learned on the training data:

In [11]:
print("Scale (Per feature relative scaling of the data):", min_max_scaler.scale_)
print("Min   (Per feature adjustment for minimum):      ", min_max_scaler.min_ )

Scale (Per feature relative scaling of the data): [0.5        0.5        0.33333333]
Min   (Per feature adjustment for minimum):       [0.         0.5        0.33333333]


```X_scaled = X*scale + min```

In [12]:
print(-3.*min_max_scaler.scale_[0] + min_max_scaler.min_[0])
print(-1.*min_max_scaler.scale_[1] + min_max_scaler.min_[1])
print(4.*min_max_scaler.scale_[2] + min_max_scaler.min_[2])

-1.5
0.0
1.6666666666666665


#### Step by step
If ```MinMaxScaler``` is given an explicit feature_range=(min, max) the full formula is:    
```Python
X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
X_scaled = X_std * (max - min) + min
```

In [13]:
X = X_train
X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
print(X_std)

[[0.5        0.         1.        ]
 [1.         0.5        0.33333333]
 [0.         1.         0.        ]]


#### Back to orginal data

In [14]:
print("Max: ",np.max(X))
print("Min: ",np.min(X))

Max:  2.0
Min:  -1.0


In [15]:
X_scaled = X_std * (np.max(X) - np.min(X)) + np.min(X)
print(X_scaled)

[[ 0.5 -1.   2. ]
 [ 2.   0.5  0. ]
 [-1.   2.  -1. ]]


### 1.1.2 MaxAbsScaler
```MaxAbsScaler``` works in a very similar fashion, but scales in a way that the training data lies within the range ```[-1, 1]``` by dividing through the largest maximum value in each feature. 
It is meant for data that is already centered at zero or sparse data.

In [16]:
X_train = np.array([[ 1., -1.,  2.],
                    [ 2.,  0.,  0.],
                    [ 0.,  1., -1.]])

In [17]:
max_abs_scaler = preprocessing.MaxAbsScaler()
X_train_maxabs = max_abs_scaler.fit_transform(X_train)
X_train_maxabs 

array([[ 0.5, -1. ,  1. ],
       [ 1. ,  0. ,  0. ],
       [ 0. ,  1. , -0.5]])

In [18]:
X_test = np.array([[ -3., -1.,  4.]])
X_test_maxabs = max_abs_scaler.transform(X_test)
X_test_maxabs   

array([[-1.5, -1. ,  2. ]])

In [19]:
print("Scale (Per feature relative scaling of the data):", max_abs_scaler.scale_)

Scale (Per feature relative scaling of the data): [2. 1. 2.]


### 1.1.3 Scaling sparse data
Centering sparse data would destroy the sparseness structure in the data, and thus rarely is a sensible thing to do. However, it can make sense to scale sparse inputs, especially if features are on different scales.      
```MaxAbsScaler``` and ```maxabs_scale``` were specifically designed for scaling sparse data, and are the recommended way to go about this. 

### 1.1.4 Scaling data with outliers
If your data contains many outliers, scaling using the mean and variance of the data is likely to not work very well. In these cases, you can use ```robust_scale``` and ```RobustScaler``` as drop-in replacements instead. They use more robust estimates for the center and range of your data.

## 2. Non-linear transformation
Like scalers, **QuantileTransformer** puts each feature into the same range or distribution. However, by performing a rank transformation, it smooths out unusual distributions and is less influenced by outliers than scaling methods. It does, however, distort correlations and distances within and across features.

QuantileTransformer and quantile_transform provide a non-parametric transformation based on the quantile function to map the data to a uniform distribution with values between 0 and 1:

In [20]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [21]:
quantile_transformer = preprocessing.QuantileTransformer(random_state=0)
X_train_trans = quantile_transformer.fit_transform(X_train)
X_test_trans = quantile_transformer.transform(X_test)
np.percentile(X_train[:, 0], [0, 25, 50, 75, 100]) 

array([4.3, 5.1, 5.8, 6.5, 7.9])

It is also possible to map the transformed data to a normal distribution by setting ```output_distribution='normal'```:

In [22]:
quantile_transformer = preprocessing.QuantileTransformer(
    output_distribution='normal', random_state=0)
X_trans = quantile_transformer.fit_transform(X)
quantile_transformer.quantiles_ 

array([[4.3       , 2.        , 1.        , 0.1       ],
       [4.31491491, 2.02982983, 1.01491491, 0.1       ],
       [4.32982983, 2.05965966, 1.02982983, 0.1       ],
       ...,
       [7.84034034, 4.34034034, 6.84034034, 2.5       ],
       [7.87017017, 4.37017017, 6.87017017, 2.5       ],
       [7.9       , 4.4       , 6.9       , 2.5       ]])

Thus the **median** of the input becomes the mean of the output, centered at 0. The normal output is clipped so that the input’s minimum and maximum — corresponding to the 1e-7 and 1 - 1e-7 quantiles respectively — do not become infinite under the transformation.

## 3. Normalization
**Normalization** is the process of **scaling individual samples to have unit norm**. This process can be useful if you plan to use a quadratic form such as the dot-product or any other kernel to quantify the similarity of any pair of samples.

This assumption is the base of the **Vector Space Model** often used in text classification and clustering contexts.

The function ```normalize``` provides a quick and easy way to perform this operation on a single array-like dataset, either using the ```l1``` or ```l2``` norms:

In [23]:
X = [[ 1., -1.,  2.],
     [ 2.,  0.,  0.],
     [ 0.,  1., -1.]]
X_normalized = preprocessing.normalize(X, norm='l2')

X_normalized 

array([[ 0.40824829, -0.40824829,  0.81649658],
       [ 1.        ,  0.        ,  0.        ],
       [ 0.        ,  0.70710678, -0.70710678]])

## 4. Binarization
**Feature binarization** is the process of **thresholding numerical features to get boolean values**. This can be useful for downstream probabilistic estimators that make assumption that the input data is distributed according to a multi-variate Bernoulli distribution.

It is also common among the text processing community to use binary feature values (probably to simplify the probabilistic reasoning) even if normalized counts (a.k.a. term frequencies) or TF-IDF valued features often perform slightly better in practice.

In [24]:
X = [[ 1., -1.,  2.],
     [ 2.,  0.,  0.],
     [ 0.,  1., -1.]]

binarizer = preprocessing.Binarizer().fit(X)  # fit does nothing
binarizer

Binarizer(copy=True, threshold=0.0)

In [25]:
binarizer.transform(X)

array([[1., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.]])

It is possible to adjust the threshold of the binarizer:

In [26]:
binarizer = preprocessing.Binarizer(threshold=1.1)
binarizer.transform(X)

array([[0., 0., 1.],
       [1., 0., 0.],
       [0., 0., 0.]])

## 5. Encoding categorical features
Often features are not given as continuous values but categorical. For example a person could have features ["male", "female"], ["from Europe", "from US", "from Asia"], ["uses Firefox", "uses Chrome", "uses Safari", "uses Internet Explorer"]. Such features can be efficiently coded as integers, for instance ["male", "from US", "uses Internet Explorer"] could be expressed as [0, 1, 3] while ["female", "from Asia", "uses Chrome"] would be [1, 2, 1].

Such integer representation can not be used directly with scikit-learn estimators, as these expect continuous input, and would interpret the categories as being ordered, which is often not desired (i.e. the set of browsers was ordered arbitrarily).

One possibility to convert categorical features to features that can be used with scikit-learn estimators is to use a one-of-K or one-hot encoding, which is implemented in OneHotEncoder. This estimator transforms each categorical feature with m possible values into m binary features, with only one active.

In [27]:
enc = preprocessing.OneHotEncoder()
enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])  


enc.transform([[0, 1, 3]]).toarray()

array([[1., 0., 0., 1., 0., 0., 0., 0., 1.]])

By default, how many values each feature can take is inferred automatically from the dataset. It is possible to specify this explicitly using the parameter n_values. There are two genders, three possible continents and four web browsers in our dataset. Then we fit the estimator, and transform a data point. In the result, the first two numbers encode the gender, the next set of three numbers the continent and the last four the web browser.

In [28]:
enc = preprocessing.OneHotEncoder(n_values=[2, 3, 4])
# Note that there are missing categorical values for the 2nd and 3rd
# features
enc.fit([[1, 2, 3], [0, 2, 0]])  


enc.transform([[1, 0, 0]]).toarray()

array([[0., 1., 1., 0., 0., 1., 0., 0., 0.]])

## 6. Imputation of missing values

In [29]:
import numpy as np
from sklearn.preprocessing import Imputer

In [30]:
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
imp.fit([[1, 2], [np.nan, 3], [7, 6]])

X = [[np.nan, 2], [6, np.nan], [7, 6]]
print(imp.transform(X)) 

[[4.         2.        ]
 [6.         3.66666667]
 [7.         6.        ]]


The Imputer class also supports sparse matrices:

In [31]:
import scipy.sparse as sp

In [32]:
X = sp.csc_matrix([[1, 2], [0, 3], [7, 6]])
imp = Imputer(missing_values=0, strategy='mean', axis=0)
imp.fit(X)

X_test = sp.csc_matrix([[0, 2], [6, 0], [7, 6]])
print(imp.transform(X_test))  

[[4.         2.        ]
 [6.         3.66666667]
 [7.         6.        ]]


## 7. Generating polynomial features

In [33]:
import numpy as np
from sklearn.preprocessing import PolynomialFeatures

In [34]:
X = np.arange(6).reshape(3, 2)
X                                                                           

array([[0, 1],
       [2, 3],
       [4, 5]])

In [35]:
poly = PolynomialFeatures(2)
poly.fit_transform(X) 

array([[ 1.,  0.,  1.,  0.,  0.,  1.],
       [ 1.,  2.,  3.,  4.,  6.,  9.],
       [ 1.,  4.,  5., 16., 20., 25.]])

The features of X have been transformed from ($X_1, X_2$) to ($1, X_1, X_2, X_1^2, X_1X_2, X_2^2$)

In some cases, only interaction terms among features are required, and it can be gotten with the setting ```interaction_only=True```:

In [36]:
X = np.arange(9).reshape(3, 3)
X                                                  

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

In [37]:
poly = PolynomialFeatures(degree=3, interaction_only=True)
poly.fit_transform(X)   

array([[  1.,   0.,   1.,   2.,   0.,   0.,   2.,   0.],
       [  1.,   3.,   4.,   5.,  12.,  15.,  20.,  60.],
       [  1.,   6.,   7.,   8.,  42.,  48.,  56., 336.]])

The features of X have been transformed from ($X_1, X_2, X_3$) to ($1, X_1, X_2, X_3, X_1X_2, X_1X_3, X_2X_3, X_1X_2X_3$).

## 8. Custom transformers

In [38]:
import numpy as np
from sklearn.preprocessing import FunctionTransformer

In [39]:
transformer = FunctionTransformer(np.log1p)
X = np.array([[0, 1], [2, 3]])
transformer.transform(X)

array([[0.        , 0.69314718],
       [1.09861229, 1.38629436]])

### Ref:
http://scikit-learn.org/stable/modules/preprocessing.html