# Preprocessing data
In general, learning algorithms benefit from standardization of the data set. If some outliers are present in the set, robust scalers or transformers are more appropriate

http://scikit-learn.org/stable/modules/preprocessing.html

# 4.3.1. Standardization, or mean removal and variance scaling
Standardization of datasets so that the individual features will have zero mean and unit variance

In [1]:
import numpy as np
from sklearn import preprocessing

Documentation: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html#sklearn.preprocessing.scale

In [2]:
?preprocessing.scale

In [3]:
X_train = np.array([[ 1., -1.,  2.],
                    [ 2.,  0.,  0.],
                    [ 0.,  1., -1.]])

In [4]:
X_scaled = preprocessing.scale(X=X_train, axis=0, with_mean=True, with_std=True, copy=True)
X_scaled

array([[ 0.        , -1.22474487,  1.33630621],
       [ 1.22474487,  0.        , -0.26726124],
       [-1.22474487,  1.22474487, -1.06904497]])

In [5]:
X_scaled.mean(axis=0)

array([0., 0., 0.])

In [6]:
X_scaled.mean(axis=1)

array([ 0.03718711,  0.31916121, -0.35634832])

In [7]:
X_scaled.std(axis=0)

array([1., 1., 1.])

In [8]:
X_scaled.std(axis=1)

array([1.04587533, 0.64957343, 1.11980724])

- `axis=0` means standardize each feature/column
- `axis=1` means standardize each sample/row

In [9]:
X_scaled_row = preprocessing.scale(X=X_train, axis=1, with_mean=True, with_std=True, copy=True)
X_scaled_row

array([[ 0.26726124, -1.33630621,  1.06904497],
       [ 1.41421356, -0.70710678, -0.70710678],
       [ 0.        ,  1.22474487, -1.22474487]])

## StandardScaler
Suitable for use in the early steps of a sklearn.pipeline.Pipeline

In [10]:
?preprocessing.StandardScaler()

In [11]:
scaler = preprocessing.StandardScaler().fit(X_train)
scaler

StandardScaler(copy=True, with_mean=True, with_std=True)

In [12]:
scaler.mean_

array([1.        , 0.        , 0.33333333])

In [13]:
scaler.scale_

array([0.81649658, 0.81649658, 1.24721913])

In [14]:
scaler.transform(X_train)

array([[ 0.        , -1.22474487,  1.33630621],
       [ 1.22474487,  0.        , -0.26726124],
       [-1.22474487,  1.22474487, -1.06904497]])

The scaler instance can then be used on new data to transform it the same way it did on the training set

In [15]:
X_test = [[-1., 1., 0.]]
scaler.transform(X_test)

array([[-2.44948974,  1.22474487, -0.26726124]])

## 4.3.1.1. Scaling features to a range
The motivation to use this scaling include robustness to very small standard deviations of features and preserving zero entries in sparse data.

In [16]:
# Scale a toy data matrix to the [0, 1] range
X_train = np.array([[ 1., -1.,  2.],
                    [ 2.,  0.,  0.],
                    [ 0.,  1., -1.]])

min_max_scaler = preprocessing.MinMaxScaler()
X_train_minmax = min_max_scaler.fit_transform(X_train)
X_train_minmax

array([[0.5       , 0.        , 1.        ],
       [1.        , 0.5       , 0.33333333],
       [0.        , 1.        , 0.        ]])

The same scaling and shifting operations will be applied to be consistent with the transformation performed on the train data

In [17]:
X_test = np.array([[ -3., -1.,  4.]])
X_test_minmax = min_max_scaler.transform(X_test)
X_test_minmax

array([[-1.5       ,  0.        ,  1.66666667]])

Examine the scaler attributes

In [18]:
min_max_scaler.scale_

array([0.5       , 0.5       , 0.33333333])

In [19]:
min_max_scaler.min_

array([0.        , 0.5       , 0.33333333])

If `MinMaxScaler` is given an explicit `feature_range=(min, max)` the full formula is:

$ X_{std} = \frac{X - X.min(axis=0)}{X.max(axis=0) - X.min(axis=0)} $

$ X_{scaled} = X_{std} * (max - min) + min $

`MaxAbsScaler` works in a very similar fashion, but scales in a way that the training data lies within the range `[-1, 1]` by dividing through the largest maximum value in each feature. It is meant for data that is already centered at zero or sparse data.

In [20]:
X_train = np.array([[ 1., -1.,  2.],
                    [ 2.,  0.,  0.],
                    [ 0.,  1., -1.]])

max_abs_scaler = preprocessing.MaxAbsScaler()
X_train_maxabs = max_abs_scaler.fit_transform(X_train)
X_train_maxabs

array([[ 0.5, -1. ,  1. ],
       [ 1. ,  0. ,  0. ],
       [ 0. ,  1. , -0.5]])

In [21]:
X_test = np.array([[ -3., -1.,  4.]])
X_test_maxabs = max_abs_scaler.transform(X_test)
X_test_maxabs

array([[-1.5, -1. ,  2. ]])

In [22]:
max_abs_scaler.scale_

array([2., 1., 2.])

## 4.3.1.2. Scaling sparse data

- http://scikit-learn.org/stable/modules/preprocessing.html#scaling-sparse-data
- https://en.wikipedia.org/wiki/Sparse_matrix

## 4.3.1.3. Scaling data with outliers

http://scikit-learn.org/stable/modules/preprocessing.html#scaling-data-with-outliers

## 4.3.1.4. Centering kernel matrices

- http://scikit-learn.org/stable/modules/preprocessing.html#centering-kernel-matrices
- https://en.wikipedia.org/wiki/Kernel_(linear_algebra)#Illustration

# 4.3.2. Non-linear transformation

Like scalers, `QuantileTransformer` puts each feature into the same range or distribution. However, by performing a rank transformation, it smooths out unusual distributions and is less influenced by outliers than scaling methods. It does, however, distort correlations and distances within and across features.

`QuantileTransformer` and `quantile_transform` provide a non-parametric transformation based on the quantile function to map the data to a uniform distribution with values between 0 and 1

In [23]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
quantile_transformer = preprocessing.QuantileTransformer(random_state=0)
quantile_transformer

QuantileTransformer(copy=True, ignore_implicit_zeros=False, n_quantiles=1000,
          output_distribution='uniform', random_state=0, subsample=100000)

In [24]:
X_train_trans = quantile_transformer.fit_transform(X_train)
X_test_trans = quantile_transformer.transform(X_test)
np.percentile(X_train[:, 0], [0, 25, 50, 75, 100])

array([4.3, 5.1, 5.8, 6.5, 7.9])

This feature corresponds to the sepal length in cm. Once the quantile transformation applied, those landmarks approach closely the percentiles previously defined:

In [25]:
np.percentile(X_train_trans[:, 0], [0, 25, 50, 75, 100])

array([9.99999998e-08, 2.38738739e-01, 5.09009009e-01, 7.43243243e-01,
       9.99999900e-01])

In [26]:
np.percentile(X_test[:, 0], [0, 25, 50, 75, 100])

array([4.4  , 5.125, 5.75 , 6.175, 7.3  ])

In [27]:
np.percentile(X_test_trans[:, 0], [0, 25, 50, 75, 100])

array([0.01351351, 0.25012513, 0.47972973, 0.6021021 , 0.94144144])

It is also possible to map the transformed data to a normal distribution by setting `output_distribution='normal'`:

In [28]:
quantile_transformer = preprocessing.QuantileTransformer(
    output_distribution='normal', random_state=0)

X_trans = quantile_transformer.fit_transform(X)
quantile_transformer.quantiles_

array([[4.3       , 2.        , 1.        , 0.1       ],
       [4.31491491, 2.02982983, 1.01491491, 0.1       ],
       [4.32982983, 2.05965966, 1.02982983, 0.1       ],
       ...,
       [7.84034034, 4.34034034, 6.84034034, 2.5       ],
       [7.87017017, 4.37017017, 6.87017017, 2.5       ],
       [7.9       , 4.4       , 6.9       , 2.5       ]])

Thus the median of the input becomes the mean of the output, centered at 0. The normal output is clipped so that the input’s minimum and maximum — corresponding to the 1e-7 and 1 - 1e-7 quantiles respectively — do not become infinite under the transformation.

# 4.3.3. Normalization

Normalization is the process of scaling individual samples to have unit norm. Useful if you want to quantify the similarity of any pair of samples.

`normalize` and `Normalizer` accept both dense array-like and sparse matrices from scipy.sparse as input.

For sparse input the data is converted to the Compressed Sparse Rows representation (see scipy.sparse.csr_matrix).

In [29]:
X = [[ 1., -1.,  2.],
     [ 2.,  0.,  0.],
     [ 0.,  1., -1.]]

X_normalized = preprocessing.normalize(X, norm='l2')
X_normalized   

array([[ 0.40824829, -0.40824829,  0.81649658],
       [ 1.        ,  0.        ,  0.        ],
       [ 0.        ,  0.70710678, -0.70710678]])

In [30]:
normalizer = preprocessing.Normalizer().fit(X)  # fit does nothing
normalizer

Normalizer(copy=True, norm='l2')

In [31]:
normalizer.transform(X)

array([[ 0.40824829, -0.40824829,  0.81649658],
       [ 1.        ,  0.        ,  0.        ],
       [ 0.        ,  0.70710678, -0.70710678]])

In [32]:
normalizer.transform([[-1.,  1., 0.]])

array([[-0.70710678,  0.70710678,  0.        ]])

# 4.3.4. Binarization

Feature binarization is the process of thresholding numerical features to get boolean values.

`binarize` and `Binarizer` accept both dense array-like and sparse matrices from scipy.sparse as input.

For sparse input the data is converted to the Compressed Sparse Rows representation.

In [33]:
X = [[ 1., -1.,  2.],
     [ 2.,  0.,  0.],
     [ 0.,  1., -1.]]

binarizer = preprocessing.Binarizer().fit(X)  # fit does nothing
binarizer

Binarizer(copy=True, threshold=0.0)

In [34]:
binarizer.transform(X)

array([[1., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.]])

It is possible to adjust the threshold of the binarizer

In [35]:
binarizer = preprocessing.Binarizer(threshold=1.1)
binarizer.transform(X)

array([[0., 0., 1.],
       [1., 0., 0.],
       [0., 0., 0.]])

# 4.3.5. Encoding categorical features

one-of-K or one-hot encoding, which is implemented in `OneHotEncoder`. This estimator transforms each categorical feature with `m` possible values into `m` binary features, with only one active.

For example a person could have features `["male", "female"]`, `["from Europe", "from US", "from Asia"]`, `["uses Firefox", "uses Chrome", "uses Safari", "uses Internet Explorer"]`. Such features can be efficiently coded as integers, for instance `["male", "from US", "uses Internet Explorer"]` could be expressed as `[0, 1, 3]` while `["female", "from Asia", "uses Chrome"]` would be `[1, 2, 1]`.

In [36]:
enc = preprocessing.OneHotEncoder()
enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])

OneHotEncoder(categorical_features='all', dtype=<class 'numpy.float64'>,
       handle_unknown='error', n_values='auto', sparse=True)

By default, how many values each feature can take is inferred automatically from the dataset. It is possible to specify this explicitly using the parameter `n_values`.

There are two genders, three possible continents and four web browsers in our dataset. Then we fit the estimator, and transform a data point. In the result, the first two numbers encode the gender, the next set of three numbers the continent and the last four the web browser

In [37]:
enc.transform([[0, 1, 3]]).toarray()

array([[1., 0., 0., 1., 0., 0., 0., 0., 1.]])

Note that, if there is a possibility that the training data might have missing categorical features, one has to explicitly set `n_values`.

In [38]:
enc = preprocessing.OneHotEncoder(n_values=[2, 3, 4])
# Note that there are missing categorical values for the 2nd and 3rd features
enc.fit([[1, 2, 3], [0, 2, 0]]) 

OneHotEncoder(categorical_features='all', dtype=<class 'numpy.float64'>,
       handle_unknown='error', n_values=[2, 3, 4], sparse=True)

In [39]:
enc.transform([[1, 0, 0]]).toarray()

array([[0., 1., 1., 0., 0., 1., 0., 0., 0.]])

# 4.3.6. Imputation of missing values

Replace missing values, encoded as `np.nan`, using the mean value of the columns (axis 0) that contain the missing values

In [40]:
import numpy as np
from sklearn.preprocessing import Imputer

imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
imp.fit([[1, 2], [np.nan, 3], [7, 6]])

Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)

In [41]:
X = [[np.nan, 2], [6, np.nan], [7, 6]]
print(imp.transform(X))

[[4.         2.        ]
 [6.         3.66666667]
 [7.         6.        ]]
