# `Scikit-learn` Preprocessing data

In [2]:
from sklearn import preprocessing
import numpy as np

## Standardization 标准化

Transform the data to center it by removing the mean value of each feature, then scale it by dividing non-constant features by their standard deviation.

If a feature has a variance that is orders of magnitude larger than others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.

**以下均按列进行标准化**

### `scale()`
It provides a quick and easy way to perform the standardization on a single array-like dataset.

In [2]:
X_train = np.array([[ 1., -1.,  2.],
                    [ 2.,  0.,  0.],
                    [ 0.,  1., -1.]])

X_scaled = preprocessing.scale(X_train)
X_scaled

In [3]:
# Scaled data has zero mean and unit variance
X_scaled.mean(axis=0)
X_scaled.std(axis=0)

### `StandardScaler`
It can save the mean and standard deviation on a training set, and reapply the same transformation on the testing set / a new set. By setting `with_mean=False` or `with_std=False`, it's possible to disable centering / scaling.

Iternal variables:  
- `mean_`  
- `scale_`  

Methods available:  
- `fit()`  
- `transform()`
- `fit_transform()`

In [5]:
scaler = preprocessing.StandardScaler().fit(X_train)
scaler

In [9]:
scaler.mean_

In [12]:
scaler.scale_

In [14]:
scaler.transform(X_train)

In [15]:
# Reapply it to a new set
X_test = [[-1., 1., 0.]]
scaler.transform(X_test)

### `MinMaxScaler`
Scalie features to lie between a given minimum and maximum value, often between 0 and 1. （默认 0-1）

Parameters:  
- `feature_range=(min, max)`



`sklearn.preprocessing.minmax_scaler()` can achieve the same goal.


In [17]:
# 同样可保存mean和std值，应用于其他数组上
min_max_scaler = preprocessing.MinMaxScaler()
X_train_minmax = min_max_scaler.fit_transform(X_train)
X_train_minmax

### `MaxAbsScaler`

The training data stays in range `[-1,1]` by dividing through the largest maximum value in each feature. It is meant for data that is **already centered at zero** or **sparse** data.

`sklearn.preprocessing.maxabs_scaler()` can achieve the same goal.

In [21]:
max_abs_scaler = preprocessing.MaxAbsScaler()
X_train_maxabs = max_abs_scaler.fit_transform(X_train)
X_train_maxabs

### Scaling sparse data
Centering sparse data would destroy the sparseness structure in the data, and thus rarely is a sensible thing to do. So `MaxAbsScaler` is specifially designed for scaling sparse data.

However, `scale()` and `StandardScaler` can accept `scipy.sparse` matrices as input, as long as `with_mean=False` is explicitly passed to the constructor.

`RobustScaler` cannot be fitted to sparse inputs, but you can use `transform()` on sparse inputs.

#### Scaling data with outliers

If the data contains many outliers, scaling with mean and variance is not likely to work very well. In these cases, `robust_scale` and `RobustScaler` can be used as drop-in replacements instead.

#### Scaling vs. Whitening
Sometimes a downstream model can further make some assumption on the linear independence of the features.

To address this issue `sklearn.decomposition.PCA` with `whiten=True` can be used to further remove the linear correlation across features.

#### Kernel Matrices Centering

`KernelCenterer` can transform the kernel matrix so that it contains inner products in the feature space defined by $\phi$ followed by removal of the mean in that space.


## Non-linear transformation

Available transformations:
- Quantile transform
- Power transform

Both are based on monotonic transformations of the features and thus preserve the rank of the values along each feature.

### Quantile transform

See here [Quantile transform](https://scikit-learn.org/stable/modules/preprocessing.html#non-linear-transformation).

#### Mapping to a **Uniform distribution**

`QuantileTransformer` and `quantile_transform()` provide a non-parametric transformation to map the data to a uniform distribution with values between 0 and 1.








In [None]:
quantile_transformer = preprocessing.QuantileTransformer(random_state=0)

It is also possible to map data to a normal distribution using `QuantileTransformer` by setting `output_distribution='normal'`

In [None]:
quantile_transformer = preprocessing.QuantileTransformer(output_distribution='normal', random_state=0)

### Power transform

It's a family of parametric transformations that aim to map data from any distribution to as close to a **Gaussian distribution**, in order to stabilize variance and minimize skewness.

#### Mapping to a **Gaussian distribution**

`PowerTransformer` currently provides two such power transformations, the **Yeo-Johnson** transform and the **Box-Cox** transform. See [here](https://scikit-learn.org/stable/modules/preprocessing.html#mapping-to-a-gaussian-distribution).

`Box-Cox` can only be applied to strictly positive data. In both methods, the transformation is parameterized by $\lambda$, which is determined through maximum likelihood estimation.


In [None]:
# With standardize=False, it means 0-mean, unit-variance
pt = preprocessing.PowerTransformer(method='box-cox', standardize=False)

## Nomalization

It's the process of scaling individual samples to have **unit norm**. This process can be useful if you plan to use a quadratic form such as the dot-product or any other kernel to quantify the similarity of any pair of samples.

### `normalize()`
It provides a quick and easy way to perform this operation on a single array-like dataset, either using the `l1` or `l2` norms:

In [3]:
X = [[ 1., -1.,  2.],
     [ 2.,  0.,  0.],
     [ 0.,  1., -1.]]
X_normalized = preprocessing.normalize(X, norm='l2')
X_normalized

array([[ 0.25, -0.25,  0.5 ],
       [ 1.  ,  0.  ,  0.  ],
       [ 0.  ,  0.5 , -0.5 ]])

### `Normalizer`
`Normalizer` implements the same operation.

In [None]:
normalizer = preprocessing.Normalizer()

## Encoding categorical features

### `OrdinalEncoder`
To convert categorical features to such integer codes, we can use the `OrdinalEncoder`. This estimator transforms each categorical feature to one new feature of integers (0 to n_categories - 1).


In [4]:
enc = preprocessing.OrdinalEncoder()

X_cat = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
enc.fit(X_cat)  
enc.transform([['female', 'from US', 'uses Safari']])

array([[0., 1., 1.]])

### `OneHotEncoder`
Such integer representation can not be used directly, as these expect continuous input, and would interpret the categories as being ordered, which is often not desired.

Another possibility is to use a one-of-K, also known as **one-hot** or **dummy** encoding, using `OneHotEncoder`, which transforms each categorical feature with `n_categories` possible values into `n_categories` binary features, with one of them 1, and all others 0.

In [6]:
enc = preprocessing.OneHotEncoder()
X_onehot = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
enc.fit(X_onehot)  
enc.transform([['female', 'from US', 'uses Safari'],
               ['male', 'from Europe', 'uses Safari']]).toarray()

array([[1., 0., 0., 1., 0., 1.],
       [0., 1., 1., 0., 0., 1.]])

It's possible to specify the categories explicitly using the parameter `categories`.

In [None]:
genders = ['female', 'male']
locations = ['from Africa', 'from Asia', 'from Europe', 'from US']
browsers = ['uses Chrome', 'uses Firefox', 'uses IE', 'uses Safari']
enc = preprocessing.OneHotEncoder(categories=[genders, locations, browsers])

If the dataset might have missing categorical features, you can specify `handle_unknown='ignore'` instead of setting the categories manually as above. And no error will be raised but the resulting one-hot encoded columns for this feature will be all zeros (`handle_unknown='ignore'` is only supported for one-hot encoding).

In [None]:
enc = preprocessing.OneHotEncoder(handle_unknown='ignore')
X_onehot_ignore = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
enc.fit(X_onehot_ignore) 
enc.transform([['female', 'from Asia', 'uses Chrome']]).toarray()

It's possible to encode each column into `n_categories - 1` columns instead of `n_categories` columns by using the `drop` parameter. This parameter allows the user to specify a category for each feature to be dropped.

This is useful to avoid co-linearity in the input matrix in some classifiers. Such functionality is useful, for example, when using non-regularized regression (`LinearRegression`), since co-linearity would cause the covariance matrix to be non-invertible. When this paramenter is not None, `handle_unknown` must be set to `error`.

## Discretization

Discretization (otherwise known as quantization or binning) provides a way to partition continuous features into discrete values, with only nominal attributes.

One-hot encoded discretized features can make a model more expressive, while maintaining interpretability. For instance, pre-processing with a discretizer can introduce nonlinearity to linear models.

### K-bins Discretization
`KBinsDiscritizer` discretizes features into `k` bins. It can also implements different binning strategies with the `strategy` parameter:
- `uniform` uses constant-width bins
- `quantile` uses the quantiles values to have equally populated bins in each feature
- `kmeans` defines bins based on a k-means clustering procedure performed on each feature independently.


In [None]:
est = preprocessing.KBinsDiscretizer(n_bins=[3, 2, 2], encode='ordinal')

### Feature binarization
It's the process of thresholding numerical features to get boolean values. This can be useful for downstream probabilistic estimators that make assumption that the input data is distributed according to a multi-variate Bernoulli distribution.

`Binarizer` is used in this case. Parameter `threshold` can be used to change the threshold $t$, if the value is smaller than $t$, then it's replaced by 0.

`StandardScaler` and `Normalizer` provide a function `binarize()` that achieves the same goal.

In [None]:
X_bi = [[ 1., -1.,  2.],
        [ 2.,  0.,  0.],
        [ 0.,  1., -1.]]
binarizer = preprocessing.Binarizer().fit(X_bi)  # fit does nothing
binarizer.transform(X_bi)

### Generating polynomial features

Given an input $(X_1, X_2, X_3, ..., X_n)$, we can transform it to polynomial features.

For example, from $(X_1, X_2)$ to $(1, X_1, X_2, X_1^2, X_1 X_2, X_2^2)$.

In [None]:
X_poly = np.arange(6).reshape(3, 2)
poly = PolynomialFeatures(2)
poly.fit_transform(X_poly)

Or from $(X_1, X_2, X_3)$ to $(1, X_1, X_2, X_3, X_1 X_2, X_2 X_3, X_1 X_3, X_1 X_2 X_3)$. By setting `interaction_only=True`, the $X_i^2$ or $X_i^3$ terms are ignored.

In [None]:
X_poly2 = np.arange(9).reshape(3, 3)
poly2 = PolynomialFeatures(degree=3, interaction_only=True)
poly2.fit_transform(X_poly2)

### Custom transformer

Often, you will want to convert an existing Python function into a transformer to assist in data cleaning or processing. You can implement a transformer from an arbitrary function with [`FunctionTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html#sklearn.preprocessing.FunctionTransformer).