# Preprocessing data
* http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing
> * 1 Standardization, or mean removal and variance scaling
> * 2 Normalization
> * 3 Binarization
> * 4 Encoding categorical features
> * 5 Imputation of missing values
> * 6 Generating polynomial features
> * 7 Custom transformers

In [23]:
import sklearn.gaussian_process 
dir(sklearn.gaussian_process )

['GaussianProcess',
 'GaussianProcessClassifier',
 'GaussianProcessRegressor',
 '__all__',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 'correlation_models',
 'gaussian_process',
 'gpc',
 'gpr',
 'kernels',
 'regression_models']

In [2]:
import sklearn.datasets
dir(datasets)

NameError: name 'datasets' is not defined

## 1 Standardization, or mean removal and variance scaling：标准化，或去均值和方差尺度
* In practice we often ignore the shape of the distribution and just transform the data to center it by removing the mean value of each feature, then scale it by dividing non-constant features by their standard deviation.
* 标准化是将每个特征的数据统一到相同的尺度

### 1.1 用同一套尺度标准化

In [5]:
from sklearn import preprocessing
import numpy as np
X = np.array([[ 1., -1.,  2.],
              [ 2.,  0.,  0.],
              [ 0.,  1., -1.]])
X_scaled = preprocessing.scale(X)

print(X_scaled)    
print()
print(X_scaled.mean(axis=0))
X_scaled.std(axis=0)

[[ 0.         -1.22474487  1.33630621]
 [ 1.22474487  0.         -0.26726124]
 [-1.22474487  1.22474487 -1.06904497]]

[ 0.  0.  0.]


array([ 1.,  1.,  1.])

* 用训练数据的均值、平均差来标准化测试数据

In [10]:
scaler = preprocessing.StandardScaler().fit(X) # to disable either centering or scaling by either passing with_mean=False or with_std=False

print(scaler)
print()
print(scaler.mean_) 
print()
print(scaler.scale_)                                       
scaler.transform(X) 

StandardScaler(copy=True, with_mean=True, with_std=True)

[ 1.          0.          0.33333333]

[ 0.81649658  0.81649658  1.24721913]


array([[ 0.        , -1.22474487,  1.33630621],
       [ 1.22474487,  0.        , -0.26726124],
       [-1.22474487,  1.22474487, -1.06904497]])

### 1.2 Scaling features to a range：标准化到某个范围，如[0 1]
> * The motivation to use this scaling include robustness to very small standard deviations of features and preserving zero entries in sparse data.

In [14]:
# MinMaxScaler
X_train = np.array([[ 1., -1.,  2.],
                    [ 2.,  0.,  0.],
                    [ 0.,  1., -1.]])

min_max_scaler = preprocessing.MinMaxScaler()  # 默认：[0 1]，可更改 feature_range=(min, max)
X_train_minmax = min_max_scaler.fit_transform(X_train)
print(X_train_minmax)

X_test = np.array([[ -3., -1.,  4.]])
X_test_minmax = min_max_scaler.transform(X_test)
X_test_minmax

[[ 0.5         0.          1.        ]
 [ 1.          0.5         0.33333333]
 [ 0.          1.          0.        ]]


array([[-1.5       ,  0.        ,  1.66666667]])

In [17]:
# MaxAbsScaler: It is meant for data that is already centered at zero or sparse data.
X_train = np.array([[ 1., -1.,  2.],
                    [ 2.,  0.,  0.],
                    [ 0.,  1., -1.]])
max_abs_scaler = preprocessing.MaxAbsScaler()
X_train_maxabs = max_abs_scaler.fit_transform(X_train)
print(X_train_maxabs)                # doctest +NORMALIZE_WHITESPACE^
print()

X_test = np.array([[ -3., -1.,  4.]])
X_test_maxabs = max_abs_scaler.transform(X_test)
print(X_test_maxabs)                 
max_abs_scaler.scale_    

[[ 0.5 -1.   1. ]
 [ 1.   0.   0. ]
 [ 0.   1.  -0.5]]

[[-1.5 -1.   2. ]]


array([ 2.,  1.,  2.])

### 1.3 Scaling sparse data: 稀疏数据

* MaxAbsScaler and maxabs_scale were specifically designed for scaling sparse data, and are the recommended way to go about this. 
* However, scale and StandardScaler can accept scipy.sparse matrices as input, as long as with_mean=False is explicitly passed to the constructor
* 
* Note that the scalers accept both Compressed Sparse Rows and Compressed Sparse Columns format (see scipy.sparse.csr_matrix and scipy.sparse.csc_matrix). 
* Any other sparse input will be converted to the Compressed Sparse Rows representation. 
* To avoid unnecessary memory copies, it is recommended to choose the CSR or CSC representation upstream.

### 1.4 Scaling data with outliers：标准化带有离群值的数据

* use robust_scale and RobustScaler as drop-in replacements instead

* Scaling vs Whitening
> * It is sometimes not enough to center and scale the features independently, since a downstream model can further make some assumption on the linear independence of the features.
> * To address this issue you can use sklearn.decomposition.PCA or sklearn.decomposition.RandomizedPCA with whiten=True to further remove the linear correlation across features.

### 1.5 Centering kernel matrices

## 2 Normalization：归一化
* Normalization is the process of scaling individual samples to have unit norm 
* 归一化是将每个样本的数据统一到相同尺度
* This process can be useful if you plan to use a quadratic form（二次方形式） such as the dot-product or any other kernel to quantify the similarity of any pair of samples

In [18]:
X = [[ 1., -1.,  2.],
     [ 2.,  0.,  0.],
     [ 0.,  1., -1.]]
X_normalized = preprocessing.normalize(X, norm='l2') # l1
X_normalized     

array([[ 0.40824829, -0.40824829,  0.81649658],
       [ 1.        ,  0.        ,  0.        ],
       [ 0.        ,  0.70710678, -0.70710678]])

In [21]:
normalizer = preprocessing.Normalizer().fit(X)  # fit does nothing
print(normalizer)
print()
print(normalizer.transform(X))
normalizer.transform([[-1.,  1., 0.]]) 

Normalizer(copy=True, norm='l2')

[[ 0.40824829 -0.40824829  0.81649658]
 [ 1.          0.          0.        ]
 [ 0.          0.70710678 -0.70710678]]


array([[-0.70710678,  0.70710678,  0.        ]])

* Sparse input
> * normalize and Normalizer accept both dense array-like and sparse matrices from scipy.sparse as input.
> * For sparse input the data is converted to the Compressed Sparse Rows representation (see scipy.sparse.csr_matrix) before being fed to efficient Cython routines. 
> * To avoid unnecessary memory copies, it is recommended to choose the CSR representation upstream.

## 3 Binarization：二值化
* Feature binarization is the process of thresholding numerical features to get boolean values. （布尔值）
* This can be useful for downstream probabilistic estimators that make assumption that the input data is distributed according to a multi-variate Bernoulli distribution. 
* For instance, this is the case for the sklearn.neural_network.BernoulliRBM.

In [23]:
X = [[ 1., -1.,  2.],
     [ 2.,  0.,  0.],
     [ 0.,  1., -1.]]

binarizer = preprocessing.Binarizer().fit(X)  # fit does nothing
print(binarizer)
binarizer.transform(X)

Binarizer(copy=True, threshold=0.0)


array([[ 1.,  0.,  1.],
       [ 1.,  0.,  0.],
       [ 0.,  1.,  0.]])

In [24]:
binarizer = preprocessing.Binarizer(threshold=1.1) # 调整阈限
binarizer.transform(X)

array([[ 0.,  0.,  1.],
       [ 1.,  0.,  0.],
       [ 0.,  0.,  0.]])

## 4 Encoding categorical features：编码类别特征
* One possibility to convert categorical features to features that can be used with scikit-learn estimators is to use a one-of-K or one-hot encoding, which is implemented in OneHotEncoder. 
* This estimator transforms each categorical feature with m possible values into m binary features, with only one active.

In [26]:
enc = preprocessing.OneHotEncoder()
print(enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]]))  
enc.transform([[0, 1, 3]]).toarray()

OneHotEncoder(categorical_features='all', dtype=<class 'float'>,
       handle_unknown='error', n_values='auto', sparse=True)


array([[ 1.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.]])

* By default, how many values each feature can take is inferred automatically from the dataset. 
* It is possible to specify this explicitly using the parameter n_values. 
> * There are two genders, three possible continents and four web browsers in our dataset. 
> * Then we fit the estimator, and transform a data point. 
> * In the result, the first two numbers encode the gender, the next set of three numbers the continent and the last four the web browser.

In [27]:
enc = preprocessing.OneHotEncoder(n_values=[2, 3, 4])
# Note that there are missing categorical values for the 2nd and 3rd features
print(enc.fit([[1, 2, 3], [0, 2, 0]])  )
enc.transform([[1, 0, 0]]).toarray()

OneHotEncoder(categorical_features='all', dtype=<class 'float'>,
       handle_unknown='error', n_values=[2, 3, 4], sparse=True)


array([[ 0.,  1.,  1.,  0.,  0.,  1.,  0.,  0.,  0.]])

## 5 Imputation of missing values：处理缺失值
* A basic strategy to use incomplete datasets is to discard entire rows and/or columns containing missing values. 
* However, this comes at the price of losing data which may be valuable (even though incomplete). 
* A better strategy is to impute the missing values, i.e., to infer them from the known part of the data.
> * The Imputer class provides basic strategies for imputing missing values, either using the mean, the median or the most frequent value of the row or column in which the missing values are located. 
> * This class also allows for different missing values encodings.

In [32]:
import numpy as np
from sklearn.preprocessing import Imputer

imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
print(imp.fit([[1, 2], [np.nan, 3], [7, 6]]))
print()
X = [[np.nan, 2], [6, np.nan], [7, 6]]
print(X)
print()
print(imp.transform(X))  

Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)

[[nan, 2], [6, nan], [7, 6]]

[[ 4.          2.        ]
 [ 6.          3.66666667]
 [ 7.          6.        ]]


In [33]:
import scipy.sparse as sp

X = sp.csc_matrix([[1, 2], [0, 3], [7, 6]])
imp = Imputer(missing_values=0, strategy='mean', axis=0)
print(imp.fit(X))
print()
X_test = sp.csc_matrix([[0, 2], [6, 0], [7, 6]])  # 稀疏格式数据，missing values are encoded by 0 and are thus implicitly stored in the matrix
print(X_test)
print()
print(imp.transform(X_test))   

Imputer(axis=0, copy=True, missing_values=0, strategy='mean', verbose=0)

  (1, 0)	6
  (2, 0)	7
  (0, 1)	2
  (2, 1)	6

[[ 4.          2.        ]
 [ 6.          3.66666675]
 [ 7.          6.        ]]


## 6 Generating polynomial features：构造多项式特征

In [35]:
import numpy as np
from sklearn.preprocessing import PolynomialFeatures

X = np.arange(6).reshape(3, 2)
print(X)
poly = PolynomialFeatures(2)
poly.fit_transform(X)

[[0 1]
 [2 3]
 [4 5]]


array([[  1.,   0.,   1.,   0.,   0.,   1.],
       [  1.,   2.,   3.,   4.,   6.,   9.],
       [  1.,   4.,   5.,  16.,  20.,  25.]])

* The features of X have been transformed from (X_1, X_2) to (1, X_1, X_2, X_1^2, X_1X_2, X_2^2)

In [37]:
X = np.arange(6).reshape(3, 2)
print(X)                                           
poly = PolynomialFeatures(degree=3, interaction_only=True) # In some cases, only interaction terms among features are required
poly.fit_transform(X)   

[[0 1]
 [2 3]
 [4 5]]


array([[  1.,   0.,   1.,   0.],
       [  1.,   2.,   3.,   6.],
       [  1.,   4.,   5.,  20.]])

In [None]:
* The features of X have been transformed from (X_1, X_2) to (1, X_1, X_2, X_1X_2)

## 7 Custom transformers：自定义传递函数

In [39]:
import numpy as np
from sklearn.preprocessing import FunctionTransformer

transformer = FunctionTransformer(np.log1p) # log transformation
X = np.array([[0, 1], [2, 3]])
print(X)
transformer.transform(X)

[[0 1]
 [2 3]]


array([[ 0.        ,  0.69314718],
       [ 1.09861229,  1.38629436]])