Main Documentation


http://scikit-learn.org/stable/modules/preprocessing.html


The sklearn.preprocessing package provides several common utility functions and transformer classes to change raw feature vectors into a representation that is more suitable for the downstream estimators.


In [2]:
"The function scale provides a quick and easy way to perform this operation on a single array-like dataset:"

from sklearn import preprocessing
import numpy as np

X_train = np.array([[ 1., -1.,  2.],
                    [ 2.,  0.,  0.],
                    [ 0.,  1., -1.]])

X_scaled = preprocessing.scale(X_train)

X_scaled


array([[ 0.        , -1.22474487,  1.33630621],
       [ 1.22474487,  0.        , -0.26726124],
       [-1.22474487,  1.22474487, -1.06904497]])

In [3]:
"Scaled data has zero mean and unit variance:"
"Remeber 0<= x' <=1 "

X_scaled.mean(axis=0)

array([0., 0., 0.])

In [4]:
X_scaled.std(axis=0)

array([1., 1., 1.])

 Scaling features to a range
 ======================
 
An alternative standardization is scaling features to lie between a given minimum and maximum value, often between zero and one, or so that the maximum absolute value of each feature is scaled to unit size. This can be achieved using <font color = 'green'>MinMaxScaler</font> or <font color='green'>MaxAbsScaler</font>, respectively.



In [5]:
from sklearn.preprocessing import MinMaxScaler
import numpy as np

weights = np.array([[115.0], [140.0], [175.0]])

scaler = MinMaxScaler()

scalled_weights = scaler.fit_transform(weights)

scalled_weights



array([[0.        ],
       [0.41666667],
       [1.        ]])

The above result matches with our formula of min_max_scaller in Lesson 10 of feature scaling

## Among these, which all algorithm will be affected by the feature scaling

1> Decision Tree

2> SVM with RBF kernel

3> Linear regresion

4> K-Means clustering

SVM and k-Means clustering will be affected by the feature re-scalling.

<font color = 'red'> Why: </font>
In general, algorithms that exploit distances or similarities (e.g. in form of scalar product) between data samples, such as k-NN, K-means clustering and SVM, are sensitive to feature transformations.

<font color = 'green'> More Detail: </font>
https://stats.stackexchange.com/questions/244507/what-algorithms-need-feature-scaling-beside-from-svm

![title](resources/algo_compare.png)
