# Standardization

Certain Machine Learning algorithms assume that all features have a somewhat normal distribution, centered around zero and with a similar variance. However, that is rarely the case in wild datasets. A feature that has a significantly larger variance can overpower others and prevent the model to learn from them. 

Standaridization transforms each feature by removing its mean value (u) and dividing it by its standard deviation (s). As such, it is centered at zero.

$\huge z = \frac{(x - u)}{s
}$

Standardization can be done using Sklearn's `StandardScaler` (doc [here](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)) . In the cell below, standardize the the data.



In [None]:
data = [[0,0],
        [1,1],
        [2,2]]

The scaler is now stored in memory and can reproduce the equivalent transformation on new data. Transform the new data to verify it does the right transformation.

In [1]:
new_data = [[1,1]]

# Scaling to range

Another transformation option is to scale to a range. The advantage of this method is its resistance to very small standard deviations. It also preserves zero entries in sparse datasets. There are two ways to scale to a range in Sklearn:

- `MinMaxScaler` transforms to a chosen range - (doc [here](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html))

- `MaxAbsScaler` transforms to a range [-1,1] - (doc [here](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MaxAbsScaler.html))

Below, use `MinMaxScaler` to transform the data to a (-2,2) range.

In [2]:
data = [[0,0],
        [1,1],
        [2,2]]


`MaxAbsScaler` works in similar fashion but transforms the data to a range [-1,1]. That transformation is better suited to data already centered at zero (standardized).

Below, standardize the data before scaling it with `MaxSbsScaler`.

In [3]:
from sklearn.preprocessing import MaxAbsScaler

data = [[-1,-1],
        [1,1],
        [3,3]]

# Dealing with outliers

A basic approach to spotting outliers in a dataset is to graph a `boxplot` of your data, easily done with `matplotlib`.

Check out the [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.boxplot.html) and graph a boxplot of the data.

In [4]:
data = [[1,1,3],
        [2,2,1],
        [3,10,2]]


In the presence of outliers, the mean value of a feature can be extremely distorted and does not offer a robust base for standardizing.

In such a case, a more stable solution is to exploit the median value and to scale according to the Interquartile Range (IQR). If the data was to be split into 4 quarters, the IQR represents the 2nd and 3rd quarters. By excluding the outermost quarters (1st and 4th), the algorithm intends to exclude the outliers.

Sklearn's `RobustScaler` does just that! - (doc [here](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html))

Go ahead, scale the data according to the quantile range.

Do the same thing but instead of using the IQR by default, set a manual range to exclude the extreme fifths of the dataset.