## A Handy Notes about Feature Scaling

Machine Learning models are very selective about the type and range of values that have to go for their input in order to work well. With the exception of decision trees, most ML models will expect you to scale the input features. What is feature scaling?

Feature scaling is nothing other than transforming the numerical features into a small range of values. In this notebook, we will see the following scaling technique:

1. Normalization
2. Standardizatiom
3. Robust Scaling

__Normalization__ and __Standardization__ can be used or applied interchangeably, but they are quite different and they are suited for different purposes.

## 1. Normalization

Normalization is a scaling techniques that transform the numerical feature to the range of values between 0 and 1.

Here is a formula that is followed when normalizing the data. Xmin is the minimum value of feature X, and Xmax is the maximum value of X.

                              Xnorm = \frac {X-Xmin} {Xmax-Xmin}

### When Should you Normalize the Features?

When you have features that have different ranges of values, normalizing these features can be a good practice.

Take an example. If you have two features that have different ranges (say one feature from 1-100, other vary from 5-300), you will to scale them so they have the same range of values.

More specifically, normalization is a preferrable scaling technique when the data at hand has not a normal or gaussian distribution. If the data's distribution is gaussian, standardization is a preferrable scaling technique. If you don't know the distribution of the data, still, normalization is a good choice at first.

With that said, when the ML algorithm of choice is either neural network or K-Nearest Neighbors(KNN), normalization is a good choice for these type of algorithms because they don't make any assumption of the input data.

Most popular ML frameworks provide functions to normalize the numerical data.

For illustration purpose, I will use tips data available in Seaborn.

In [1]:
import seaborn as snb

tips = snb.load_dataset('tips')

tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [6]:
tips.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   total_bill  244 non-null    float64 
 1   tip         244 non-null    float64 
 2   sex         244 non-null    category
 3   smoker      244 non-null    category
 4   day         244 non-null    category
 5   time        244 non-null    category
 6   size        244 non-null    int64   
dtypes: category(4), float64(2), int64(1)
memory usage: 7.4 KB


In [15]:
# Let's take all numerical features from the above data.

numerical_features_list = [feature for feature in tips.columns if tips[feature].dtype!='category']

numerical_features = tips[numerical_features_list]

In [16]:
numerical_features

Unnamed: 0,total_bill,tip,size
0,16.99,1.01,2
1,10.34,1.66,3
2,21.01,3.50,3
3,23.68,3.31,2
4,24.59,3.61,4
...,...,...,...
239,29.03,5.92,3
240,27.18,2.00,2
241,22.67,2.00,2
242,17.82,1.75,2


For now let's scale those numerical features with Scikit-Learn preprocessing functions. We will use `MinMaxScaler` which scale the data to the range between 0 and 1 by default. If you want a different range, you can change that.

In [17]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

numerical_scaled = scaler.fit_transform(numerical_features)

In [19]:
numerical_scaled[:5]

array([[0.29157939, 0.00111111, 0.2       ],
       [0.1522832 , 0.07333333, 0.4       ],
       [0.3757855 , 0.27777778, 0.4       ],
       [0.43171345, 0.25666667, 0.2       ],
       [0.45077503, 0.29      , 0.6       ]])

In [20]:
import pandas as pd

num_scaled_features = pd.DataFrame(numerical_scaled, columns=numerical_features_list)

In [21]:
num_scaled_features.head()

Unnamed: 0,total_bill,tip,size
0,0.291579,0.001111,0.2
1,0.152283,0.073333,0.4
2,0.375786,0.277778,0.4
3,0.431713,0.256667,0.2
4,0.450775,0.29,0.6


As you can see above, all the values are scaled to the values between 0 and 1.

## 2. Standardization

In standardization, the numerical features are rescaled to have the 0 mean(u) and unity standard deviation(std or \sigma ).

Here is the formula of standardization. Xstd is the standardized feature, X is the feature, u is mean of the feature, and \sigma is the standard deviation.

### When Should you Standardize the Features?

When you know that the training data at hand has a normal or gaussian distribution, you should standardize such data.

Some ML algorithms such as Support Vector Machines(with rbf kernel) and linear models expect that the input data to have a normal distribution.

In most cases, whether you choose normalization or standardization, it won't make much difference, but it can. So, it makes sense to try both especially if you are not sure about the distribution of the data.

Here is how Standardization is implemented in Scikit-Learn.

In [23]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

num_std = scaler.fit_transform(numerical_features)

In [25]:
num_std[:5]

array([[-0.31471131, -1.43994695, -0.60019263],
       [-1.06323531, -0.96920534,  0.45338292],
       [ 0.1377799 ,  0.36335554,  0.45338292],
       [ 0.4383151 ,  0.22575414, -0.60019263],
       [ 0.5407447 ,  0.4430195 ,  1.50695847]])

In [29]:
# The mean of each feature in the scaled data

scaler.mean_

array([19.78594262,  2.99827869,  2.56967213])

In [31]:
# The variance of each feature in the scaled data

scaler.var_

array([78.92813149,  1.90660851,  0.9008835 ])

Scaled data has zero mean an unit variance.

In [35]:
import numpy as np

print(f"The mean of the scaled features is: {np.round(num_std.mean(axis=0))}")
print(f"The standard deviation of scaled features is: {np.round(num_std.std(axis=0))}")

The mean of the scaled features is: [-0.  0. -0.]
The standard deviation of scaled features is: [1. 1. 1.]


In [26]:
num_std_df = pd.DataFrame(num_std, columns=numerical_features_list)

In [27]:
num_std_df.head()

Unnamed: 0,total_bill,tip,size
0,-0.314711,-1.439947,-0.600193
1,-1.063235,-0.969205,0.453383
2,0.13778,0.363356,0.453383
3,0.438315,0.225754,-0.600193
4,0.540745,0.44302,1.506958


## 3. Robust Scaler

[Robust scaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html#sklearn.preprocessing.RobustScaler) is kind of similar to standardization but is used when the data contains many outliers.

Instead of dropping mean, the median is dropped and the data is scaled to the Interquartile Range(IQR). The IQR is the range between the 1st quartile (25th quantile) and the 3rd quartile (75th quantile).

Like normalization and standardization, Roust scaler is also implemented easily in Scikit-Learn.

In [36]:
from sklearn.preprocessing import RobustScaler

robust_scaler = RobustScaler()

num_robust_scaled = robust_scaler.fit_transform(numerical_features)

In [37]:
num_robust_scaled[:5]

array([[-0.07467532, -1.2096    ,  0.        ],
       [-0.69155844, -0.7936    ,  1.        ],
       [ 0.29823748,  0.384     ,  1.        ],
       [ 0.54591837,  0.2624    ,  0.        ],
       [ 0.63033395,  0.4544    ,  2.        ]])

By scaling the data with Robust Scaler, the median of the resulting values will have a median of zero.

In [38]:
print(f"The median of the scaled numerical features is: {np.round(np.median(num_robust_scaled,axis=0))}")

The median of the scaled numerical features is: [-0.  0.  0.]


__Scaling__ the input data before feeding it to a machine learning model is always a good practice.

Here are the punchlines:

* __Scaling__ the features helps the model to converge faster.
* __Normalization__ is scaling the data to be between 0 and 1. It is preferred when the data has not a normal distribution
* __Standardization__ is scaling the data to have 0 mean and unit standard deviation. It is preferred when the data has a normal or gaussian distribution.
* __Robust scaling__ technique is used if the data has many outliers.
* In most cases, the choice of scaling technique won't make much difference (or it can). __Try all of them__ and see what work best with your data.
* Only the features are scaled. The labels should not be scaled.
* Make sure to not fit the scaler on test data. Only transfom.

        Don't`: scaler.fit_transfrom(X_test)
        Do: scaler.transform(X_test)

## Futher Learning

[https://scikit-learn.org/stable/modules/preprocessing.html#standardization-or-mean-removal-and-variance-scaling](https://scikit-learn.org/stable/modules/preprocessing.html#standardization-or-mean-removal-and-variance-scaling)