#**Feature Scaling Exercise Solution**

Data processing is often described to be the toughest task or step in building any Machine Learning system by
data scientists with the need of both domain knowledge as well as mathematical transformations.

Adapted from Dipanjan Sarkar et al. 2018. [Practical Machine Learning with Python](https://link.springer.com/book/10.1007/978-1-4842-3207-1).


#Feature Scaling

Using the raw values as input features might make models biased
toward features having really high magnitude values. These models are typically sensitive to the magnitude or
scale of features like linear or logistic regression. Other models like tree based methods can still work without
feature scaling. However it is still recommended to normalize and scale down the features with feature scaling,
especially if you want to try out multiple Machine Learning algorithms on input features.

In [2]:
#Import necessary dependencies and settings
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
import numpy as np
import pandas as pd
np.set_printoptions(suppress=True)

## Load sample RNASeq data

We have five genes with their RNASeq FPKMs (Fragments Per Kilobase of exon per Million reads). It is quite evident that some genes have
been expressed a lot more than the others, giving a rise to values of high scale and magnitude.

In [3]:
fpkms = pd.DataFrame([1295., 25., 19000., 5., 1., 300.], columns=['fpkms'])
fpkms

Unnamed: 0,fpkms
0,1295.0
1,25.0
2,19000.0
3,5.0
4,1.0
5,300.0


## Standardized Scaling

The standard scaler tries to standardize each value in a feature column by removing the mean and scaling
the variance to be 1 from the values. This is also known as centering and scaling and can be denoted
mathematically as $SS(X_i) = \frac{x_i - \mu}{\sigma}$, where each value in feature $X$ is subtracted by the mean $\mu_i$ and the resultant is divided by the standard deviation $\sigma_x$. This is also known as $Z$-scsore scaling. We can aslo divide the resultant by the variance instead of the standard deviation if needed.

In [5]:
# Create a StandardScaler
ss = StandardScaler()
# Add a 'zscore' column
fpkms['zscore'] = ss.fit_transform(fpkms[['fpkms']])
fpkms

Unnamed: 0,fpkms,zscore
0,1295.0,-0.307214
1,25.0,-0.489306
2,19000.0,2.231317
3,5.0,-0.492173
4,1.0,-0.492747
5,300.0,-0.449877


In [6]:
# We can manually use the formula to compute the same result
fw = np.array(fpkms['fpkms'])
(fw[0] - np.mean(fw)) / np.std(fw)

-0.30721413311687235

## Min-Max Scaling
With min-max scaling, we can transform and scale our feature values such that each value is within the
range of [0, 1]. Min-Max Scaler can be represented as $MMS(X_i)=\frac{x_i - min(x)}{max(x) - min(x)}$, where we scale aach value in the feature $X$ by substracting it from the minimum value in the feature $min(X)$ and dividing the resultant by the difference between the maximum and minimum values in the feature $max(X)-min(X)$.

In [7]:
# Create a MinMaxScaler
mms = MinMaxScaler()
fpkms['minmax'] = mms.fit_transform(fpkms[['fpkms']])
fpkms

Unnamed: 0,fpkms,zscore,minmax
0,1295.0,-0.307214,0.068109
1,25.0,-0.489306,0.001263
2,19000.0,2.231317,1.0
3,5.0,-0.492173,0.000211
4,1.0,-0.492747,0.0
5,300.0,-0.449877,0.015738


In [8]:
# We can manually use the formula to compute the same result
(fw[0] - np.min(fw)) / (np.max(fw) - np.min(fw))

0.06810884783409653

## Robust Scaling
The disadvantage of min-max scaling is that often the presence of outliers affects the scaled values for any
feature. Robust scaling tries to use specific statistical measures to scale features without being affected by
outliers. Mathematically this scaler can be represented as $\frac{x_i - median(x)}{IQR_{(1,3)}(x)}$, where we scale each value of feature $X$ by subtracting the median of $X$ and dividing the resultant by the IQR (Inter-Quartile Range) of $X$ which is the range (difference) between the first quartile (25th percentile) and the third quartile (75th percentile).

In [9]:
# Create a RobustScaler
rs = RobustScaler()
fpkms['robust'] = rs.fit_transform(fpkms[['fpkms']])
fpkms

Unnamed: 0,fpkms,zscore,minmax,robust
0,1295.0,-0.307214,0.068109,1.092883
1,25.0,-0.489306,0.001263,-0.13269
2,19000.0,2.231317,1.0,18.178528
3,5.0,-0.492173,0.000211,-0.15199
4,1.0,-0.492747,0.0,-0.15585
5,300.0,-0.449877,0.015738,0.13269
