# Robust_Scaler

Outliars are one of the worst nightmares which hauts every single data scientist. Outliars are the data which do not match with the regular structure of the dataset, generally originate due to mis-interpretation of data. From it's name, you can conclude that these are data who are lying about the behaviour of the data while standing out.

But there is a good news that we have a predefined function in the preprocessing class provided by sklearn library.

Normally, we try to get rid of Outliars using means and standard deviation, by removing all the data more than 2 or 3 standard deviation from the mean. But it has some serious problems which could not be neglected.

* Firstly, when we use mean, we assume that the distribution is normal, because when we remove data more than 3 standard SD from mean, we are taking 99.87% according to Normal Distribution.

* Secondly, mean and standard deviation are greatly impacted by Outliars.

* Thirdly, this method will decrease the accuracy when applying on small dataset.

In [1]:
from sklearn.preprocessing import robust_scale

Robust_Scaler is going to use more robust metrics and methods to scale your data.

The most important feature which Robust_scaler uses is, it is based on percentiles, and because of that, it does not get affected by large marginal values, also called OUTLIARS.

In [2]:
import pandas as pd
import numpy as np
df = pd.read_csv(r"Iris.csv")
df.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


In [13]:
df_copy = df[:] #copying because we should not render the original copy
Species = del df_copy["Species"] #because it does not work on non numerical data
Id = del df_copy["Id"] #Id should not be scaled

scaled = robust_scale(df_copy,with_centering=False)
#with_centering is false to not distort the sparsity of the data

df_copy = pd.DataFrame(scaled, columns=["SepalLengthCm", "SepalWidthCm", "PetalLengthCm", "PetalWidthCm"])
#you can store the attribute name early in the analysis
df_copy.head()

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
0,3.923077,7.0,0.4,0.133333
1,3.769231,6.0,0.4,0.133333
2,3.615385,6.4,0.371429,0.133333
3,3.538462,6.2,0.428571,0.133333
4,3.846154,7.2,0.4,0.133333


When we are using robust_scale function, the outliars are removed using this definition:

Outliers are values below Q1-1.5(Q3-Q1) or above Q3+1.5(Q3-Q1) or equivalently, values below Q1-1.5 IQR or above Q3+1.5 IQR.

Interquartile range (IQR) normalization forces the distributions to have the same values for the 25th and 75th percentiles (Geller et al., 2003)

When we use with_centering=True, the data is centered according to the median.

# References

http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.robust_scale.html#sklearn.preprocessing.robust_scale