   # Normalisation

Normalisation has some different meanings when we use in different context, but basically it means adjusting values measured on different scales to a notionally common scale.

There are some different ways to normalise a dataset:
* Transforming data using Z-score and t-score.
* Rescaling data to have valuesbetween 0 and 1.
* Standardising residuals: Ratios used in regression analysis can force residuals into the shape of a normal distribution.

People generally confuse between normalisation and standardization, and many times used interchangebly.

The basic difference between standardization and normalisation is:

 * Normalisation scales a variable to have the values between 0 and 1.
 * Standardization transforms the data to have a mean of zero and standard deviation of 1.

Next question which can make you feel uneasy is "Why the heck we need normalisation".

Since the coefficients of a model are not scaled according to the units of the inputs, we need to get them on the same scale.

From this answer one more question arises that whether the variance of that attribute gets affected because 1Kg and 2Kg has little variance but 1000g and 2000g has a larger variance.

The answer is yes, and it is one of the feature of Normalisation which is, it is used to increase the variance of attributes which have very less variance with respect to other attributes.

If the variance of an attribute is very less than other attributes, its effect on the prediction model will decrease which can decrease the accuracy.


In [21]:
import numpy as np
import pandas as pd
data = pd.read_csv(r"Iris.csv")
Sepal = np.array(data["SepalLengthCm"])
data.describe() 

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
count,150.0,150.0,150.0,150.0,150.0
mean,75.5,5.843333,3.054,3.758667,1.198667
std,43.445368,0.828066,0.433594,1.76442,0.763161
min,1.0,4.3,2.0,1.0,0.1
25%,38.25,5.1,2.8,1.6,0.3
50%,75.5,5.8,3.0,4.35,1.3
75%,112.75,6.4,3.3,5.1,1.8
max,150.0,7.9,4.4,6.9,2.5


In [22]:
Sepal_L = np.array(data["SepalLengthCm"])
Petal_L = np.array(data["PetalLengthCm"])
Sepal_W = np.array(data["SepalWidthCm"])
Petal_W = np.array(data["PetalWidthCm"])
to_be_normalized = np.stack((Sepal_L,Sepal_W,Petal_L,Petal_W), axis=1)

In [23]:
from sklearn.preprocessing import normalize
normalized = normalize(to_be_normalized)

In [24]:
normalized_data = pd.DataFrame(normalized,columns=["SepalLengthCm","SepalWidthCm","PetalLengthCm","PetalWidthCm"])
normalized_data

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
0,0.803773,0.551609,0.220644,0.031521
1,0.828133,0.507020,0.236609,0.033801
2,0.805333,0.548312,0.222752,0.034269
3,0.800030,0.539151,0.260879,0.034784
4,0.790965,0.569495,0.221470,0.031639
5,0.784175,0.566349,0.246870,0.058087
6,0.780109,0.576603,0.237425,0.050877
7,0.802185,0.545486,0.240655,0.032087
8,0.806424,0.531507,0.256589,0.036656
9,0.818031,0.517530,0.250418,0.016695


# Statistical information of real data

In [25]:
print(data.describe())

               Id  SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm
count  150.000000     150.000000    150.000000     150.000000    150.000000
mean    75.500000       5.843333      3.054000       3.758667      1.198667
std     43.445368       0.828066      0.433594       1.764420      0.763161
min      1.000000       4.300000      2.000000       1.000000      0.100000
25%     38.250000       5.100000      2.800000       1.600000      0.300000
50%     75.500000       5.800000      3.000000       4.350000      1.300000
75%    112.750000       6.400000      3.300000       5.100000      1.800000
max    150.000000       7.900000      4.400000       6.900000      2.500000


# Statistical information of normalized data

In [26]:
print(normalized_data.describe())

       SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm
count     150.000000    150.000000     150.000000    150.000000
mean        0.751621      0.404780       0.454958      0.140965
std         0.044619      0.105087       0.159747      0.078136
min         0.653877      0.238392       0.167836      0.014727
25%         0.715261      0.326738       0.250925      0.048734
50%         0.754883      0.354371       0.536367      0.164148
75%         0.788419      0.525237       0.580025      0.197532
max         0.860939      0.607125       0.636981      0.280419


As you can see in the above information, in the real dataset, the difference between the standard deviation of the attributes if higher, but in the normalized dataset, it is lower, giving all the attributes equal chance to have an impact on the prediction.

And the max values are also now less than 1.

# References

* http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-normalization