# Normalization

Normalizing data is crucial when using Machine Learning Algorithms. Most of them are sensible to un-normalized data which can lead to unexpected results. For example, the Neighbors based Algorithms and KMeans Algorithm are using the p-distance in their learning phase. Besides, Normalization is a first step before using a Linear Regression due to Gauss Markov assumptions. 

Un-normalized data can also create complications for the some ML algorithms to converge. Normalization is also a way to encode the data and to keep the global distribution. When we know the estimators used to normalize the data, we can easily un-normalize the data and come back to the original distribution.

There are 3 main normalization techniques:
 - <b>Z-Score</b> : We reduce and center the feature values using the average and standard deviation. This normalization is sensible to outliers.
 - <b>Robust Z-Score</b> : We reduce and center the feature values using the median and the median absolute deviation. This normalization is robust to outliers.
 - <b>Min-Max</b> : We reduce the feature values by using a bijection to [0,1]. The max will reach 1 and the min 0. This normalization is robust to outliers.

To see how to normalize data in Vertica ML Python, we will use the well-known 'Titanic' dataset.

In [1]:
from vertica_ml_python import *
vdf = vDataFrame("titanic")
print(vdf)

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
,survived,boat,ticket,embarked,home.dest,sibsp,fare,sex,body,pclass,age,name,cabin,parch
0.0,0,,113781,S,"Montreal, PQ / Chesterville, ON",1,151.55,female,,1,2.0,"Allison, Miss. Helen Loraine",C22 C26,2
1.0,0,,113781,S,"Montreal, PQ / Chesterville, ON",1,151.55,male,135,1,30.0,"Allison, Mr. Hudson Joshua Creighton",C22 C26,2
2.0,0,,113781,S,"Montreal, PQ / Chesterville, ON",1,151.55,female,,1,25.0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",C22 C26,2
3.0,0,,112050,S,"Belfast, NI",0,0.0,male,,1,39.0,"Andrews, Mr. Thomas Jr",A36,0
4.0,0,,PC 17609,C,"Montevideo, Uruguay",0,49.5042,male,22,1,71.0,"Artagaveytia, Mr. Ramon",,0
,...,...,...,...,...,...,...,...,...,...,...,...,...,...


<object>  Name: titanic, Number of rows: 1234, Number of columns: 14


Let's look at the 'fare' and 'age' of the passengers. 

In [2]:
vdf.select(["age", "fare"])

0,1,2
,age,fare
0.0,2.0,151.55
1.0,30.0,151.55
2.0,25.0,151.55
3.0,39.0,0.0
4.0,71.0,49.5042
,...,...


<object>  Name: titanic, Number of rows: 1234, Number of columns: 2

They are both living in a different numerical interval. That's why it can be judicious to normalize them. To normalize data in Vertica ML Python, we can use the 'normalize' method.

In [3]:
help(vdf["age"].normalize)

Help on method normalize in module vertica_ml_python.vcolumn:

normalize(method:str='zscore', by:list=[], return_trans:bool=False) method of vertica_ml_python.vcolumn.vColumn instance
    ---------------------------------------------------------------------------
    Normalizes the input vcolumns using the input method.
    
    Parameters
    ----------
    method: str, optional
            Method used to normalize.
                    zscore        : Normalization using the Z-Score (avg and std).
                            (x - avg) / std
                    robust_zscore : Normalization using the Robust Z-Score (median and mad).
                            (x - median) / (1.4826 * mad)
                    minmax        : Normalization using the MinMax (min and max).
                            (x - min) / (max - min)
    by: list, optional
            vcolumns used in the partition.
    return_trans: bool, optimal
            If set to True, the method will return the transformatio

The 3 main normalization techniques are available. Let's normalize the 'fare' and the 'age' using the 'MinMax' method.

In [4]:
vdf["age"].normalize(method = "minmax")
vdf["fare"].normalize(method = "minmax")
vdf.select(["age", "fare"])

0,1,2
,age,fare
0.0,0.020961466047446,0.295805899800363
1.0,0.37241119618426,0.295805899800363
2.0,0.309652315802686,0.295805899800363
3.0,0.485377180871093,0.0
4.0,0.887034015313167,0.096625763278767
,...,...


<object>  Name: titanic, Number of rows: 1234, Number of columns: 2

Both of the features now scale in [0;1]. It is also possible to normalize by a specific partition (you should use the 'by' parameter). There is another technique which is as important as Normalization. Indeed, Machine Learning Algorithms are sensible to the number of features and they are also sometimes sensible to correlated predictors. Decomposition is the main topic of our next lesson.