## Zero Mean and Unit Variance
<br>Scaling data to zero mean and unit variance is a example of normalization in here are several advantages of normalization, many of which are interrelated:</br>
<br>Makes training less sensitive to the scale of features: Consider a regression problem where you’re given features of an apartment and are required to predict the price of the apartment. Let’s say there are 2 features — no. of bedrooms and the area of the apartment. Now, the no. of bedrooms will be in the range 1–4 typically, while the area will be in the range 100–200𝑚2. If you’re modelling the task as linear regression, you want to solve for coefficients 𝑤1and 𝑤2 corresponding to no. of bedrooms and area. Now, because of the scale of the features, a small change in 𝑤2 will change the prediction by a lot compared to the same change in 𝑤1, to the point that setting 𝑤2 correctly might dominate the optimization process.</br>
<br>Regularization behaves differently for different scaling: Suppose you have an ℓ2regularization on the problem above. It is easy to see that ℓ2 regularization pushes larger weights towards zero more strongly than smaller weights. So consider that you obtain some optimal values of 𝑤1and 𝑤2 using your given unnormalized data matrix 𝑋. Now instead of using 𝑚2 as the unit of area, if I change the data to represent area in 𝑓𝑡2, the corresponding column of X will get multiplied by a factor of ~10. Therefore, you would expect the corresponding optimal coefficient 𝑤2 to go down by a factor of 10 to maintain the value of y. But, as stated before, the ℓ2 regularization now has a smaller effect because of the smaller value of the coefficient. So you will end up getting a larger value of 𝑤2 than you would have expected. This does not make sense — you did not change the information content of the data, and therefore, your optimal coefficients should not have changed.</br>
<br>Consistency for comparing results across models: As covered in point 2, scaling of features affects performance. So, if there are scientists developing new methods, and compare previous state-of-the-art methods with their new methods, which uses more carefully chosen scaling, then the results will not be reliable.</br>
<br>Makes optimization well-conditioned: Most machine learning optimizations are solved using gradient descent, or a variant thereof. And the speed of convergence depends on the scaling of features (or more precisely, the eigenvalues of 𝑋𝑇𝑋, which explains why zero mean helps). Normalization makes the problem better conditioned, improving the convergence rate of gradient descent. I give an intuition of this using a simple example below.</br>
<br>Consider the simplest case where 𝐴 is a 2 x 2 diagonal matrix, say 𝐴=𝑑𝑖𝑎𝑔([𝑎1,𝑎2]). Then, the contours of the objective function ‖𝐴𝑥−𝑏‖2 will be axis-aligned ellipses as shown in the figure below:


Suppose you start at the point marked in red. Observe that to reach the optimal point, you need to take a very large step in the horizontal direction but a small step in the vertical direction. The descent direction is given by the green arrow. If you go along this direction, then you will move larger distance in the vertical direction and smaller distance in the horizontal direction, which is the opposite of what you want to do!

If you take a small step along the gradient, covering the large horizontal distance to the optimal is going to take a large number of steps. If you take a large step along the gradient, you will overshoot the optimal in the vertical direction.

This behavior is due to the shape of the contours. The more circular the contours are, the faster you will converge to the optimal. The elongation of the ellipses is given by the ratio of the largest and the smallest eigenvalues of the matrix 𝐴
A
. In general, the convergence of an optimization problem is measured by its condition number, which in this case is the ratio of the two extreme eigenvalues.

(Prasoon Goyal's answer to Why is the Speed Of Convergence of gradient descent depends on the maximal and minimal eigenvalues of A in solving AX=b through least squares.)

Finally, I should mention that normalization does not always help, as far as accuracy is concerned. Here's a simple example : consider a problem with only one feature with variance 1. Now suppose I add a dummy feature with variance 0.01. If you regularize your model correctly, the solution will not change much because of this dummy dimension. But if you now normalize it to have unit variance, it might hurt the performance. </br>

### Notes on standard deviation and variance: 
![normal_distribution](https://upload.wikimedia.org/wikipedia/commons/8/8c/Standard_deviation_diagram.svg)
<br>In statistics, the standard deviation (SD, also represented by the lower case Greek letter sigma σ) is a measure that is used to quantify the amount of variation or spread of a set of data values.</br>
<br>A low standard deviation indicates that the data points tend to be close to the mean (also called the expected value) of the set, while a high standard deviation indicates that the data points are spread out over a wider range of values.</br>

In [64]:
import numpy as np

data = np.random.rand(10,2)*10

print(data)
print("mean:",np.mean(data, axis = 0))
print("variance:",np.var(data, axis = 0))

[[9.13035522 2.95174588]
 [0.74890874 3.66603453]
 [1.93508603 8.93285109]
 [4.4791619  7.18034272]
 [1.57212201 1.84023272]
 [5.8401751  4.65987914]
 [0.21427841 7.80071103]
 [8.00363362 6.49762062]
 [7.26773536 6.87702282]
 [7.47464145 7.23799235]]
mean: [4.66660978 5.76444329]
variance: [9.93327524 4.90707711]


### One way of doing it: sklearn package

In [65]:
from sklearn import preprocessing

data_scaled = preprocessing.scale(data)

print(data_scaled)
print("scaled mean:",data_scaled.mean(axis = 0))
print("variance:",data_scaled.var(axis = 0))

[[ 1.41629325 -1.26973056]
 [-1.24303987 -0.94728061]
 [-0.86667994  1.43030821]
 [-0.05947498  0.63917674]
 [-0.9818441  -1.7714988 ]
 [ 0.37235829 -0.49863126]
 [-1.4126717   0.91922841]
 [ 1.05879791  0.33097683]
 [ 0.82530616  0.50224963]
 [ 0.89095499  0.66520141]]
scaled mean: [-3.44169138e-16 -1.88737914e-16]
variance: [1. 1.]


### Another way: Impliment it!
Note: standard deviation equales the square root of variance

In [66]:
scaled_data = (data - data.mean(axis=0)) / data.std(axis=0)

print(scaled_data)
print("scaled mean:",scaled_data.mean(axis = 0))
print("variance:",scaled_data.var(axis = 0))

[[ 1.41629325 -1.26973056]
 [-1.24303987 -0.94728061]
 [-0.86667994  1.43030821]
 [-0.05947498  0.63917674]
 [-0.9818441  -1.7714988 ]
 [ 0.37235829 -0.49863126]
 [-1.4126717   0.91922841]
 [ 1.05879791  0.33097683]
 [ 0.82530616  0.50224963]
 [ 0.89095499  0.66520141]]
scaled mean: [-3.44169138e-16 -1.88737914e-16]
variance: [1. 1.]
