# Regularization
 A central problem in machine learning is how to make an algorithm that will perform well not just on the training data, but also on new inputs. Many strategies used in machine learning are explicitly designed to reduce the test error, possibly at the expense of increased training error. These strategies are known collectively as regularization.

Therefore regularization increases training error but reduces generalization error hence more no of epochs are needed to get the desired result.Regularization helps to reduce overfitting of the model.

There are many regularization techniques used some but extra term in objective function and some but extra constraint on the model.

 1. L1/L2 regularizers
 2. DropOut
 3. Data Augmentation
 4. Label Smoothing

## L1/L2 Regularizers
L1 and L2 regularizers are some time known as weight decay.

L1 Regularization works by adding an l1 norm to the cost function.

L2 Regularization works by adding an l2 norm to the cost function. 

The idea behind l1 and l2 norm is smaller weight generalizes the model better so both of these norm perform some kind of weight decay.



### L2 regularization 
    $$
    C = any\ loss\ function  + \frac{\lambda}{2n}\sum w^2
    $$
    
Here λ is a regularization parameter and n is the size of training data w is the weight.we are adding a sum of squares of all weights to the cost function which is scaled by λ/2n where λ > 0.

The intitution behind the l2 reguarization is if cost function is increased more weights will be penalize so the nettwork prefers to learn small weights.Large weights will only be allowed if they considerably improve the first part of the cost function. Put another way, regularization can be viewed as a way of compromising between finding small weights and minimizing the original cost function. The relative importance of the two elements of the compromise depends on the value of λ: when λ is small we prefer to minimize the original cost function, but when λ is large we prefer small weights.

Updating weight formulae while backprop
$$
w = w - \eta \frac{\partial C}{\partial w} - \frac {\eta \lambda} {n} w
$$

$$
w = \left( 1 - \frac{\eta \lambda } {n} \right) w - \eta \frac{\partial C}{\partial w} 
$$

Here 
$$
\left( 1 - \frac{\eta \lambda } {n} \right)
$$
is the rescaling factor for weights or the weight decay factor.For very small λ value it is allowing big weights and if λ value is big it is penealizing the weights.

Why is this going on? Heuristically, if the cost function is unregularized, then the length of the weight vector is likely to grow, all other things being equal. Over time this can lead to the weight vector being very large indeed. This can cause the weight vector to get stuck pointing in more or less the same direction, since changes due to gradient descent only make tiny changes to the direction, when the length is long. I believe this phenomenon is making it hard for our learning algorithm to properly explore the weight space, and consequently harder to find good minima of the cost function.



### L1 regularization 
$$
C = any\ loss\ function  + \frac{\lambda}{n}\sum_w |w|
$$

L1 regularization is similar to l2 just the norm formulae changes from sum of squares to absolute value.

Updating weight formulae while backprop
$$
w = w - \eta \frac{\partial C}{\partial w} - \frac {\eta \lambda} {n} sign(w)
$$

sign(w) is just the sign of the weight vector +1 for positive weights and -1 for negative weights

#### Comparing L1 and L2 
In both expressions the effect of regularization is to shrink the weights. This accords with our intuition that both kinds of regularization penalize large weights. But the way the weights shrink is different. In L1 regularization, the weights shrink by a constant amount toward 0. In L2 regularization, the weights shrink by an amount which is proportional to w. And so when a particular weight has a large magnitude, |w|, L1 regularization shrinks the weight much less than L2 regularization does. By contrast, when |w| is small, L1 regularization shrinks the weight much more than L2 regularization. The net result is that L1 regularization tends to concentrate the weight of the network in a relatively small number of high-importance connections, while the other weights are driven toward zero.

## Dropout 
Dropout is another regularization techniques which is very simple to understand.
![](dropout.gif)
So it takes a probability p and based on the value of p it randomly disables that percentage of neuron.

For example if the dropout value is 0.3 on a layer. It will disable 30% neuron in the layer i.e zero the value of those neuron.

While training with every batch a different set on neurons are disabled which is completely random.

So why does dropout increases the robustness of the model?
Heuristically, when we dropout different sets of neurons, it's rather like we're training different neural networks. And so the dropout procedure is like averaging the effects of a very large number of different networks. The different networks will overfit in different ways, and so, hopefully, the net effect of dropout will be to reduce overfitting.

For example In cnn if the model is trained on dogs vs cats example and few particular neurons having higher weight, everytime the model witnesses the whiskers in the image it activates those neurons and we get cat. But what if those whiskers are no there then model fails significantly. so dropout forces the model to learn more attributes of the training data while training. 

when p = 0.5

By repeating dropout over and over, our network will learn a set of weights and biases. Of course, those weights and biases will have been learnt under conditions in which half the hidden neurons were dropped out. When we actually run the full network that means that twice as many hidden neurons will be active. To compensate for that, we halve the weights outgoing from the hidden neurons.

## Data Augmentation
The best way to make a machine learning model generalize better is to train it on more data. Of course, in practice, the amount of data we have is limited. One way to get around this problem is to create fake data and add it to the training set.
Here Data Augmentation comes to rescue which is kind of regularization technique because now we have generated more data for the model where the image changes every time so model has to learn different attributes of the data which will lead to better generalization.

## Label Smoothing

# Normalization
1. Eip session2
2. Batch Norm
3. Weight Norm
3. Layer Norm
https://mlexplained.com/2018/01/13/weight-normalization-and-layer-normalization-explained-normalization-in-deep-learning-part-2/
