# Demystifying Neural Network

### Improving neural network's learning : overfitting and regularization
- Intro : bias and variance
- Regularization techniques
    1. L1 regularization
    2. L2 regularization
    3. Dropout
    4. Artificially increasing the training set 

#### Intro : bias and variance
Before getting into regularization techniques which is used to prevent neural network from overfitting, I wanted to briefly talk about a bias and variance tradeoff in machine learning. This idea can be well explained with the comparison between a linear model (simple model) and a polynomial model (complex one). 

<img src="img/blog5_figure1.png" width="700" height="400" />

Let's say we want to predict a price of a house with the number of rooms of the house. Blue points represent training data and green points represent testing data. The model in the middle shows the best fit. Both the leftmost linear model and the rightmost 7th order polynomial model have large **generalization error**. The generalization error of a model is its expected errors on data points that are not in the training data. Although both model suffer from large generalization error, they have different problems; one suffers from a bias problem and the other from a variance problem. 

**Bias** is the inability of a machine learning algorithm to capture the underlying structure of the data. Since the best fit model shown by the figure in the middle shows that the relationship between house price and the number of rooms is not linear. Even if we had a large training set, we would not be able to fit our training data well with a linear model. Therefore, the linear model suffers from a bias problem (also can be said that the model is 'underfit').

On the other hand, the rightmost polynomial model suffers from the **vairance** problem. As the term suggests, a model suffers from the variance problem when the generalization error varies a lot given a new testing set. Since we fit every single point in the training data with the high order polynomial model, we have 0 error. However, given the testing data set, we see that our polynomial model does horribly at predicting the testing data points and incurs huge error. This is a problem since we are fitting every single pattern of the small and finite training set. Thus, it would be very hard for the 7th order polynomial model to generalize to much larger data sets. Our 7th order polynomial has a problem of overfitting. 

This is a tradeoff between bias and variance. If our model is too simple, we suffer from large bias but small variance. If our model is too complex, we suffer from large variance but small bias. It's really hard to find the right balance to represent the underlying phenomenon given a training dataset. In order to mitigate this generalization error, there exists a number of regularization techniques, such as boosting and bagging which helps to mitigate generalization error.

Neural network can also suffer generalization error. Neural network tends to suffer from large variance. I haven't heard about neural network's suffering from large bias probably because of its having many free parameters. Increasing the number of hidden layers and/or neurons may lead neural network to overfitting. A neural network with a large number of free parameters tends to explain the given training data perfectly but fail to generalize to unseen data. The commonly used techniques are: L1 and L2 regularization, dropout, and data augmentation (there are many more! But I will focus on these three). Probably you have some experience using L1 and L2 regularization (Ridge and Lasso regularization or elastic net) in linear regression. The same regularization techniques are also used when constructing neural network.

### L2 Regularization
Regularization can be viewed as a way of compromising between finding small weights and minimizing the given cost function. L1 and L2 regularization methods help mitigate the variance problem of neural network by shrinking parameters and making our predictions less sensitive to them. 

L2 regularization can be implemented directly in the cost function by adding the sum of the squares of all the weights in neural network. That is, for every weight $w$ in the network, $\frac{1}{2}\lambda w^2$ is added to the cost function. We introduce a small amount of bias (regularization term) into how our hypothesis function generated by neural network fits to the data, but in return for this small amount of bias, we get a drop in variance. That is, by starting with a slightly worse fit, we can provide better prediction with less generalization error. Here, $\lambda$ is the regularization strength which determines how severe we want to penalize our parameters. Let's look closely. If we are using the mean squared error cost function:
$$ C = \frac{1}{2n}\sum_x||y-\hat y||^2 + \frac{\lambda}{2n}\sum_w w^2$$
The first term is the usual mean squared error cost function that we are familiar with, but the second term, the L2 regularization term, is added. $n$ is the size of training data. Bias term $b$ is not penalized, thus not included in the regularization term. The regularization term penalizes big weight vectors and makes the network prefer to learn small weights. The value of $\lambda$, as a hyperparameter, decides the strength of regularization. $\lambda$ can be any value from 0 to $+\infty$. If $\lambda$ is small, minimizing a cost function is preferred over finding small weights. If $\lambda$ is big, finding small weights is preferred over minimizing a cost function. 

If we set $C_0$ to the original cost function, the partial derivative of $C_0$ with respect to $w$ becomes:    

$$\frac{\partial C}{\partial w} = \frac{\partial C_0}{\partial w} + \frac{\lambda}{n}w $$  

During the gradient descent, we update the weights with this learning rule where $\eta$ is a step size:    

$$w → w - \eta \frac{\partial C_0}{\partial w} - \frac{\eta \lambda}{n}w $$  
$$ = (1-\frac{\eta \lambda}{n})w - \eta \frac{\partial C_0}{\partial w}$$
This shows that with L2 regularization, every weight is decayed linearly towards 0 with this rescaling factor $(1-\frac{\eta \lambda}{n})$ in front of $w$. For stochastic gradient descent, nothing but the the partial derivative of the cost function $C_0$ with respect to $w$ changes to 
$\frac {\eta}{m} \sum_x\frac{\partial C_x}{\partial w}$ as we estimate $\frac{\partial C_0}{\partial w}$ by averaging over a mini-batch of m training data.

### L1 Regularization 
While we added the term $\frac{1}{2}\lambda w^2$ for L2 regularization, for L1 regularization, we add $\lambda |w|$ to the cost function. 
$$\frac{\partial C}{\partial w} = \frac{\partial C_0}{\partial w} + \frac{\lambda}{n}|w| $$  
Then, the weight update rule is:
$$w → w - \frac{\eta \lambda}{n}sign(w) - \eta \frac{\partial C_0}{\partial w}$$   

Unlike L2 regularization which drives all small weights asymptotically close to 0 (as weights shrink by a constant amount), L1 regularization drives unimportant weights all the way to 0. L2 regularization is like a sinking boat; everything goes down together. On the other hand, L1 regularization drives the weights to shrink by an amount proportional to $w$. Therefore, L1 regularization will drive some of the insignificant weights quickly to become 0, leading the weights to become sparse during optimization. That is, the network will end up using only a subset of their most significant inputs and become less sensitive to unimportant and noisy ones. If we are not concerned with feature selection, L2 regularization is preferred as it tends to give better performance than L1 regularization. 

### dropout

Unlike L1 and L2 regularization which modifies a cost function by adding a regularization term, dropout does not modify a cost function. Therefore, dropout can be implemented with L1 or L2 regularization as dropout modifies the network itself. Basically what dropout does is it makes neural network to only keep a $p$ percent of neurons of a selected hidden layer (where $p$ is a hyperparameter that has to be tuned). 

<img src="img/blog5_figure2.png" width="600" height="400" /> 

While training, user-chosen $p$ determines what percent of neurons of a given hidden layer will go on a vacation. If $p=0.5$, half of the hidden layer neurons are temporarily and randomly deleted. The remaining ones will have to work harder and weights and biases are updated without the neurons that went on a vacation. Then, dropout process is repeated with a new random subset of hidden layer neurons in a next mini batch. Dropout allows neural network to learn robust features with different subsets of hidden layer neurons. Without other neurons' presence, a subset of randomly chosen neurons have to be work harder in the absence of other pieces of evidence. In this sense, it is similar to L1 and L2 regularization.

### Data Augmentation

Since less training data means that a neural network might suffer from a variance problem when introduced to new test examples which it has never seen before, data augmentation can be a powerful tool to make neural network robust, especially when getting a new data set is difficult and/or costly. Let's think of MNIST training data with handwritten digits. There are about 7 billion people in this world each with different style of handwriting. Let's say we only have 1000 training data. It would be hard for our neural network to generalize well to 7 billion people's handwritten digits. To help our neural network, we can expand our training data by making small rotations of the images that we have like below. 

<img src="img/blog5_figure3.png" width="500" height="400" /> 

By adding more variations of handwritten images, neural network would be able to learn better in classifying digits and improve its performance. This can be done in different settings not only in image classification. Let's think about speech recognition. We can also add 'distortion' to recording data by adding background noises, slowing down or speeding up speech to make neural network robust in recognizing speech.