&copy;Copyright 2017 Shuang Wu<br>
cite from the Neural Networks and Deep Learning book by Michael Nielsen http://neuralnetworksanddeeplearning.com <br>
Learning notes

## CH 3
## Improving the way neural networks learn

### The cross-entropy cost function
For a simple one layer neural network with $\eta = 0.15$, initial weight $0.6$ and initial bias $0.9$. The cost is the quadratic cost function, $C$. When input $1$, we want the output to be $0$. After 300 epoch, can reach to $0.09$.<br>
![cec1](/notebooks/imgs/cec1.jpg)<br>
This process learn fast. If change both initial weights and bias to $2$.<br>
![cec2](/notebooks/imgs/cec2.jpg)<br>
This shows the learning process slowly. In the first 150 epoch, the weights and biases don't change much.<br>
When the artificial neuron has a lot of difficulty learning when it's badly wrong.<br>
Learning is slow is same as partial derivatives are small. This is becuase the shape of the sigmoid function. When the neuron's ouput is close to 1, the curve gets very flat, and then $\sigma'(z)$ get very small, which cause both derivatives very small.<br>

### Intro the cross-entropy cost func
We can solve the slow-down learning problem by replace the quadratic cost with a cross-entropy cost function.<br>
Suppose the simple model have three inputs now:<br>
![cec3](/notebooks/imgs/cec3.jpg)<br>
Define the cross-entropy cost function as:<br>
$$C=-\frac{1}{n}\sum_x[y\ln a+(1-y)\ln(1-a)]$$<br>
$n$ is total # of training data, sum is over all training inputs. $y$ is the corresponding output.<br>
First, interpret the cross-entropy func as cost func. There are 2 properties:<br>
1. $C>0$<br>
    (1) all terms in the sum are negative because logarithms of the number range from 0 to 1.<br>
    (2) - sign out fornt the sum.<br>
2. The neuron's actual output is close to the desired output for all training inputs, $x$, which make the croo-entropy close to 0. This can be seen by 2 situations:<br>
    (1) $y=0$ and $a \approx 0$<br>
    (2) $y=1$ and $a \approx 1$<br>
both case make the $C=0$.<br>

To know why the cross-entropy solve the slow down problem, compute the derivative respect to weights.<br>
$$\frac{\partial C}{\partial w_j}=\frac{1}{n}\sum_x x_j(\sigma(z)-y)$$<br>
This means the weight learns is controlled by $\sigma(z)-y$, the error in the output. The larger the error, the faster the neuron will learn. In the quadratic cost func, it slow down by $\sigma'(z)$. And the partial derivative for the bias:<br>
$$\frac{\partial C}{\partial b} = \frac{1}{n}\sum_x(\sigma(z)-y)$$<br>
Return to the previous example, the results increase:<br>
![cec4](/notebooks/imgs/cec4.jpg)<br>
![cec5](/notebooks/imgs/cec5.jpg)<br>

Extension the cross-entropy to multi-layer networks.<br>
$$C=-\frac{1}{n}\sum_x\sum_j[y_i\ln a^L_j+(1-y_j)\ln(1-a^L_j)]$$

<strong>Choose between two cost function</strong><br>
Cross-entropy is nearly always the better choice, provide the ouput neurons are sigmoid neurons. It's because when we're setting up the network we usually initialize the weights and biases randomly. It may happend that those initial choices result in the network being decisively wrong for some training input, so, an output neuron will have saturated near 1, when it should be 0, vice versa. When use quadratic, it won't stop learning completely, since the weights will continue learning from other training inputs, which undesirable.

### Using the cross-entropy to classify MNIST digits<br>
The overall accuracy increase, especially when we have 100 hidden layers. The cross-entropy cost gives us similar or betterresults than the quadratic cost.But just small improvement. Cross-entropy is a widely-used cost function. And this is a good laboratory to begin understanding neuron saturation and how it may be addressed.<br>

### What does the cross-entropy mean? 
Our neuron is tring to compute the function $x\rightarrow y = y(x)$, but instead, it computes the function $x\rightarrow a =a(x)$.Suppose we think of $a$ as our neuron's estimated probability that $y$ is $1$, and $1-a$ is the estimated probability that the right value for $y$ is $0$. Then, the cross-entropy measures how "suprised" we are, when we learn the true value for $y$. Low suprise when output is what we expect, high surprise when output is unexpected.

### Softmax<br>
Softmax layer. The idea is to define a new type of output layer for our neural networks. Begining in the same way as w/ a sigmoid layer, by forming the weighted inputs as $z_j^L=\sum_kw_{jk}^La_k^{L-1}+b^L_j$. But we don't apply sigmoid function. In a softmax layer we apply <i>softmax function</i> to the $z^L_j$ instead of $a^L_j=\sigma(z^L_j)$. The activation $a^L_j$ of the $j^{th}$ output neuron is:<br>
$$a^L_j = \frac{e^{z^L_j}}{\sum_ke^{z^L_k}}$$<br>
Sum over all the output neurons in the denominator.<br>
$$\sum_j a^L_j = \frac{\sum_j e^{z^L_j}}{\sum_ke^{z^L_k}}=1$$<br>
As a result, if $a^L_4$ increase, then the output activations must decrease by yhe same total amount, to ensure the sum over all remains $1$. Also hold for all the other activations.<br>
This equation also have the property that all positive, because the exponential.<br>
So the softmax layer is a set of positive #s which sum up to 1, or the softmax layer can be thought of as a probability distribution. So it's convenient to be ablut to interpret the output activation $a^L_j$ as the networks's estimate of the probability that the correct output is $j$.<br>
<strong>The activations from a sigmoid layer won't in general form a probability distribution. With the sigmoid output layer we dont have such a simple interpretation of the output activations.<br></strong>
It's same as rescaling the $z^L_j$, and squishing them together to form a probability ditribution.<br>
Softmax is monotonicity and non-locality.<br>

<strong>The learning slowdown problem:</strong>
To see how softmax layer address the learning slowdown problem, define the log-likelihood cost function:<br>
$$C\equiv -\ln a^L_y$$<br>
And the corresponding partial derivatives will be:<br>
$$\frac{\partial C}{\partial b^L_j} = a^L_j - y_j$$<br>
$$\frac{\partial C}{\partial w^L_{jk}} = a^{L-1}_k(a^L_j - y_j)$$
It's useful to think of a softmax output layer w/ log-likelihood cost as being quite similar to a sigmoid output layer w/ cross-entropy cost.<br>
In many situations both approaches work well. As a more general point of principle, softmax plus log-likelihood is worth usign whenever you want to interpret the output activations as probabilities. This is not always a concern, but can be useful w/ <strong>classification problems involving disjoint classes</strong>.<br>

### Overfitting and regularization
The test situation: use 30 hidden neuron network, with it's 23,860 parameters. Instead of train 50,000 images, use the first 1000 training images. $\eta=0.5$ and mini-batch size of 10. Train for 400 epochs, large than before. The cost change as the network learns:<br>
![of1](/notebooks/imgs/of1.jpg)<br>
The classification accuracy on the test data chagnes over time<br>:
![of2](/notebooks/imgs/of2.jpg)<br>
The cost looks good but the test accuracy results show the improvement is an illusion. What our network learns after epoch 280 no longer generalizes to the test data. And so it' not useful learning. And we say the network is <i>overfitting</i> beyond epoch 280.<br>
The cost on the test data:<br>
![of3](/notebooks/imgs/of3.jpg)<br>
The cost improves until around epoch 15, after that it get worse, even the cost on the training data is continue get better. Another sign for overfitting.<br>
And the classification accuracy on the training data is also the sign:<br>
![of4](/notebooks/imgs/of4.jpg)<br>
The accuracy rises all the way up to 100%, same as correctly classifiers all 1,000 images. But the test accuracy only 82%. So overfitting.<br>
The pbcious way to detect is keeping track of accuracy on the test data. If the accuracy no longer improving, then stop training. But not absolutely.<br>

We can also use the validation for overfitting. We'll compute the classification accuracy on the validation data at the end of each epoch. Once the classification accuracy on the validation data has saturated, stop training. This strategy is called <i>early stoppping</i>.<br>
Have the validation instead of using test data directly because need make sure we can true measure of how well our NN generalizes.<br>
Now back to the example, use 50,000 as training and the test images:<br>
![of5](/notebooks/imgs/of5.jpg)<br>
Though there is a small overfit, it's already much better than previous example.<br>
<strong>One of the best ways of reducing overfitting is to increase the size of the training data. With enough training data it is difficult for even a very large network to overfit. Unfortunately, training data can be expensive or difficult to acquire, so not always a practical option.</strong>

### Regularization<br>
When fixed network and fixed training data, to solve the overfitting can use the <i>Regularization</i> techniques. One of the most commonly known as weight decay or L2 regularization.<br>
Idea is to add extra term to the cost function, the regularization term. E.g /w the cross-entropy:<br>
$$C=-\frac{1}{n}\sum_{x_j}[y_i\ln a_j^L+(1-y_i)\ln(1-a^L_j)]+\frac{\lambda}{2n}\sum_w w^2$$<br>
The second term is the sum of the squares of all the weights in the network, scaled by a factor $\lambda/2n$, for $\lambda>0$, known as regularization parameter. $n$ os the size of training set. This term do not include bias.<br>
Regularize w/ the quadratic cost function:<br>
$$C=\frac{1}{2n}\sum_x\|y-a^L\|^2+\frac{\lambda}{2n}\sum_w w^2$$ <br>
Or, in general:<br>
$$C=C_0+\frac{\lambda}{2n}\sum_w w^2$$<br>

First, SGD, stochastic gradient descent, w/ regularized neural network. The partial derivatives now become:<br>
$$\frac{\partial C}{\partial w} = \frac{\partial C_0}{\partial w} + \frac{\lambda}{n}w$$<br>
$$\frac{\partial C}{\partial b} = \frac{\partial C_0}{\partial b}$$<br>
The term $\partial C_0$ can be computed using backpropagation. The learning rule for biases stay same:<br>
$$b\rightarrow b-\eta\frac{\partial C_0}{\partial b}$$<br>
For weight become:<br>
$$w\rightarrow w-\eta\frac{\partial C_0}{\partial w}-\frac{\eta\lambda}{n}w = (1-\frac{\eta\lambda}{n})w-\eta\frac{\partial C_0}{\partial w}$$
Almost same, except rescale the weight by $(1-\frac{\eta\lambda}{n})$, this referred to as weight decay. This is how gradient descent work. For SGD with a mini-batch of $m$ trainings w/ the equation:<br>
$$w\rightarrow (1-\frac{\eta\lambda}{n})w-\frac{\eta}{m}\sum_x\frac{\partial C_x}{\partial w}$$<br>
The sum is over trainings $x$ in the mini-batch, $C_x$ is the unregularized cost for each training, $x$. Except the rescaling, all remain same as before. The regularized learning rule for the biases is the same as w/o regularized case.<br>
$$b\rightarrow b-\frac{\eta}{m}\sum_x\frac{\partial C_x}{\partial b}$$<br>

Apply to the example. 30 hidden neurons, mini-batch size of 10, learning rate of 0.5, cross-entropy cost function. Try $lambda=0.1$.<br>
Cost on training decrease ove the whole time, like before, w/o regularize:<br>
![rg1](/notebooks/imgs/rg1.jpg)<br>
The accuracy on the test data continue increase:<br>
![rg2](/notebooks/imgs/rg2.jpg)<br>
This solve the overfitting problem and increase the peak accuracy.<br>
Now try the case that when trainings increase, if regularize also help.Hyperparameter same as before, w/ the change of $\lambda$ to 5 by considering the equation $1-\frac{\eta\lambda}{n}$<br>.
![rg3](/notebooks/imgs/rg3.jpg)<br>
Good news:<br>
1. classification accuracy on the test goes up.<br>
2. Gap between test and training much narrower than before, reduce overfitting.<br>

Now, try 100 hidden neurons will give the accuracy to $97.92%$ on the validation. For 60 epoches w/ $\eta=0.1$ and $\lambda=5$ will give accuracy $98.04%$.<br>

### Why does regularization help reduce overfitting?<br>
Example:<br>
![rg4](/notebooks/imgs/rg4.jpg)<br>
If do 9th-order polynomial:<br>
![rg5](/notebooks/imgs/rg5.jpg)<br>
This is the exact fit, if use linear fit:<br>
![rg6](/notebooks/imgs/rg6.jpg)<br>

### Other techniques for regularization
L1 regularization, dropout and artifically increasing the training set size.<br>

<strong>L1 regularization</strong>:<br>
Sum of the absolute values of the weights:<br>
$$C=C_0+\frac{\lambda}{n}\sum_w|w|$$<br>
Update rule for L1:<br>
$$w\rightarrow w' = w-\frac{\eta\lambda}{n}sgn(w)-\eta\frac{\partial C_0}{\partial w}$$<br>
$sgn(w)$ is the sign of $w$, $+1$ for $w$ positive, vice versa.

<strong>Dropout</strong>:<br>
In dropout, we modify the network itself. If we have the network:<br>
![do1](/notebooks/imgs/do1.jpg)<br>
Starting by randomly deleting half the hidden neurons in the network, while leave the input and output untouched.<br>
![do2](/notebooks/imgs/do2.jpg)<br>
First forward-propagate and then backpropagate. After this, update the weight and bias. Then restore the dropout neurons and choosing new random subset of hidden neurons to delete, do the same thing as before and update the parameter.<br>
Dropout has been especially useful in training large, deep networks, where the problem of overfitting is often acute.<br>

<strong>Artificially expanding the training data</strong>:<br>
For the image, rotate the original image by some degree can use as new image. Also can apply to other area, just looking for oppurtunities to apply it.<br>

<strong>An aside on big data and what it means to compare classification accuracies</strong>:<br>
What we want is both better algorithms and better training data. It's fine to look for better algorithms, but do not focusing on better algorithms to ignore the easy wins getting more or better taining data.<br>

Overfitting is a major problem in NN, especially as computers get more powerful, and we have the ability to train larger networks. So need to develop powerful regularization techniques to reduce overfitting, this is an extremely area of current work.<br>

### Weight initialization <br>
