# Improving the way neural networks learn
http://neuralnetworksanddeeplearning.com/chap3.html

## The cross-entropy cost function
$C = \frac{(y-a)^2}{2}, \ (54)$

$\frac{\partial C}{\partial w} = (a-y)\sigma'(z) x = a \sigma'(z), \ (55)$

$\frac{\partial C}{\partial b} = (a-y)\sigma'(z) = a \sigma'(z),\ (56)$

where I have substituted x=1 and y=0. 

![](http://ou8qjsj0m.bkt.clouddn.com//17-11-8/690390.jpg)

We can see from this graph that when the neuron's output is close to 1, the curve gets very flat, and so σ′(z) gets very small. Equations (55) and (56) then tell us that ∂C/∂w and ∂C/∂b get very small. This is the origin of the learning slowdown.

## Introducing the cross-entropy cost function
We'll suppose instead that we're trying to train a neuron with several input variables, x1,x2,… corresponding weights w1,w2,… and a bias, b:

![](http://neuralnetworksanddeeplearning.com/images/tikz29.png)

$a = \sigma(z)$

$z = \sum_j w_j x_j+b$

The cross-entropy cost function:

$C = -\frac{1}{n} \sum_x \left[y \ln a + (1-y ) \ln (1-a) \right], \ (57)$

- n is the total number of items of training data
- the sum is over all training inputs, x
- y is the corresponding desired output

Summing up, the cross-entropy is positive, and tends toward zero as the neuron gets better at computing the desired output, y, for all training inputs, x.

We substitute a=σ(z) into (57), and apply the chain rule twice, obtaining:

$\frac{\partial C}{\partial w_j} = -\frac{1}{n} \sum_x \left(\frac{y}{\sigma(z)} -\frac{(1-y)}{1-\sigma(z)} \right)\frac{\partial \sigma}{\partial w_j} \ (58)$

$= -\frac{1}{n} \sum_x \left(\frac{y}{\sigma(z)} -\frac{(1-y)}{1-\sigma(z)} \right)\sigma'(z) x_j.\ (59)$

Putting everything over a common denominator and simplifying this becomes:

$\frac{\partial C}{\partial w_j} = \frac{1}{n}\sum_x \frac{\sigma'(z) x_j}{\sigma(z) (1-\sigma(z))}(\sigma(z)-y). \ (60)$

Using the definition of the sigmoid function:

$\sigma(z) = 1/(1+e^{-z})$

$\sigma'(z) = \sigma(z)(1-\sigma(z))$

It simplifies to become:

$\frac{\partial C}{\partial w_j} = \frac{1}{n} \sum_x x_j(\sigma(z)-y). \ (61)$

It tells us that the rate at which the weight learns is controlled by σ(z)−y, i.e., by the error in the output. The larger the error, the faster the neuron will learn.

In particular, it avoids the learning slowdown caused by the σ′(z) term in the analogous equation for the quadratic cost, Equation (55). When we use the cross-entropy, the σ′(z) term gets canceled out, and we no longer need worry about it being small. 

In a similar way, we can compute the partial derivative for the bias. 

$\frac{\partial C}{\partial b} = \frac{1}{n} \sum_x (\sigma(z)-y). \ (62)$

In particular, suppose $y = y_1, y_2, \ldots$ are the desired values at the output neurons, i.e., the neurons in the final layer, while $a^L_1, a^L_2, \ldots$ are the actual output values. Then we define the cross-entropy by

$C = -\frac{1}{n} \sum_x \sum_j \left[y_j \ln a^L_j + (1-y_j) \ln (1-a^L_j) \right]. \ (63)$

## Softmax
The activation $a^L_j$ of the $j^{th}$ output neuron is

$a^L_j = \frac{e^{z^L_j}}{\sum_k e^{z^L_k}}, \ (78)$

where in the denominator we sum over all the output neurons.

We can prove using Equation (78) and a little algebra:

$\sum_j a^L_j = \frac{\sum_j e^{z^L_j}}{\sum_k e^{z^L_k}} = 1. \ (79)$

The learning slowdown problem:

We'll use x to denote a training input to the network, and y to denote the corresponding desired output. Then the log-likelihood cost associated to this training input is

$C \equiv -\ln a^L_y. \ (80)$

$\frac{\partial C}{\partial b^L_j} = a^L_j-y_j  \ (81)$

$\frac{\partial C}{\partial w^L_{jk}} = a^{L-1}_k (a^L_j-y_j) \ (82)$

In fact, it's useful to think of a softmax output layer with log-likelihood cost as being quite similar to a sigmoid output layer with cross-entropy cost.

## Overfitting and regularization
Let's now look at how the classification accuracy on the test data changes over time:

![](http://neuralnetworksanddeeplearning.com/images/overfitting2.png)

In the first 200 epochs (not shown) the accuracy rises to just under 82 percent. The learning then gradually slows down. Finally, at around epoch 280 the classification accuracy pretty much stops improving. Later epochs merely see small stochastic fluctuations near the value of the accuracy at epoch 280. We say the network is `overfitting` or overtraining beyond epoch 280.

Another sign of overfitting, let's look at the cost on the test data:

![](http://neuralnetworksanddeeplearning.com/images/overfitting3.png)

We can see that the cost on the test data improves until around epoch 15, but after that it actually starts to get worse, even though the cost on the training data is continuing to get better. 

Another sign of overfitting may be seen in the classification accuracy on the training data:

![](http://neuralnetworksanddeeplearning.com/images/overfitting4.png)

The accuracy rises all the way up to 100100 percent. 

We'll compute the classification accuracy on the `validation_data` at the end of each epoch. Once the classification accuracy on the validation_data has saturated, we stop training. This strategy is called **early stopping**. 

![](http://neuralnetworksanddeeplearning.com/images/overfitting_full.png)

As you can see, the accuracy on the test and training data remain much closer together than when we were using 1,000 training examples. In particular, the best classification accuracy of 97.8697.86 percent on the training data is only 2.532.53 percent higher than the 95.3395.33 percent on the test data. That's compared to the 17.7317.73 percent gap we had earlier! **Overfitting is still going on, but it's been greatly reduced.**

## Regularization
Here's the regularized cross-entropy:

$C = -\frac{1}{n} \sum_{xj} \left[y_j \ln a^L_j+(1-y_j) \ln(1-a^L_j)\right] + \frac{\lambda}{2n} \sum_w w^2. \ (85)$

The first term is just the usual expression for the cross-entropy.

The second term, namely the sum of the squares of all the weights in the network. This is scaled by a factor λ/2n, where λ>0 is known as the `regularization parameter`, and n is, as usual, the size of our training set.

The quadratic cost can be done in a similar way:

$C = \frac{1}{2n} \sum_x \|y-a^L\|^2 + \frac{\lambda}{2n} \sum_w w^2. \ (86)$

In both cases we can write the regularized cost function as

$C = C_0 + \frac{\lambda}{2n}\sum_w w^2, \ (87)$

where $C_0$ is the original, unregularized cost function.

Taking the partial derivatives of Equation (87) gives

$\frac{\partial C}{\partial w} = \frac{\partial C_0}{\partial w} + \frac{\lambda}{n} w \ (88)$

$\frac{\partial C}{\partial b} = \frac{\partial C_0}{\partial b}.\ (89)$

The gradient descent learning rule for the biases doesn't change from the usual rule:

$b \rightarrow b -\eta \frac{\partial C_0}{\partial b}.\ (90)$

The learning rule for the weights becomes:

$w \rightarrow w-\eta \frac{\partial C_0}{\partial w}-\frac{\eta \lambda}{n} w \ (91)$

$= \left(1-\frac{\eta \lambda}{n}\right) w -\eta \frac{\partial C_0}{\partial w}. \ (92)$

This is exactly the same as the usual gradient descent learning rule, except we first rescale the weight w by a factor $1-\frac{\eta\lambda}{n}$. This rescaling is sometimes referred to as `weight decay`, since it makes the weights smaller.

The regularized learning rule for stochastic gradient descent becomes:

$w \rightarrow \left(1-\frac{\eta \lambda}{n}\right) w -\frac{\eta}{m} \sum_x \frac{\partial C_x}{\partial w}, \ (93)$

$b \rightarrow b - \frac{\eta}{m} \sum_x \frac{\partial C_x}{\partial b},\ (94)$

where the sum is over training examples x in the mini-batch, and $C_x$ is the (unregularized) cost for each training example.

## Why does regularization help reduce overfitting?
A standard story people tell to explain what's going on is along the following lines: smaller weights are, in some sense, lower complexity, and so **provide a simpler and more powerful explanation for the data, and should thus be preferred.** That's a pretty terse story, though, and contains several elements that perhaps seem dubious or mystifying.

The fact that L2 regularization doesn't constrain the biases.

## Other techniques for regularization
I briefly describe three other approaches to reducing overfitting: L1 regularization, dropout, and artificially increasing the training set size. 

**L1 regularization**:

$C = C_0 + \frac{\lambda}{n} \sum_w |w|.\ (95)$

We'll look at the partial derivatives of the cost function. Differentiating (95) we obtain:

$\frac{\partial C}{\partial w} = \frac{\partial C_0}{\partial w} + \frac{\lambda}{n} {sgn}(w),\ (96)$

- sgn(w) is the sign of w

The resulting update rule for an L1 regularized network is

$w \rightarrow w' = w-\frac{\eta \lambda}{n} {sgn}(w) - \eta \frac{\partial C_0}{\partial w}, \ (97)$

Compare that to the update rule for L2 regularization (c.f. Equation (93)),

$w \rightarrow w' = w\left(1 - \frac{\eta \lambda}{n} \right) - \eta \frac{\partial C_0}{\partial w}. \ (98)$

In both expressions the effect of regularization is to shrink the weights. This accords with our intuition that both kinds of regularization penalize large weights. But the way the weights shrink is different. In L1 regularization, the weights shrink by a constant amount toward 0. In L2 regularization, the weights shrink by an amount which is proportional to w. And so when a particular weight has a large magnitude, |w|, L1 regularization shrinks the weight much less than L2 regularization does. By contrast, when |w| is small, L1 regularization shrinks the weight much more than L2 regularization. The net result is that L1 regularization tends to concentrate the weight of the network in a relatively small number of high-importance connections, while the other weights are driven toward zero.

**Dropout**:

Suppose we're trying to train a network:

![](http://neuralnetworksanddeeplearning.com/images/tikz30.png)

We start by randomly (and temporarily) deleting half the hidden neurons in the network, while leaving the input and output neurons untouched. After doing this, we'll end up with a network along the following lines. Note that the dropout neurons, i.e., the neurons which have been temporarily deleted, are still ghosted in:

![](http://neuralnetworksanddeeplearning.com/images/tikz31.png)

We forward-propagate the input xx through the modified network, and then backpropagate the result, also through the modified network. After doing this over a mini-batch of examples, we update the appropriate weights and biases. We then repeat the process, first restoring the dropout neurons, then choosing a new random subset of hidden neurons to delete, estimating the gradient for a different mini-batch, and updating the weights and biases in the network.

By repeating this process over and over, our network will learn a set of weights and biases. Of course, those weights and biases will have been learnt under conditions in which half the hidden neurons were dropped out. When we actually run the full network that means that twice as many hidden neurons will be active. 

**The reason is that the different networks may overfit in different ways, and averaging may help eliminate that kind of overfitting.**

## Handwriting recognition revisited: the code

In [1]:
import mnist_loader
training_data, validation_data, test_data = mnist_loader.load_data_wrapper()

In [2]:
import network2
net = network2.Network([784, 30, 10], cost=network2.CrossEntropyCost)
net.SGD(training_data, 30, 10, 0.5,
        lmbda = 5.0,
        evaluation_data=validation_data,
        monitor_evaluation_accuracy=True,
        monitor_evaluation_cost=True,
        monitor_training_accuracy=True,
         monitor_training_cost=True)

Epoch 0 training complete
Cost on training data: 0.478850757921
Accuracy on training data: 47053 / 50000
Cost on evaluation data: 0.780147817261
Accuracy on evaluation data: 9425 / 10000

Epoch 1 training complete
Cost on training data: 0.443697058681
Accuracy on training data: 47465 / 50000
Cost on evaluation data: 0.84408265512
Accuracy on evaluation data: 9494 / 10000

Epoch 2 training complete
Cost on training data: 0.436786090824
Accuracy on training data: 47677 / 50000
Cost on evaluation data: 0.890229406699
Accuracy on evaluation data: 9519 / 10000

Epoch 3 training complete
Cost on training data: 0.407652260249
Accuracy on training data: 48001 / 50000
Cost on evaluation data: 0.892164701353
Accuracy on evaluation data: 9566 / 10000

Epoch 4 training complete
Cost on training data: 0.399290825757
Accuracy on training data: 47998 / 50000
Cost on evaluation data: 0.902849979457
Accuracy on evaluation data: 9577 / 10000

Epoch 5 training complete
Cost on training data: 0.3945055283

([0.78014781726142135,
  0.84408265511979197,
  0.89022940669924955,
  0.89216470135278947,
  0.90284997945730083,
  0.90405773553029234,
  0.98214172688143564,
  0.92401680288569232,
  0.93584563114802966,
  0.92713279440004093,
  0.95271079188215801,
  0.9419838195623853,
  0.95433245102496667,
  0.94235134253021169,
  1.0012166045938655,
  0.95440628431329322,
  0.93520360509257461,
  0.9333497511445531,
  0.93306141544043952,
  0.9625131922963579,
  0.94581322639251653,
  0.94133942830841644,
  0.94570732200907814,
  0.98827400271308619,
  0.95700885402158575,
  0.93628486006556444,
  0.9543149692086571,
  0.9342760162199828,
  0.9472039946502453,
  0.93841539893931469],
 [9425,
  9494,
  9519,
  9566,
  9577,
  9597,
  9526,
  9615,
  9589,
  9625,
  9574,
  9602,
  9582,
  9619,
  9549,
  9601,
  9637,
  9636,
  9635,
  9595,
  9622,
  9629,
  9625,
  9566,
  9578,
  9637,
  9609,
  9634,
  9614,
  9620],
 [0.47885075792071663,
  0.44369705868077208,
  0.43678609082354852,
  0.40