# Regularization

Reference <http://neuralnetworksanddeeplearning.com/chap3.html#regularization>

Regularization is a techniques which can reduce overfitting

What is overfitting and why regularization works <http://neuralnetworksanddeeplearning.com/chap3.html#why_does_regularization_help_reduce_overfitting>

Suppose our network mostly has small weights, as will tend to happen in a regularized network. The smallness of the weights means that the behavior of the network won't change too much if we change a few random inputs here and there. That makes it difficult for a regularized network to learn the effects of local noise in the data. 

A network with large weights may change its behaviour quite a bit in response to small changes in the input. Unregularized network can use large weights to learn a complex model that carries a lot of information about the noise in the training data.

A regularized networks are constrained to build relatively simple models based on patterns seen often in the training data, and are resistant to learning peculiarities of the noise in the training data. This will force our networks to do real learning about the phenomenon at hand, and to generalize better from what they learn.

No-one has yet developed an entirely convincing theoretical explanation for why regularization helps networks generalize

our networks already generalize better than one might a priori expect. A network with 100 hidden neurons has nearly 80,000 parameters. We have only 50,000 images in our training data. It's like trying to fit an 80,000th degree polynomial to 50,000 data points. By all rights, our network should overfit terribly. And yet, as we saw earlier, such a network actually does a pretty good job generalizing. Why is that the case? It's not well understood. It has been conjectured In Gradient-Based Learning Applied to Document Recognition, by Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner (1998). that "the dynamics of gradient descent learning in multilayer nets has a self-regularization effect"


## L2 regularization

The idea of _weight decay_ or _L2 regularization_ is to add _regularization term_ to the cost function. Intuitively, the effect of regularization is to make it so the network prefers to learn small weights, all other things being equal.

$\begin{eqnarray}  C = C_0 + \frac{\lambda}{2n}
\sum_w w^2,
\tag{87}\end{eqnarray}$

$\begin{eqnarray} 
  \frac{\partial C}{\partial w} & = & \frac{\partial C_0}{\partial w} + 
  \frac{\lambda}{n} w \tag{88}\\ 
  \frac{\partial C}{\partial b} & = & \frac{\partial C_0}{\partial b}.
\tag{89}\end{eqnarray}$

Use back propagation to compute partial derivatives 

$\begin{eqnarray}
b & \rightarrow & b -\eta \frac{\partial C_0}{\partial b}.
\tag{90}\end{eqnarray}$

$\begin{eqnarray} 
  w & \rightarrow & w-\eta \frac{\partial C_0}{\partial
    w}-\frac{\eta \lambda}{n} w \tag{91}\\ 
  & = & \left(1-\frac{\eta \lambda}{n}\right) w -\eta \frac{\partial
    C_0}{\partial w}. 
\tag{92}\end{eqnarray}$

This is exactly the same as the usual gradient descent learning rule, except we first rescale the weight w by a factor (weight decay) $1-\frac{\eta\lambda}{n}$ 

Stochastic gradient descent works in the same way. 


### Cross-entropy regularized:

$\begin{eqnarray} C = -\frac{1}{n} \sum_{xj} \left[ y_j \ln a^L_j+(1-y_j) \ln
(1-a^L_j)\right] + \frac{\lambda}{2n} \sum_w w^2.
\tag{85}\end{eqnarray}$

### Quadratic cost regularized:

$C = \frac{1}{2n} \sum_x \|y-a^L\|^2 +
  \frac{\lambda}{2n} \sum_w w^2\tag{86}$


## L1 regularization

<http://neuralnetworksanddeeplearning.com/chap3.html#other_techniques_for_regularization>

In this approach we modify the unregularized cost function by adding the sum of the absolute values of the weights. This is similar to L2 regularization, penalizing large weights, and tending to make the network prefer small weights. In L1 regularization, the weights shrink by a constant amount toward 0. In L2 regularization, the weights shrink by an amount which is proportional to w.
$\begin{eqnarray}  C = C_0 + \frac{\lambda}{n} \sum_w |w|.
\tag{95}\end{eqnarray}$

## Dropout

With dropout, we start by randomly (and temporarily) deleting half the hidden neurons in the network, while leaving the input and output neurons untouched. We forward-propagate the input x through the modified network, and then backpropagate the result, also through the modified network. After doing this over a mini-batch of examples, we update the appropriate weights and biases. We then repeat the process, first restoring the dropout neurons, then choosing a new random subset of hidden neurons to delete, estimating the gradient for a different mini-batch, and updating the weights and biases in the network. "This technique reduces complex co-adaptations of neurons, since a neuron cannot rely on the presence of particular other neurons. It is, therefore, forced to learn more robust features that are useful in conjunction with many different random subsets of the other neurons." (from <https://arxiv.org/pdf/1207.0580.pdf>)

<http://neuralnetworksanddeeplearning.com/chap3.html#other_techniques_for_regularization>

## Artificially expanding the training data

The classification accuracies improve considerably as we use more training data (see original book)

Obtaining more training data is a great idea. Unfortunately, it can be expensive, and so is not always possible in practice.

Suppose, for example, that we take an MNIST training image of a five, and rotate it by a small amount, let's say 15 degrees. It's still recognizably the same digit. And yet at the pixel level it's quite different to any image currently in the MNIST training data. 

We can expand our training data by making many small rotations of all the MNIST training images, and then using the expanded training data to improve our network's performance.

This idea is very powerful and has been widely used. Let's look at some of the results from a paper: Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis, by Patrice Simard, Dave Steinkraus, and John Platt (2003). which applied several variations of the idea to MNIST. <https://ieeexplore.ieee.org/document/1227801/>

Variations on this idea can be used to improve performance on many learning tasks, not just handwriting recognition. The general principle is to expand the training data by applying operations that reflect real-world variation. 




In [1]:
# Reset and load python functions
%reset
%run python/download_data.py
%run python/mnist_loader.py
%run python/network2.py
%run python/overfitting.py

# Experiment 3

training_data, validation_data, test_data = load_data_wrapper()
net = Network([784, 100, 10], cost=CrossEntropyCost)
net.large_weight_initializer()
# Hyper-parameters:
num_epochs_3 = 30
batch_size_3 = 10
learning_3 = 0.5
lmbda_3 = 5.0
net.SGD(
    training_data, num_epochs_3, batch_size_3, learning_3,
    evaluation_data=validation_data, lmbda = lmbda_3,
    monitor_evaluation_accuracy=True)

Once deleted, variables cannot be recovered. Proceed (y/[n])? y
Epoch 0 training complete
Cost on training data: 122856.39730641841
Accuracy on training data: 47245 / 50000
Cost on evaluation data: 122856.41263689839
Accuracy on evaluation data: 9428 / 10000
Epoch 1 training complete
Cost on training data: 76528.48556196361
Accuracy on training data: 47879 / 50000
Cost on evaluation data: 76528.51504255563
Accuracy on evaluation data: 9531 / 10000
Epoch 2 training complete
Cost on training data: 48659.80088618216
Accuracy on training data: 48413 / 50000
Cost on evaluation data: 48659.8423165974
Accuracy on evaluation data: 9623 / 10000
Epoch 3 training complete
Cost on training data: 31955.063780332912
Accuracy on training data: 48750 / 50000
Cost on evaluation data: 31955.110152876627
Accuracy on evaluation data: 9665 / 10000
Epoch 4 training complete
Cost on training data: 21956.777308635923
Accuracy on training data: 48734 / 50000
Cost on evaluation data: 21956.83116751666
Accuracy 

In [3]:
# 

# Experiment 4 - Some tuning - 60 epochs at η=0.1 and λ=5.0

training_data, validation_data, test_data = load_data_wrapper()
net = Network([784, 100, 10], cost=CrossEntropyCost)
net.large_weight_initializer()
# Hyper-parameters:
num_epochs_4 = 60
batch_size_4 = 10
learning_4 = 0.1
lmbda_4 = 5.0
net.SGD(
    training_data, num_epochs_4, batch_size_4, learning_4,
    evaluation_data=validation_data, lmbda = lmbda_4,
    monitor_evaluation_accuracy=True)

Epoch 0 training complete
Accuracy on evaluation data: 8945 / 10000
Epoch 1 training complete
Accuracy on evaluation data: 9171 / 10000
Epoch 2 training complete
Accuracy on evaluation data: 9294 / 10000
Epoch 3 training complete
Accuracy on evaluation data: 9376 / 10000
Epoch 4 training complete
Accuracy on evaluation data: 9425 / 10000
Epoch 5 training complete
Accuracy on evaluation data: 9470 / 10000
Epoch 6 training complete
Accuracy on evaluation data: 9507 / 10000
Epoch 7 training complete
Accuracy on evaluation data: 9513 / 10000
Epoch 8 training complete
Accuracy on evaluation data: 9553 / 10000
Epoch 9 training complete
Accuracy on evaluation data: 9583 / 10000
Epoch 10 training complete
Accuracy on evaluation data: 9593 / 10000
Epoch 11 training complete
Accuracy on evaluation data: 9614 / 10000
Epoch 12 training complete
Accuracy on evaluation data: 9638 / 10000
Epoch 13 training complete
Accuracy on evaluation data: 9639 / 10000
Epoch 14 training complete
Accuracy on evalu

([],
 [8945,
  9171,
  9294,
  9376,
  9425,
  9470,
  9507,
  9513,
  9553,
  9583,
  9593,
  9614,
  9638,
  9639,
  9652,
  9670,
  9675,
  9670,
  9693,
  9700,
  9709,
  9713,
  9711,
  9731,
  9731,
  9734,
  9739,
  9740,
  9739,
  9744,
  9736,
  9749,
  9750,
  9748,
  9763,
  9760,
  9759,
  9765,
  9764,
  9756,
  9758,
  9757,
  9769,
  9779,
  9776,
  9772,
  9786,
  9779,
  9786,
  9775,
  9780,
  9781,
  9760,
  9790,
  9786,
  9800,
  9789,
  9792,
  9786,
  9795],
 [],
 [])