# Weight initialization

<http://neuralnetworksanddeeplearning.com/chap3.html#weight_initialization>

Math:

-  https://en.wikipedia.org/wiki/Normal_distribution 

## Large weight 

- initialize the weights with indipendent Gaussian random variables, normalized to have mean 0 and standard deviation  1
- $z = \sum_jw_j x_j+b$  is a sum over a total of normalized Gaussian random variables
- $|z|$ will be pretty large (>> 1 or << -1) and $\sigma(z)$ will be very close to either 1 or 0
- That means our hidden neuron will have saturated. And when that happens, as we know, making small changes in the weights will make only absolutely miniscule changes in the activation of our hidden neuron. 
- As a result, those weights will only learn very slowly when we use the gradient descent algorithm

## Better weight

- initialize the weights with indipendent Gaussian random variables, normalized to have mean 0 and standard deviation  $1/\sqrt{n_{\rm in}}$
- $n_{\rm in}$ = neuron  input weights
- $z = \sum_jw_j x_j+b$ will again be a Gaussian random variable with mean 0, but it'll be much more sharply peaked than it was before
- this reduce the cases where neuron saturate and then reduce slow down learning

## Example

Check original book ...

# Network hyper-parameters

<http://neuralnetworksanddeeplearning.com/chap3.html#how_to_choose_a_neural_network's_hyper-parameters>

In practice, when you're using neural nets to attack a problem, it can be difficult to find good hyper-parameters.

In practice, there are relationships between the hyper-parameters. 
 
You should be on the lookout for signs that things aren't working, and be willing to experiment. In particular, this means carefully monitoring your network's behaviour, especially the validation accuracy.

hyper-parameter optimization is not a problem that is ever completely solved


## broad strategy


- strip the problem down (eg try to distinguish image with 0 and 1, remove all other images from dataset)
- reduce training set
- increasing the frequency of monitoring

Example

- reduce dataset to 1000 images
- change $\lambda$ according to dataset size to keep the weight decay the same (eg. $\lambda=1000$ for entire dataset, $\lambda = 20$ for reduced dataset of 1000 images
- try different value of $\eta$

## Learning rate

To understand the reason for the oscillations, recall that stochastic gradient descent is supposed to step us gradually down into a valley of the cost function, however, if $\eta$ is too large then the steps will be so large that they may actually overshoot the minimum, causing the algorithm to climb up out of the valley instead

We can set $\eta$ as follows

- First, we estimate the threshold value for $\eta$ at which the cost on the training data immediately begins decreasing, instead of oscillating or increasing. 
- $\eta$ as a factor of two below the threshold. Such a choice will typically allow you to train for many epochs, without causing too much of a slowdown in learning.
- $\eta$ primary purpose is really to control the step size in gradient descent, and monitoring the training cost is the best way to detect if the step size is too big (but it works the same with validation data)

## early stopping

early stopping means to terminate if the best classification accuracy doesn't improve for quite some time. 

I suggest using the no-improvement-in-ten rule for initial experimentation, and gradually adopting more lenient rules

## Learning rate schedule

For first experiments my suggestion is to use a single, constant value for the learning rate. That'll get you a good first approximation. Later, if you want to obtain the best performance from your network, it's worth experimenting with a learning schedule

The idea is to hold the learning rate constant until the validation accuracy starts to get worse. Then decrease the learning rate by some amount, say a factor of two or ten.

## Mini-batch size

Too small, and you don't get to take full advantage of the benefits of good matrix libraries optimized for fast hardware. Too large and you're simply not updating your weights often enough. What you need is to choose a compromise value which maximizes the speed of learning.

Fortunately, the choice of mini-batch size at which the speed is maximized is relatively independent of the other hyper-parameters 

The way to go is therefore to use some acceptable (but not necessarily optimal) values for the other hyper-parameters, and then trial a number of different mini-batch sizes, scaling η as above. Plot the validation accuracy versus time (as in, real elapsed time, not epoch!), and choose whichever mini-batch size gives you the most rapid improvement in performance. 

## Automated techniques

Hand-optimization is a good way to build up a feel for how neural networks behave. However, and unsurprisingly, a great deal of work has been done on automating the process. A common technique is grid search, which systematically searches through a grid in hyper-parameter space. 

See book for more details, books and papers on this topic.

## Type of neuron



See <http://neuralnetworksanddeeplearning.com/chap3.html#other_models_of_artificial_neuron>

## datasets 

use validation accuracy (on validation data) to pick the regularization hyper-parameter, the mini-batch size, and network parameters such as the number of layers and hidden neurons, and so on