# Problems

* vanishing/exploding gradients
* training would be extemely slow
* too many parameters would risk overfitting

# Vanishing/Exploding gradients

The backpropagation works by going from the output layer to the input layer. It use the gradient of the cost function with regards to each parameter in the network.
Gradients often gets smaller as the algorithm progresses down to the lower layer. The lower layer connection weights virtually unchanged and training never converges to a good solution. 
In some cases, opposite can happen: the gradients can grow bigger and bigger, so many layers get insanely large weight updates and the algorithm diverges. 


## reason
With the old activation function and initialization, the variance of the outputs of each layer is much greater than the variance of its input. For example, for logistic activation function(sigmoid function), when inputs become large the function saturates at 0 or 1, with a derivative extremely close to 0. There is no gradient to propagate back through the network.

## solution

### Initialization 
<img src="Initialization.png"/>

### Nonsaturating activation functions
* Leak ReLU:
LReLU(z) = max($\alpha$z,z) , typically $\alpha$ set to 0.01(small leak)

* Randomized leaky ReLU(RReLU): $\alpha$ is picked randomly in a given range during training and it  is fixed to an average value during testing (act as a regularizer which reduce the risk of overfitting)

* Parametric leaky ReLU(PReLU): $\alpha$ is authorized to be learned during training (instead of being a hyperparameter, it can be modified by backpropagation) <font color='red'>ourperform  on  large image datasets, but on smaller datasets it runs the risk of overfitting the training set</font>

* Exponential linear unit(ELU):
$$ ELU_{\alpha}=\left\{
\begin{aligned}
&\alpha(exp(z)-1)  & \text{if } z<0 \\
&z & \text{if } z \geq 0\\
\end{aligned}
\right.
$$
$\alpha$ is usually set to 1.

<img src='ELU.png' />

### which activation?
In general: ELU>Leaky ReLU>ReLU>tanh>logistic
If you care more about runtime, then you may prefer leaky ReLUs over ELUs. 

### Batch Normalization
Internal Covariate Shift problem: the distribution of each layer's inputs changes during training, as the parameters of the previous layers change.
#### Solution: 
add an operation in the model just before the activation function of each layer, simply zero-centering and normalizing the inputs, then scaling and shifting.
At test time, use the whole training set's mean and standard deviation.

The vanishing gradients problems was strongly reduced, to the point that they could use saturaing activation functions like logistic and tanh function. The network were also less sensitive to the weight initialization.

Batch normalization add some complexity to the model. If you need predictions to be lightning-fast, you may want to check how well plain ELU+ He initialization perform before playing with batch normalization.

### Gradient Clipping
A technique to lessen the exploding gradients problem is to clip the gradients during backpropagation so that they never exceed some threshold.

In TensorFlow, the optimizer's minimize() function takes care of both computing the gradients and applying them. So you first call <code> compute_gradients()</code> first, then create an operation of clip the gradients using <code>clip_by_value()</code>, and finally create an operation to apply the clipped gradients using optimizer's <code>apply_gradients()</code> method.
```
threshold = 1.0
optimizer = tf.train.GradientDescentOptimizer(learning_rate)
grads_and_vars = optimizer.computer_gradients(loss)
capped_gvs = [(tf.clip_by_value(grad, -threshold, threshold), var) for grad, var in grads_and_vars]
training_op = optimizer.apply_gradients(capped_gvs)
```
The gradients will be clipped between -1.0 and 1.0 and applied. The threshold is a hyperparameter you can tune.

# Reusing Pretrained Layers
It is generally not a good idea to train a very large DNN from scratch. You should try to find an existing neural network that accomplishes a similar task and just resuse the lower layers of this network. It will not only speed up training considerably, but will also require much less training data.
<img src = 'Reuse_layers.png'/>

In [5]:
import tensorflow as tf

In [11]:
from tensorflow.contrib.layers import fully_connected 

In [13]:
x  = tf.placeholder(tf.float32,shape=(None,10),name='x')

In [14]:
hidden1 = fully_connected(x,10,scope='hidden1')