# Welcome to Deep Learning!
# Main content
- Vanishing/Exploding gradients
- training is slow with a large network.
- Overfitting.

# Vanishing/Exploding gradients
The backpropagation algorithm works by going from the output layer to the input layer, propagating the error gradient on the way. Once the algorithm has computed the gradient of the cost function with regards to each parameter in the network, it uses these gradients to update each parameter with a Gradient Descent step.

#### vanishing gradients problem
Unfortunately, gradients often get smaller and smaller as the algorithm progresses down to the lower layers. As a result, the Gradient Descent update leaves the lower layer connection weights virtually unchanged, and training never converges to a good solution. 

#### exploding gradients problem
In some cases, the opposite can happen: the gradients can grow bigger and bigger, so many layers get insanely large weight updates and the algorithm diverges. This is the exploding gradients problem, which is mostly encountered in recurrent neural networks 

#### A Paper “Understanding the Difficulty of Training Deep Feedforward Neural Networks” 
Although this unfortunate behavior has been empirically observed for quite a while (it was one of the reasons why deep neural networks were mostly abandoned for a long time), it is only around 2010 that significant progress was made in understand‐ ing it. A paper titled “Understanding the Difficulty of Training Deep Feedforward Neural Networks” by Xavier Glorot and Yoshua Bengio1 found a few suspects, including the combination of the popular logistic sigmoid activation function and the weight initialization technique that was most popular at the time, namely random initialization using a normal distribution with a mean of 0 and a standard deviation of 1. In short, they showed that with this activation function and this initialization scheme, the variance of the outputs of each layer is much greater than the variance of its inputs. Going forward in the network, the variance keeps increasing after each layer until the activation function saturates at the top layers. This is actually made worse by the fact that the logistic function has a mean of 0.5, not 0 (the hyperbolic tangent function has a mean of 0 and behaves slightly better than the logistic function in deep networks).

**Looking at the logistic activation function (see Figure 11-1). when inputs become large (negative or positive), the function saturates at 0 or 1, with a derivative extremely close to 0. Thus when backpropagation kicks in, it has virtually no gradient to propagate back through the network, and what little gradient exists keeps getting diluted as backpropagation progresses down through the top layers, so there is really nothing left for the lower layers.** 

![11-1](images/11-1.png)



# Xavier and He Initialization
### Xavier initialization
For the signal to flow properly, the authors argue that we need the variance of the outputs of each layer to be equal to the variance of its inputs,2 and we also need the gradients to have equal variance before and after flowing through a layer in the reverse direction.

They proposed a good compromise that has proven to work very well in practice: the connection weights must be initialized randomly as described in Equation 11-1, where ninputs and noutputs are the number of input and output connections for the layer whose weights are being initialized (also called fan-in and fan-out). This initialization strategy is often called **Xavier initialization** (after the author’s first name), or sometimes **Glorot initialization**.

![11](images/e11-1.png)

**Using the Xavier initialization strategy can speed up training considerably, and it is one of the tricks that led to the current success of Deep Learning. **

### He initialization
Some recent papers have provided similar strategies for different activation functions, as shown in Table 11-1. The initialization strategy for the ReLU activation function (and its var‐ iants, including the ELU activation described shortly) is sometimes called **He initialization** (after the last name of its author).

![11](images/t11-1.png)

By default, the fully_connected() function (introduced in Chapter 10) uses Xavier initialization (with a uniform distribution). You can change this to He initialization by using the variance_scaling_initializer() function .

# Nonsaturating Activation Functions
#### ReLU activation function is better than sigmoid
- it does not saturate for positive values 
- and also because it is quite fast to compute.

#### the ReLU activation function is not perfect.
It suffers from a problem known as the **dying ReLUs**: during training, some neurons effectively die, meaning they stop outputting anything other than 0. 

## Leaky ReLU
This function is defined as $LeakyReLU_\alpha(z) = max(\alpha z, z)$ (see Figure 11-2). The hyperparameter $\alpha$ defines how much the function “leaks”: it is the slope of the function for z < 0, and is typically set to 0.01. This small slope ensures that leaky ReLUs never die; they can go into a long coma, but they have a chance to eventually wake up. 

- In fact, setting $\alpha$ = 0.2 (huge leak) seemed to result in better performance than $\alpha$ = 0.01 (small leak).
- **randomized leaky ReLU (RReLU)**, where α is picked randomly in a given range during training, and it is fixed to an average value during testing. It also performed fairly well and seemed to act as a regularizer
- **parametric leaky ReLU (PReLU)**, where α is authorized to be learned during training (instead of being a hyperparameter, it becomes a parameter that can be modified by backpropagation like any other parameter).It strongly outperform ReLU on large image datasets, but on smaller datasets it runs the risk of overfitting the training set.

![11](images/11-2.png)

### How to use Leaky Relu in tensorflow?
```
def leaky_relu(z, name=None):
return tf.maximum(0.01 * z, z, name=name)
    hidden1 = fully_connected(X, n_hidden1, activation_fn=leaky_relu)
```

## exponential linear unit (ELU) 
### Better than all ReLU variants
- training time was reduced
- the neural network per‐ formed better on the test set. 
![11](images/e11-2.png)

### Differences compared with ReLU
- **First it takes on negative values when z < 0, which allows the unit to have an average output closer to 0.** This helps alleviate(缓解) the vanishing gradients problem, as discussed earlier. The hyperparameter α defines the value that the ELU function approaches when z is a large negative number. It is usually set to 1, but you can tweak it like any other hyperparameter if you want.
- **Second, it has a nonzero gradient for z < 0, which avoids the dying units issue.**
- **Third, the function is smooth everywhere**, including around z = 0, which helps speed up Gradient Descent, since it does not bounce as much left and right of z = 0.

### Drawbacks
The main drawback of the ELU activation function is that it is slower to compute than the ReLU and its variants (due to the use of the exponential function), but dur‐ ing training this is compensated by the faster convergence rate. However, at test time an ELU network will be slower than a ReLU network.

## So which activation function should you use for the hidden layers of your deep neural networks? 
Although your mileage will vary, **in general ELU > leaky ReLU (and its variants) > ReLU > tanh > logistic**. 
- If you care a lot about runtime performance, then you may prefer leaky ReLUs over ELUs. 
- If you don’t want to tweak yet another hyperparameter, you may just use the default α values suggested earlier (0.01 for the leaky ReLU, and 1 for ELU). 
- If you have spare time and computing power, you can use cross-validation to evaluate other activation functions, in particular **RReLU** if your network is overfitting, or PReLU if you have a huge training set.

### Parametric ReLU：
对于 Leaky ReLU 中的，通常都是通过先验知识人工赋值的。 然而可以观察到，损失函数对的导数我们是可以求得的，可不可以将它作为一个参数进行训练呢？ 
Kaiming He的论文《Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification》指出，不仅可以训练，而且效果更好。

### RReLU: Randomized ReLU
核心思想就是，在训练过程中， 是从一个高斯分布  中 随机出来的，然后再测试过程中进行修正（有点像dropout的用法）。