# Welcome to Deep Learning!
# Main content
- Vanishing/Exploding gradients
- training is slow with a large network.
- Overfitting.

# Vanishing/Exploding gradients
The backpropagation algorithm works by going from the output layer to the input layer, propagating the error gradient on the way. Once the algorithm has computed the gradient of the cost function with regards to each parameter in the network, it uses these gradients to update each parameter with a Gradient Descent step.

#### vanishing gradients problem
Unfortunately, gradients often get smaller and smaller as the algorithm progresses down to the lower layers. As a result, the Gradient Descent update leaves the lower layer connection weights virtually unchanged, and training never converges to a good solution. 

#### exploding gradients problem
In some cases, the opposite can happen: the gradients can grow bigger and bigger, so many layers get insanely large weight updates and the algorithm diverges. This is the exploding gradients problem, which is mostly encountered in recurrent neural networks 

#### A Paper “Understanding the Difficulty of Training Deep Feedforward Neural Networks” 
Although this unfortunate behavior has been empirically observed for quite a while (it was one of the reasons why deep neural networks were mostly abandoned for a long time), it is only around 2010 that significant progress was made in understand‐ ing it. A paper titled “Understanding the Difficulty of Training Deep Feedforward Neural Networks” by Xavier Glorot and Yoshua Bengio1 found a few suspects, including the combination of the popular logistic sigmoid activation function and the weight initialization technique that was most popular at the time, namely random initialization using a normal distribution with a mean of 0 and a standard deviation of 1. In short, they showed that with this activation function and this initialization scheme, the variance of the outputs of each layer is much greater than the variance of its inputs. Going forward in the network, the variance keeps increasing after each layer until the activation function saturates at the top layers. This is actually made worse by the fact that the logistic function has a mean of 0.5, not 0 (the hyperbolic tangent function has a mean of 0 and behaves slightly better than the logistic function in deep networks).

**Looking at the logistic activation function (see Figure 11-1). when inputs become large (negative or positive), the function saturates at 0 or 1, with a derivative extremely close to 0. Thus when backpropagation kicks in, it has virtually no gradient to propagate back through the network, and what little gradient exists keeps getting diluted as backpropagation progresses down through the top layers, so there is really nothing left for the lower layers.** 

![11-1](images/11-1.png)



# Xavier and He Initialization
### Xavier initialization
For the signal to flow properly, the authors argue that we need the variance of the outputs of each layer to be equal to the variance of its inputs,2 and we also need the gradients to have equal variance before and after flowing through a layer in the reverse direction.

They proposed a good compromise that has proven to work very well in practice: the connection weights must be initialized randomly as described in Equation 11-1, where ninputs and noutputs are the number of input and output connections for the layer whose weights are being initialized (also called fan-in and fan-out). This initialization strategy is often called **Xavier initialization** (after the author’s first name), or sometimes **Glorot initialization**.

![11](images/e11-1.png)

**Using the Xavier initialization strategy can speed up training considerably, and it is one of the tricks that led to the current success of Deep Learning. **

### He initialization
Some recent papers have provided similar strategies for different activation functions, as shown in Table 11-1. The initialization strategy for the ReLU activation function (and its var‐ iants, including the ELU activation described shortly) is sometimes called **He initialization** (after the last name of its author).

![11](images/t11-1.png)

By default, the fully_connected() function (introduced in Chapter 10) uses Xavier initialization (with a uniform distribution). You can change this to He initialization by using the variance_scaling_initializer() function .

# Nonsaturating Activation Functions
#### ReLU activation function is better than sigmoid
- it does not saturate for positive values 
- and also because it is quite fast to compute.

#### the ReLU activation function is not perfect.
It suffers from a problem known as the **dying ReLUs**: during training, some neurons effectively die, meaning they stop outputting anything other than 0. 

## Leaky ReLU
This function is defined as $LeakyReLU_\alpha(z) = max(\alpha z, z)$ (see Figure 11-2). The hyperparameter $\alpha$ defines how much the function “leaks”: it is the slope of the function for z < 0, and is typically set to 0.01. This small slope ensures that leaky ReLUs never die; they can go into a long coma, but they have a chance to eventually wake up. 

- In fact, setting $\alpha$ = 0.2 (huge leak) seemed to result in better performance than $\alpha$ = 0.01 (small leak).
- **randomized leaky ReLU (RReLU)**, where α is picked randomly in a given range during training, and it is fixed to an average value during testing. It also performed fairly well and seemed to act as a regularizer
- **parametric leaky ReLU (PReLU)**, where α is authorized to be learned during training (instead of being a hyperparameter, it becomes a parameter that can be modified by backpropagation like any other parameter).It strongly outperform ReLU on large image datasets, but on smaller datasets it runs the risk of overfitting the training set.

![11](images/11-2.png)

### How to use Leaky Relu in tensorflow?
```
def leaky_relu(z, name=None):
return tf.maximum(0.01 * z, z, name=name)
    hidden1 = fully_connected(X, n_hidden1, activation_fn=leaky_relu)
```

## exponential linear unit (ELU) 
### Better than all ReLU variants
- training time was reduced
- the neural network per‐ formed better on the test set. 
![11](images/e11-2.png)

### Differences compared with ReLU
- **First it takes on negative values when z < 0, which allows the unit to have an average output closer to 0.** This helps alleviate(缓解) the vanishing gradients problem, as discussed earlier. The hyperparameter α defines the value that the ELU function approaches when z is a large negative number. It is usually set to 1, but you can tweak it like any other hyperparameter if you want.
- **Second, it has a nonzero gradient for z < 0, which avoids the dying units issue.**
- **Third, the function is smooth everywhere**, including around z = 0, which helps speed up Gradient Descent, since it does not bounce as much left and right of z = 0.

### Drawbacks
The main drawback of the ELU activation function is that it is slower to compute than the ReLU and its variants (due to the use of the exponential function), but dur‐ ing training this is compensated by the faster convergence rate. However, at test time an ELU network will be slower than a ReLU network.

### Codes
```
hidden1 = fully_connected(X, n_hidden1, activation_fn=tf.nn.elu)
```

## So which activation function should you use for the hidden layers of your deep neural networks? 
Although your mileage will vary, **in general ELU > leaky ReLU (and its variants) > ReLU > tanh > logistic**. 
- If you care a lot about runtime performance, then you may prefer leaky ReLUs over ELUs. 
- If you don’t want to tweak yet another hyperparameter, you may just use the default α values suggested earlier (0.01 for the leaky ReLU, and 1 for ELU). 
- If you have spare time and computing power, you can use cross-validation to evaluate other activation functions, in particular **RReLU** if your network is overfitting, or PReLU if you have a huge training set.

### Parametric ReLU：
对于 Leaky ReLU 中的，通常都是通过先验知识人工赋值的。 然而可以观察到，损失函数对的导数我们是可以求得的，可不可以将它作为一个参数进行训练呢？ 
Kaiming He的论文《Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification》指出，不仅可以训练，而且效果更好。

### RReLU: Randomized ReLU
核心思想就是，在训练过程中， 是从一个高斯分布  中 随机出来的，然后再测试过程中进行修正（有点像dropout的用法）。

## Batch Normalization
**Although using He initialization along with ELU (or any variant of ReLU) can significantly reduce the vanishing/exploding gradients problems at the beginning of train‐ ing, it doesn’t guarantee that they won’t come back during training.**

2015 Sergey Ioffe and Christian Szegedy proposed a technique called **Batch Normalization (BN)**
- to address the vanishing/exploding gradients problems, 
- and more generally the problem that the distribution of each layer’s inputs changes during training, as the parameters of the previous layers change (which they call the **Internal Covariate Shift problem**).

### Main Idea
The technique consists of adding an operation in the model just before the activation function of each layer, simply zero-centering and normalizing the inputs, then scaling and shifting the result using two new parameters per layer (one for scaling, the other for shifting). **In other words, this operation lets the model learn the optimal scale and mean of the inputs for each layer.**

### In order to zero-center and normalize the inputs
the algorithm needs to estimate the inputs’ mean and standard deviation. It does so by evaluating the mean and standard deviation of the inputs over the current mini-batch (hence the name “Batch Normalization”). 

![11](images/e11-3.png)

**At test time, there is no mini-batch to compute the empirical mean and standard deviation, so instead you simply use the whole training set’s mean and standard deviation. These are typically efficiently computed during training using a moving average. So, in total, four parameters are learned for each batch-normalized layer: γ (scale), β (offset), μ (mean), and σ (standard deviation).**

### Advantages of BN
- The authors demonstrated that this technique considerably improved all the deep neural networks they experimented with. 
- The vanishing gradients problem was strongly reduced, to the point that they could use saturating activation functions such as the tanh and even the logistic activation function. 
- The networks were also much less sensitive to the weight initialization. 
- They were able to use much larger learning rates, significantly speeding up the learning process. Specifically, they note that “Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. [...] Using an ensemble of batch-normalized net‐ works, we improve upon the best published result on ImageNet classification: reach‐ ing 4.9% top-5 validation error (and 4.8% test error), exceeding the accuracy of human raters.” 
- Finally, like a gift that keeps on giving, Batch Normalization also acts like **a regularizer**, reducing the need for other regularization techniques (such as dropout, described later in the chapter).

### disadvantages of BN
Batch Normalization does, however, **add some complexity to the model** (although it removes the need for normalizing the input data since the first hidden layer will take care of that, provided it is batch-normalized). Moreover, **there is a runtime penalty: the neural network makes slower predictions due to the extra computations required at each layer**. 

**So if you need predictions to be lightning-fast, you may want to check how well plain ELU + He initialization perform before playing with Batch Normalization.**

### Implementing Batch Normalization with TensorFlow

#### Solution 1
TensorFlow provides a batch_normalization() function that simply centers and normalizes the inputs, but you must compute the mean and standard deviation yourself (based on the mini-batch data during training or on the full dataset during test‐ ing, as just discussed) and pass them as parameters to this function, and you must also handle the creation of the scaling and offset parameters (and pass them to this function).

#### Solution 2
use the batch_norm() function, which handles all this for you. You can either call it directly or tell the fully_connected() function to use it.


### In newer version of TensorFlow
Note: the book uses `tensorflow.contrib.layers.batch_norm()` rather than `tf.layers.batch_normalization()` (which did not exist when this chapter was written). It is now preferable to use `tf.layers.batch_normalization()`, because anything in the contrib module may change or be deleted without notice. Instead of using the `batch_norm()` function as a regularizer parameter to the `fully_connected()` function, we now use `batch_normalization()` and we explicitly create a distinct layer. The parameters are a bit different, in particular:
- decay is renamed to momentum,
- is_training is renamed to training,
- updates_collections is removed: the update operations needed by batch normalization are added to the UPDATE_OPS collection and you need to explicity run these operations during training (see the execution phase below),
- we don't need to specify scale=True, as that is the default.

Also note that in order to run batch norm just before each hidden layer's activation function, we apply the ELU activation function manually, right after the batch norm layer.

Note: since the `tf.layers.dense()` function is incompatible with `tf.contrib.layers.arg_scope()` (which is used in the book), we now use python's `functools.partial()` function instead. It makes it easy to create a `my_dense_layer()` function that just calls `tf.layers.dense()` with the desired parameters automatically set (unless they are overridden when calling `my_dense_layer()`). As you can see, the code remains very similar.

In [6]:

import tensorflow as tf

n_inputs = 28 * 28
n_hidden1 = 300
n_hidden2 = 100
n_outputs = 10

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")

training = tf.placeholder_with_default(False, shape=(), name='training')

# dense(…): 全连接层 
hidden1 = tf.layers.dense(X, n_hidden1, name="hidden1")
bn1 = tf.layers.batch_normalization(hidden1, training=training, momentum=0.9)
bn1_act = tf.nn.elu(bn1)

hidden2 = tf.layers.dense(bn1_act, n_hidden2, name="hidden2")
bn2 = tf.layers.batch_normalization(hidden2, training=training, momentum=0.9)
bn2_act = tf.nn.elu(bn2)

logits_before_bn = tf.layers.dense(bn2_act, n_outputs, name="outputs")
logits = tf.layers.batch_normalization(logits_before_bn, training=training,
                                       momentum=0.9)

To avoid repeating the same parameters over and over again, we can use Python's `partial()` function:

partial函数的作用就是：将所作用的函数作为partial()函数的第一个参数，原函数的各个参数依次作为partial（）函数的后续参数，原函数有关键字参数的一定要带上关键字，没有的话，按原有参数顺序进行补充。

In [8]:
tf.reset_default_graph()

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
training = tf.placeholder_with_default(False, shape=(), name='training')

from functools import partial

my_batch_norm_layer = partial(tf.layers.batch_normalization,
                              training=training, momentum=0.9)

hidden1 = tf.layers.dense(X, n_hidden1, name="hidden1")
bn1 = my_batch_norm_layer(hidden1)
bn1_act = tf.nn.elu(bn1)
hidden2 = tf.layers.dense(bn1_act, n_hidden2, name="hidden2")
bn2 = my_batch_norm_layer(hidden2)
bn2_act = tf.nn.elu(bn2)
logits_before_bn = tf.layers.dense(bn2_act, n_outputs, name="outputs")
logits = my_batch_norm_layer(logits_before_bn)

Let's build a neural net for MNIST, using the ELU activation function and Batch Normalization at each layer:

In [10]:
tf.reset_default_graph()

batch_norm_momentum = 0.9
learning_rate = 0.01

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
y = tf.placeholder(tf.int32, shape=(None), name="y")
training = tf.placeholder_with_default(False, shape=(), name='training')

with tf.name_scope("dnn"):
    he_init = tf.variance_scaling_initializer()

    my_batch_norm_layer = partial(
            tf.layers.batch_normalization,
            training=training,
            momentum=batch_norm_momentum)

    my_dense_layer = partial(
            tf.layers.dense,
            kernel_initializer=he_init)

    hidden1 = my_dense_layer(X, n_hidden1, name="hidden1")
    bn1 = tf.nn.elu(my_batch_norm_layer(hidden1))
    hidden2 = my_dense_layer(bn1, n_hidden2, name="hidden2")
    bn2 = tf.nn.elu(my_batch_norm_layer(hidden2))
    logits_before_bn = my_dense_layer(bn2, n_outputs, name="outputs")
    logits = my_batch_norm_layer(logits_before_bn)

with tf.name_scope("loss"):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
    loss = tf.reduce_mean(xentropy, name="loss")

with tf.name_scope("train"):
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    training_op = optimizer.minimize(loss)

with tf.name_scope("eval"):
    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))
    
init = tf.global_variables_initializer()
saver = tf.train.Saver()


Note: since we are using `tf.layers.batch_normalization()` rather than `tf.contrib.layers.batch_norm()` (as in the book), we need to explicitly run the extra update operations needed by batch normalization (sess.run([training_op, extra_update_ops],...).

**when training, the moving_mean and moving_variance need to be updated. By default the update ops are placed in tf.GraphKeys.UPDATE_OPS, so they need to be added as a dependency to the train_op**

In [13]:
import numpy as np

(X_train, y_train), (X_test, y_test) = tf.keras.datasets.mnist.load_data()
X_train = X_train.astype(np.float32).reshape(-1, 28*28) / 255.0
X_test = X_test.astype(np.float32).reshape(-1, 28*28) / 255.0
y_train = y_train.astype(np.int32)
y_test = y_test.astype(np.int32)
X_valid, X_train = X_train[:5000], X_train[5000:]
y_valid, y_train = y_train[:5000], y_train[5000:]

In [14]:
def shuffle_batch(X, y, batch_size):
    rnd_idx = np.random.permutation(len(X))
    n_batches = len(X) // batch_size
    for batch_idx in np.array_split(rnd_idx, n_batches):
        X_batch, y_batch = X[batch_idx], y[batch_idx]
        yield X_batch, y_batch

In [15]:
n_epochs = 20
batch_size = 200

extra_update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size):
            sess.run([training_op, extra_update_ops],
                     feed_dict={training: True, X: X_batch, y: y_batch})
        accuracy_val = accuracy.eval(feed_dict={X: X_valid, y: y_valid})
        print(epoch, "Validation accuracy:", accuracy_val)

    save_path = saver.save(sess, "./my_model_final.ckpt")

0 Validation accuracy: 0.9022
1 Validation accuracy: 0.9246
2 Validation accuracy: 0.9358
3 Validation accuracy: 0.9454
4 Validation accuracy: 0.9526
5 Validation accuracy: 0.9556
6 Validation accuracy: 0.9584
7 Validation accuracy: 0.9618
8 Validation accuracy: 0.9646
9 Validation accuracy: 0.9672
10 Validation accuracy: 0.9682
11 Validation accuracy: 0.9684
12 Validation accuracy: 0.9704
13 Validation accuracy: 0.97
14 Validation accuracy: 0.973
15 Validation accuracy: 0.9724
16 Validation accuracy: 0.9756
17 Validation accuracy: 0.9738
18 Validation accuracy: 0.9746
19 Validation accuracy: 0.976


What!? That's not a great accuracy for MNIST. Of course, if you train for longer it will get much better accuracy, but with such a shallow network, **Batch Norm and ELU are unlikely to have very positive impact: they shine mostly for much deeper nets.**

**Note that you could also make the training operation depend on the update operations:**
```
with tf.name_scope("train"):
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    extra_update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
    with tf.control_dependencies(extra_update_ops):
        training_op = optimizer.minimize(loss)
```

This way, you would just have to evaluate the training_op during training, TensorFlow would automatically run the update operations as well:
```
sess.run(training_op, feed_dict={training: True, X: X_batch, y: y_batch})
```


**One more thing**: notice that the list of trainable variables is shorter than the list of all global variables. This is because the moving averages are non-trainable variables. **If you want to reuse a pretrained neural network (see below), you must not forget these non-trainable variables.**

In [16]:
[v.name for v in tf.trainable_variables()]

['hidden1/kernel:0',
 'hidden1/bias:0',
 'batch_normalization/gamma:0',
 'batch_normalization/beta:0',
 'hidden2/kernel:0',
 'hidden2/bias:0',
 'batch_normalization_1/gamma:0',
 'batch_normalization_1/beta:0',
 'outputs/kernel:0',
 'outputs/bias:0',
 'batch_normalization_2/gamma:0',
 'batch_normalization_2/beta:0']

In [17]:
[v.name for v in tf.global_variables()]

['hidden1/kernel:0',
 'hidden1/bias:0',
 'batch_normalization/gamma:0',
 'batch_normalization/beta:0',
 'batch_normalization/moving_mean:0',
 'batch_normalization/moving_variance:0',
 'hidden2/kernel:0',
 'hidden2/bias:0',
 'batch_normalization_1/gamma:0',
 'batch_normalization_1/beta:0',
 'batch_normalization_1/moving_mean:0',
 'batch_normalization_1/moving_variance:0',
 'outputs/kernel:0',
 'outputs/bias:0',
 'batch_normalization_2/gamma:0',
 'batch_normalization_2/beta:0',
 'batch_normalization_2/moving_mean:0',
 'batch_normalization_2/moving_variance:0']

## Gradient Clipping
A popular technique to lessen the exploding gradients problem is to simply clip the gradients during backpropagation so that they never exceed some threshold(**This is mostly useful for recurrent neural networks).

# Reusing Pretrained Layers
If the original model was trained using TensorFlow, you can simply restore it and train it on the new task:

```
[...] # construct the original model
with tf.Session() as sess:
    saver.restore(sess, "./my_original_model.ckpt") 
[...] # Train it on your new task 
```

However, in general you will want to **reuse only part of the original model. A simple solution is to configure the Saver to restore only a subset of the variables from the original model.** For example, the following code restores only hidden layers 1, 2, and 3:
```
[...] # build new model with the same definition as before for hidden layers 1-3
init = tf.global_variables_initializer()
reuse_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES,
                                   scope="hidden[123]")
                                   
reuse_vars_dict = dict([(var.name, var.name) for var in reuse_vars]) 
original_saver = tf.Saver(reuse_vars_dict) # saver to restore the original model

new_saver = tf.Saver() # saver to save the new model

with tf.Session() as sess: 
    sess.run(init)
    original_saver.restore("./my_original_model.ckpt") # restore layers 1 to 3 
    [...] # train the new model
    new_saver.save("./my_new_model.ckpt") # save the whole model
```

- First we build the new model, making sure to copy the original model’s hidden layers 1 to 3. We also create a node to initialize all variables. 
- Then we get the list of all variables that were just created with `trainable=True` (which is the default), and we keep only the ones whose scope matches the regular expression "hidden[123]" (i.e., we get all trainable variables in hidden layers 1 to 3). 
- Next we create a dictionary mapping the name of each variable in the original model to its name in the new model (generally you want to keep the exact same names). 
- Then we create a Saver that will restore only these variables, and we create another Saver to save the entire new model, not just layers 1 to 3. 
- We then start a session and initialize all variables in the model, then restore the variable values from the original model’s layers 1 to 3. 
- Finally, we train the model on the new task and save it.