## 4.3 神经网络优化算法
**梯度下降算法**主要用于优化单个参数的取值，而**反向传播算法**给出了一个高效的方式在所有参数上使用梯度下降算法，从而使神经网络模型在训练数据上的损失函数尽可能小。通过参数的梯度和学习率，参数的更新公式为：

$$ \theta_{n+1} = \theta_{n} - \eta\frac{\partial}{\partial\theta_{n}}J(\theta_{n})$$

其中$\theta$为参数，下表表示迭代轮数，$\eta$表示学习率（learning rate），$J(\theta)$表示损失函数。下面假设我们要最小化函数  $y=x^2$, 选择初始点   $x_0=5$，学习率为0.3，这个优化过程可以总结如下，可以看到，只经过5轮迭代，参数$x$的值已经比较接近最优值0了：

| 轮数 | 当前轮参数值 |     梯度\*学习率   |   更新后的参数值   |
| ---- |   -----    |      -------      |    -------       |
|   1 |      5    |     2\*5\*0.3=3    |       5-3=2     |
|   2 |      2    |    2\*2\*0.3=1.2    |     2-1.2=0.8    |
|   3 |     0.8   |  2\*0.8\*0.3=0.48    |    0.8-0.48=0.32   |
|   4 |    0.32    |   2\*0.32\*0.3=0.192 |   0.32-0.192=0.128    |
|   5 |   0.128    |  2\*0.128\*0.3=0.0768 | 0.128-0.0768=0.0512     |

**神经网络的优化可以分为两个阶段：**
- 第一阶段先前向传播计算得到预测值，并将预测值和真实值做对比得到两者之间的差距；
- 第二阶段通过反向传播计算损失函数对每一个参数的梯度，再根据梯度和学习率使用梯度下降算法更新每一个参数。

梯度下降算法注意点：
- 梯度下降算法并**不能保证被优化的函数达到全局最优解**，有可能只是局部最优，也因此参数的初始值很大程度上影响最后得到的结果。**只有当损失函数为凸函数时，梯度下降算法才能保证达到全局最优解。**

- 梯度下降算法的另一个问题就是**计算时间太长**，因为在每轮迭代中$J(\theta)$是在所有训练数据上的损失和。为加速，可以使用随机梯度下降（stochastic gradient descent），每轮迭代中随机优化一条训练数据的损失函数，但是每条数据代表整体数据的能力太差。实际使用时，采用折中——**每次计算一小部分训练数据的损失函数，称为一个batch，通过矩阵运算并不比单个数据慢太多，另一方面可以大大减少收敛所需要的迭代次数，同时可以收敛到的结果更加接近梯度下降的效果。**

## 4.4 神经网络进一步优化
### 4.4.1 学习率的设置
学习率控制参数的更新幅度。
- 如果幅度过大，那么可能导致参数在极优值的两侧来回移动；
- 如果幅度过小，虽然能保证收敛性，但是会大大降低优化速度，需要更多轮的迭代才能达到一个比较理想的优化效果。

继续刚才的假设：需要最小化函数 $y=x^2$, 选择初始点 $x_0=5$。两种情况分别如下：

**a. 学习率过大（学习率为1，x在5和-5之间震荡）**

In [1]:
import tensorflow as tf

TRAINING_STEPS = 10
LEARNING_RATE = 1

x = tf.Variable(tf.constant(5, dtype=tf.float32), name="x")
y = tf.square(x)

train_op = tf.train.GradientDescentOptimizer(LEARNING_RATE).minimize(y)

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for i in range(TRAINING_STEPS):
        sess.run(train_op)
        x_value = sess.run(x)
        print("After %s iteration(s): x%s is %f."% (i+1, i+1, x_value) )

After 1 iteration(s): x1 is -5.000000.
After 2 iteration(s): x2 is 5.000000.
After 3 iteration(s): x3 is -5.000000.
After 4 iteration(s): x4 is 5.000000.
After 5 iteration(s): x5 is -5.000000.
After 6 iteration(s): x6 is 5.000000.
After 7 iteration(s): x7 is -5.000000.
After 8 iteration(s): x8 is 5.000000.
After 9 iteration(s): x9 is -5.000000.
After 10 iteration(s): x10 is 5.000000.


**b. 学习率过小（学习率为0.001，下降速度过慢，在901轮时才收敛到0.823355）**

In [2]:
TRAINING_STEPS = 1000
LEARNING_RATE = 0.001

x = tf.Variable(tf.constant(5, dtype=tf.float32), name="x")
y = tf.square(x)

train_op = tf.train.GradientDescentOptimizer(LEARNING_RATE).minimize(y)

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for i in range(TRAINING_STEPS):
        sess.run(train_op)
        if i % 100 == 0: 
            x_value = sess.run(x)
            print("After %s iteration(s): x%s is %f."% (i+1, i+1, x_value))

After 1 iteration(s): x1 is 4.990000.
After 101 iteration(s): x101 is 4.084646.
After 201 iteration(s): x201 is 3.343555.
After 301 iteration(s): x301 is 2.736923.
After 401 iteration(s): x401 is 2.240355.
After 501 iteration(s): x501 is 1.833880.
After 601 iteration(s): x601 is 1.501153.
After 701 iteration(s): x701 is 1.228794.
After 801 iteration(s): x801 is 1.005850.
After 901 iteration(s): x901 is 0.823355.


TensorFlow提供了一种更加灵活的学习率设置方法——**指数衰减**，tf.train.exponential_decay。通过这个函数，可以**先使用较大的学习速率来得到一个比较优的解，然后随着迭代的继续逐步减小学习率，使模型在训练后期更加稳定。**
<p align='center'>
    <img src=images/图4.13.JPG>
</p>
**c. 使用指数衰减的效果如下：**

In [3]:
TRAINING_STEPS = 100
global_step = tf.Variable(0)
LEARNING_RATE = tf.train.exponential_decay(0.1, global_step, 1, 0.96, staircase=True)

x = tf.Variable(tf.constant(5, dtype=tf.float32), name="x")
y = tf.square(x)
# 在minimize中传入global_step将自动更新global_step的参数。从而使得学习率得到相应更新
train_op = tf.train.GradientDescentOptimizer(LEARNING_RATE).minimize(y, global_step=global_step)

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for i in range(TRAINING_STEPS):
        sess.run(train_op)
        if i % 10 == 0:
            LEARNING_RATE_value = sess.run(LEARNING_RATE)
            x_value = sess.run(x)
            print("After %s iteration(s): x%s is %f, learning rate is %f."% (i+1, i+1, x_value, LEARNING_RATE_value))

After 1 iteration(s): x1 is 4.000000, learning rate is 0.096000.
After 11 iteration(s): x11 is 0.690561, learning rate is 0.063824.
After 21 iteration(s): x21 is 0.222583, learning rate is 0.042432.
After 31 iteration(s): x31 is 0.106405, learning rate is 0.028210.
After 41 iteration(s): x41 is 0.065548, learning rate is 0.018755.
After 51 iteration(s): x51 is 0.047625, learning rate is 0.012469.
After 61 iteration(s): x61 is 0.038558, learning rate is 0.008290.
After 71 iteration(s): x71 is 0.033523, learning rate is 0.005511.
After 81 iteration(s): x81 is 0.030553, learning rate is 0.003664.
After 91 iteration(s): x91 is 0.028727, learning rate is 0.002436.
