# Optimizer
## 1.梯度下降法（Gradient Descent）
* 标准的梯度下降法
    - 先计算所有样本的总误差，然后根据总误差更新权值
* 随机梯度下降法
    - 随机抽取一个样本计算误差，然后更新权值
* 批量梯度下降法
    - 随机抽取一个batch的样本计算误差，然后更新权值
* 公式：$ W = W - \eta \bullet \nabla J(W;x^{(i)},y^{(i)})$
   
## 2.动力下降法（Momentum）
* 当前权值的改变会受到上一次改变的影响。比如小球下坡的时候会带有惯性，因此在上一次的梯度较大时，下一次会加速下降。
* $\gamma$:动力，一般取值为0.9
* $V_t = \gamma V_{t-1} + \eta \bullet \nabla J(W)$
* $W = W - V_t$

## 3.NAG（Nesterov accelerated gradient）
* 内斯特罗夫加速梯度
* 这个是对“动力下降法”的改良，因为动力下降，在接近谷底的时候，会因为惯性的原因，导致跳出最优解，因此NAG在此基础之上，添加了对小球下一次位置的预测，然后应用到本次的下降中。
* $V_t = \gamma V_{t-1} + \eta \bullet \nabla J(W - \gamma V_{t-1})$
* $W = W - V_t$

## 4.Adagrad
* $i$:表示第i个分类
* $t$:表示某个类别出现的次数
* $\epsilon$:表示一个非常小的值，防止除0
* $\eta$:学习率
* $g_{t,i} =  \nabla _w J(W)$: i类别在第t次的梯度
* $W_{t + 1} = W_t - \frac {\eta}{\sqrt{\sum_{t^"=1}^{t}{(g_{t^",i})}^2 + \epsilon }} \bullet g_t$
* 它是基于SGD思想的一种优化算法，对于比较常见的数据（分类）会有比较小的学习率，对于比较罕见的数据（分类）会有比较大的学习率，使用与比较稀疏的数据集。因此他的优点也恰恰是他的缺点，优点是可以自动的调整学习率；缺点是随着迭代次数的增大，学习率会越来越小，最终趋于0.


## 5.RMSprop
* 和Adagrad的思想非常相似，唯一的不同是，不在是记录所有的t了，而是取前t次的累加。

## 6.Adadelta

## 7. Adam
* $\beta _1$:一般取值为0.9
* $\beta _2$:一般取值为0.99
* $\epsilon$:表示一个非常小的值，防止除0
* $m_t = \beta _1 m_{t-1} + (1-\beta _1)g_t$
* $v_t = \beta _2 v_{t-1} + (1- \beta _2)g_t^2$
* $\hat m = \frac{m_t}{1-\beta _1 ^t}$
* $\hat v = \frac{v_t}{1-\beta _2 ^t}$
* $W_{t+1} = W_t - \frac{\eta}{\sqrt{\hat v_t} + \epsilon} \hat m_t $
* 这是非常著名的Adam算法，应用的非常之多。就想Adagrad 和RMSprop会存储之前衰减的平方梯度，同时也会保存衰减的梯度，更新方式和前面的类似。


In [4]:
# 手写体识别数据集
# http://yann.lecun.com/exdb/mnist/
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets('MNIST_data',one_hot=True)

# batch
batch_size = 100
# 批次数
n_batch = mnist.train.num_examples // batch_size

# learning_rate
learning_rate = 0.2

# 占位符
x = tf.placeholder(tf.float32,[None,784])
y = tf.placeholder(tf.float32,[None,10])
keep_probility = tf.placeholder(tf.float32)
lr = tf.Variable(0.001,dtype=tf.float32)

# 创建一个单层的神经网络
W1 = tf.Variable(tf.truncated_normal([784,500],stddev=0.1))
b1 = tf.Variable(tf.zeros([500]) + 0.1)
L1 = tf.tanh(tf.matmul(x,W1) + b1)
L1_drop = tf.nn.dropout(L1,keep_prob=keep_probility)

W2 = tf.Variable(tf.truncated_normal([500,300],stddev=0.1))
b2 = tf.Variable(tf.zeros([300]) + 0.1)
L2 = tf.tanh(tf.matmul(L1_drop,W2) + b2)
L2_drop = tf.nn.dropout(L2,keep_prob=keep_probility)

W3 = tf.Variable(tf.truncated_normal([300,100],stddev=0.1))
b3 = tf.Variable(tf.zeros([100]) + 0.1)
L3 = tf.tanh(tf.matmul(L2_drop,W3) + b3)
L3_drop = tf.nn.dropout(L3,keep_prob=keep_probility)

W4 = tf.Variable(tf.truncated_normal([100,10],stddev=0.1))
b4 = tf.Variable(tf.zeros([10]) + 0.1)
prediction = tf.nn.softmax(tf.matmul(L3_drop,W4) + b4)

# loss
# loss = tf.reduce_mean(tf.square(y - prediction))

# 使用交叉熵代价函数
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y,logits=prediction))

# train
# train_step = tf.train.GradientDescentOptimizer(learning_rate=learning_rate).minimize(loss)
train_step = tf.train.AdamOptimizer(lr).minimize(loss)
# init
init = tf.global_variables_initializer()
# accuracy
correct_predictions = tf.equal(tf.arg_max(y,1),tf.arg_max(prediction,1))
accuracy = tf.reduce_mean(tf.cast(correct_predictions,tf.float32))

with tf.Session() as sess:
    sess.run(init)
    for epoch in range(21):
        sess.run(tf.assign(lr, 0.001 * (0.95 ** epoch)))
        for batch in range(n_batch):
            batch_xs,batch_ys = mnist.train.next_batch(batch_size)
            sess.run(train_step,feed_dict={x:batch_xs,y:batch_ys,keep_probility:0.9})
        lr_rate = sess.run(lr)
        test_acc = sess.run(accuracy,feed_dict={x:mnist.test.images,y:mnist.test.labels,keep_probility:0.9})
        train_acc = sess.run(accuracy,feed_dict={x:mnist.train.images,y:mnist.train.labels,keep_probility:0.9})
        print('epoch:'+ str(epoch) + ", Testing Accuracy:" + str(test_acc) + ", Training Accuracy:" + str(train_acc) + ", Learning Rate:" + str(lr_rate)) 

Extracting MNIST_data/train-images-idx3-ubyte.gz
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz
epoch:0, Testing Accuracy:0.9422, Training Accuracy:0.9448364, Learning Rate:0.001
epoch:1, Testing Accuracy:0.9552, Training Accuracy:0.9606, Learning Rate:0.00095
epoch:2, Testing Accuracy:0.9606, Training Accuracy:0.9680727, Learning Rate:0.0009025
epoch:3, Testing Accuracy:0.9622, Training Accuracy:0.9717636, Learning Rate:0.000857375
epoch:4, Testing Accuracy:0.9694, Training Accuracy:0.9782364, Learning Rate:0.00081450626
epoch:5, Testing Accuracy:0.9707, Training Accuracy:0.9794545, Learning Rate:0.0007737809
epoch:6, Testing Accuracy:0.97, Training Accuracy:0.9811636, Learning Rate:0.0007350919
epoch:7, Testing Accuracy:0.9684, Training Accuracy:0.98081815, Learning Rate:0.0006983373
epoch:8, Testing Accuracy:0.9715, Training Accuracy:0.98507273, Learning Rate:0.0006634204
epoch:9, Testi