## AdaGrad (Adaptive Gradient)

Equations (Vectorization form):

![](https://photos-5.dropbox.com/t/2/AABph95VK3YO2qoY7ZyNwoaWtirMX8XX6kCqxxxQTIkmoQ/12/63047491/png/32x32/1/_/1/2/Screenshot%20from%202017-10-18%2010-41-12.png/EITw7jAYtz4gBygH/ccbTm0wuHxin8D72tfhzTig_ZqMZcYd4SzQD3tBcfus?size=2048x1536&size_mode=3)

Non-vectorization form:

![](https://photos-3.dropbox.com/t/2/AACw0eC5SS9-EacX4XIVaxb9et0IabIaeg_zinPE52hgGA/12/63047491/png/32x32/1/_/1/2/Screenshot%20from%202017-10-18%2010-56-12.png/EITw7jAYuT4gBygH/NmwLTZ2ohr6RmKdMwStQQiaOyZaUmmO93VFpbb3EsMk?size=2048x1536&size_mode=3)

![](https://photos-1.dropbox.com/t/2/AADlAkyE6NuIH8rRT0leYfrM75L-s6y_LotctG1AKmIuuw/12/63047491/png/32x32/1/_/1/2/Screenshot%20from%202017-10-18%2010-56-39.png/EITw7jAYuT4gBygH/bgr0-OWax20p_Ylnu9-obfBtCLF5aYiDdCVk0NfZv-g?size=2048x1536&size_mode=3)

The **AdaGrad** algorithm works by scaling down the gradient vector along the steepest dimensions.

- The algorithm maintains a vector s which consists of square of gradients (obtained by using element-wist multiplication of gradients). In other words, each element s<sub>i</sub> accumulates the squares of the partial derivative of the cost function with regard to parameter θ<sub>i</sub>. If the cost function is steep along the i<sup>th</sup> dimension (bigger derivative), then s<sub>i</sub> will get larger after each iteration.

- The second step of the algorithm is almost identical to vanilla Gradient Descent with one big difference. The gradient vector is scaled down by a factor of `sqrt(s + ε)` (obtained by using element-wise division). Epsilon is called the smoothing term which is used to prevent division by zero (usually take the value of 10<sup>-10</sup>). The equivalent non-vectorized form is shown in above figure.

The general idea of this algorithm is that it decays the learning rate faster for steep dimension and lower for dimension with gentler slopes. This process is called **Adaptive Learning Rate**. One benefit of this approach is that it requires much less tuning of the learning rate hyperparameter.

*Note that this algorithm is not good when training neural networks as it often stops too early. Specifically, the learning rate gets scaled down so much that the algorithm ends up stopping before reaching the global minimum. Therefore, we should not use this algorithm to train neural networks (it may be suitable for simple tasks like Linear Regression)*

In [1]:
import numpy as np
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data

In [2]:
z = np.linspace(-5, 5, 200)
tf.reset_default_graph()
n_inputs = 784
n_hidden1 = 300
n_hidden2 = 100
n_outputs = 10
learning_rate = 0.01

In [3]:
X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
y = tf.placeholder(tf.int64, shape=(None), name="y")

In [4]:
with tf.name_scope("dnn"):
    hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.elu, name="hidden1")
    hidden2 = tf.layers.dense(hidden1, n_hidden2, activation=tf.nn.elu, name="hidden2")
    logits = tf.layers.dense(hidden2, n_outputs, name="outputs")

In [5]:
with tf.name_scope("loss"):
    xen = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
    loss = tf.reduce_mean(xen, name="loss")

In [6]:
with tf.name_scope("train"):
    optimizer = tf.train.AdagradOptimizer(learning_rate=learning_rate)
    training_op = optimizer.minimize(loss)

In [7]:
with tf.name_scope("eval"):
    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))

In [8]:
init = tf.global_variables_initializer()

In [9]:
n_epochs = 100
batch_size = 100
mnist = input_data.read_data_sets("/tmp/data/")

Extracting /tmp/data/train-images-idx3-ubyte.gz
Extracting /tmp/data/train-labels-idx1-ubyte.gz
Extracting /tmp/data/t10k-images-idx3-ubyte.gz
Extracting /tmp/data/t10k-labels-idx1-ubyte.gz


In [10]:
with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for iteration in range(len(mnist.test.labels) // batch_size):
            X_batch, y_batch = mnist.train.next_batch(batch_size)
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        acc_train = accuracy.eval(feed_dict={X: X_batch, y: y_batch})
        acc_test = accuracy.eval(feed_dict={X: mnist.test.images, y: mnist.test.labels})
        print(epoch, "Train accuracy:", acc_train, "Test accuracy:", acc_test)

0 Train accuracy: 0.78 Test accuracy: 0.8544
1 Train accuracy: 0.86 Test accuracy: 0.8842
2 Train accuracy: 0.96 Test accuracy: 0.8984
3 Train accuracy: 0.91 Test accuracy: 0.9005
4 Train accuracy: 0.89 Test accuracy: 0.9082
5 Train accuracy: 0.93 Test accuracy: 0.9125
6 Train accuracy: 0.91 Test accuracy: 0.9151
7 Train accuracy: 0.92 Test accuracy: 0.9168
8 Train accuracy: 0.96 Test accuracy: 0.9183
9 Train accuracy: 0.93 Test accuracy: 0.919
10 Train accuracy: 0.91 Test accuracy: 0.9233
11 Train accuracy: 0.92 Test accuracy: 0.9235
12 Train accuracy: 0.94 Test accuracy: 0.9228
13 Train accuracy: 0.93 Test accuracy: 0.9236
14 Train accuracy: 0.92 Test accuracy: 0.93
15 Train accuracy: 0.99 Test accuracy: 0.9269
16 Train accuracy: 0.93 Test accuracy: 0.9277
17 Train accuracy: 0.95 Test accuracy: 0.9297
18 Train accuracy: 0.95 Test accuracy: 0.9309
19 Train accuracy: 0.91 Test accuracy: 0.931
20 Train accuracy: 0.96 Test accuracy: 0.9315
21 Train accuracy: 0.94 Test accuracy: 0.9328
22