# Aim : To Compare different available `Optimizers` 

In [1]:
# TensorFlow and tf.keras
import tensorflow as tf

# Helper libraries
import numpy as np
from tensorflow.keras import initializers
from tensorflow.python.keras import activations

print(tf.__version__)

# downloading fashion_mnist data
fashion_mnist = tf.keras.datasets.fashion_mnist

(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()

class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',
               'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']

train_images = train_images / 255.0

test_images = test_images / 255.0       

c:\users\sonu.ramkumar.jha\desktop\experiments\env\lib\site-packages\numpy\.libs\libopenblas.GK7GX5KEQ4F6UYO3P26ULGBQYHGQO7J4.gfortran-win_amd64.dll
c:\users\sonu.ramkumar.jha\desktop\experiments\env\lib\site-packages\numpy\.libs\libopenblas.WCDJNK7YVMPZQ2ME2ZZHJJRJ3JIKNDB7.gfortran-win_amd64.dll


2.5.0


In [2]:
activation = tf.keras.activations.tanh

model = tf.keras.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation=activation),
tf.keras.layers.Dense(10)
])

optim = tf.keras.optimizers.SGD()
model.compile(optimizer=optim,loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),metrics=['accuracy'])

# model summary
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten (Flatten)            (None, 784)               0         
_________________________________________________________________
dense (Dense)                (None, 128)               100480    
_________________________________________________________________
dense_1 (Dense)              (None, 10)                1290      
Total params: 101,770
Trainable params: 101,770
Non-trainable params: 0
_________________________________________________________________


<img src="images\model.jpeg" height=50% width=50% alt-text="Case 1 Gradient Descent">

**Forward Pass:**

$$\overbrace{\begin{bmatrix}
    i_{1}  \\
    i_{2}  \\
    \vdots  \\
    i_{784}
\end{bmatrix}\begin{bmatrix}
    w_{11} & x_{12} & x_{13} & \dots  & x_{1128} \\
    w_{21} & x_{22} & x_{23} & \dots  & x_{2128} \\
    \vdots & \vdots & \vdots & \ddots & \vdots \\
    w_{7841} & x_{7842} & x_{7843} & \dots  & x_{784128}
\end{bmatrix}+\begin{bmatrix}
    b_{1}  \\
    b_{2}  \\
    \vdots  \\
    b_{128}
\end{bmatrix}}^{Input of First Activation Function} =\begin{bmatrix}
    s_{1}  \\
    s_{2}  \\
    \vdots  \\
    s_{128}
\end{bmatrix}$$

$$Activation1 \ \left (\begin{bmatrix}
    s_{1}  \\
    s_{2}  \\
    \vdots  \\
    s_{128}
\end{bmatrix} \right)=\begin{bmatrix}
    i^{`}_{1}  \\
    i^{`}_{2}  \\
    \vdots  \\
    i^{`}_{128}
\end{bmatrix}$$

$$\overbrace{\begin{bmatrix}
    i^{`}_{1}  \\
    i^{`}_{2}  \\
    \vdots  \\
    i^{`}_{128}
\end{bmatrix}\begin{bmatrix}
    w^{`}_{11} & w^{`}_{12} & w^{`}_{13} & \dots  & w^{`}_{110} \\
    w^{`}_{21} & w^{`}_{22} & w^{`}_{23} & \dots  & w^{`}_{210} \\
    \vdots & \vdots & \vdots & \ddots & \vdots \\
    w^{`}_{1281} & w^{`}_{1282} & w^{`}_{1283} & \dots  & w^{`}_{12810}
\end{bmatrix}+\begin{bmatrix}
    b^{`}_{1}  \\
    b^{`}_{2}  \\
    \vdots  \\
    b^{`}_{128}
    \end{bmatrix}}^{Input of Second Activation Function}=\begin{bmatrix}
    s^{`}_{1}  \\
    s^{`}_{2}  \\
    \vdots  \\
    s^{`}_{10}
\end{bmatrix}$$


$$Activation2 \ \left (\begin{bmatrix}
    s^{`}_{1}  \\
    s^{`}_{2}  \\
    \vdots  \\
    s^{`}_{128}
\end{bmatrix}\  \right )=\begin{bmatrix}
    y^{`}_{1}  \\
    y^{`}_{2}  \\
    \vdots  \\
    y^{`}_{10}
\end{bmatrix}$$

**Whole Forward Pass:**

$$Activation2\left (Activation1\left (\overbrace{\begin{bmatrix}
    i_{1}  \\
    i_{2}  \\
    \vdots  \\
    i_{784}
\end{bmatrix}*\begin{bmatrix}
    w_{11} & x_{12} & x_{13} & \dots  & x_{1128} \\
    w_{21} & x_{22} & x_{23} & \dots  & x_{2128} \\
    \vdots & \vdots & \vdots & \ddots & \vdots \\
    w_{7841} & x_{7842} & x_{7843} & \dots  & x_{784128}
\end{bmatrix}+\begin{bmatrix}
    b_{1}  \\
    b_{2}  \\
    \vdots  \\
    b_{128}
\end{bmatrix}}^{Input of Hidden Layer}  \right )+\begin{bmatrix}
    w^{`}_{11} & w^{`}_{12} & w^{`}_{13} & \dots  & w^{`}_{110} \\
    w^{`}_{21} & w^{`}_{22} & w^{`}_{23} & \dots  & w^{`}_{210} \\
    \vdots & \vdots & \vdots & \ddots & \vdots \\
    w^{`}_{1281} & w^{`}_{1282} & w^{`}_{1283} & \dots  & w^{`}_{12810}
\end{bmatrix}+\begin{bmatrix}
    b^{`}_{1}  \\
    b^{`}_{2}  \\
    \vdots  \\
    b^{`}_{128}
    \end{bmatrix}\  \right )$$

**In Simple Form**

$$Activation2 \ (Activation1 \ (I_{1784} \ W_{784128}+B_{128})+W^{`}_{12810}+B{`}_{10}) = y{`}$$

**Cast Function(Error):**
$$C = (y-y{`})^2$$

**Backword Pas:**

**We have 4 values to update before the 2nd forward pass -** 

$$\vec W, \vec B, \vec W{`} and \ \vec B{`}$$

## Gradient Discent

**Gradien Discent Formula**

$$\boxed{\vec{W_{new}} = \vec{W_{old}} - \eta \frac{\partial \vec{C}}{\partial \vec{W_{old}}}}\$$

**Where**

\begin{equation}
 \left.\begin{aligned}
        \vec{W_{new}} = New \ Weight\\
        \vec{W_{old}} = Old \ Weight\\
        \eta = learning \ rate
       \end{aligned}
 \right\}
\end{equation}


**According to Chain Rule**

\begin{equation}
\frac{\partial \vec C}{\partial \vec W} = \frac{\partial C}{\partial y{`}}\times\frac{\partial y{`}}{\partial Activation2}\times\frac{\partial Activation2}{\partial Activation1}\times \frac{\partial Activation1}{\partial W}\\
\frac{\partial \vec C}{\partial \vec B} = \frac{\partial C}{\partial y{`}}\times\frac{\partial y{`}}{\partial Activation2}\times\frac{\partial Activation2}{\partial Activation1}\times\frac{\partial Activation1}{\partial B}\\
\frac{\partial \vec C}{\partial \vec W{`}} = \frac{\partial C}{\partial y{`}}\times\frac{\partial y{`}}{\partial Activation2}\times\frac{\partial Activation2}{\partial W_{`}}\\
\frac{\partial \vec C}{\partial \vec B{`}} = \frac{\partial C}{\partial y{`}}\times\frac{\partial y{`}}{\partial Activation2}\times\frac{\partial Activation2}{\partial B_{`}}\\
\end{equation}

**Observations**:
    - As you can see Wn is directly propotional to Dc/Dw. 

In [8]:
# model.fit(train_images, train_labels, epochs=10, batch_size=32)

# test_loss, test_acc = model.evaluate(test_images,  test_labels, verbose=2)

# print('test_loss', test_loss)
# print('test_accuracy', test_acc)

# Momentum Optimizer

**With momentum the weight update formula becomes**

$$\vec{W_{new}} = \vec{W_{old}} - m_{new}$$
$$m_{new} = \beta \ m_{old}+\eta \frac{\partial \vec{C}}{\partial \vec{W_{old}}}$$

`Where m is called momentum and` 

$$\beta : coefficient \ of \ momentum$$

**Step:1**
$$When: \ \vec W_{old} = W_{0} \ and \ m = m_{0} = 0$$

$$\therefore m_{1} = \beta \ m_{0}+\eta \frac{\partial \vec{C}}{\partial \vec{W_{0}}}\$$

$$\therefore m_{1} = \eta \frac{\partial \vec{C}}{\partial \vec{W_{0}}} \ \left (\because m_{0}=0\right )\tag{i}$$

$$\therefore \boxed{\vec W_{1} = W_{0}-m_{1} = \vec W_{0} - \eta \frac{\partial \vec
{C}}{\partial \vec{W_{0}}}}\tag{Same as Gradient Discent Formula}$$

**Observations**
- As you can see for the first time weight update happens like the `Gradient Discent` when `m=0`.

**Step:2**
$$When: \ \vec W_{old} = W_{1} \ and \ m =  m_{1} = \eta \frac{\partial \vec{C}}{\partial \vec{W_{0}}}\tag{from eq(i)}\$$

$$\therefore \vec m_{2} = \beta \ m_{1} + \eta \ \frac{\partial \vec{C}}{\partial \vec{W_{1}}}\tag{ii}$$
$$\therefore \vec W_{2} = \vec W_{1} - m2$$
$$\therefore \vec W_{2} = \vec W_{1} - \beta \ m_{1} + \eta \ \frac{\partial \vec{C}}{\partial \vec{W_{1}}}$$
$$\therefore \vec W_{2} = \vec W_{1} - \beta \left( \eta \frac{\partial \vec{C}}{\partial \vec{W_{0}}} \right)+ \eta \ \frac{\partial \vec{C}}{\partial \vec{W_{1}}}$$
$$\therefore \boxed{\vec W_{2} = \vec W_{1} - \eta \left( \beta \ \frac{\partial \vec{C}}{\partial \vec{W_{0}}}+\frac{\partial \vec{C}}{\partial \vec{W_{1}}} \right)}$$
$$For \ \beta = 0.9$$
$$\therefore \boxed{\vec W_{2} = \vec W_{1} - \eta \left( 0.9 \ \frac{\partial \vec{C}}{\partial \vec{W_{0}}}+\frac{\partial \vec{C}}{\partial \vec{W_{1}}} \right)}$$

**Observations**
- For the second time weight depends on 90% of the initial weight plus 100% last weight.

**Step:3**
$$When: \ \vec W_{old} = W_{2} \ and \ m =  \vec m_{2} = \beta \ m_{1} + \eta \ \frac{\partial \vec{C}}{\partial \vec{W_{1}}}\tag{from eq(ii)}$$

$$\therefore \vec m_{3} = \beta \ m_{2} + \eta \ \frac{\partial \vec{C}}{\partial \vec{W_{2}}}\tag{iii}$$
$$\therefore \vec m_{3} = \beta \left( \beta \ m_{1} + \eta \ \frac{\partial \vec{C}}{\partial \vec{W_{1}}}\right) + \eta \ \frac{\partial \vec{C}}{\partial \vec{W_{2}}}$$
$$\therefore \vec m_{3} = \beta \left( \beta \ \eta \frac{\partial \vec{C}}{\partial \vec{W_{0}}} + \eta \ \frac{\partial \vec{C}}{\partial \vec{W_{1}}}\right) + \eta \ \frac{\partial \vec{C}}{\partial \vec{W_{2}}}\tag{from eq(i)}$$

$$\therefore \boxed{ \vec m_{3} = \eta \left (\beta^{2} \frac{\partial \vec{C}}{\partial \vec{W_{0}}}+ \beta \frac{\partial \vec{C}}{\partial \vec{W_{1}}}+\frac{\partial \vec{C}}{\partial \vec{W_{2}}} \right)}\tag{iv}$$
$$\therefore \vec W_{3} = \vec W_{2}-m_{3}$$
$$\therefore \boxed{\vec W_{3} = \vec W_{2}-\eta \left (\beta^{2} \frac{\partial \vec{C}}{\partial \vec{W_{0}}}+ \beta \frac{\partial \vec{C}}{\partial \vec{W_{1}}}+\frac{\partial \vec{C}}{\partial \vec{W_{2}}} \right)}$$

**Case:1**
$$For \ \beta = 0$$
$$\vec W_{3} = \vec W_{2} - \frac{\partial \vec{C}}{\partial \vec{W_{2}}}\tag{Same as Gradiend Discent}$$

**Case:2**
$$For \ \beta = 0.9$$
$$\therefore \boxed{\vec W_{3} = \vec W_{2}-\eta \left (0.81 \frac{\partial \vec{C}}{\partial \vec{W_{0}}}+ 0.9 \frac{\partial \vec{C}}{\partial \vec{W_{1}}}+\frac{\partial \vec{C}}{\partial \vec{W_{2}}} \right)}$$

**Observatinos**
- As you can see in Case:1 for beta = 0, weight update happens like Gradient Discent.
- For beta=0.9 (practically good), Weight depends on 81% of past weight, 90% recent past weight and 100% the last weight.
- Therefore `momemtum optimizer` has and advantage over `Gradient Discent` that with the `impace of previous weight` it can jump through `saddle point`.

In [None]:
# activation = tf.keras.activations.tanh

# model = tf.keras.Sequential([
# tf.keras.layers.Flatten(input_shape=(28, 28)),
# tf.keras.layers.Dense(128, activation=activation),
# tf.keras.layers.Dense(10)
# ])

# optim = tf.keras.optimizers.SGD()
# model.compile(optimizer=optim,loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),metrics=['accuracy'])

# # model summary
# model.summary()

# model.fit(train_images, train_labels, epochs=10, batch_size=32)

# test_loss, test_acc = model.evaluate(test_images,  test_labels, verbose=2)

# print('test_loss', test_loss)
# print('test_accuracy', test_acc)

## 3. Nesterov Accelaraged Gradient (NAG)