## Explanation of Adam Optimizer:

A step in Adam is defined by this equation:

$
\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t
$

- θ are the parameters (weights) of the model.  $t$ denotes the current training step number.

(and thus $\theta_{t+1}$ are the parameters *after* a given training step is applied to the weights, i.e. this equation represents running `optimizer.step()` in PyTorch after a forward and backward pass through the model).​

- $\hat{m}_t$ is the first moment (an exponentially decaying average of past gradients, decaying by beta1).
​
- $\hat{v}_t$ is the second moment (an exponentially decaying average of past squared gradients, decaying by beta2).

- η is the learning rate. 

- ϵ (epsilon) is a small number used to prevent division by zero if $\hat{v}_t$ is zero or so small that it could cause numerical problems based on the precision of the calculation on the computer. (i.e. infinite values or NaNs)

TLDR: An simple moving average of gradients calculated, then divided by a exponential moving average (pow2) of your gradients, then multiplied by a learning rate to get your actual gradient update number. That learning step for each weight is then subtracted from the current weights to get the new weights.  This equation is run on every weight that is "unfrozen" in the model.

In [None]:
# Install dependencies for this notebook (shouldn't be required on Google Colab)
%pip install matplotlib
%pip install numpy

In [None]:
# Import numpy and set my_numbers to random (for now)
import numpy as np
gradients_over_timesteps = np.random.rand(200)
print(','.join(str(f"{x:0.2f}") for x in gradients_over_timesteps))

### Run cell once to define the plot function
Only need to run once unless you change the code

In [None]:
#markdown 
import numpy as np
import matplotlib.pyplot as plt

def plot(beta1:float, beta2:float, epsilon:float, nbeta:float=None, npow:int=None):
    m = np.zeros_like(gradients_over_timesteps) # first moment 
    v = np.zeros_like(gradients_over_timesteps) # second moment
    vp = np.zeros_like(gradients_over_timesteps) # q moment 

    graph_len = len(gradients_over_timesteps)+1

    for i in range(1, graph_len):
        m[i-1] = beta1 * m[i-2] + (1 - beta1) * gradients_over_timesteps[i-1]  
        v[i-1] = beta2 * v[i-2] + (1 - beta2) * (gradients_over_timesteps[i-1]**2)
        if nbeta and npow:
            vp[i-1] = nbeta * vp[i-2] + (1 - nbeta) * (gradients_over_timesteps[i-1]**npow)

    # bias-correction
    m_hat = m / (1 - beta1**np.arange(1, graph_len))
    v_hat = v / (1 - beta2**np.arange(1, graph_len))
    if nbeta and npow:
        vp_hat = vp / (1 - nbeta**np.arange(1, graph_len))

    # scale moment by n-root
    v_hat = np.sqrt(v_hat)

    if nbeta and npow: # n-root
        vp_hat = np.power(vp_hat,1/npow)

    adam_step = m_hat / (v_hat + epsilon)
    if nbeta and npow:
        adamNpow_step = m_hat / (vp_hat + epsilon)

    m_hat_first_poly = np.poly1d(np.polyfit(np.arange(1, graph_len), m_hat, 1))
    v_hat_first_poly = np.poly1d(np.polyfit(np.arange(1, graph_len), v_hat, 1))

    # plot
    plt.figure(figsize=(10, 6))
    plt.plot(gradients_over_timesteps, label="step gradients")
    plt.plot(m_hat, label="First Moment with Beta1")
    plt.plot(m_hat_first_poly(np.arange(1, graph_len)), label="First Moment trend")
    plt.plot(v_hat, label="Second Moment with Beta2")
    plt.plot(v_hat_first_poly(np.arange(1, graph_len)), label="Second Moment trend")
    plt.plot(adam_step, label="Adam learning step")
    
    if nbeta and npow:
        plt.plot(vp_hat, label="nth moment with nBeta")
    
    if nbeta and npow:
        plt.plot(adamNpow_step, label="Adam with nth moment denominator, learning Step")
    plt.legend()
    plt.xlabel("training step")
    plt.title("Adam simulator with given gradients")
    plt.show()
    # return an image of the plot
    return plt


### Run **one** of the following cells to set the gradients_over_timesteps list

The fake gradients are set to random in a cell above, but you can try one of these patterns out if you want to see how the optimizer behaves with different gradients.

Feel free to play with different values here.  This list represents a gradient value for each training step.  We are only "simulating" a model with a single weight and many timesteps. 

In a real model, the gradients would be calculated by the model during each training step and you would have different gradients for every unfrozen weight in the model.

In [None]:
# OPTIONAL a specific pattern
gradients_over_timesteps = [
    # high frequency pattern
    1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,
    1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,
    1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,
    # lower frequency pattern
    0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1,0.7,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,
    0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1,0.5,0.5,0.3,0.3,0.3,0.3,0.3,0.3,0.3,0.3,0.3,
    0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.15,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,
    # high frequency, low amplitude pattern centered on zero
    -0.001,0.001,-0.001,0.001,-0.001,0.001,-0.001,0.001,-0.001,0.001,-0.001,0.001,-0.001,0.001,
    -0.001,0.001,-0.001,0.001,-0.001,0.001,-0.001,0.001,-0.001,0.001,-0.001,0.001,-0.001,0.001,
    -0.001,0.001,-0.001,0.001,-0.001,0.001,-0.001,0.001,-0.001,0.001,-0.001,0.001,-0.001,0.001,
    ]

In [None]:
# OPTIONAL a sine wave
steps = 300
gradients_over_timesteps = np.sin(np.linspace(0, 20*np.pi, steps))

In [None]:
# OPTIONAL yet another pattern to try for fun
gradients_over_timesteps = [
    0.4,0.0,0.3,0.0,0.3,0.0,0.4,0.0,0.3,0.0,0.3,0.0,0.4,0.0,0.3,0.0,0.3,0.0,0.4,0.0,
    0.1,0.0,0.1,0.0,0.1,0.0,0.1,0.0,0.1,0.0,0.08,0.0,0.08,0.0,0.11,0.0,0.1,0.0,0.1,0.0,
    0.05,0.0,0.05,0.0,0.05,0.0,0.05,0.0,0.05,0.0,0.05,0.0,0.05,0.0,0.05,0.0,0.05,0.0,0.05,0.0,
    0.1,0.0,0.1,0.0,0.1,0.0,0.1,0.0,0.1,0.0,0.1,0.0,0.1,0.0,0.1,0.0,0.1,0.0,0.1,0.0,
    0.08,0.0,0.08,0.0,0.08,0.0,0.08,0.0,0.08,0.0,0.08,0.0,0.08,0.0,0.08,0.0,0.08,0.0,0.08,0.0,
    0.05,0.0,0.05,0.0,0.05,0.0,0.05,0.0,0.05,0.0,0.05,0.0,0.05,0.0,0.05,0.0,0.05,0.0,0.05,0.0,
]

### Tweak parameters to see the effect on the learning steps
You may want to also go back up and try different values for the my_numbers list


In [None]:
epsilon = 1e-8 # probably don't need to change
beta1 = 0.9
beta2 = 0.999
my_plot = plot(beta1, beta2, epsilon)
# optionally save
#my_plot.figimage(plt.savefig("plot.png"))

In [None]:
# TOY MODEL JUST FOR FUN: Use nth root denominator in place of "2,sqrt" moment of normal Adam
epsilon = 1e-8 # probably don't need to change
beta1 = 0.9
beta2 = 0.999
# if nbeta = beta2 and npower=2, this is equivalent to Adam
# you can set npower=2 and just change nbeta to see what affect changing beta2 would have
nbeta = 0.99
npower = 4 # INT: use nth power and nth root in place of "2,sqrt" moment of normal Adam
my_plot = plot(beta1, beta2, epsilon, nbeta, npower)
# optionally save
#my_plot.figimage(plt.savefig("plot.png"))
