The Adam optimization algorithm is a popular extension of gradient descent that combines ideas from two other enhancements to the basic algorithm: **momentum** and **adaptive learning rate** techniques. Adam stands for **"Adaptive Moment Estimation"**, and it's particularly well-suited for problems with large datasets or many parameters.

### Key Features of Adam

1. **Momentum**: Adam maintains a decaying average of past gradients (similar to momentum), which serves as the "velocity" in classical momentum.
   
2. **Adaptive Learning Rates**: It keeps an exponentially decaying average of past squared gradients. This term adapts the learning rate for each parameter, making it smaller for parameters associated with frequently occurring features.

3. **Bias Correction**: Adam includes bias corrections to the first and second moment estimates to counteract their initialization at the origin.

### Adam Update Rules

Given:
- $\theta$: Parameters to be optimized.
- $\alpha$: Step size (sometimes referred to as the learning rate).
- $\beta_1, \beta_2$: Exponential decay rates for the moment estimates; typical values are 0.9 and 0.999, respectively.
- $\epsilon$: A small constant (e.g., $10^{-8}$) to prevent any division by zero in the implementation.

The update rules at time step $ t $ are as follows:

1. **Gradient Calculation**: $ g_t = \nabla_{\theta} J(\theta_t) $
2. **Update biased first moment estimate**: $ m_t = \beta_1 \cdot m_{t-1} + (1 - \beta_1) \cdot g_t $
3. **Update biased second raw moment estimate**: $ v_t = \beta_2 \cdot v_{t-1} + (1 - \beta_2) \cdot g_t^2 $
4. **Compute bias-corrected first moment estimate**: $ \hat{m}_t = \frac{m_t}{1 - \beta_1^t} $
5. **Compute bias-corrected second raw moment estimate**: $ \hat{v}_t = \frac{v_t}{1 - \beta_2^t} $
6. **Update parameters**: $ \theta_{t+1} = \theta_t - \alpha \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} $

### Example in Julia

Below is a simple Julia implementation of the Adam optimizer for a quadratic cost function $ J(\theta) = (\theta - 3)^2 $.


This script demonstrates using the Adam optimizer to minimize a simple quadratic function, illustrating how it adjusts its updates based on both the magnitude and frequency of the gradients. Adam's effectiveness across various types of neural network training tasks has made it a popular choice in deep learning frameworks and applications.

In [1]:
# Initialization
α = 0.01
β₁ = 0.9
β₂ = 0.999
ε = 1e-8
θ = 0.0  # Initial guess
m = 0.0  # Initialize first moment vector
v = 0.0  # Initialize second moment vector
t = 0    # Time step

# Cost function and its derivative
J(θ) = (θ - 3)^2
Jꜝ(θ) = 2 * (θ - 3)

# Adam optimization
for i in 1:100
    t += 1
    g = Jꜝ(θ)  # Get gradients w.r.t. stochastic objective at timestep t

    # Update biased first moment estimate
    m = β₁ * m + (1 - β₁) * g
    # Update biased second raw moment estimate
    v = β₁ * v + (1 - β₂) * g^2

    # Compute bias-corrected first moment estimate
    m̂ = m / (1 - β₁^t)
    # Compute bias-corrected second raw moment estimate
    v̂ = v / (1 - β₂^t)

    # Update parameters
    θ -= α * m̂ / (sqrt(v̂) + ε)

    println("Iteration $i: θ = $θ, Cost = $(J(θ))")
end

Iteration 1: θ = 0.009999999983333334, Cost = 8.940100000099667
Iteration 2: θ = 0.020257203963417146, Cost = 8.878867130531912
Iteration 3: θ = 0.030773362949946256, Cost = 8.816306822167572
Iteration 4: θ = 0.04155004259986488, Cost = 8.752426150440861
Iteration 5: θ = 0.05258862287237344, Cost = 8.687233826021373
Iteration 6: θ = 0.06389029846547596, Cost = 8.620740179445152
Iteration 7: θ = 0.075456080015039, Cost = 8.552957139921002
Iteration 8: θ = 0.08728679603027181, Cost = 8.4838982085796
Iteration 9: θ = 0.09938309553300367, Cost = 8.413578426479699
Iteration 10: θ = 0.11174545136222767, Cost = 8.342014337726782
Iteration 11: θ = 0.12437416410017509, Cost = 8.269223948094567
Iteration 12: θ = 0.13726936657175093, Cost = 8.195226679568503
Iteration 13: θ = 0.15043102886554005, Cost = 8.120043321252306
Iteration 14: θ = 0.16385896382180473, Cost = 8.043695977093927
Iteration 15: θ = 0.17755283293094964, Cost = 7.966208010896109
Iteration 16: θ = 0.19151215258480972, Cost = 7.88