# What are Optimizers?

Optimizers are algorithms used in machine learning and deep learning to adjust the parameters (weights and biases) of a model during training. Their main goal is to minimize the loss function, which measures how far the model's predictions are from the actual values. By updating the parameters in the right direction, optimizers help the model learn patterns from data and improve its performance.

## Detailed Explanation of Optimizers

### 1). First Order Optimizers
First order optimizers are optimization algorithms that use only the first derivative (gradient) of the loss function with respect to the model parameters to update those parameters. These optimizers are called "first order" because they rely solely on gradient information, not on higher-order derivatives (like the Hessian matrix).

- **Loss Function:** A loss function (also called cost function or objective function) measures how well the model's predictions match the actual data. The goal of training is to minimize this value. Common types include:
    - **Mean Squared Error (MSE):** Used for regression tasks, measures the average squared difference between predicted and actual values.
    - **Cross-Entropy Loss:** Used for classification tasks, measures the difference between the predicted probability distribution and the actual distribution.
    - **Hinge Loss:** Used for support vector machines, penalizes predictions that are on the wrong side of the margin.

- **Gradient:** The gradient is a vector of partial derivatives of the loss function with respect to each model parameter. It points in the direction of the steepest increase of the loss. By moving in the opposite direction of the gradient, we can minimize the loss. Types of gradients:
    - **Batch Gradient:** Computed using the entire dataset.
    - **Stochastic Gradient:** Computed using a single data point.
    - **Mini-batch Gradient:** Computed using a small subset of the data.

- **Parameter Update:** The optimizer updates the model parameters by subtracting a fraction (learning rate) of the gradient from the current parameters.

**Examples of First Order Optimizers:**
- **Gradient Descent:** Updates parameters using the gradient of the loss over the whole dataset.
- **Stochastic Gradient Descent (SGD):** Updates parameters using the gradient from a single data point at a time.
- **Mini-batch Gradient Descent:** Uses a small batch of data for each update, balancing speed and accuracy.
- **Momentum:** Adds a fraction of the previous update to the current update, helping to accelerate convergence and avoid local minima.

### 2).  Adaptive Optimizers
Adaptive optimizers are advanced optimization algorithms that adjust the learning rate for each parameter individually based on the history of gradients. This allows the optimizer to adapt to the geometry of the loss surface, often leading to faster and more stable convergence.

- **Learning Rate:** The learning rate is a hyperparameter that determines the size of the steps taken during optimization. Adaptive optimizers automatically adjust this for each parameter.
- **Gradient Accumulation:** Adaptive optimizers keep track of past gradients (and sometimes squared gradients) to inform future updates.

**Examples of Adaptive Optimizers:**
- **AdaGrad:** Adapts the learning rate for each parameter based on the sum of the squares of all previous gradients. Works well for sparse data.
- **RMSProp:** Modifies AdaGrad by using a moving average of squared gradients, preventing the learning rate from shrinking too much.
- **Adam (Adaptive Moment Estimation):** Combines the ideas of Momentum and RMSProp, using moving averages of both gradients and squared gradients.
- **AdaDelta:** An extension of AdaGrad that seeks to reduce its aggressive, monotonically decreasing learning rate.

**Key Terms Explained:**
- **Hyperparameter:** A parameter whose value is set before the learning process begins (e.g., learning rate, batch size).
- **Convergence:** The process of approaching a minimum value of the loss function during training.
- **Local Minimum:** A point where the loss is lower than at neighboring points, but not necessarily the lowest possible (global minimum).

By understanding these optimizers and their components, you can choose the right optimization strategy for your machine learning models.

# Understanding Optimizers in Machine Learning

## What is an Optimizer?
An optimizer is an algorithm or method used in machine learning and deep learning to adjust the parameters (weights and biases) of a model in order to minimize the loss function. The optimizer updates the model parameters based on the gradients computed during backpropagation, helping the model learn from data and improve its predictions.

## Types of Optimizers

### 1. First Order Optimizers
First order optimizers use only the first derivative (gradient) of the loss function to update the parameters. They are generally simple and efficient.
**Examples:**
- Batch Gradient Descent
- Stochastic Gradient Descent (SGD)
- Mini-batch Gradient Descent


### 2. Adaptive Optimizers
Adaptive optimizers adjust the learning rate for each parameter individually based on past gradients. They often lead to faster convergence and better performance in practice.
**Examples:**
- AdaGrad
- RMSProp
- Adam
- AdaDelta

# 1). First Order Optimizers
## Batch Gradient Descent

Batch Gradient Descent is an optimization algorithm used to train machine learning models. In this method, the entire training dataset is used to compute the gradient of the loss function with respect to the model parameters. The parameters are then updated based on this gradient.

### How it Works:
1. Calculate the predictions for all training examples using the current model parameters.
2. Compute the loss (error) for all predictions compared to the actual values.
3. Calculate the gradient (partial derivatives) of the loss function with respect to each parameter, using the entire dataset.
4. Update the parameters by moving them in the direction that reduces the loss (opposite to the gradient).

#### Characteristics:
- Uses the whole dataset for each update, which can be slow for large datasets.
- Provides a stable and accurate estimate of the gradient.
- Can get stuck in local minima or saddle points.

#### Formula:
For parameter $\theta$ and learning rate $\alpha$:
$$\theta = \theta - \alpha \cdot \nabla J(\theta)$$
where $\nabla J(\theta)$ is the gradient of the loss function $J$ with respect to $\theta$, computed over the entire dataset.

#### Pros:
- Stable convergence
- Accurate gradient estimation

#### Cons:
- Slow for large datasets
- High memory usage

Batch Gradient Descent is best suited for smaller datasets where computation time and memory are not major concerns.

### Example: Batch Gradient Descent with Epochs

#### What is an Epoch?
An **epoch** is one complete pass through the entire training dataset during the training process. In other words, when every sample in the dataset has been used once to update the model parameters, one epoch is completed. Training a model usually involves multiple epochs to allow the model to learn better from the data.

###3 Example Scenario
Suppose you have a dataset with 1000 samples. If you use batch gradient descent, the model will:
- Use all 1000 samples to compute the gradient and update the parameters once. This is one epoch.
- Repeat this process for several epochs (e.g., 10 epochs), so the model sees the entire dataset 10 times, updating the parameters after each pass.

**Training Process:**
1. **Epoch 1:**
    - Compute predictions for all 1000 samples.
    - Calculate the loss and gradients using all samples.
    - Update the model parameters once.
2. **Epoch 2:**
    - Repeat the process with the updated parameters.
3. **Continue for the desired number of epochs.**

#### Why Use Multiple Epochs?
- The model may not learn enough from just one pass through the data.
- Multiple epochs allow the model to gradually minimize the loss and improve accuracy.

#### Visualization:

| Epoch | Loss (Example) |
|-------|---------------|
|   1   |     0.95      |
|   2   |     0.80      |
|   3   |     0.65      |
|  ...  |     ...       |
|  10   |     0.20      |

As the number of epochs increases, the loss typically decreases, showing that the model is learning from the data.

In [None]:
for i in range(nb_epochs):
    params_grad=evaluate_gradient(loss_function,data,params)
    params=params-learning_rate * params_grad

## Stochastic Gradient Descent (SGD)

Stochastic Gradient Descent (SGD) is an optimization algorithm used to train machine learning models. Unlike batch gradient descent, which uses the entire dataset to compute the gradient, SGD updates the model parameters using only a single randomly selected data point at each step.

#### How it Works:
1. Shuffle the training dataset.
2. For each training example:
    - Compute the prediction using the current model parameters.
    - Calculate the loss for this single example.
    - Compute the gradient of the loss with respect to the model parameters (using only this example).
    - Update the parameters immediately based on this gradient.
3. Repeat the process for multiple epochs (passes through the dataset).

#### Characteristics:
- Updates parameters more frequently (after each example), which can lead to faster initial learning.
- Introduces more noise in the updates, which can help escape local minima but may cause the loss to fluctuate.
- Well-suited for large datasets and online learning.

#### Formula:
For parameter $\theta$ and learning rate $\alpha$, using a single example $(x_i, y_i)$:
$$\theta = \theta - \alpha \cdot \nabla J(\theta; x_i, y_i)$$
where $\nabla J(\theta; x_i, y_i)$ is the gradient of the loss function $J$ with respect to $\theta$, computed for the $i$-th example.

#### Example Scenario
Suppose you have a dataset with 1000 samples. In SGD:
- The model picks one sample at a time, computes the gradient, and updates the parameters immediately.
- After all 1000 samples have been used once, one epoch is completed.
- This process is repeated for several epochs (e.g., 10 epochs).

**Training Process:**
1. **Epoch 1:**
    - For each sample (from 1 to 1000):
        - Compute prediction, loss, gradient, and update parameters.
2. **Epoch 2:**
    - Repeat the process with the updated parameters.
3. **Continue for the desired number of epochs.**

#### Visualization:

| Epoch | Loss (Example) |
|-------|---------------|
|   1   |     1.10      |
|   2   |     0.85      |
|   3   |     0.70      |
|  ...  |     ...       |
|  10   |     0.25      |

The loss may fluctuate more from step to step, but generally decreases over epochs as the model learns.

In [None]:
for i in range(nb_epochs):
    np.random.shuffle(data)
    for example in data:
        params_grad = evaluate_gradient(loss_function, example, params)
        params = params - learning_rate * params_grad

## Mini-Batch Gradient Descent

Mini-Batch Gradient Descent is an optimization algorithm that combines the advantages of both batch and stochastic gradient descent. Instead of using the entire dataset (batch) or a single data point (stochastic) to compute the gradient, it uses a small random subset of the data called a "mini-batch."

#### How it Works:
1. Shuffle the training dataset.
2. Divide the dataset into small groups called mini-batches (e.g., 32, 64, or 128 samples per mini-batch).
3. For each mini-batch:
    - Compute predictions for all samples in the mini-batch.
    - Calculate the loss and gradients using only the mini-batch.
    - Update the model parameters based on the mini-batch gradient.
4. Repeat the process for all mini-batches in the dataset (one epoch).
5. Repeat for multiple epochs.


#### Characteristics:
- Balances the efficiency and stability of batch and stochastic methods.
- Faster convergence than batch gradient descent and less noisy than stochastic gradient descent.
- Well-suited for large datasets and can take advantage of parallel hardware (like GPUs).

#### Formula:
For parameter $\theta$ and learning rate $\alpha$, using a mini-batch $B$ of $m$ samples:
$$\theta = \theta - \alpha \cdot \frac{1}{m} \sum_{i \in B} \nabla J(\theta; x_i, y_i)$$
where $\nabla J(\theta; x_i, y_i)$ is the gradient of the loss function for the $i$-th sample in the mini-batch.

#### Example Scenario
Suppose you have a dataset with 1000 samples and choose a mini-batch size of 100:
- The dataset is divided into 10 mini-batches (each with 100 samples).
- For each mini-batch, compute the gradient and update the parameters.
- After all 10 mini-batches are processed, one epoch is completed.
- Repeat for several epochs (e.g., 10 epochs).

**Training Process:**
1. **Epoch 1:**
    - For each mini-batch (1 to 10):
        - Compute predictions, loss, gradient, and update parameters.
2. **Epoch 2:**
    - Repeat the process with updated parameters.
3. **Continue for the desired number of epochs.**

#### Visualization:

| Epoch | Loss (Example) |
|-------|---------------|
|   1   |     0.90      |
|   2   |     0.75      |
|   3   |     0.60      |
|  ...  |     ...       |
|  10   |     0.18      |

Mini-batch gradient descent is the most commonly used method in deep learning due to its efficiency and effectiveness.

In [None]:
for i in range(nb_epochs):
    np.random.shuffle(data)
    for batch in get_batches(data, batch_size=50): # here batch_size means 50 samples
        params_grad = evaluate_gradient(loss_function, batch, params)
        params = params - learning_rate * params_grad   

# Adagrad Optimizer: Detailed Explanation

## Introduction
Adagrad (Adaptive Gradient Algorithm) is an optimization algorithm designed to adapt the learning rate for each parameter individually, improving performance on sparse data and features. It is widely used in machine learning and deep learning for training models.

---

## The Core Idea
Adagrad adapts the learning rate for each parameter based on the historical gradients. Parameters with infrequent updates get larger learning rates, while those with frequent updates get smaller learning rates.
changes learning rates for each and every 'weight', for each and every 'hidden neurons' and for each and 'every layers', for every iterations.

---

## Mathematical Formulation
Let:
- $\theta$ be the parameter vector.
- $g_t$ be the gradient at time step $t$.
- $G_t$ be the sum of the squares of the gradients up to time $t$.

### 1. Accumulated Squared Gradients
For each parameter $i$:

$$
G_{t,i} = \sum_{\tau=1}^{t} g_{\tau,i}^2
$$

Where:
- $g_{\tau,i}$ is the gradient of the loss with respect to parameter $i$ at time $\tau$.

### 2. Parameter Update Rule
The update for each parameter $i$ at time $t$:

$$
\theta_{t+1,i} = \theta_{t,i} - \frac{\eta}{\sqrt{G_{t,i}} + \epsilon} \cdot g_{t,i}
$$

Where:
- $\eta$ is the initial learning rate (a hyperparameter).
- $\epsilon$ is a small constant (e.g., $10^{-8}$) to prevent division by zero.

---

## Step-by-Step Breakdown
1. **Compute the gradient** $g_t$ for the current mini-batch.
2. **Accumulate squared gradients** for each parameter:
   $$
   G_{t,i} = G_{t-1,i} + g_{t,i}^2
   $$
3. **Update each parameter** using the adapted learning rate:
   $$
   \theta_{t+1,i} = \theta_{t,i} - \frac{\eta}{\sqrt{G_{t,i}} + \epsilon} \cdot g_{t,i}
   $$

---

## Subformulas
- **Element-wise operations:** All operations are performed element-wise for each parameter.
- **Learning rate adaptation:** The effective learning rate for parameter $i$ at time $t$ is:
  $$
  \text{Effective Learning Rate}_{t,i} = \frac{\eta}{\sqrt{G_{t,i}} + \epsilon}
  $$

---

## Graphical Representation

### 1. Accumulated Gradient Growth
A plot of $G_{t,i}$ (y-axis) vs. time step $t$ (x-axis) shows a monotonically increasing curve, as squared gradients are always positive.

### 2. Effective Learning Rate Decay
A plot of the effective learning rate $\frac{\eta}{\sqrt{G_{t,i}} + \epsilon}$ (y-axis) vs. time step $t$ (x-axis) shows a decaying curve, since $G_{t,i}$ increases over time.

```
Graph 1: Accumulated Squared Gradients (G_t)
|
|         /
|        /
|       /
|      /
|_____/____________ t

Graph 2: Effective Learning Rate
|
|\
| \
|  \
|   \
|____\___________ t
```

---

## Advantages
- **Automatic learning rate adaptation** for each parameter.
- **Works well with sparse data** (e.g., NLP, text data).

## Disadvantages
- **Aggressive, monotonically decreasing learning rate** can make the learning rate too small, causing the optimizer to stop learning before reaching the optimum.

---

## Summary Table

| Step                | Formula                                                      |
|---------------------|--------------------------------------------------------------|
| Accumulate Gradients| $G_{t,i} = G_{t-1,i} + g_{t,i}^2$                        |
| Update Parameters   | $\theta_{t+1,i} = \theta_{t,i} - \frac{\eta}{\sqrt{G_{t,i}} + \epsilon} \cdot g_{t,i}$ |
| Effective LR        | $\frac{\eta}{\sqrt{G_{t,i}} + \epsilon}$                 |

---


# AdaDelta Optimizer: Complete Detailed Explanation

## Introduction
AdaDelta is an extension of Adagrad that seeks to reduce its aggressive, monotonically decreasing learning rate. AdaDelta adapts learning rates based on a moving window of gradient updates, rather than accumulating all past squared gradients. This allows AdaDelta to continue learning even after many updates.

---

## Core Idea
AdaDelta addresses Adagrad's main limitation: the continual decay of learning rates. Instead of accumulating all past squared gradients, AdaDelta restricts the window of accumulated past gradients to a fixed size (using an exponentially decaying average). It also eliminates the need to set a default learning rate.

---

## Mathematical Formulation
Let:
- $\theta$ be the parameter vector.
- $g_t$ be the gradient at time step $t$.
- $E[g^2]_t$ be the exponentially decaying average of past squared gradients.
- $E[\Delta\theta^2]_t$ be the exponentially decaying average of past squared parameter updates.
- $\rho$ be the decay rate (typically $0.9$ or $0.95$).
- $\epsilon$ be a small constant for numerical stability (e.g., $10^{-6}$).

### 1. Accumulate Squared Gradients (Running Average)

$$
E[g^2]_t = \rho \cdot E[g^2]_{t-1} + (1 - \rho) \cdot g_t^2
$$

- $E[g^2]_t$: Running average of squared gradients at time $t$.
- $\rho$: Decay rate, controls how much history is considered.
- $g_t^2$: Element-wise square of the current gradient.

### 2. Compute Update Step

The update step $\Delta\theta_t$ is computed as:

$$
\Delta\theta_t = - \frac{\sqrt{E[\Delta\theta^2]_{t-1} + \epsilon}}{\sqrt{E[g^2]_t + \epsilon}} \cdot g_t
$$

- $E[\Delta\theta^2]_{t-1}$: Running average of squared parameter updates from the previous step.
- $g_t$: Current gradient.
- $\epsilon$: Small constant for stability.

### 3. Accumulate Squared Updates (Running Average)

After computing $\Delta\theta_t$, update its running average:

$$
E[\Delta\theta^2]_t = \rho \cdot E[\Delta\theta^2]_{t-1} + (1 - \rho) \cdot (\Delta\theta_t)^2
$$

- $E[\Delta\theta^2]_t$: Running average of squared parameter updates at time $t$.

### 4. Parameter Update

Update the parameter:

$$
\theta_{t+1} = \theta_t + \Delta\theta_t
$$

---

## Step-by-Step Breakdown
1. **Compute the gradient** $g_t$ for the current mini-batch.
2. **Update running average of squared gradients:**
   $$
   E[g^2]_t = \rho E[g^2]_{t-1} + (1-\rho) g_t^2
   $$
3. **Compute update step:**
   $$
   \Delta\theta_t = - \frac{\sqrt{E[\Delta\theta^2]_{t-1} + \epsilon}}{\sqrt{E[g^2]_t + \epsilon}} g_t
   $$
4. **Update running average of squared updates:**
   $$
   E[\Delta\theta^2]_t = \rho E[\Delta\theta^2]_{t-1} + (1-\rho) (\Delta\theta_t)^2
   $$
5. **Update parameters:**
   $$
   \theta_{t+1} = \theta_t + \Delta\theta_t
   $$

---

## Explanation of Terms
- **$\rho$ (Decay Rate):** Controls the memory of the running averages. Higher $\rho$ means longer memory.
- **$E[g^2]_t$:** Exponentially decaying average of squared gradients (tracks how large recent gradients have been).
- **$E[\Delta\theta^2]_t$:** Exponentially decaying average of squared parameter updates (tracks how large recent updates have been).
- **$\epsilon$:** Prevents division by zero and stabilizes the update.
- **$g_t$:** Gradient of the loss with respect to parameters at time $t$.
- **$\Delta\theta_t$:** The actual parameter update step at time $t$.

---

## Graphical Representation

### 1. Running Average of Squared Gradients ($E[g^2]_t$)
A plot of $E[g^2]_t$ (y-axis) vs. time step $t$ (x-axis) shows a smooth curve that adapts to the recent magnitude of gradients.

### 2. Adaptive Update Step
A plot of $\Delta\theta_t$ (y-axis) vs. time step $t$ (x-axis) shows how the update step size adapts over time, depending on the ratio of running averages.

```
Graph 1: Running Average of Squared Gradients (E[g^2]_t)
|
|      __
|     /  \
|    /    \
|___/______\______ t

Graph 2: Adaptive Update Step (Δθ_t)
|
|  /\    /\
| /  \__/  \
|/         \____ t
```

---

## Advantages
- **No need to set a default learning rate.**
- **Adapts learning rates based on recent history.**
- **Works well for non-stationary objectives and deep networks.**

## Disadvantages
- **Still requires tuning of $\rho$ and $\epsilon$.**
- **Can be sensitive to initialization of running averages.**

---

## Summary Table
| Step | Formula |
|------|---------|
| Running Avg. Gradients | $E[g^2]_t = \rho E[g^2]_{t-1} + (1-\rho) g_t^2$ |
| Update Step | $\Delta\theta_t = - \frac{\sqrt{E[\Delta\theta^2]_{t-1} + \epsilon}}{\sqrt{E[g^2]_t + \epsilon}} g_t$ |
| Running Avg. Updates | $E[\Delta\theta^2]_t = \rho E[\Delta\theta^2]_{t-1} + (1-\rho) (\Delta\theta_t)^2$ |
| Parameter Update | $\theta_{t+1} = \theta_t + \Delta\theta_t$ |

---

If you want a code example or matplotlib graph for AdaDelta, let me know!

# Adam Optimizer: Complete Detailed Explanation

## Introduction
Adam (Adaptive Moment Estimation) is a popular optimization algorithm that combines the advantages of two other extensions of stochastic gradient descent: AdaGrad and RMSProp. Adam computes adaptive learning rates for each parameter by estimating the first (mean) and second (uncentered variance) moments of the gradients.

---

## Core Idea
Adam maintains two running averages for each parameter:
- The exponentially decaying average of past gradients (first moment, like momentum)
- The exponentially decaying average of past squared gradients (second moment, like RMSProp)

Adam also includes bias correction terms to account for the initialization of these averages at zero.

---

## Mathematical Formulation
Let:
- $\theta$ be the parameter vector.
- $g_t$ be the gradient at time step $t$.
- $m_t$ be the first moment estimate (mean of gradients).
- $v_t$ be the second moment estimate (uncentered variance of gradients).
- $\beta_1$ be the decay rate for the first moment (typically $0.9$).
- $\beta_2$ be the decay rate for the second moment (typically $0.999$).
- $\epsilon$ be a small constant for numerical stability (e.g., $10^{-8}$).
- $\eta$ be the learning rate (default $0.001$).

### 1. Initialize
- $m_0 = 0$ (vector of zeros)
- $v_0 = 0$ (vector of zeros)

### 2. Update Biased First and Second Moment Estimates

$$
m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t
$$

$$
v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2
$$

- $m_t$: Exponentially decaying average of past gradients (first moment).
- $v_t$: Exponentially decaying average of past squared gradients (second moment).

### 3. Compute Bias-Corrected Estimates

$$
\hat{m}_t = \frac{m_t}{1 - \beta_1^t}
$$

$$
\hat{v}_t = \frac{v_t}{1 - \beta_2^t}
$$

- $\hat{m}_t$: Bias-corrected first moment estimate.
- $\hat{v}_t$: Bias-corrected second moment estimate.

### 4. Parameter Update

$$
\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t
$$

---

## Step-by-Step Breakdown
1. **Compute the gradient** $g_t$ for the current mini-batch.
2. **Update first moment estimate:**
   $$
   m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t
   $$
3. **Update second moment estimate:**
   $$
   v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2
   $$
4. **Compute bias-corrected estimates:**
   $$
   \hat{m}_t = \frac{m_t}{1 - \beta_1^t}
   $$
   $$
   \hat{v}_t = \frac{v_t}{1 - \beta_2^t}
   $$
5. **Update parameters:**
   $$
   \theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t
   $$

---

## Explanation of Terms
- **$\beta_1$ (First Moment Decay Rate):** Controls the memory of the mean of gradients. Typical value: $0.9$.
- **$\beta_2$ (Second Moment Decay Rate):** Controls the memory of the uncentered variance of gradients. Typical value: $0.999$.
- **$m_t$:** Exponentially decaying average of past gradients (momentum).
- **$v_t$:** Exponentially decaying average of past squared gradients (RMSProp-like behavior).
- **$\hat{m}_t$, $\hat{v}_t$:** Bias-corrected versions of $m_t$ and $v_t$.
- **$\epsilon$:** Prevents division by zero and stabilizes the update.
- **$\eta$:** Learning rate.

---

## Graphical Representation

### 1. First and Second Moment Estimates
A plot of $m_t$ and $v_t$ (y-axis) vs. time step $t$ (x-axis) shows how the running averages adapt to the gradient's magnitude and variance.

### 2. Adaptive Update Step
A plot of the effective update $\frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t$ (y-axis) vs. time step $t$ (x-axis) shows how Adam adapts the step size for each parameter.

```
Graph 1: First and Second Moment Estimates (m_t, v_t)
|
|   /\
|  /  \
|_/____\______ t

Graph 2: Adaptive Update Step
|
|  /\    /\
| /  \__/  \
|/         \____ t
```

---

## Advantages
- **Combines benefits of AdaGrad and RMSProp.**
- **Works well in practice for most deep learning problems.**
- **Bias correction improves performance, especially in early training.**

## Disadvantages
- **Requires tuning of $\eta$, $\beta_1$, $\beta_2$, and $\epsilon$.**
- **Can sometimes lead to non-convergent or unstable training if not tuned properly.**

---

## Summary Table
| Step | Formula |
|------|---------|
| First Moment | $m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t$ |
| Second Moment | $v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2$ |
| Bias Correction | $\hat{m}_t = \frac{m_t}{1-\beta_1^t}$, $\hat{v}_t = \frac{v_t}{1-\beta_2^t}$ |
| Parameter Update | $\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t$ |

---

If you want a code example or matplotlib graph for Adam, let me know!