# Introduction

There is a GD expression that we have all seen. And we know how backprop works with the forward, backward, and update rule. We also know how to imagine the landscape of loss where each weight is plotted on an axis with loss on the other axis and our goal is to get to the lowest loss using that weight. 

As we trudge along this landscape of loss, we realize that we are doing the most fundamental part of deep learning. If we think of llms as a giant mathematical expression, then the main objective of training is to tune the constants in that expression (weights). To move the weights such that the resulting expression is close enough to the data manifold used during training is the main objective of training. And in that sense, deep learning is essentially an optimization problem. 

And then if moving across this landscape and finding that elusive global minima in this fractured landscape is the goal, we start seeing that the key starts becoming exactly *how* we move through it. 

In this notebook, that is what we will cover. And for me, that is also the *hook* for optimizers. Because while it can get deeply mathematical and the explanations can get quite abstract, the truth is that at the end of the day, all we're doing is writing an algorithm to walk down a hill.

# Optimizers Cheat Sheet

Below is a comprehensive optimizer reference table. All optimizers follow the standard update form: $w_{t+1} = w_t + v_t$, where each optimizer defines $v_t$ (the step) plus internal state updates.

## Notation

- $w_t$: parameters at step $t$
- $g_t = \nabla L_t(w_t)$: stochastic gradient (mini-batch)
- $\eta$: learning rate
- $\epsilon$: small constant for numerical stability
- $\odot$: elementwise multiply, $\oslash$: elementwise divide
- Bias correction: $\hat{m}_t = \frac{m_t}{1-\beta_1^t}$, $\hat{s}_t = \frac{s_t}{1-\beta_2^t}$

| Family | Name | Formulae | Hook | Paper | Notes | Downsides |
|--------|------|----------|------|-------|-------|-----------|
| **A. Baselines** | **SGD** | $v_t = -\eta \, g_t$ | "Step = −LR × gradient" | Classical; see Nesterov's *Introductory Lectures on Convex Optimization* (2004) | Baseline: simplest unbiased stochastic descent. Often generalizes well, but can be slow and LR-sensitive. | Slow convergence; very LR-sensitive; struggles with ill-conditioned problems (ravines/elongated valleys); easily stuck at saddle points; noisy trajectory with small batches; no per-parameter adaptation. |
| | **SGD + Momentum** | $v_t = -\eta \, m_t$<br>$m_t = \beta_1 m_{t-1} + g_t$ | "Remember where you were going" | [Polyak (1964)](https://doi.org/10.1016/0041-5553(64)90137-5) "Some methods of speeding up the convergence of iteration methods" | Smooths noisy gradients; accelerates along consistent directions. Dampens oscillations in ravines. | Overshoots near minima causing oscillations/"ringing"; requires careful $\beta$ tuning; high $\beta$ causes persistent oscillations; still no per-parameter LR; can amplify noise if $\beta$ too high. |
| | **Nesterov (NAG)** | $v_t = \beta_1 v_{t-1} - \eta \, \nabla L_t(w_t + \beta_1 v_{t-1})$ | "Momentum, but peek ahead" | [Nesterov (1983)](https://doi.org/10.1007/BF01032144) "A method of solving a convex programming problem with convergence rate O(1/k²)" | Gradient evaluated at lookahead point. Theoretically optimal rates in some convex settings; often steadier in practice. | Less exploratory than momentum (may get stuck in local minima more easily); theoretical guarantees mainly for convex; lookahead adds implementation complexity; still oscillates (just less); benefits diminish in highly stochastic settings. |
| **B. Adaptive (diagonal)** | **AdaGrad** | $v_t = -\eta \, \frac{g_t}{\sqrt{s_t} + \epsilon}$<br>$s_t = s_{t-1} + g_t \odot g_t$ | "Remembers *all* squared grads" | [Duchi, Hazan, Singer (2011)](https://jmlr.org/papers/v12/duchi11a.html) "Adaptive Subgradient Methods for Online Learning and Stochastic Optimization" | Per-parameter LR; great for sparse features. Downside: accumulator grows forever → effective LR can decay too much. | **Accumulator grows unbounded** → effective LR decays to ~0, training stalls; unsuitable for deep/non-convex problems; cannot recover from early large gradients; poor for long training runs; essentially unusable for modern neural nets. |
| | **RMSProp** | $v_t = -\eta \, \frac{g_t}{\sqrt{s_t} + \epsilon}$<br>$s_t = \beta_2 s_{t-1} + (1-\beta_2)(g_t \odot g_t)$ | "AdaGrad but forget the distant past" | [Hinton, Tieleman (2012)](https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf) Coursera Lecture 6, Slide 29 (unpublished) | Fixes AdaGrad's vanishing step sizes via EMA of squared gradients. Works well on non-stationary problems. | No bias correction (biased early estimates); less theoretical grounding than Adam; still sensitive to LR and $\beta_2$ choices; no momentum component (slower than Adam in practice); unpublished origin limits formal analysis. |
| | **AdaDelta** | $v_t = -\frac{\sqrt{u_{t-1}+\epsilon}}{\sqrt{s_t+\epsilon}} \odot g_t$<br>$s_t = \beta_2 s_{t-1} + (1-\beta_2)(g_t \odot g_t)$<br>$u_t = \beta_2 u_{t-1} + (1-\beta_2)(v_t \odot v_t)$ | "Scales updates by past update RMS" | [Zeiler (2012)](https://arxiv.org/abs/1212.5701) "ADADELTA: An Adaptive Learning Rate Method" | Reduces sensitivity to learning rate by tracking RMS of updates as well as gradients. | Slow convergence in practice; largely superseded by Adam; limited modern adoption/support; extra state (update accumulator) for marginal benefit; no clear advantage over RMSProp/Adam in most benchmarks. |
| **C. Adam Family** | **Adam** | $v_t = -\eta \, \frac{\hat{m}_t}{\sqrt{\hat{s}_t} + \epsilon}$<br>$m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t$<br>$s_t = \beta_2 s_{t-1} + (1-\beta_2)(g_t \odot g_t)$ | "(mean grad) / (RMS grad)" | [Kingma & Ba (2014)](https://arxiv.org/abs/1412.6980) "Adam: A Method for Stochastic Optimization" | Combines momentum (1st moment) + RMS scaling (2nd moment). Extremely common baseline; stable across many settings. | **Proven non-convergence** on some convex problems ([Reddi et al. 2018](https://arxiv.org/abs/1904.09237)); worse generalization than SGD in some vision tasks; can exhibit limit cycles; default $v_0=0$ causes sign-descent instability early; tends to find sharper minima; can overfit more than SGD. |
| | **AdamW** | $v_t = -\eta \, \frac{\hat{m}_t}{\sqrt{\hat{s}_t} + \epsilon} - \eta \lambda w_t$<br>$m_t, s_t$ as Adam | "Adam step + separate shrink" | [Loshchilov & Hutter (2017)](https://arxiv.org/abs/1711.05101) "Decoupled Weight Decay Regularization" | Decouples weight decay from adaptive scaling. De-facto standard in most LLM pretraining recipes. | Inherits Adam's non-convergence issues; sharp minima problem persists; generalization gap vs SGD remains in some settings; $\mu P$ scaling of optimal LR [breaks down at scale](https://arxiv.org/html/2405.13698v1); weight decay doesn't address all Adam pathologies with LR schedules/normalization. |
| | **AMSGrad** | $v_t = -\eta \, \frac{\hat{m}_t}{\sqrt{\bar{s}_t} + \epsilon}$<br>$m_t, s_t$ as Adam<br>$\bar{s}_t = \max(\bar{s}_{t-1}, s_t)$ | "Adam, but $s_t$ can only go up" | [Reddi, Kale, Kumar (2018)](https://arxiv.org/abs/1904.09237) "On the Convergence of Adam and Beyond" | Non-decreasing second moment fixes some convergence pathologies of Adam. | Extra memory (stores running max); often **no practical improvement** over Adam in real experiments; slower adaptation (max never decreases); rarely adopted in practice; theoretical fix doesn't always translate to better results. |
| | **AdaMax** | $v_t = -\eta \, \frac{\hat{m}_t}{u_t + \epsilon}$<br>$m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t$<br>$u_t = \max(\beta_2 u_{t-1}, \|g_t\|)$ | "Adam, but scale by max-grad" | [Kingma & Ba (2014)](https://arxiv.org/abs/1412.6980) (Section 7 of Adam paper) | $\ell_\infty$-norm variant of Adam; sometimes more stable. | Less studied than Adam; not widely adopted; can be sensitive to outlier gradients (max-based); limited empirical validation at scale; marginal benefits over Adam in most cases. |
| | **Nadam** | $v_t = -\eta \, \frac{\tilde{m}_t}{\sqrt{\hat{s}_t} + \epsilon}$<br>$m_t, s_t$ as Adam<br>$\tilde{m}_t = \beta_1 \hat{m}_t + \frac{(1-\beta_1) g_t}{1-\beta_1^t}$ | "Adam + Nesterov peek" | [Dozat (2016)](https://openreview.net/forum?id=OM0jvwB8jIp57ZJjtNEZ) "Incorporating Nesterov Momentum into Adam" | Nesterov-style acceleration into Adam's momentum. Often faster/better than Adam. | Inherits Adam's convergence issues; added complexity from Nesterov formulation; not always better than Adam (task-dependent); less widely supported in frameworks; more hyperparameters to reason about. |
| | **RAdam** | $v_t = -\eta \cdot r_t \cdot \frac{\hat{m}_t}{\sqrt{\hat{s}_t} + \epsilon}$ if $\rho_t>4$<br>$v_t = -\eta \cdot \hat{m}_t$ otherwise<br>$m_t, s_t$ as Adam; $\rho_t, r_t$ = variance-rectification | "Adam with early-step correction" | [Liu et al. (2019)](https://arxiv.org/abs/1908.03265) "On the Variance of the Adaptive Learning Rate and Beyond" | Rectifies variance of Adam's adaptive LR early in training. Reduces need for hand-designed warmup. | Still has Adam's fundamental convergence issues; added complexity of variance rectification logic; marginal improvement in many practical settings; warmup still sometimes needed; less widely adopted than AdamW. |
| | **Yogi** | $v_t = -\eta \, \frac{\hat{m}_t}{\sqrt{\hat{s}_t} + \epsilon}$<br>$m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t$<br>$s_t = s_{t-1} - (1-\beta_2) \cdot \text{sign}(s_{t-1} - g_t \odot g_t) \odot (g_t \odot g_t)$ | "$s_t$ grows cautiously (direction of gap)" | [Zaheer et al. (2018)](https://papers.nips.cc/paper/8186-adaptive-methods-for-nonconvex-optimization) "Adaptive Methods for Nonconvex Optimization" (NeurIPS) | Controls second-moment growth more conservatively than Adam; helps in some nonconvex settings. | Not widely tested at scale; limited adoption; convergence can be slower than Adam; sign-based update adds complexity; limited framework support; few large-scale validations. |
| | **AdaBelief** | $v_t = -\eta \, \frac{\hat{m}_t}{\sqrt{\hat{s}_t} + \epsilon}$<br>$m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t$<br>$s_t = \beta_2 s_{t-1} + (1-\beta_2)(g_t - m_t) \odot (g_t - m_t)$ | "Variance of (grad − momentum)" | [Zhuang et al. (2020)](https://arxiv.org/abs/2010.07468) "AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients" | Uses surprise $(g_t - m_t)$ to set step sizes; can behave more SGD-like in generalization. | Sensitive to hyperparameters ($\epsilon$ especially); limited large-scale validation (only ResNet-18 on ImageNet in paper); can be unstable; behaves like SGD in low-surprise regions (slower); not widely adopted in production. |
| | **AdaFactor** | $v_t = -\eta \, \frac{g_t}{\sqrt{R_t} + \epsilon}$<br>$r_t = \beta_2 r_{t-1} + (1-\beta_2) \text{mean}_\text{cols}(g_t \odot g_t)$<br>$c_t = \beta_2 c_{t-1} + (1-\beta_2) \text{mean}_\text{rows}(g_t \odot g_t)$<br>$R_t \approx \frac{r_t \otimes c_t}{\text{mean}(r_t)}$ | "Second moment, but low-rank (row×col)" | [Shazeer & Stern (2018)](https://arxiv.org/abs/1804.04235) "Adafactor: Adaptive Learning Rates with Sublinear Memory Cost" | Adam-like with sublinear memory by factorizing second-moment accumulators. Used in T5 and large NLP models. | **Slower convergence** than Adam; training instability without warmup; removes momentum (unstable); factorization approximation can hurt; requires update clipping and careful $\beta_2$ scheduling; [warmup strongly recommended](https://huggingface.co/docs/transformers/main_classes/optimizer_schedules). |
| **D. Layer-wise Scaling** | **LARS** | $v_t = -\eta \cdot q_t \cdot d_t$<br>$d_t = g_t + \lambda w_t$<br>$q_t = \frac{\|w_t\|}{\|d_t\| + \epsilon}$ | "Match step norm to weight norm" | [You et al. (2017)](https://arxiv.org/abs/1708.03888) "Large Batch Training of Convolutional Networks" | Stabilizes large-batch SGD by scaling each layer's update to match its weight norm. | **Fails on attention models** (BERT); requires warmup (unstable without); not for normal batch sizes (can harm); [traps in sharp minima early](https://arxiv.org/abs/2309.14053); limited theoretical foundation; standard optimizers can match LARS at large batch with proper tuning. |
| | **LAMB** | $v_t = -\eta \cdot q_t \cdot u_t$<br>$u_t = \frac{\hat{m}_t}{\sqrt{\hat{s}_t} + \epsilon} + \lambda w_t$<br>$q_t = \frac{\|w_t\|}{\|u_t\| + \epsilon}$ | "Adam direction, LARS scaling" | [You et al. (2019)](https://arxiv.org/abs/1904.00962) "Large Batch Optimization for Deep Learning: Training BERT in 76 minutes" | Makes Adam/AdamW behave better at very large batch sizes via layer-wise trust ratio. | **Fails on RNNs**; warmup dependency (lacks theoretical foundation); [contested benefits](https://openreview.net/forum?id=E9e18Ms5TeV), standard Adam/Nesterov can match at large batch; inherits Adam's issues; adds layer-wise complexity; limited benefit at normal batch sizes. |
| **E. Sign-based** | **Lion** | $v_t = -\eta \cdot \text{sign}(\beta_1 m_{t-1} + (1-\beta_1) g_t)$<br>$m_t = \beta_2 m_{t-1} + (1-\beta_2) g_t$ | "Sign step, momentum memory" | [Chen et al. (2023)](https://arxiv.org/abs/2302.06675) "Symbolic Discovery of Optimization Algorithms" | Drops second-moment state (memory savings). Uses signed updates; often competitive in large-scale training. | **Sign discreteness** can cause non-convergence in some models; requires smaller LR + larger weight decay vs Adam; [limited benefit at small batch (<64)](https://arxiv.org/abs/2302.06675); theoretically uncertain (doesn't fit existing optimizer categories); leads to larger weight norms; [fails on some tasks](https://github.com/lucidrains/lion-pytorch/discussions/1) (speech, dense prediction). |
| **F. Modern Variants** | **Adan** | $v_t = -\eta \, \frac{p_t}{\sqrt{n_t} + \epsilon} - \eta \lambda w_t$<br>$d_t = g_t - g_{t-1}$<br>$m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t$<br>$u_t = \beta_2 u_{t-1} + (1-\beta_2) d_t$<br>$n_t = \beta_3 n_{t-1} + (1-\beta_3)(g_t + (1-\beta_2)d_t)^2$<br>$p_t = m_t + (1-\beta_2) u_t$ | "Adam + (grad change) + Nesterov" | [Xie et al. (2022)](https://arxiv.org/abs/2208.06677) "Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models" | Nesterov-style momentum + gradient-difference tracking. Faster convergence; strong on transformers. | **Higher memory** (3 moment buffers + prev gradient); [sensitive to $\beta_3$ tuning](https://github.com/sail-sg/Adan); limited large-scale studies; requires ZeRO for memory parity with Adam; stores previous gradient (extra memory); Adam initially faster (Adan catches up later). |
| | **Sophia** | $v_t = -\eta \cdot \text{clip}\left(\frac{m_t}{h_t + \epsilon}, \rho\right) - \eta \lambda w_t$<br>$m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t$<br>$h_t = \beta_2 h_{t-1} + (1-\beta_2) \hat{h}_t$<br>($\hat{h}_t$ estimates $\text{diag}(\text{Hessian})$) | "Adam-like, Hessian-aware step size" | [Liu et al. (2023)](https://arxiv.org/abs/2305.14342) "Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training" | Uses diagonal Hessian estimate to scale updates. Designed for LLM pretraining scalability. | **Biased Hessian estimate** (Gauss-Newton approximation); [slow to adapt when variance changes](https://arxiv.org/abs/2305.14342); Hessian computation overhead (~5% wall-clock); constrained to light-weight Hessian estimators (no K-FAC); clipping hyperparameter $\rho$ adds tuning burden; per-example gradient approximation. |
| **G. Matrix Preconditioning** | **Shampoo** | $V_t = -\eta \, L_t^{-1/4} \cdot G_t \cdot R_t^{-1/4}$<br>$L_t = L_{t-1} + G_t G_t^\top$<br>$R_t = R_{t-1} + G_t^\top G_t$ | "Washes gradients with matrix inverse powers" | [Gupta et al. (2018)](https://arxiv.org/abs/1802.09568) "Shampoo: Preconditioned Stochastic Tensor Optimization" | Full-matrix preconditioner (Kronecker-factor). Captures cross-parameter structure; heavier compute/memory. | **High memory** $O(m^2 + n^2)$; expensive eigendecomposition/inverse-root; many hyperparameters; [compute overhead grows over time](https://openreview.net/forum?id=ASqdVeifn7); requires distributed implementation at scale; eigendecomp must be done periodically (costly); preconditioner update frequency tuning. |
| **H. Muon Family** | **Muon** | $V_t = -\eta \, O_t$<br>$M_t = \beta_1 M_{t-1} + \nabla L(W_{t-1})$<br>$O_t = \text{NS}_5(M_t)$<br>($\text{NS}_5$ = Newton-Schulz orthogonalization) | "Momentum, then orthogonalize" | [Jordan et al. (2024)](https://kellerjordan.github.io/posts/muon/) "Muon: An optimizer for hidden layers in neural networks" (blog/code) | Orthogonalizes momentum matrix via Newton-Schulz. Applied to 2D weight matrices in transformers. | **Only for 2D params** (needs fallback optimizer for 1D/embeddings); [high communication cost with tensor parallelism](https://huggingface.co/blog/onekq/muon-optimizer); NS iteration introduces noisy singular values; coupled Newton requires float32 (slow on modern GPUs); hyperparameter transfer not fully explored; limited to matrix weights. |
| | **Moonlight** | $V_t = -\eta(0.2 \cdot O_t \cdot \sqrt{\max(m,n)} + \lambda X_t)$<br>$M_t = \beta_1 U_t + \nabla f(X_t, \xi_t)$<br>$O_t = \text{NewtonSchulz}(M_t)$ | "Muon + right RMS scale + decay" | [MoonLight Team (2025)](https://arxiv.org/abs/2501.09483) "Muon is Scalable for LLM Training" | Practical scaling (weight decay + update-scale matching) to make Muon work at larger LLM scale. | Inherits Muon's 2D-only limitation; additional scaling complexity; very recent (limited validation); theoretical understanding still developing; requires careful update-scale tuning; tensor parallelism communication overhead remains. |
| | **NorMuon** | $V_t = -\eta \lambda W_t - \hat{\eta} \cdot \bar{O}_t$<br>$M_t = \beta_1 M_{t-1} + (1-\beta_1) G_t$<br>$O_t = \text{NS}_5(M_t)$<br>$r_t = \beta_2 r_{t-1} + (1-\beta_2) \text{mean}_\text{cols}(O_t \odot O_t)$<br>$\bar{O}_t = \frac{O_t}{\sqrt{\text{ExpandRows}(r_t)} + \epsilon}$ | "Muon + per-row RMS equalizer" | [Li et al. (2025)](https://arxiv.org/abs/2503.07067) "NorMuon: Making Muon More Efficient And Scalable" | Row-wise adaptive normalization after orthogonalization. Fixes non-uniform per-neuron update norms. | Inherits Muon's 2D-only limitation; adds row-wise normalization overhead (extra state + compute); very recent (limited adoption); requires additional $\beta_2$ tuning; limited theoretical analysis. |
| | **REG** | $V_k = -\alpha \cdot (\hat{M}_{k+1} + \lambda W_k)$<br>$M_{k+1} = \beta_1 M_k + (1-\beta_1) \nabla f(W_k)$<br>$\tilde{M}_{k+1} = \text{RACS}(M_{k+1}; p)$<br>$\hat{M}_{k+1} = \tilde{M}_{k+1} \cdot \frac{\rho_\text{target}}{\text{RMS}(\tilde{M}_{k+1})}$ | "Equilibrate matrix (rows+cols), then step" | [Liu et al. (2025)](https://arxiv.org/abs/2502.06615) "REG: A Simple and Effective Optimizer for Large Model Training" | Row-and-Column-Scaling (RACS) regularizes update matrix gently. Compatible with AdamW-style training. | Very new (limited validation at scale); adds RACS overhead; limited theoretical analysis; $\rho_\text{target}$ and RACS iterations add hyperparameters; not yet widely adopted or benchmarked. |
| | **MARS-M** | $V_t = -\eta(0.2 \cdot O_t \cdot \sqrt{\max(m,n)} + \lambda X_t)$<br>$C_t = \nabla f(X_t) + \gamma_t \frac{\beta_1}{1-\beta_1}[\nabla f(X_t) - \nabla f(X_{t-1})]$<br>$M_t = \beta_1 M_{t-1} + (1-\beta_1) \text{Clip}(C_t, 1)$<br>$O_t = \text{NewtonSchulz}(M_t)$ | "Corrected gradient + Muon orthogonal step" | [Liu, Yuan, Gu (2025)](https://arxiv.org/abs/2503.03699) "MARS-M: Training Large Language Models via Matrix Smoothing" | Combines Muon/Moonlight with MARS/STORM-like variance reduction for faster convergence. | Very new (limited adoption); combines multiple methods (high complexity); requires storing previous gradient (extra memory); inherits Muon's 2D-only limitation; clipping and $\gamma_t$ schedule add tuning burden; limited benchmarks outside original paper. |

## Weight Decay Notes

- **L2 regularization (coupled):** add $\lambda w_t$ into gradient: $g_t \leftarrow g_t + \lambda w_t$
- **Decoupled weight decay (LLM standard):** apply shrinkage separately: $v_t \leftarrow v_t - \eta \lambda w_t$ (AdamW uses this)

## Implementation Variants (same math, different storage)

- **Fused optimizers**: same updates, kernel-fused for speed
- **8-bit / quantized states**: same formulas, compressed state tensors for memory savings
- **Paged optimizers**: same math, paging optimizer state CPU↔GPU for fine-tuning

# Things to know

## Approaches

In a textbook (deterministic, smooth) setting, the gold-standard local step is Newton's method: fit a quadratic around $w_t$ and jump to its minimizer:

$$\Delta w_t = -H_t^{-1} g_t$$

For a true quadratic, this lands at the optimum in one step. More generally, you can read it as a *curvature-aware preconditioner*: it shrinks steps along high-curvature directions and stretches them along flat directions.

Why we don't just do this for modern nets: the Hessian is $N \times N$ for $N$ parameters (roughly $O(N^2)$ storage and $O(N^3)$ factorization), it's often indefinite in non-convex problems (so you need damping/trust regions), and with mini-batches both $g_t$ and any curvature estimate are noisy. There's also a more fundamental issue: deep learning objectives are inherently stochastic. Beyond mini-batch sampling, techniques like dropout introduce additional noise, and the computational cost of first-order derivatives scales with evaluating the function itself, making gradient descent efficient in a way that second-order methods can't match. In high-dimensional parameter spaces with noisy objectives, higher-order methods are simply ill-suited. So in practice we keep the cheap first-order gradient signal and either **(1) improve first-order dynamics** or **(2) approximate curvature with lightweight preconditioners**.

The cheatsheet families map nicely onto these approaches:

**A. Baselines (pure first-order):** SGD, Momentum, Nesterov. Direction comes from $g_t$; momentum is an EWMA that reduces noise and accelerates consistent directions.

**B. Adaptive (diagonal):** AdaGrad / RMSProp / AdaDelta. Running second moments of gradients act like a diagonal preconditioner (per-parameter step normalization).

**C. Adam family (diagonal preconditioning + momentum):** Adam / AdamW / RAdam / AdaBelief / AdaFactor, etc. Still first-order, but with (i) momentum for direction and (ii) RMS scaling for per-parameter step sizing. The second-moment term is best thought of as a curvature proxy (closer to a diagonal Fisher/Gauss-Newton statistic than the true Hessian), but the practical effect is Newton-like: damp consistently large/noisy coordinates and let small ones move.

**D. Layer-wise scaling:** LARS / LAMB. Adds a trust ratio per layer so update norms track weight norms. Useful in large-batch regimes where inter-layer scale mismatch becomes the main pathology.

**E. Sign-based:** Lion. Keeps the direction (sign) of a momentum-like term but discards magnitudes, trading some fidelity for simplicity and reduced state.

**F. Modern variants:** Adan mostly improves first-order dynamics (it uses gradient differences as an acceleration/variation signal), while Sophia explicitly estimates a diagonal Hessian term via Hutchinson-style probing, making it more second-order than Adam-like methods.

**G. Matrix preconditioning:** Shampoo uses structured (Kronecker-factored) second-moment matrices to capture cross-parameter correlations inside a weight matrix. More expensive than diagonal methods but closer in spirit to Newton-style preconditioning.

**H. Muon family (matrix geometry constraints):** Muon / Moonlight / NorMuon / REG / MARS-M orthogonalize or equilibrate momentum-like updates for 2D tensors. This is less about estimating curvature directly and more about keeping the update geometry well-conditioned for transformer weight matrices.

Finally, two orthogonal axes show up in practice: **weight decay** (coupled L2 vs decoupled AdamW-style) and **implementation variants** (fused kernels, 8-bit optimizer state, paged state) that keep the math the same but change speed/memory.

## Weight Update Formula

Let's review the most basic/fundamental weight update formula as done through vanilla SGD. This disucssion is a bit off topic, but I went through this rabbit hole for a while so just noting it down here. Also, only writing down the high level points, deeper analysis will lead to the question of what it means to stabilize the weight update equation, and by corollary, training itself - which is a larger question and won't be analyzed here. Idea here is just to think of the intuition. 

Here is the equation:

W_new = W_old - lr * L'(W_old)

There is something about this equation that feels a bit wrong. 

For example, at the most funadmental level, are the units in all the terms the same? Thinking from a physics perspective, if we strictly assign physical units, W_old is a position (e.g., "meters") and the gradient grad(f'(W_old)) is a slope ("energy per meter"). Subtracting a slope from a position is just physically impossible~ This reveals that the gradient is merely a directional signal (force) indicating urgency, not a spatial displacement. The equation requires a conversion factor to translate "steepness of the hill" into "distance to step."

Perhaps the resolution to this paradox lies in the Taylor Series expansion of the loss function. The "perfect" update rule, which accounts for the landscape's geometry, is:

W_new = W_old - lr * ( L'(W_old) / L''(W_old) )

Here, the numerator (gradient, $L'$) provides the push, while the denominator (curvature/Hessian, $L''$) provides the braking mechanism. Dimensionally, it can be argued that this balances because: the units of curvature (${energy}/{{meters}^2}$) divide the units of gradient (${energy}/{{meters}}$), canceling out the "energy" and leaving purely "meters." So it can be argued that this confirms that a valid update step requires two distinct derivatives: one to determine direction ($L'$) and one to determine scale ($L''$).

Of course, calculating the Hessian ($L''$) is computationally intractable for millions of parameters. Therefore, the learning rate $lr$ acts as a constant scalar proxy for the inverse curvature (1/{L''}$). When we set a learning rate, we are effectively guessing the geometry of the error surface. A high $lr$ assumes low curvature (a wide, flat valley where big steps are safe), while a low $lr$ assumes high curvature (a sharp, narrow ravine where precision is required). Thus, $lr$, in addition to being a speed setting, is also a unit-restoring term that bridges the gap between the "force" of the gradient and the "displacement" of the weight update.

Modern optimization strategies seem to implicitly acknowledge this relationship. Learning rate schedulers (decaying $lr$ over time) mimic the assumption that the loss landscape transitions from a broad basin (low curvature) to a narrow minimum (high curvature) as training progresses. More advanced optimizers like Adam take this a step further by dividing the gradient by a rolling average of squared gradients. This term ($\sqrt{v_t}$) serves as a computationally cheap estimation of the local curvature, attempting to replicate the unit-correcting behavior of Newton’s method by normalizing the step size for each parameter individually.

## Exponentially Weighted Moving Average

EWMA (exponentially weighted moving average) is the "smooth a noisy stream without forgetting the present" trick. You take a time series (daily temperature, stock price, whatever), and instead of a dumb uniform mean that treats last week and last year the same, you keep a running average that leans harder on recent values while older values fade out exponentially. Visually it's that clean black curve that hugs the data enough to track the trend but not enough to chase every random zig-zag. This exact "trend extraction" idea shows up all over: time series / financial forecasting, signal processing (it's basically a first-order low-pass filter), and in deep learning it's the core primitive behind momentum-style optimizers.

The entire method is one recurrence:

$$v_t = \beta v_{t-1} + (1 - \beta) x_t$$

Here $x_t$ is the value at time $t$ (temperature, gradient, whatever), and $v_t$ is the EWMA at time $t$. $\beta \in [0, 1)$ controls memory. If $\beta$ is large, you heavily trust the past state $v_{t-1}$, so the curve is stable and smooth. If $\beta$ is small, you aggressively trust the current observation $x_t$, so the curve becomes twitchy and tracks the data closely. Two practical notes: you need some $v_0$ to start the recurrence. People often set $v_0 = 0$, or set $v_0 = x_0$ (or some reasonable constant). Early on, initialization matters because the filter "warms up" from that starting point.

A tiny numerical walk-through makes it real. Say $\beta = 0.9$, $v_0 = 0$, and your first observation is $x_1 = 30$. Then $v_1 = 0.9 \cdot 0 + 0.1 \cdot 30 = 3$. Next if $x_2 = 17$, then $v_2 = 0.9 \cdot 3 + 0.1 \cdot 17 = 4.4$. And so on. Every step is "keep most of the previous smoothed value, mix in a little bit of the new measurement." Plotting $v_t$ against time gives you a trend line that reacts gradually instead of instantly.

There's a really useful intuition for what $\beta$ means: EWMA behaves kind of like an average over the last

$$\frac{1}{1 - \beta}$$

points (an "effective window length"). If $\beta = 0.9$, that's about $1/0.1 = 10$ steps of memory. If $\beta = 0.5$, that's $1/0.5 = 2$ steps, meaning you're basically averaging only the last couple points and your estimate will whip around. Same data, different $\beta$, wildly different behavior: high $\beta$ gives you a slow, stable, low-variance curve; low $\beta$ gives you a moody curve that updates its "belief" based almost entirely on what just happened.

If you want the "why does it weight recent points more?" proof, just expand the recurrence a few steps. Substitute $v_{t-1}$ into $v_t$, then substitute again, and you'll see the pattern:

$$v_t = (1 - \beta)(x_t + \beta x_{t-1} + \beta^2 x_{t-2} + \cdots + \beta^{t-1} x_1) + \beta^t v_0$$

So the contribution of $x_{t-k}$ is $(1 - \beta)\beta^k$. Since $\beta < 1$, those weights decay exponentially as you go back in time. That is literally the mechanism behind the two key properties: (1) newer points matter more, (2) any fixed old point's influence shrinks as time goes on.

Now connect this to deep learning optimizers: replace "temperature" with "gradient". Gradients are noisy minibatch estimates, so you don't want your update direction to flip around just because one batch was weird. Momentum and friends maintain an EWMA of gradients (or squared gradients), which is just this same filter applied to a different signal. High $\beta$ means "trust the long-term direction, smooth aggressively," low $\beta$ means "react quickly to new gradient information." The sweet spot is task-dependent, but in practice you see $\beta \approx 0.9$ all the time because it gives a nice stability/response tradeoff.

Implementation-wise, Python makes this boring (in a good way). In pandas you can use `ewm` to compute the exponential moving average over a column. Pandas often parameterizes the recurrence with alpha instead of $\beta$, where

$$\alpha = 1 - \beta$$

Same thing, just a different knob. So if someone says "alpha = 0.1", that corresponds to $\beta = 0.9$ (slow/stable). If they say "alpha = 0.9", that corresponds to $\beta = 0.1$ (fast/moody). Typical workflow: load time series (date, value), compute `ema = df['value'].ewm(alpha=alpha).mean()`, attach it as a new column, plot raw series plus EMA with matplotlib. The best exercise is to vary $\alpha$ / $\beta$ and actually feel how the curve transitions from smooth trend extractor to near-copy of the raw data, and then write the recurrence yourself in a loop once, so you internalize that it's just one line of state update repeated forever.

# SGD with momentum

SGD with momentum lives inside one big picture: you're not "optimizing weights," you're navigating a loss landscape. You feed input → get prediction $\hat{y}$ → compare to target $y$ with some loss like mean squared error

$$L(y, \hat{y}) = (y - \hat{y})^2$$

and because $\hat{y}$ depends on parameters $\theta$ (weights and biases), your loss is ultimately $L(\theta)$. In toy land, if $\theta$ is a single weight $w$, you can plot $L(w)$ as a 2D curve. If $\theta = (w, b)$, you can plot $L(w, b)$ as a 3D surface. In real nets $\theta \in \mathbb{R}^N$ with $N$ in the millions, so that surface exists but your human brain can't "look at it," so we constantly fall back to 1D/2D slices and their shadows.

That's where contour plots are secretly doing a ton of work. Take a 3D surface $L(w, b)$, look at it from the top, and draw curves where the loss is constant, those are level sets. That's the contour plot: a 2D projection of a 3D surface. You lost one dimension (height), so you encode it with color. The geometry is the key: where contour lines are close together, the slope magnitude is large (steep region). Where the lines are far apart, the surface is flat-ish (small gradient over a big region). Saddle points show up as that weird "up in one direction, down in the other" structure, often with large flat-ish neighborhoods where gradients are tiny and progress crawls. Local minima show up as "basins" that look like closed loops around a dip. You can mentally reverse-engineer the 3D surface from the 2D contours: tight rings = steep walls, spaced rings = gentle bowl, twisted pattern = saddle.

Now convex vs non-convex: convex is the friendly universe where there's one basin and every downhill path leads to the same global minimum. Non-convex is the deep learning universe: multiple basins, flats, ravines, saddles, weird curvature. The "why is optimization hard?" list in practice is brutally simple:

*Local minima:* you can fall into a small dip and stop improving, even though there's a deeper basin elsewhere (global minimum).

*Saddle points / plateaus:* gradients are tiny across a large region, so you move at a glacial pace (even if you're not "stuck" in a strict minimum).

*High curvature ravines:* one direction is steep (large curvature), another direction is shallow (small curvature), so vanilla updates bounce side-to-side while making slow forward progress.

Plus the "SGD reality tax": noisy gradients (because minibatches are stochastic estimates), and inconsistent gradients (directions can vary across steps due to noise + curvature + minibatch sampling).

Before momentum, the baseline update is: "step opposite the gradient." For parameters $\theta$,

$$\theta_{t+1} = \theta_t - \alpha \nabla_\theta L(\theta_t)$$

where $\alpha$ is the learning rate. If you compute $\nabla L$ over the full dataset, that's batch gradient descent: smooth, stable, often slow per step. If you compute it using one example at a time, that's stochastic gradient descent (SGD): cheap per step, but the path is jagged because the gradient estimate is noisy. If you compute it over a small batch, that's mini-batch GD (what people usually mean in deep learning): a practical middle ground, faster iteration than batch, less chaos than pure SGD.

So where does momentum enter? Momentum is basically: "don't treat each gradient like it's a brand-new opinion; treat gradients like noisy measurements of a trend." If the last several gradients have been pointing roughly the same way, you should build confidence and move faster in that direction. If gradients disagree (noise, oscillations across a ravine), you should damp the indecision.

There are two mental models that land well:

*Crowd directions model:* you're trying to reach point B in a city. You ask one person and they point east, maybe, maybe not. You ask four people and they all point east, now you commit and walk faster. If the crowd is split (two say east, two say west), you still move, but cautiously. Momentum is the "ask multiple past gradients, form a consensus direction."

*Physics model:* you're a ball rolling down a landscape. A ball doesn't teleport to "the steepest direction" and forget its previous motion every millisecond. It has velocity; it carries inertia. Momentum optimization explicitly introduces a velocity-like state. In Newtonian terms, momentum is $p = mv$. We don't really have a meaningful mass here, so you can pretend $m = 1$ and focus on "velocity."

Mathematically, SGD with momentum keeps a velocity vector $v_t$ (same shape as $\theta$). Replace the raw gradient step with an exponentially weighted moving average of past gradients (EMA), and use that to update parameters:

$$v_t = \beta v_{t-1} + \alpha g_t$$
$$\theta_{t+1} = \theta_t - v_t$$

where $g_t = \nabla_\theta L(\theta_t)$ (usually computed on a mini-batch), $\alpha$ is learning rate, and $\beta \in [0, 1)$ is the momentum coefficient (commonly $\beta = 0.9$).

Two important notes:

This is literally the exponential moving average idea applied to gradients (or "update directions").

If you unroll the recurrence, you see the EMA explicitly:

$$v_t = \alpha (g_t + \beta g_{t-1} + \beta^2 g_{t-2} + \cdots)$$

So the current velocity is a weighted sum of recent gradients, with weights decaying exponentially into the past. Recent gradients matter more; ancient gradients fade out.

The parameter $\beta$ is the "memory / decay factor." It sets how long the optimizer remembers the past. A useful rule of thumb: the effective averaging window length is about

$$\text{window} \approx \frac{1}{1 - \beta}$$

So $\beta = 0.9$ remembers roughly $\sim 10$ steps, $\beta = 0.99$ remembers $\sim 100$ steps, etc. Bigger $\beta$ = smoother, more inertia, more "commitment."

Edge cases explain the whole design:

*If $\beta = 0$:*

$$v_t = \alpha g_t, \quad \theta_{t+1} = \theta_t - \alpha g_t$$

That's just vanilla SGD. Momentum collapses to "no momentum."

*If $\beta \to 1$:* you stop forgetting. Velocity becomes a long-running accumulator. This can create persistent oscillations / a kind of dynamic equilibrium where you don't settle nicely, because the system carries too much inertia and not enough damping.

Now the core behavioral payoff. Picture the classic deep learning ravine: steep curvature in one direction (say vertical), shallow slope in the other (say horizontal). Vanilla SGD tends to bounce: it overshoots across the steep direction, flips gradient sign, overshoots back, repeat. You get a zig-zag trajectory that wastes steps oscillating "up and down" while inching forward along the shallow direction. Momentum acts like a low-pass filter: the oscillatory component cancels out over time (because it keeps switching sign), while the consistent component along the shallow direction accumulates. Net effect: less vertical oscillation, more horizontal progress, faster convergence.

That same mechanism also helps with:

*Noisy gradients (especially small batch sizes):* the noise is high-frequency randomness; EMA smooths it out.

*Inconsistent gradients:* if the direction is unstable from step to step, momentum refuses to fully commit; it averages them and produces a more stable update direction.

*Saddle points / flat regions:* gradients can be tiny for a long time. Momentum can "carry" you through these regions because velocity doesn't instantly drop to zero when the instantaneous gradient is small. You keep moving due to accumulated velocity.

*Local minima (the small annoying ones):* inertia can help you roll out of a shallow basin. If the dip is small and your velocity is high enough, you don't get trapped, you pass through and continue toward a better basin.

And here's the funny twist: momentum's superpower is also its most common failure mode. Because it builds velocity, it often overshoots the optimum and then has to correct. Near the minimum, the true gradient points back toward the basin center, but your velocity may still be blasting forward from earlier steps. So you fly past the bottom, climb the other side, turn around, fly past again… and you get oscillations around the optimum. The exponential decay ($\beta < 1$) acts as damping, so those oscillations usually shrink and you eventually settle, but you can waste time "ringing" around the minimum.

On a contour plot, this looks exactly like what your intuition expects: momentum trajectories cut through the landscape aggressively, often overshooting the basin center and spiraling or oscillating before stabilizing. Plain SGD looks more like a jittery random walk that eventually drifts into the basin, often slower but with less dramatic overshoot. In interactive visualizers (contour plot + click-to-start-point), this contrast is almost comically visible: SGD is the anxious squirrel; momentum is the overconfident skateboarder.

So the headline claims (when someone asks "why use momentum?") are basically three:

*Speed:* it almost always reaches a good region faster than plain SGD, especially in ravines / high curvature terrain.

*Escaping shallow traps:* it can roll through small local minima or tiny bumps because it has inertia.

*Stability under noise:* it smooths noisy, stochastic gradients by averaging history.

And the main caution label:

*Overshoot + oscillation near optima:* momentum can waste steps bouncing around the minimum before damping out. It's still typically faster than vanilla SGD overall, but this is exactly why later methods try to keep the "fast" while fixing the "ringing."

The cleanest way to remember what's happening is this: SGD reacts to the present; momentum reacts to the recent past plus the present. It's a memory-equipped optimizer. That memory is an exponentially decayed history of gradients. Set $\beta$ too low and you're basically back to SGD. Set it too high and you get a stubborn optimizer that refuses to slow down and can oscillate for longer. In the sweet spot (often $\beta \approx 0.9$), you get the practical win: faster traversal across ugly non-convex terrain where gradients are noisy, curvature is weird, and the loss surface is doing its best impression of a crumpled bedsheet.

If you want one sentence that isn't lying: SGD with momentum replaces "take a step downhill" with "maintain a velocity that's an EMA of downhill directions, then step according to that velocity." That's it. Everything else (faster convergence in ravines, smoothing noise, escaping shallow minima, overshooting and oscillations) is just that sentence playing out in geometry.

# Nesterov Accelerated Gradient

You're training a network. You have parameters $w$ (weights, biases, whatever), a loss $L(w)$, and the entire game is: find $w^*$ that makes $L(w)$ small. Vanilla gradient descent just says: "look at the slope here, step downhill."

$$w_{t+1} = w_t - \eta \nabla L(w_t)$$

with learning rate $\eta$. In deep learning you see the usual trio: batch GD (full dataset gradient), SGD (one sample), mini-batch (the practical default). They all share the same core weakness: they can be slow and jittery, especially in landscapes that look like long ravines / narrow valleys (common with ill-conditioned curvature). Even in a toy convex case like linear regression with MSE, the loss surface is a smooth bowl in $(m, b)$ space: you start at some random $(m, b)$, and you "walk" toward the optimum. Batch GD will get there, but it often takes a bunch of small, cautious steps (think 25–30-ish iterations) because the gradient keeps changing and you're always reacting locally.

Momentum is the first big hack that feels like physics: stop being a goldfish. Keep a running "velocity" that remembers where you've been going.

A common momentum form is:

$$v_t = \beta v_{t-1} + \eta \nabla L(w_t)$$
$$w_{t+1} = w_t - v_t$$

Here $v_t$ is the velocity, $\beta \in [0, 1)$ is the momentum coefficient (decay factor), and $\eta$ is the learning rate. Substitute $v_t$ into the weight update and you see the key thing immediately:

$$w_{t+1} = w_t - \beta v_{t-1} - \eta \nabla L(w_t)$$

So the step is a sum of two pushes:

$-\eta \nabla L(w_t)$: the "fresh" downhill push from the current gradient

$-\beta v_{t-1}$: the "inertia" push from accumulated past gradients

That inertia is why momentum often rockets through the early part of optimization: the optimizer zips toward the vicinity of the minimum in just a few epochs, while plain GD trudges there in many more. But you also see the dark side: overshoot and oscillation. If $\beta$ is high (e.g. 0.9), you're giving a lot of weight to history, so once you build speed you tend to blast past the minimum, then correct, then blast past again, with oscillations that gradually decay. Tuning $\beta$ down (e.g. 0.8) reduces how hard the past keeps shoving you, so the oscillations damp faster. That's the core trade: bigger $\beta$ = faster "ball rolling downhill" behavior, but also more ringing.

This oscillation problem is not just cosmetic. In non-convex deep nets (and even in convex-but-ill-conditioned bowls), that ringing can waste steps. The optimizer is spending compute doing a little interpretive dance around the minimum instead of just settling.

Nesterov Accelerated Gradient (NAG) is a deceptively small tweak on momentum that attacks exactly that: it tries to reduce the oscillation by being less surprised by where momentum is about to take you. The mental model: momentum is driving the car while staring at the road under the front bumper. NAG lets you peek a bit ahead, then steer.

Momentum computes the gradient at the current position $w_t$, then mixes it with velocity. NAG instead evaluates the gradient at a lookahead point: where momentum alone would take you.

Define the lookahead weight:

$$w_{\text{lookahead}} = w_t - \beta v_{t-1}$$

Now compute the gradient there, and update velocity using that gradient:

$$v_t = \beta v_{t-1} + \eta \nabla L(w_{\text{lookahead}}) = \beta v_{t-1} + \eta \nabla L(w_t - \beta v_{t-1})$$

Then the actual parameter update is still:

$$w_{t+1} = w_t - v_t$$

If you substitute, you get the compact "single-line" view that shows the difference vs momentum:

Momentum:

$$w_{t+1} = w_t - \beta v_{t-1} - \eta \nabla L(w_t)$$

NAG:

$$w_{t+1} = w_t - \beta v_{t-1} - \eta \nabla L(w_t - \beta v_{t-1})$$

That's it. Same ingredients. Same hyperparameters. One change: the gradient is measured after peeking in the direction momentum is about to move.

Why does this reduce oscillations? Picture the classic overshoot near a minimum along one direction. With momentum, you arrive near the bottom with a big $v$ pointing "forward." Even if the true gradient at $w_t$ is starting to change sign (meaning you've crossed the minimum), the update still includes this big inertial shove, so you keep going too far, then you correct back, then too far again. The optimizer is always a step late because it's using a gradient that doesn't account for the fact that momentum is about to move you.

NAG fixes the timing mismatch. It says: "Before you commit to the full step, pretend you already applied momentum, and from there ask the loss: which way is downhill?" Near the minimum, that lookahead point often lands on the "other side" of the bowl where the gradient flips direction. So the gradient term in NAG becomes a kind of early braking signal: it partially cancels the momentum shove before you overshoot as badly. Geometrically, instead of doing a big U-turn after you fly past the minimum, you do a smaller correction sooner. The result is the same fast approach but noticeably less ringing.

So the slogan version (that's actually accurate) is:

Momentum: "push me based on where I am."

NAG: "push me based on where momentum is about to put me."

There is a real tradeoff lurking here: damping oscillations can sometimes mean damping your ability to escape. If the loss surface has a little local basin separated by a ridge, momentum can occasionally carry enough inertia to roll up and out of that basin. NAG, by being more conservative near turning points (because it's constantly applying this lookahead correction), might fail to accumulate the same "slam through the barrier" behavior and can get stuck oscillating inside a local minimum region. That's not a universal law (deep net landscapes are weirder than the 2D cartoon) but it's a legit failure mode intuition: NAG is better behaved, and "better behaved" can sometimes mean "less exploratory."

Finally, the practical implementation detail from the Keras angle: NAG isn't some separate exotic optimizer class in the basic API sense, it's typically just a switch on SGD-with-momentum. Conceptually:

Vanilla SGD = momentum off, Nesterov off

Momentum SGD = momentum on, Nesterov off

NAG = momentum on, Nesterov on

So you're really choosing: do I want the velocity term, and if yes, do I want the gradient evaluated at the current position (momentum) or at the lookahead position (NAG)?

Net-net: NAG is "momentum, but slightly more clairvoyant." Same core mechanism (accumulate velocity to speed up in consistent directions) but it uses that lookahead gradient to reduce overshoot and tame oscillations, which often makes it converge faster and cleaner in the kinds of curved valleys you see all over deep learning optimization.

# Adaptive Gradient

Adagrad = Adaptive Gradient. The whole vibe is: stop pretending one global learning rate makes sense for every parameter forever. Instead, give each parameter its own effective step size that adapts based on how big its gradients have been historically. Some parameters get hammered with large gradients (steep directions) → shrink their step size. Others barely see signal (flat or sparse directions) → keep their step size relatively large so they can actually move. That's the entire trick, and it's one of those "how did we ever not do this?" moments.

Where does this matter? Two classic situations show up immediately.

First: features on wildly different scales. Think CGPA in $[0, 10]$ vs salary in $[0, 10^6]$. If you don't normalize, the gradient wrt the "salary weight" tends to be huge compared to the "CGPA weight", so vanilla gradient descent takes giant steps in one direction and tiny steps in another. In practice you usually normalize and move on, so this isn't the most compelling reason to use Adagrad, but it's a clean intuition pump: different gradient magnitudes should imply different step sizes.

Second (much more interesting): sparse features. Sparse means most entries are zero. A very typical setup: you've got normal dense signals like IQ and CGPA, plus a binary indicator like "IIT-JEE" that is mostly 0 and occasionally 1. This kind of feature can be extremely predictive but rarely "active." People also mention L1 regularization around sparsity because it tends to push weights toward exactly zero; different topic, but it lives in the same neighborhood of "sparsity changes geometry."

Now: why does sparsity create optimization drama? Because it changes the loss landscape geometry into a nasty stretched ravine / elongated bowl. If your features are "well-behaved" and similarly scaled/dense, the loss contours in parameter space (say $w$ and $b$ for a linear model) look more like concentric-ish circles/ellipses. But with sparse structure, you get contours that are extremely elongated: steep in one direction, flat in another. In 3D you can literally see a valley where along one axis the loss changes rapidly, and along the other axis it barely changes.

That geometry is exactly where vanilla GD and momentum start doing dumb stuff that looks like it's working (loss goes down), but wastes tons of steps. In an elongated valley, the gradient points mostly across the valley walls (the steep direction), so you take a step, bounce to the other side, bounce back, etc. Progress along the valley floor (the shallow direction) is painfully slow because the gradient component there is small. So your path becomes this zig-zag slide: you rush down the steep dimension, then inch forward along the flat one. Momentum adds inertia, which can make you overshoot harder across the steep walls before it settles, still not the direct "cut to the minimum" route your brain wants.

You can understand the "why" super concretely with the simplest possible network: a single neuron doing linear regression.

$$\hat{y} = wx + b \cdot 1$$

Take mean squared error (ignoring averaging constants for the intuition):

$$L = (y - \hat{y})^2$$

Compute gradients for one example:

$$\frac{\partial L}{\partial \hat{y}} = -2(y - \hat{y})$$

$$\frac{\partial \hat{y}}{\partial w} = x, \quad \frac{\partial \hat{y}}{\partial b} = 1$$

So:

$$\frac{\partial L}{\partial w} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial w} = -2(y - \hat{y})x$$

$$\frac{\partial L}{\partial b} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial b} = -2(y - \hat{y})$$

Now look at what sparsity does. If $x$ is sparse, then for most training examples $x = 0$. For those examples:

$$\frac{\partial L}{\partial w} = -2(y - \hat{y}) \cdot 0 = 0$$

So the gradient signal for $w$ is literally zero most of the time. Meanwhile the bias gradient doesn't get multiplied by $x$; it always exists:

$$\frac{\partial L}{\partial b} = -2(y - \hat{y})$$

In batch gradient descent (sum over many examples), $\sum(y - \hat{y})x$ is small because most terms are zero, but $\sum(y - \hat{y})$ is not. So the bias update dominates early. In parameter-space terms: your update vector has a big component in the $b$-direction and a tiny component in the $w$-direction, so you move mostly along one axis. That's the "why am I sliding down the valley wall forever?" behavior.

Here's a good exercise: why does the path eventually start moving along the other direction after initially rushing down one axis? Mechanically: as you descend the steep direction, the gradient component in that direction shrinks (you're reaching the valley floor), so the other component, previously tiny, becomes relatively more important. Also, once the model has adjusted the bias/dense parts enough, the remaining error can become better explained by the sparse feature when it's active, so the sparse-direction gradients that do appear start steering you. Either way, the key is: the geometry forces a two-phase dynamic in vanilla methods, first "fix the steep thing," then "slowly crawl along the shallow thing."

So what's the fix? If the problem is "gradient magnitudes are incomparable across parameters," the only knob left in the update rule is the learning rate. Standard GD uses the same $\eta$ for every parameter:

$$\theta_{t+1} = \theta_t - \eta \, g_t$$

where $g_t = \nabla_\theta L(\theta_t)$.

But in these ravines you want something like:

- if a parameter sees consistently large gradients, shrink its effective learning rate so you don't bounce around

- if a parameter sees tiny or infrequent gradients (hello sparsity), increase its effective learning rate so it can move meaningfully when it finally gets signal

That idea, per-parameter learning rates, is the heart of Adagrad (and later RMSProp/Adam, which refine it).

Adagrad does it by keeping an accumulator of squared past gradients. For each parameter (think elementwise), define:

$$v_t = v_{t-1} + g_t^2$$

where $g_t^2$ is elementwise square. Then update:

$$\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{v_t} + \epsilon} \odot g_t$$

- $\eta$ is the base learning rate (scalar)

- $v_t$ is the running sum of squared gradients (same shape as $\theta$)

- $\epsilon$ is a tiny constant for numerical stability (prevents division by zero and exploding steps when $v_t \approx 0$)

- $\odot$ is elementwise multiplication

If you want to see it as "effective learning rate per parameter $i$":

$$\eta_{i,t} = \frac{\eta}{\sqrt{v_{i,t}} + \epsilon}$$

$$\theta_{i,t+1} = \theta_{i,t} - \eta_{i,t} \, g_{i,t}$$

Now the intuition clicks hard:

- In steep directions, $g_{i,t}$ tends to be large repeatedly → $v_{i,t}$ grows fast → $\sqrt{v_{i,t}}$ gets large → $\eta_{i,t}$ shrinks → steps get damped.

- In sparse/flat directions, $g_{i,t}$ is often zero and occasionally nonzero → $v_{i,t}$ grows slowly → $\eta_{i,t}$ stays relatively large → when the gradient finally shows up, you take a meaningful step instead of a microscopic one.

Geometrically, Adagrad is trying to "circularize" the problem: it rescales each coordinate direction based on observed gradient energy so the ravine doesn't dominate your trajectory. Instead of the long wasteful slide and zig-zag, the path is more direct and reaches the minimum faster (at least early on). It's not magic, it's just fixing a mismatch between the landscape's anisotropy (different curvature in different directions) and your optimizer's assumption (one step size fits all).

Now the big gotcha, and it's a killer: Adagrad's learning rates only ever go down. Because $v_t$ is a cumulative sum, it grows monotonically:

$$v_t = \sum_{\tau=1}^{t} g_\tau^2$$

So $\sqrt{v_t}$ tends to increase without bound (or at least keeps creeping up), which means the effective learning rates $\eta_{i,t}$ keep shrinking. Eventually they get so small that updates become basically zero:

$$\Delta\theta_{i,t} \approx 0$$

And then you're stuck, your optimizer stops making progress before you properly converge to the best solution, especially in deep nets where you need sustained learning for a long time and the gradient statistics change across training. This is why you almost never see raw Adagrad as the default choice for modern neural networks. It can work fine in simpler convex-ish problems (classic examples: some linear models, some sparse-feature settings), but for deep learning it tends to "die" by annealing itself into paralysis.

And that's the historical punchline: Adagrad is less the final boss and more the ancestor. The core idea, normalize steps by a running measure of gradient magnitude per parameter, is exactly what later optimizers keep, while fixing the "learning rate decays forever" bug. RMSProp swaps the cumulative sum for a moving average (so the denominator doesn't grow without bound). Adam combines that RMSProp-style second-moment tracking with momentum-like first-moment tracking. Different beasts, same genetic material.

So Adagrad is the clean first principles version: detect which coordinates are loud vs quiet by accumulating $g^2$, divide by $\sqrt{\text{that}}$, and suddenly sparse features and ravines stop bullying your optimizer. Then it overcorrects by never forgetting the past, and eventually it forgets how to learn at all.

# AdaDelta

AdaDelta is the "let's actually think about what we're doing" version of AdaGrad. It fixes AdaGrad's slow death problem (the accumulator that grows forever), but it also does something more interesting: it makes the update equation dimensionally correct without requiring a learning rate at all. That second part is the clever bit that most people gloss over.

Start from AdaGrad's fatal flaw. It divides the learning rate by the sum of all past squared gradients:

$$v_t = v_{t-1} + g_t^2$$
$$\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{v_t} + \epsilon} \odot g_t$$

That denominator only grows. Step 1: $g_1^2$. Step 2: $g_1^2 + g_2^2$. Step 1000: $g_1^2 + \cdots + g_{1000}^2$. It never shrinks. Eventually your effective learning rate $\rightarrow 0$, and learning stops dead.

The first fix is obvious once you ask: why should gradients from 1000 steps ago matter equally to gradients from 5 steps ago? They shouldn't. The naive approach would be to store the last $w$ gradients and average their squares, but that's memory-expensive. The clever fix: exponential moving average (EMA). Instead of storing history, use a running average that naturally "forgets" old values:

$$E[g^2]_t = \rho \cdot E[g^2]_{t-1} + (1-\rho) \cdot g_t^2$$

Think of it as a weighted blend: $\rho$ times the old estimate plus $(1-\rho)$ times the new observation. With $\rho = 0.9$, the current gradient contributes 10%, the previous average contributes 90%, a gradient from 10 steps ago has decayed to $0.9^{10} \approx 35\%$ influence, and a gradient from 50 steps ago is at $0.9^{50} \approx 0.5\%$, essentially forgotten. Result: the denominator stays bounded, old gradients fade away, learning can continue indefinitely. The update becomes $\Delta\theta_t = -\eta \cdot g_t / \text{RMS}[g]_t$ where $\text{RMS}[g]_t = \sqrt{E[g^2]_t + \epsilon}$. This is basically RMSprop (Hinton beat Zeiler to the punch on this part by a few months). But AdaDelta goes further.

The second fix is more subtle: dimensional analysis. Consider what units things have. If your parameter $\theta$ has units (say, "meters"), then the gradient $g = \partial f/\partial \theta$ has units of $1/\text{meters}$ (inverse of $\theta$). Now look at SGD: $\Delta\theta = -\eta \cdot g$. If $\eta$ is unitless, then $\Delta\theta$ has units of $1/\text{meters}$. But you're adding $\Delta\theta$ to $\theta$ (which has units meters). The units don't match. This is mathematically incoherent.

AdaGrad doesn't fix this either, it divides gradient by $\sqrt{\sum g^2}$, which has units $1/\theta$, so the result still has wrong units. Newton's method gets this right: $\Delta\theta = H^{-1} \cdot g$ where $H = \partial^2 f/\partial\theta^2$. Check the units: $g$ has units $1/\theta$, $H$ has units $1/\theta^2$, so $H^{-1}$ has units $\theta^2$, and $H^{-1} \cdot g$ has units $\theta^2 \cdot (1/\theta) = \theta$. Newton's method produces updates with the same units as the parameters. Dimensionally correct.

The AdaDelta insight: rearrange Newton's method as $\Delta\theta = g/H$, which implies $1/H = \Delta\theta/g$. This says the inverse Hessian is the ratio of update size to gradient size. We don't have the Hessian, but we do have $\text{RMS}[g]$ (a measure of gradient magnitudes) and $\text{RMS}[\Delta\theta]$ (a measure of update magnitudes from previous steps). So approximate: $1/H \approx \text{RMS}[\Delta\theta]/\text{RMS}[g]$. Plug this in:

$$\Delta\theta_t = -\frac{\text{RMS}[\Delta\theta]_{t-1}}{\text{RMS}[g]_t} \cdot g_t$$

Check the units: $\text{RMS}[\Delta\theta]$ has units of $\theta$, $\text{RMS}[g]$ has units of $1/\theta$, $g$ has units of $1/\theta$. So $\Delta\theta = (\theta)/(1/\theta) \cdot (1/\theta) = \theta^2 \cdot (1/\theta) = \theta$. The units now match. No arbitrary learning rate $\eta$ needed.

The full algorithm maintains two running averages: $E[g^2]_t$ (EMA of squared gradients, like RMSprop) and $E[\Delta\theta^2]_t$ (EMA of squared updates, the new part). Each step: compute $E[g^2]_t = \rho \cdot E[g^2]_{t-1} + (1-\rho) \cdot g_t^2$, then $\Delta\theta_t = -(\text{RMS}[\Delta\theta]_{t-1}/\text{RMS}[g]_t) \cdot g_t$, then update $E[\Delta\theta^2]_t = \rho \cdot E[\Delta\theta^2]_{t-1} + (1-\rho) \cdot \Delta\theta_t^2$, then $\theta_{t+1} = \theta_t + \Delta\theta_t$. Typical hyperparameters: $\rho = 0.95$, $\epsilon = 10^{-6}$.

One neat detail: notice $\text{RMS}[\Delta\theta]_{t-1}$ uses time $t-1$, while $\text{RMS}[g]_t$ uses time $t$. The lag is a feature. If a sudden huge gradient spike hits, the denominator $\text{RMS}[g]_t$ immediately increases (it includes the new gradient), but the numerator $\text{RMS}[\Delta\theta]_{t-1}$ doesn't react yet (it's lagged). Result: effective learning rate drops instantly, damping the spike. This is automatic gradient clipping built into the algorithm.

So AdaDelta combines ideas from several places: from SGD it takes the negative gradient direction, from momentum the numerator accumulates past update information, from AdaGrad the per-dimension scaling via squared gradients, and from Newton the correct units via Hessian approximation. All with $O(1)$ memory per parameter and no learning rate to tune.

The original paper tracked the "effective learning rate", the ratio $\text{RMS}[\Delta\theta]_{t-1}/\text{RMS}[g]_t$, across different layers during training, and found some interesting dynamics. First, lower (deeper) layers get larger step sizes early in training. This makes sense: backpropagation multiplies gradients through each layer, so if weights are typically $< 1$, gradients shrink exponentially as you go deeper. Lower layers see tiny $g$, which means small $\text{RMS}[g]$, which means large effective learning rate. AdaDelta automatically compensates for the vanishing gradient problem without manual per-layer tuning.

Second, late in training the effective learning rate converges to 1 across all layers. Why? Near a minimum, both gradients and updates become tiny. When $E[g^2]$ and $E[\Delta\theta^2]$ both approach zero, the $\epsilon$ terms dominate: $\text{RMS}[\Delta\theta]/\text{RMS}[g] \approx \sqrt{\epsilon}/\sqrt{\epsilon} = 1$. A step size of 1 sounds dangerously large, but it doesn't cause divergence because the actual update is $\Delta\theta = 1 \times g$, and $g$ itself is tiny near the minimum. So updates naturally shrink toward zero, not because the step size decays, but because the gradient does. This is "natural annealing": the system automatically reduces updates as you approach convergence, no learning rate schedule needed.

There's a catch, though. Near a minimum you often oscillate: step left, overshoot, step right, overshoot. Momentum handles this elegantly because it averages velocity, left and right oscillations partially cancel, producing net movement toward the center. AdaDelta doesn't get this benefit. It accumulates *squared* updates in the numerator: $E[\Delta\theta^2] = \rho \cdot E[\Delta\theta^2]_{t-1} + (1-\rho) \cdot \Delta\theta_t^2$. Squaring makes everything positive. Left and right oscillations both contribute positively, inflating the numerator rather than canceling. This can sustain oscillations rather than damping them. The suggested fix is to add an explicit annealing schedule, which defeats the "no learning rate" promise and is something momentum-based methods get for free via velocity averaging.

Sounds great on paper. In practice, AdaDelta is mostly a historical footnote. RMSprop and Adam dominate for a few reasons. First, slow convergence: the unit-correction trick sounds elegant, but the "effective learning rate" that emerges from $\text{RMS}[\Delta\theta]/\text{RMS}[g]$ tends to be conservative, and Adam with a tuned $\eta$ usually gets there faster. Second, the "no learning rate" promise is oversold: you still have $\rho$ (the decay rate) and $\epsilon$ (the stability constant), and in some problems $\epsilon$ matters a lot for convergence behavior. You didn't eliminate hyperparameters; you just renamed them. Third, Adam won: by the time practitioners were choosing between RMSprop and AdaDelta, Adam showed up with momentum + adaptive scaling + bias correction, and that combination was good enough that most people just use Adam and move on. Fourth, extra state: AdaDelta tracks both $E[g^2]$ and $E[\Delta\theta^2]$, which is 2x the state of RMSprop per parameter. Adam also tracks two things, but Adam's first moment is momentum which actually helps convergence. AdaDelta's update accumulator is more of an accounting trick.

AdaDelta's historical role is this: it proved you could eliminate the learning rate if you tracked updates, and it showed the dimensional analysis perspective on optimizer design. Both insights are interesting. But the optimizer itself isn't the one you'll reach for. If you want one sentence: AdaDelta is RMSprop with the learning rate replaced by a ratio of accumulated update magnitudes, which makes the units work out but doesn't actually make it faster than Adam in practice.

# RMSprop

RMSprop (Root Mean Square Propagation) is basically Adagrad with amnesia. And that "amnesia" is the whole point.

Start from the geometry that motivates all these adaptive methods. In a lot of deep learning (and also in toy sparse-feature setups), the loss surface in parameter space looks like a long, skinny ravine: steep walls in one direction, shallow slope in another. If you run vanilla gradient descent (even with momentum), you tend to "ping-pong" across the ravine: big gradients push you hard sideways into the wall, then you correct, then you bounce again. You eventually creep down, but it's a very indirect trajectory compared to what you'd want, which is basically "slide along the valley floor toward the minimum."

Adagrad shows up as an early fix, especially when the data is sparse (many zeros in some features). The key move: keep a per-parameter accumulator of squared gradients, so each parameter gets its own effective learning rate. In coordinates, with parameters $\theta$, gradient $g_t = \nabla_\theta L(\theta_t)$, and elementwise square $(g_t)^2 = g_t \odot g_t$:

$$v_t = v_{t-1} + g_t^2$$

$$\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{v_t} + \epsilon} \odot g_t$$

($\epsilon$ is the usual tiny constant for numerical stability; many writeups include it implicitly.)

This does something very intuitive: if a parameter has been seeing large gradients repeatedly, its $v_t$ grows, so $\sqrt{v_t}$ grows, and the step on that coordinate shrinks. If another parameter rarely gets gradients (because the feature is often zero), its accumulator grows slowly, so it keeps a relatively larger step size when it does get a gradient. That's why Adagrad can be nice for sparse data: it automatically dampens "frequent" directions and lets "rare" directions still move.

Now the catch, and it's a big one: Adagrad never forgets. That sum $v_t = \sum_{i=1}^{t} g_i^2$ only increases. So the effective learning rate on each coordinate is

$$\eta_{\text{eff},t} = \frac{\eta}{\sqrt{v_t} + \epsilon}$$

and as $t$ grows, $v_t$ tends to grow, so $\eta_{\text{eff},t}$ tends to shrink toward zero. After enough steps, you're taking microscopic updates. That's Adagrad's failure mode: you make progress early, then later you're basically crawling, and you can stall before you actually reach the minimum (or you approach it so slowly that in practice it's "stuck").

To make this concrete with a simple example: consider linear regression where one column is all ones (bias feature) and another column is the $x$ values (sparse), with output $y$. Parameters are $b$ for the bias and $m$ for the slope. In sparse settings, $b$ tends to get nontrivial gradients almost every step (the "ones" feature is always on), while $m$'s gradient can be intermittent or smaller depending on the sparsity pattern. Under Adagrad, the bias coordinate's accumulator $v_{t,b}$ can balloon because you keep adding $g_{t,b}^2$ across all epochs since the beginning of time. That makes the denominator huge, so the effective learning rate for $b$ becomes tiny, so you can end up in a situation like: "I still need to move in the $b$ direction to reach the minimum, but my step size in $b$ has been annihilated by history." The algorithm doesn't mathematically forbid further movement, it just makes it so slow it looks like it stopped.

So RMSprop: same idea as Adagrad (normalize by a running scale of gradient magnitudes), but it refuses to accumulate the entire history. Instead of a plain sum, it keeps an exponentially weighted moving average (EWMA) of squared gradients:

$$v_t = \beta v_{t-1} + (1 - \beta) g_t^2$$

and the parameter update keeps the same "divide by root of second moment" form:

$$\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{v_t} + \epsilon} \odot g_t$$

Here $\beta$ is typically something like $0.9$, $0.95$, or $0.99$. Interpret $\beta$ as a memory knob: closer to 1 means longer memory; smaller means you track more recent gradients more aggressively.

This one-line change is deceptively powerful. Unroll the recurrence (assume $v_0 = 0$):

$$v_1 = (1 - \beta) g_1^2$$

$$v_2 = \beta(1 - \beta) g_1^2 + (1 - \beta) g_2^2$$

$$v_3 = \beta^2(1 - \beta) g_1^2 + \beta(1 - \beta) g_2^2 + (1 - \beta) g_3^2$$

and in general:

$$v_t = (1 - \beta) \sum_{i=1}^{t} \beta^{t-i} g_i^2$$

So the contribution from epoch 1 gets multiplied by $\beta^{t-1}$, which decays exponentially. Old gradients don't dominate forever; they fade out. You can see it algebraically: the optimizer starts forgetting older epochs.

Why does this fix Adagrad's stall? Because $v_t$ now tracks something like the recent RMS (root-mean-square) magnitude of the gradient on each parameter. It can go up and down depending on what the optimizer is currently experiencing; it doesn't just monotonically explode. So $\eta_{\text{eff},t} \approx \eta / \sqrt{v_t}$ doesn't inevitably collapse to ~0 just because time passed. Updates keep having nontrivial size, and you don't get the "I used up my learning rate budget in the first 20% of training" pathology.

Geometrically, RMSprop is trying to equalize progress across coordinates. In the ravine picture, one direction consistently has large gradients (steep walls) and the other has smaller gradients (along the valley floor). Dividing by $\sqrt{v_t}$ per coordinate damps the steep direction more and damps the shallow direction less, so your steps become less zig-zag and more "down the valley." In sparse data terms, parameters tied to frequently-active features get their effective step reduced, while infrequent features don't get permanently penalized just because training has been running a long time.

One thing worth noting: on a convex problem like linear regression, you may not visually see a dramatic difference between standard gradient descent and RMSprop because convex problems are forgiving, plain GD converges fine if you pick a reasonable learning rate. Deep nets are non-convex, ill-conditioned, and full of plateaus/ridges/saddles, so the "adaptive scaling" trick tends to matter more there. RMSprop's practical selling point is that it often behaves like a stabilizer: it keeps learning moving when raw gradients would either explode you or force you into tiny step sizes.

RMSprop and Adam look similar, and that's not a coincidence. Adam basically takes RMSprop's second-moment idea and adds a first-moment (momentum-like) running average:

$$m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t$$

$$v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2$$

(with bias corrections usually applied), and then it updates using $m_t$ normalized by $\sqrt{v_t}$. So if you compare trajectories, RMSprop can resemble "Adam without the momentum term (and without Adam's bias correction machinery)." On simple convex problems, they can look very similar.

One place I need to be blunt: claims like "Adam won't reach the minimum whereas RMSprop does" or "RMSprop has no disadvantages" are too absolute. In non-convex optimization there is no general guarantee that either method reaches a global minimum; the landscape can have many minima and many saddle points, and what you get depends on initialization, noise, learning rates, batch size, architecture, etc. Empirically, sometimes RMSprop beats Adam on a task, sometimes Adam beats RMSprop, sometimes SGD-with-momentum beats both (often in final generalization). RMSprop does have downsides in the practical sense: it still has hyperparameters ($\eta$, $\beta$, $\epsilon$), can be sensitive to the learning rate, can converge to different solutions, and like all adaptive methods it can sometimes trade off "fast training loss descent" against "generalization" depending on the setup. So: it's excellent, but it's not a magic spell with zero tradeoffs.

Historically, before Adam became the default-ish choice in many deep learning workflows, RMSprop was a go-to optimizer for neural nets, and it's still a serious contender. If Adam is acting weird on a particular problem (instability, poor validation, annoying sensitivity), trying RMSprop is a totally reasonable move because it keeps the core adaptive normalization idea but with slightly different dynamics.

If you want one sentence to tattoo on your brain: Adagrad divides by $\sqrt{\sum g^2}$ and eventually suffocates itself; RMSprop divides by $\sqrt{\text{EWMA}(g^2)}$ so it keeps breathing.

# Adam

Adam = Adaptive Moment Estimation. It's popular because it's basically the "best-of" mashup of two older ideas that solved two different pain points in gradient descent: momentum (use history to move faster and smoother) and adaptive per-parameter step sizes (don't use one global learning rate when the geometry and sparsity are wildly uneven). Put differently: Adam is Momentum + RMSprop, plus a small but important fix called bias correction so the early steps aren't systematically wrong.

Start from the boring baseline: gradient descent is just "walk downhill." With parameters $w$ and loss $L(w)$, you compute the gradient $g_t = \nabla_w L(w_t)$ and update

$$w_{t+1} = w_t - \eta \, g_t$$

where $\eta$ is the learning rate. The distinction between Batch GD / SGD / Mini-batch GD is mostly how noisy $g_t$ is: batch uses all data (low noise, slow per step), SGD uses one example (high noise, cheap), mini-batch is the practical middle (moderate noise, GPU-friendly). The "valley" picture matters because neural nets often induce nasty geometry: steep curvature in some directions, flat in others. Plain GD tends to zig-zag in narrow ravines: it bounces left-right across the valley walls while slowly making progress along the valley floor.

Momentum is the first historical "aha": don't just use the current gradient, maintain a velocity that's an exponential moving average of past gradients. In its common form:

$$m_t = \beta_1 m_{t-1} + (1 - \beta_1) \, g_t$$
$$w_{t+1} = w_t - \eta \, m_t$$

Here $m_t$ is like inertia: if gradients keep pointing roughly the same way, you accelerate and reach the minimum in far fewer iterations/epochs than vanilla GD. This is why momentum gets to the bottom much faster than vanilla GD: it's not wasting as much effort canceling yesterday's direction with today's direction. But momentum introduces/keeps a common annoyance: oscillations. In a narrow ravine, the gradient direction can flip sign in the high-curvature direction step-to-step, and momentum can cause you to overshoot, swing back, overshoot again, and only gradually settle.

That sets up Nesterov Accelerated Gradient (NAG): same general idea as momentum, but with a "lookahead" so you correct sooner and damp oscillations. Intuitively: instead of computing the gradient at your current position, you compute it at the position you'd end up after applying momentum, then adjust. A typical formulation:

$$\tilde{w}_t = w_t - \eta \beta_1 m_{t-1}$$
$$g_t = \nabla_w L(\tilde{w}_t)$$
$$m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t$$
$$w_{t+1} = w_t - \eta m_t$$

Net effect: your orbit tightens. You still get the speed benefits of momentum, but you reduce that big loopy wandering.

Now the second historical "aha" is about learning rate scaling, especially for sparse data / sparse features. Suppose some parameters almost never get gradients (rare words, rare categorical features, some embedding rows, etc.). With a single global $\eta$, the dense parameters get hammered with updates constantly, while sparse ones inch along because they rarely receive signal. The optimizer spends a long time racing down "the dense side" and only later drifting toward the sparse side, which is slow and kind of dumb.

AdaGrad fixes this by giving every parameter its own effective learning rate that shrinks with the accumulated squared gradients:

$$v_t = \sum_{i=1}^{t} g_i^2 \quad \text{(elementwise)}$$
$$w_{t+1} = w_t - \frac{\eta}{\sqrt{v_t} + \epsilon} \odot g_t$$

Key intuition: if a parameter gets large or frequent gradients, its $v_t$ grows, so $\eta / \sqrt{v_t}$ shrinks and you take smaller steps there. If a parameter is sparse and rarely updated, its $v_t$ stays small, so it gets relatively larger steps when it finally does get a gradient. This is why AdaGrad "descends directly" in sparse-feature scenarios: it equalizes progress across coordinates.

But AdaGrad has a very specific failure mode: $v_t$ is a cumulative sum, so it never decreases. Learning rates decay monotonically forever. Eventually the effective step sizes become so tiny that training can stall: you stop making meaningful updates and can "miss" or fail to reach the minimum in practice because you've annealed yourself into immobility.

RMSprop is basically "AdaGrad, but don't let the accumulator grow without bound." Replace the raw cumulative sum with an exponential moving average of squared gradients:

$$v_t = \beta_2 v_{t-1} + (1 - \beta_2) \, g_t^2$$
$$w_{t+1} = w_t - \frac{\eta}{\sqrt{v_t} + \epsilon} \odot g_t$$

Now $v_t$ tracks recent gradient magnitudes instead of the entire history, so you get adaptive per-parameter scaling without the "learning rate goes to zero forever" pathology. RMSprop tends to actually reach the minimum cleanly where AdaGrad can bog down.

At this point the pattern is screaming at you: one family of tricks is about directional smoothing / inertia (Momentum, NAG). Another family is about adaptive step sizes / learning rate normalization (AdaGrad, RMSprop). Adam's entire thesis is: why not do both at once?

So Adam maintains two exponential moving averages:

$m_t$ is the first moment (mean) of gradients, giving momentum-like behavior.

$v_t$ is the second moment (uncentered variance / mean square) of gradients, giving RMSprop-like scaling.

The core recurrences are:

$$g_t = \nabla_w L(w_t)$$
$$m_t = \beta_1 m_{t-1} + (1 - \beta_1) \, g_t$$
$$v_t = \beta_2 v_{t-1} + (1 - \beta_2) \, g_t^2$$

If you unroll these recurrences (assuming $m_0 = 0$ and $v_0 = 0$), you see the exponentially weighted sums explicitly:

$$m_t = (1 - \beta_1) \sum_{i=1}^{t} \beta_1^{t-i} g_i$$

$$v_t = (1 - \beta_2) \sum_{i=1}^{t} \beta_2^{t-i} g_i^2$$

So $m_t$ is an exponentially weighted sum of past gradients (recent gradients weighted more heavily), and $v_t$ is the same but for squared gradients. The contribution from step $i$ decays as $\beta^{t-i}$: older terms fade out exponentially.

Then the update is (elementwise division):

$$w_{t+1} = w_t - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}$$

That hat notation matters. This is the bias correction. At initialization you set $m_0 = 0$ and $v_0 = 0$. Early on, exponential moving averages that start at zero are biased toward zero (because they haven't "warmed up" yet). Without correction, the first few steps systematically underestimate the moments, especially when $\beta_1, \beta_2$ are close to 1. The fix is:

$$\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}$$

This makes the early estimates behave like what you intended: moment estimates that aren't artificially shrunk just because you started at zero.

A couple implementation-level notes that are actually conceptual:

The $\sqrt{\hat{v}_t}$ is there because $v_t$ lives in "gradient-squared units." Taking a square root puts it back in gradient units so the scaling is dimensionally sensible.

$\epsilon$ is just a small constant (often $10^{-8}$) to avoid division by zero and to stabilize numerics.

Hyperparameters: $\beta_1$ is commonly 0.9 (momentum timescale ~10 steps). $\beta_2$ is commonly 0.999 (a much longer timescale for the squared-gradient statistics). Note that $\beta_2$ must be in $[0, 1)$ for an exponential moving average. In practice you'll see $\beta_2 = 0.99$ or $0.999$ depending on taste/problem. $\eta$ is often set to something like $10^{-3}$ by default, and one of Adam's selling points is that you often don't have to micromanage $\eta$ as much as with vanilla SGD because the denominator is doing a lot of automatic normalization. But "often" isn't "always": it's still a hyperparameter, just less fragile.

The clean intuition for why Adam's trajectory looks "more centered" and converges in fewer iterations: $m_t$ filters out the high-frequency noise and builds a stable direction of travel (the momentum story), while $v_t$ shrinks steps in directions/coordinates with consistently large gradients (the ravine wall bouncing story) and expands steps for coordinates that rarely get action (the sparse-feature story). So you get faster progress along the valley floor, less chaotic bouncing across it, and better handling of uneven curvature and sparsity. On simple convex toy problems you might not feel the magic. On real neural nets (high-dimensional, noisy mini-batch gradients, nonconvex loss surfaces) the combination is often very forgiving and fast.

There are other relatives floating around (AdaDelta, Adamax, etc.) but the pragmatic takeaway is basically this as a workflow heuristic: start with Adam as a strong baseline for ANN/CNN/RNN training, look at learning curves and validation performance, and if it's not behaving, try RMSprop or even plain Momentum/SGD. There's no single optimizer that dominates on every dataset/model/hyperparameter regime; optimization is geometry + noise + inductive bias colliding in a dark alley. Adam just tends to be a very good first flashlight.

## Adam vs RMSProp with Momentum

We just described Adam as "momentum + RMSprop-style scaling." RMSprop itself came from Hinton's Coursera lectures in 2012. Adam arrived in 2014. That's a two-year gap. So here's the natural question: did anyone just try adding momentum to RMSprop during that window?

Yes. [Graves (2013)](https://arxiv.org/abs/1308.0850) used exactly that combination, RMSProp with momentum bolted on, for his influential work on sequence generation with RNNs. It wasn't a formal paper about the optimizer itself, but it was a real recipe used in practice. So "RMSProp + momentum" predates Adam.

Which raises the obvious follow-up: if RMSProp + momentum already existed, what exactly did Adam contribute? The answer is two things: a different way of combining the components, and bias correction. These turn out to matter more than you'd expect.

The key difference is *order of operations*. Let's write them out side by side.

**RMSProp with momentum:**

$$v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2$$
$$\text{scaled\_grad}_t = \frac{g_t}{\sqrt{v_t} + \epsilon}$$
$$m_t = \beta_1 m_{t-1} + (1 - \beta_1) \cdot \text{scaled\_grad}_t$$
$$w_{t+1} = w_t - \eta \, m_t$$

**Adam:**

$$m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t$$
$$v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2$$
$$w_{t+1} = w_t - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}$$

See the difference? In RMSProp+momentum, you first rescale each gradient by $\sqrt{v_t}$, then apply momentum to those rescaled gradients. In Adam, you apply momentum to the raw gradients, then rescale at the end.

Symbolically:

- **RMSProp + momentum:** $\text{momentum}(g / \sqrt{v})$
- **Adam:** $\text{momentum}(g) / \sqrt{v}$

These are not the same. Momentum and division don't commute.

Why does this matter? Think about what each approach is actually tracking.

In Adam, $m_t$ is a running average of raw gradients. It's estimating the expected gradient direction, preserving the correlation structure of the gradient signal across time. Then you divide by $\sqrt{v_t}$ to normalize the step size. The first moment and second moment are estimated independently, then combined at the end.

In RMSProp+momentum, you're applying momentum to already-normalized updates. Each gradient gets rescaled before it enters the momentum buffer. This means momentum is operating on signals that have already been "whitened" by the second-moment scaling.

A concrete example helps. Suppose you have sparse gradients: most steps have $g = 0$, but occasionally a large gradient appears. In Adam, the $m_t$ buffer slowly accumulates these sparse signals in their raw form, and when you divide by $\sqrt{v_t}$, the ratio captures the signal-to-noise structure. In RMSProp+momentum, each gradient gets rescaled before momentum sees it. When $g = 0$, the rescaled gradient is $0/\sqrt{v_t} = 0$, fine. But when a large $g$ suddenly appears after many zeros, the rescaled value depends heavily on what $v_t$ looks like at that moment. If $v_t$ is small (hasn't seen gradients in a while), the rescaled gradient can be huge. Momentum then amplifies that rescaled spike.

The second difference is bias correction. Adam explicitly corrects for the initialization bias:

$$\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}$$

RMSProp doesn't do this. Early in training, when you initialize $v_0 = 0$, the exponential moving average is biased toward zero. Without correction, $v_t$ underestimates the true second moment, so $\sqrt{v_t}$ is too small, so you divide by something tiny, so your steps are too big.

This matters most when $\beta_2$ is close to 1. With sparse gradients, you often want $\beta_2 = 0.999$ or higher so the optimizer doesn't forget the few nonzero gradients it sees. But with $\beta_2 = 0.999$, the first step has $v_1 = (1 - 0.999) g_1^2 = 0.001 \cdot g_1^2$. Without bias correction, you're dividing by $\sqrt{0.001 \cdot g_1^2} \approx 0.03 |g_1|$ instead of something closer to $|g_1|$. That's a 30× larger step than intended. Early steps can explode, leading to divergence.

Adam's bias correction fixes this. In the first step, $\hat{v}_1 = v_1 / (1 - \beta_2^1) = v_1 / 0.001 = g_1^2$. The estimate is corrected to what it should be if you'd been running forever. RMSProp with momentum doesn't have this machinery, so it's fragile when $\beta_2$ is high.

The practical upshot: Adam's "momentum on raw gradients, then normalize" design plus bias correction makes it more stable, especially for sparse gradients or when you need $\beta_2$ close to 1. RMSProp+momentum can work, but it's easier to accidentally blow up, and the momentum buffer is tracking a different quantity (rescaled gradients vs raw gradients). Adam isn't just "RMSProp plus momentum bolted on." The combination is tighter than that.

## Adam vs AdaGrad

We compared Adam to RMSProp+momentum, its "sibling" that appeared around the same time. But what about AdaGrad, the original adaptive method that started this whole family? AdaGrad came first ([Duchi et al., 2011](https://jmlr.org/papers/v12/duchi11a.html)), and it's the one that introduced the idea of per-parameter learning rates based on gradient history. How does Adam relate to its ancestor?

Let's put the formulae side by side first.

**AdaGrad:**

$$v_t = v_{t-1} + g_t^2 = \sum_{i=1}^{t} g_i^2$$
$$w_{t+1} = w_t - \frac{\eta}{\sqrt{v_t} + \epsilon} \odot g_t$$

**Adam:**

$$m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t$$
$$v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2$$
$$w_{t+1} = w_t - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}$$

The structural difference is clear: AdaGrad has no first moment (no momentum), and its second moment is a cumulative sum rather than an EMA. Adam has both moments as EMAs, plus bias correction.

But here's what's interesting: AdaGrad is actually a *limiting case* of Adam. You can recover AdaGrad exactly from Adam by choosing specific hyperparameters. This isn't just a curiosity, it reveals why Adam's bias correction exists in the first place.

**Setting up the limit:**

Start with Adam and set $\beta_1 = 0$. No momentum. The first moment collapses to just the current gradient:

$$m_t = (1 - 0) \cdot g_t = g_t$$

Now we only have the second moment to think about. What happens as $\beta_2 \to 1$?

Adam's second moment, when you unroll it, is:

$$v_t = (1 - \beta_2) \sum_{i=1}^{t} \beta_2^{t-i} g_i^2$$

As $\beta_2 \to 1$, two things happen: the weights $\beta_2^{t-i}$ all approach 1 (every past gradient gets equal weight, infinitely long memory), and the prefactor $(1 - \beta_2) \to 0$. These cancel in a specific way. Taking the limit carefully:

$$\lim_{\beta_2 \to 1} \hat{v}_t = \frac{1}{t} \sum_{i=1}^{t} g_i^2$$

So Adam's bias-corrected $\hat{v}_t$ becomes the *average* of squared gradients. But AdaGrad uses the *sum*, not the average. To bridge this gap, you need one more adjustment: anneal the learning rate as $\eta_t = \eta \cdot t^{-1/2}$.

Then Adam's update becomes:

$$w_{t+1} = w_t - \eta \cdot t^{-1/2} \cdot \frac{g_t}{\sqrt{\frac{1}{t}\sum_{i=1}^{t} g_i^2} + \epsilon}$$

The $t^{-1/2}$ from the learning rate and the $\sqrt{1/t}$ from the denominator combine:

$$= w_t - \frac{\eta \cdot g_t}{\sqrt{\sum_{i=1}^{t} g_i^2} + \epsilon'}$$

which is exactly AdaGrad.

**The punchline:** AdaGrad = Adam with no momentum ($\beta_1 = 0$), infinitely long second-moment memory ($\beta_2 \to 1$), and a specific learning rate schedule ($\eta_t \propto t^{-1/2}$).

**Why bias correction is essential for this limit:**

This correspondence only works *with* bias correction. Here's why.

Without bias correction, Adam's raw $v_t$ with $\beta_2$ close to 1 looks like:

$$v_t = (1 - \beta_2) \sum_{i=1}^{t} \beta_2^{t-i} g_i^2 \approx (1 - \beta_2) \cdot (\text{sum of } t \text{ terms})$$

With $\beta_2 = 0.9999$, that prefactor is $0.0001$. You're dividing by $\sqrt{0.0001 \cdot (\text{something})}$, which is 100× smaller than it should be. Your steps are 100× larger than intended. As $\beta_2 \to 1$, this bias becomes infinite, and your parameter updates explode.

Adam's bias correction rescales $v_t$ by $1/(1 - \beta_2^t)$, which exactly compensates for the shrinking prefactor. The corrected $\hat{v}_t$ behaves like a proper average, and the limit to AdaGrad works out cleanly.

RMSprop doesn't have bias correction. This is why you can't push $\beta_2$ too close to 1 in RMSprop without things blowing up. Adam's bias correction was designed precisely to allow this regime: to let you smoothly interpolate between "short memory" (small $\beta_2$, RMSprop-like) and "infinite memory" (large $\beta_2$, AdaGrad-like) without numerical disasters.

**The family tree:**

| Optimizer | First moment | Second moment | Bias correction |
|-----------|--------------|---------------|-----------------|
| SGD | None | None | N/A |
| AdaGrad | None | Cumulative sum (infinite memory) | N/A (sum is unbiased) |
| RMSprop | None | EMA (finite memory) | No |
| Adam | EMA (momentum) | EMA (finite memory) | Yes |
| Adam $\to$ AdaGrad | $\beta_1 = 0$ | $\beta_2 \to 1$ + LR schedule | Required for limit to work |

Adam is the general-purpose framework. AdaGrad and RMSprop are special cases or close relatives, and the bias correction is what makes Adam flexible enough to cover the whole spectrum. You can think of $\beta_2$ as a "memory dial": turn it down and you get RMSprop-like behavior (adapt quickly, forget the past), turn it up and you approach AdaGrad-like behavior (never forget). Adam, with bias correction, lets you turn that dial smoothly without blowing up at either extreme.

## Why Adam Doesn't Explode

There's a property of Adam that often gets overlooked: its step sizes are naturally bounded. No matter how steep the gradient, Adam won't take a crazy huge step that blows up your optimization. This is the "speed limit" baked into the algorithm.

Look at the effective step for parameter $i$:

$$\Delta_t = \eta \cdot \frac{m_t}{\sqrt{v_t}}$$

The ratio $m_t / \sqrt{v_t}$ is asking: "what's the average gradient relative to its typical magnitude?" And here's the key observation: for any number $g$,

$$\frac{|g|}{\sqrt{g^2}} = \frac{|g|}{|g|} = 1$$

When you divide a signal by its own root-mean-square, you get something normalized. The numerator $m_t$ is roughly an average of recent gradients. The denominator $\sqrt{v_t}$ is roughly the RMS of recent gradients. So the ratio hovers around $\pm 1$ in typical conditions. This means the actual step size is approximately $\eta$, regardless of the raw gradient magnitude.

Think about what this implies. In standard gradient descent, step size = $\eta \times$ gradient. If gradient is huge, you take a huge step. You might overshoot and fly off the cliff. In Adam, step size $\approx \eta \times (\pm 1)$. The gradient gets normalized by its own scale before you move.

You can think of this as a trust region. Adam is saying: "No matter how steep the gradient looks, I won't step further than approximately $\eta$." Large gradients don't cause explosions. Small gradients don't get drowned out, they still take steps of size $\sim \eta$. The optimizer becomes somewhat scale-invariant.

The math behind the bound uses Cauchy-Schwarz. Since $m_t$ and $v_t$ are exponentially weighted averages, we have $m_t \approx \mathbb{E}[g]$ and $v_t \approx \mathbb{E}[g^2]$. A standard inequality says:

$$|\mathbb{E}[g]| \leq \sqrt{\mathbb{E}[g^2]}$$

Rearranging:

$$\frac{|\mathbb{E}[g]|}{\sqrt{\mathbb{E}[g^2]}} \leq 1$$

So under typical conditions, $|m_t / \sqrt{v_t}| \leq 1$, meaning $|\Delta_t| \leq \eta$. The step is bounded by the learning rate.

There's one edge case: extreme sparsity. Suppose gradients were zero forever, then suddenly a non-zero gradient $g$ appears at step $t$. With no history to smooth against:

$$m_t = (1 - \beta_1) \cdot g$$
$$v_t = (1 - \beta_2) \cdot g^2$$

The ratio becomes:

$$\frac{m_t}{\sqrt{v_t}} = \frac{(1-\beta_1) \cdot g}{\sqrt{(1-\beta_2) \cdot g^2}} = \frac{(1-\beta_1)}{\sqrt{1-\beta_2}} \cdot \text{sign}(g)$$

The $g$ cancels. With defaults $\beta_1 = 0.9$, $\beta_2 = 0.999$:

$$(1 - \beta_1) = 0.1$$
$$\sqrt{1 - \beta_2} = \sqrt{0.001} \approx 0.0316$$
$$\text{ratio} = 0.1 / 0.0316 \approx 3.16$$

So even in the worst case, a single spike after complete silence, the step is bounded by $\sim 3.16 \eta$. Not infinity. Not catastrophe. Just a somewhat larger step that the bias correction machinery will handle.

| Scenario | Why | Bound on $\lvert\Delta_t\rvert$ |
|----------|-----|----------------------|
| Typical (gradients vary over time) | $m_t/\sqrt{v_t} \approx \pm 1$ by Cauchy-Schwarz | $\approx \eta$ |
| Extreme sparsity (all zeros then spike) | No history to smooth $v_t$ | $\eta \cdot (1-\beta_1)/\sqrt{1-\beta_2}$ |

### Why Setting $\eta$ Becomes Easy

This bounded-step property has a practical payoff: you can reason about $\eta$ using prior knowledge instead of trial-and-error. Suppose you initialize parameters at $\theta_0 = 0$ and you believe the optimum lies within distance $D \approx 100$ from there. You plan to train for $T \approx 10{,}000$ steps. Back-of-envelope:

- Need to cover distance $\sim 100$
- Have $\sim 10{,}000$ steps
- Each step travels at most $\sim \eta$

So $\eta \approx 100 / 10{,}000 = 0.01$ is a reasonable starting point. You're not guessing blindly, you're using geometry.

This reasoning is impossible with vanilla SGD, where step size = $\eta \times$ gradient, and the gradient magnitude is unpredictable. With Adam, $\eta$ directly controls "how far can I travel per step," which you can estimate from your prior beliefs about the parameter space.

### The Signal-to-Noise Interpretation

There's another way to read the ratio $m_t / \sqrt{v_t}$: as a signal-to-noise ratio (SNR).

$$\text{SNR} = \frac{m_t}{\sqrt{v_t}} \approx \frac{\mathbb{E}[g]}{\sqrt{\mathbb{E}[g^2]}}$$

The numerator $m_t \approx \mathbb{E}[g]$ is the "signal", the average gradient direction. The denominator $\sqrt{v_t} \approx \sqrt{\mathbb{E}[g^2]}$ is the "noise", the typical gradient magnitude, which includes variance from inconsistent directions.

**High SNR:** Gradients consistently point the same way. The average is large, the RMS isn't much bigger. $\text{SNR} \approx 1$, so you take a full step of size $\sim \eta$.

**Low SNR:** Gradients are noisy or inconsistent, some point left, some point right. The average cancels out (small $m_t$), but the RMS stays large (individual gradients are still big). $\text{SNR} \approx 0$, so you take a tiny step.

This is exactly what you want. When the gradient signal is clear, commit. When it's ambiguous, hesitate. Adam does this automatically.

Near an optimum, gradients become tiny and noisy (you're in a flat region where numerical noise dominates). But notice: if you scale all gradients by a constant $c$, you get $(c \cdot m_t) / (c \cdot \sqrt{v_t}) = m_t / \sqrt{v_t}$. The $c$ cancels completely. Adam's step size depends on the *shape* of the gradient distribution, not its scale.

### Scale Invariance Across Layers

This cancellation property solves a nasty problem in deep networks: different layers have vastly different gradient scales. In a deep net, gradients in early layers might be $\sim 0.001$ while gradients in later layers are $\sim 1000$ (a million-fold difference). With SGD:

| Layer | Gradient scale | SGD step |
|-------|----------------|----------|
| Layer 1 (early) | $\sim 0.001$ | $\eta \times 0.001$ = tiny |
| Layer 7 (late) | $\sim 1000$ | $\eta \times 1000$ = huge |

You'd need careful per-layer learning rates to avoid the early layers crawling while the late layers explode. With Adam:

| Layer | Gradient scale | Adam step |
|-------|----------------|-----------|
| Layer 1 (early) | $\sim 0.001$ | $\eta \times (\pm 1) \approx \eta$ |
| Layer 7 (late) | $\sim 1000$ | $\eta \times (\pm 1) \approx \eta$ |

One $\eta$ works for all layers. The normalization by $\sqrt{v_t}$ automatically compensates for scale differences. This is why Adam "just works" across wildly different architectures without per-layer tuning.

### The Three Properties

| Property | What It Means | Why It Helps |
|----------|---------------|--------------|
| Bounded steps | $\lvert\Delta_t\rvert \lesssim \eta$ regardless of gradient | Easy to choose $\eta$ from prior knowledge |
| SNR $\to 0$ near optimum | Steps shrink when gradients are noisy/small | Automatic annealing, no schedule needed |
| Scale invariance | Rescaling gradients doesn't change step | One $\eta$ works across all parameters |

These three properties together explain why Adam "just works" with $\eta = 0.001$ across wildly different architectures and scales. You're not fighting the optimizer; the optimizer is normalizing itself.

### What Adam's Scale Invariance Is *Not*

There's a subtle but important distinction that often gets lost in the "Adam just works" narrative: Adam normalizes by *gradient magnitude*, not *curvature*. These are different things.

$\sqrt{v_t}$ is the RMS of recent gradients, purely first-order information. Curvature means second derivatives, the Hessian. They're not the same, and conflating them leads to wrong intuitions about what Adam is actually doing.

Here's the mental trap: you might read "one $\eta$ works for all layers" and think Adam is doing something curvature-aware, like taking larger steps in flat regions and smaller steps in steep ravines. That would be ideal, it's exactly what Newton's method does by scaling by $H^{-1}$. But Adam doesn't do this.

What Adam actually does is compensate for *gradient scale differences*. If layer A has gradients 100× larger than layer B (due to initialization, activation patterns, architecture quirks), the $\sqrt{v_t}$ normalization prevents layer A from dominating the updates. You don't need to hand-tune per-layer learning rates. That's the convenience.

But gradient magnitude and curvature are only loosely correlated:

| Scenario | Gradient | Curvature | What you want | What Adam does |
|----------|----------|-----------|---------------|----------------|
| Near a saddle point | Small | High | Small step (dangerous region) | Large-ish step (low $v_t$) |
| On a gentle slope far from optimum | Large | Low | Large step (safe to move fast) | Small-ish step (high $v_t$) |

A parameter can have small gradients but high curvature (you're balanced on a ridge, moving is risky). A parameter can have large gradients but low curvature (you're on a gentle slope far from the optimum, big steps are fine). Adam treats these cases backwards relative to what curvature-aware optimization would do.

So the honest framing is: Adam's "one $\eta$ works everywhere" is a *convenience feature* about gradient scale normalization, not a claim about optimal curvature adaptation. It's practically useful, less tuning, more robust across architectures, but it's not theoretically ideal.

If you actually want curvature adaptation, you need optimizers that explicitly estimate second-order information: K-FAC, Shampoo, natural gradient methods, or things like Sophia that use diagonal Hessian estimates. These come with significant computational cost, which is exactly why Adam's "good enough" approximation remains the workhorse. The gradient-magnitude normalization isn't solving the curvature problem; it's solving a different problem (scale mismatch across parameters) well enough that you can often ignore the curvature problem in practice.

## Convergence

Now for the part that makes optimization theorists happy: does Adam actually converge? And what does "converge" even mean when you're doing stochastic optimization on a non-convex loss surface?

### The Regret Framework

Here's the setup. Imagine optimization as a game against nature:

- Step 1: You pick $\theta_1$, then nature reveals $f_1$, you pay $f_1(\theta_1)$
- Step 2: You pick $\theta_2$, then nature reveals $f_2$, you pay $f_2(\theta_2)$
- ...
- Step T: You pick $\theta_T$, then nature reveals $f_T$, you pay $f_T(\theta_T)$

You never see the cost function until after you've committed. This models real training: each $f_t$ is the loss on mini-batch $t$, you update parameters before seeing the next batch, and the data might even be non-stationary.

After $T$ rounds, you look back and ask: "What if I had known all the functions $f_1 \ldots f_T$ upfront and just picked the single best fixed $\theta^*$?"

$$\text{Regret}(T) = \sum_{t=1}^{T} f_t(\theta_t) - \sum_{t=1}^{T} f_t(\theta^*)$$

That's regret: how much extra you paid for not being omniscient. It's a fair benchmark because it doesn't require predicting the future, it works even if data is adversarial, and it makes no assumptions about the individual $f_t$.

### Adam's Guarantee: $O(\sqrt{T})$ Regret

The Adam paper proves that under certain assumptions, regret grows as $O(\sqrt{T})$. Why is this good?

$$\text{Average regret per step} = \frac{O(\sqrt{T})}{T} = O\left(\frac{1}{\sqrt{T}}\right) \to 0$$

As $T \to \infty$, average regret vanishes. You're getting closer to optimal over time:

| Steps $T$ | Average regret |
|-----------|----------------|
| 100 | $\sim 0.1$ |
| 10,000 | $\sim 0.01$ |
| 1,000,000 | $\sim 0.001$ |

For context: $O(T)$ regret would be bad (constant gap forever, never improves). $O(\sqrt{T})$ is actually optimal for general convex online learning without further assumptions. You can do better ($O(\log T)$) with stronger assumptions, but $O(\sqrt{T})$ is the baseline "good" result.

### The Assumptions (And What They Mean)

The theorem doesn't come for free. Here's what you need:

**1. Bounded gradients.**

$$\|\nabla f_t(\theta)\|_2 \leq G, \quad \|\nabla f_t(\theta)\|_\infty \leq G_\infty$$

No gradient can be arbitrarily large. The $L_2$ bound caps total gradient energy; the $L_\infty$ bound caps any single coordinate. This is the "no cliffs" assumption: if there's a direction where the loss drops infinitely steeply, all bets are off. In practice, you enforce this with gradient clipping. Raw cross-entropy with extreme predictions can violate this; clipping fixes it.

**2. Bounded parameter distance.**

$$\|\theta_n - \theta_m\|_2 \leq D, \quad \|\theta_n - \theta_m\|_\infty \leq D_\infty$$

Your parameter trajectory stays in a finite region. If $\theta_t$ can wander infinitely far from the optimum $\theta^*$, regret could be infinite even for good algorithms. Think of it as: you're searching for treasure on a finite island, not the infinite ocean. Adam's bounded step sizes (discussed earlier) help ensure this informally: if $|\Delta_t| \lesssim \eta$, total distance traveled is roughly $\sum \eta / \sqrt{t} \approx 2\eta\sqrt{T}$, which grows but sublinearly.

**3. The $\beta$ condition.**

$$\frac{\beta_1^2}{\sqrt{\beta_2}} < 1$$

This ensures momentum doesn't overpower the variance normalization. With defaults $\beta_1 = 0.9$, $\beta_2 = 0.999$:

$$\frac{(0.9)^2}{\sqrt{0.999}} = \frac{0.81}{0.9995} \approx 0.81 < 1 \quad \checkmark$$

If $\beta_1$ is too large relative to $\beta_2$, the momentum term $m_t$ builds up faster than $\sqrt{v_t}$ can track, and the ratio $m_t / \sqrt{v_t}$ becomes unbounded. The condition keeps the two in balance.

**4. Learning rate decay.**

$$\eta_t = \frac{\eta}{\sqrt{t}}$$

This is the Goldilocks rate. Decay too fast (like $1/t$) and you can't travel far enough to reach $\theta^*$ if you started far away. Decay too slow (constant $\eta$) and you never settle down, oscillating around the optimum forever with $O(T)$ regret. The $1/\sqrt{t}$ rate is just right: total distance you can travel is $\sum 1/\sqrt{t} \approx 2\sqrt{T} \to \infty$, so you can reach anywhere, but steps shrink so you eventually converge.

**5. Momentum decay.**

$$\beta_{1,t} = \beta_1 \cdot \lambda^{t-1}, \quad \lambda \in (0, 1)$$

Early in training, high momentum ($\beta_1 \approx 0.9$) smooths noise and accelerates through flat regions. Late in training, near the optimum, gradients are tiny and noisy. If momentum stays high, $m_t$ remembers stale gradients and you overshoot. Decaying $\beta_1 \to 0$ makes $m_t \approx g_t$, giving fine-grained control at the end.

### What This Actually Means in Practice

Here's the uncomfortable truth: the theorem is for *convex* online learning. Deep learning is *non-convex*. The assumptions (especially bounded gradients, decaying learning rate, decaying momentum) aren't how people actually run Adam. In practice:

- Learning rate is often constant or follows a cosine schedule, not $1/\sqrt{t}$
- Momentum $\beta_1$ is constant at 0.9, not decayed
- Gradient clipping is common but not universal
- The loss surface has saddle points, local minima, and weird topology

So what's the theorem good for? It tells you that Adam's design isn't arbitrary. The specific combination of momentum EMA, squared-gradient EMA, bias correction, and the ratio $m_t / \sqrt{v_t}$ has provable properties in a well-understood setting. That's not nothing. It's evidence that the algorithm is doing something sensible, even if the guarantees don't directly transfer to your 70B parameter transformer.

The practical takeaway: Adam's convergence story is more about *not diverging* than about guaranteeing you reach a global optimum. The bounded steps, the SNR interpretation, the scale invariance—these properties make Adam robust. It won't blow up. It won't stall completely. It makes steady progress in reasonable directions. That's why it's the workhorse: not because it's theoretically optimal for deep learning (it isn't), but because it's reliably okay across a huge range of problems where "reliably okay" beats "occasionally great but often explodes."

# AdaMax

Adam normalizes gradients by tracking squared gradients and dividing by their root mean square. That $\sqrt{v_t}$ is basically an $L_2$ norm over recent gradient history. AdaMax asks: what if you used $L_\infty$ instead?

Mathematicians have a family of ways to measure vector size, called $L_p$ norms:

$$\|x\|_p = (|x_1|^p + |x_2|^p + \cdots + |x_n|^p)^{1/p}$$

$p = 2$ gives you the Euclidean norm. $p = 1$ gives you the Manhattan distance. As $p$ gets larger, the biggest element dominates more and more. At $p = \infty$, only the max matters. Take $[3, 7, 2]$:

| $p$ | Result |
|-----|--------|
| 1 | 12 |
| 2 | $\approx 7.87$ |
| 10 | $\approx 7.04$ |
| 100 | $\approx 7.0001$ |
| $\infty$ | 7 exactly |

See what's happening? The 7 eats everything else as $p$ grows.

So try Adam with different $p$ values? The authors ran into numerical instability. If $p = 50$ and $g_t = 5$, you're computing $5^{50} \approx 10^{35}$. That overflows. And even if it doesn't, taking the 50th root amplifies floating-point errors. Large finite $p$ is a dead end.

But you can take the limit $p \to \infty$ analytically, and the result is clean. In the $L_p$ version of Adam:

$$v_t = \beta_2^p \cdot v_{t-1} + (1 - \beta_2^p) \cdot |g_t|^p$$

Unroll it:

$$v_t = (1 - \beta_2^p) \cdot [\beta_2^{p(t-1)} \cdot |g_1|^p + \beta_2^{p(t-2)} \cdot |g_2|^p + \cdots + |g_t|^p]$$

Define $u_t = \lim_{p \to \infty} v_t^{1/p}$. Since $\beta_2 < 1$, we have $\beta_2^p \to 0$ as $p \to \infty$, so $(1 - \beta_2^p) \to 1$. That prefactor vanishes. What's left:

$$u_t = \lim_{p \to \infty} [(\beta_2^{t-1} |g_1|)^p + (\beta_2^{t-2} |g_2|)^p + \cdots + |g_t|^p]^{1/p}$$

That's an $L_p$ norm of the vector $[\beta_2^{t-1}|g_1|, \beta_2^{t-2}|g_2|, \ldots, |g_t|]$. As $p \to \infty$, the $L_p$ norm becomes the max:

$$u_t = \max(\beta_2^{t-1}|g_1|, \beta_2^{t-2}|g_2|, \ldots, \beta_2|g_{t-1}|, |g_t|)$$

Now the trick. What's $\beta_2 \cdot u_{t-1}$?

$$u_{t-1} = \max(\beta_2^{t-2}|g_1|, \ldots, |g_{t-1}|)$$

Multiply by $\beta_2$:

$$\beta_2 \cdot u_{t-1} = \max(\beta_2^{t-1}|g_1|, \ldots, \beta_2|g_{t-1}|)$$

Compare to $u_t$. The only difference is $u_t$ also has $|g_t|$ in the max. So:

$$u_t = \max(\beta_2 \cdot u_{t-1}, |g_t|)$$

No squares, no square roots, no sums. Just a max of two numbers.

Adam needs bias correction because $v_0 = 0$ and the EMA starts biased toward zero. With $\beta_2 = 0.999$: $v_1 = 0.001 \cdot g_1^2$, which is 1000x smaller than it should be. AdaMax doesn't have this problem:

$$u_1 = \max(\beta_2 \cdot 0, |g_1|) = |g_1|$$

The max with zero just gives you $|g_1|$. No shrinkage, no bias, no correction needed.

Adam's step size is bounded by $\eta$ in typical cases, but in extreme sparsity (all zeros then a spike) you get $|\Delta_t| \leq \eta \cdot (1-\beta_1)/\sqrt{1-\beta_2} \approx 3.16\eta$. AdaMax is simpler: $|\Delta_t| \leq \eta$, always, unconditionally. The step is $\eta \cdot m_t / u_t$, and $u_t \geq |g_t|$ by construction, so the ratio stays bounded.

The real difference is how they aggregate history. Adam blends squared gradients smoothly via EMA. If your last five gradient magnitudes were $[1, 3, 2, 8, 1]$, Adam mixes them all into one RMS estimate. AdaMax takes the max. With $\beta_2 = 0.999$:

$$u_t = \max(0.999^4 \cdot 1, 0.999^3 \cdot 3, 0.999^2 \cdot 2, 0.999 \cdot 8, 1) = 7.99$$

That 8 dominates. It keeps dominating until it decays away.

**When AdaMax shines.** AdaMax tends to work well with sparse or spiky gradients. Consider training word embeddings with a vocabulary of 50,000 words. In any given minibatch, only a few hundred words appear. For most words, the gradient is zero. Then, when a rare word finally appears, it might have a large gradient because its embedding hasn't been updated in a long time.

In Adam, a rarely-updated word has a tiny $v_t$ (it's been accumulating $\beta_2 \cdot v_t$ with no new gradient signal). When a gradient finally arrives, $v_t$ is small, so the step is huge. This can cause instability. In AdaMax, $u_t$ is the max of all past gradient magnitudes (decayed). Even if a word hasn't appeared recently, the $u_t$ from its last appearance is still there, decayed but non-zero. The step is bounded by $\eta$ regardless.

**When Adam is better.** For most "normal" training scenarios—dense gradients, no extreme sparsity, well-behaved loss landscape—Adam's $L_2$ averaging tends to work better. The smooth blending gives a more nuanced estimate of typical gradient scale, and the RMS naturally handles the fact that gradients vary in magnitude from step to step.

The max operation in AdaMax can be "too harsh" in these settings. A single large gradient will dominate $u_t$ for many steps, even if subsequent gradients are consistently small. Adam would adapt faster to the new, smaller gradient regime. This is probably why Adam became the default optimizer while AdaMax remained a niche alternative. Adam's behavior is just more appropriate for the common case.

**The takeaway.** AdaMax is what you get when you ask: "What if we pushed Adam's $L_2$ norm all the way to the $L_\infty$ norm?" The answer is a simpler algorithm—just track the decayed max of gradient magnitudes—with cleaner theoretical properties—steps are always bounded by $\eta$—but different practical behavior—winner-take-all instead of smooth blending. It's a good example of how taking a mathematical limit can simplify rather than complicate, and how different norms encode different assumptions about what matters in your data.

# NAdam

NAdam = Nesterov-accelerated Adam. It's what you get when you apply the Nesterov "lookahead" trick to Adam's momentum term. The idea: Adam already tracks a running average of gradients (the first moment $m_t$), but it uses *yesterday's* momentum when computing the update. NAdam asks: what if we used *today's* momentum instead?

Recall Adam's update. At step $t$, you compute:

$$m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t$$
$$v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2$$

Then bias-correct and update:

$$\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}$$
$$w_{t+1} = w_t - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}$$

Now let's look inside $\hat{m}_t$. Expand it by substituting the recurrence:

$$\hat{m}_t = \frac{\beta_1 m_{t-1} + (1 - \beta_1) g_t}{1 - \beta_1^t}$$

Split this into two terms:

$$\hat{m}_t = \frac{\beta_1 \cdot m_{t-1}}{1 - \beta_1^t} + \frac{(1 - \beta_1) \cdot g_t}{1 - \beta_1^t}$$

The first term is the contribution from historical momentum. The second term is the contribution from the current gradient. Adam's parameter update is driven by both.

Here's the Nesterov insight. Standard momentum says: blend old momentum with current gradient, then step. Nesterov says: use a more *recent* version of the momentum. Concretely: replace $m_{t-1}$ with $m_t$ in that first term.

Why? Because $m_t$ already incorporates $g_t$. Using $m_t$ instead of $m_{t-1}$ gives you a preview of where momentum is heading. You're peeking one step into the future.

So NAdam takes Adam's split form and makes the substitution $m_{t-1} \to m_t$:

$$\bar{m}_t = \frac{\beta_1 \cdot m_t}{1 - \beta_1^{t+1}} + \frac{(1 - \beta_1) \cdot g_t}{1 - \beta_1^t}$$

Wait, why did the denominator change for the first term? This is where the math gets careful.

The bias correction factor $(1 - \beta_1^k)$ is calibrated to the age of the information. In Adam, $m_{t-1}$ has been accumulating for $t-1$ steps, so at step $t$ the correction factor is $(1 - \beta_1^t)$. But $m_t$ has been accumulating for $t$ steps. If we're using $m_t$ the way Adam *would* use it at step $t+1$, the correct bias correction is $(1 - \beta_1^{t+1})$.

Each term gets the bias correction matching its effective timestep. The gradient term $g_t$ is still "current," so it keeps $(1 - \beta_1^t)$. The momentum term $m_t$ is being used as if we're one step ahead, so it gets $(1 - \beta_1^{t+1})$.

The full NAdam update is:

$$m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t$$
$$v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2$$
$$\bar{m}_t = \frac{\beta_1 \cdot m_t}{1 - \beta_1^{t+1}} + \frac{(1 - \beta_1) \cdot g_t}{1 - \beta_1^t}$$
$$\hat{v}_t = \frac{v_t}{1 - \beta_2^t}$$
$$w_{t+1} = w_t - \eta \frac{\bar{m}_t}{\sqrt{\hat{v}_t} + \epsilon}$$

The second moment is unchanged from Adam. Only the first moment gets the Nesterov treatment.

**A common point of confusion:** "But wait, we're at step $t$. Don't we need $m_t$ to exist before we can use it?" Yes, and it does exist. Both Adam and NAdam compute $m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t$ as their first step. The difference is what happens *after* that.

In Adam, the standard bias correction $\hat{m}_t = m_t / (1 - \beta_1^t)$ is algebraically equivalent to the split form with $m_{t-1}$ and $g_t$. When you expand it, you see $m_{t-1}$ sitting there. That's just algebra—substituting the definition of $m_t$ back in.

In NAdam, we don't use that expanded form. We directly construct $\bar{m}_t$ using $m_t$ (which we just computed) in place of where Adam's expansion had $m_{t-1}$. Same computation of $m_t$, different assembly of the final bias-corrected quantity.

So: both optimizers compute $m_t$ the same way. The difference is how they form the momentum estimate used in the parameter update. Adam's bias correction is equivalent to "$m_{t-1}$ + $g_t$". NAdam explicitly uses "$m_t$ + $g_t$", which double-counts $g_t$ (since $g_t$ is already inside $m_t$). That double-counting is the lookahead.

## What the substitution actually means

You might be wondering: we're at timestep $t$, we have $m_t$, what's strange about using $m_t$?

Think about what information each term contains.

**Adam at step $t$:**
- Historical term uses $m_{t-1}$ → contains gradients $g_1, g_2, \ldots, g_{t-1}$
- Fresh term uses $g_t$
- Together: old gradients (weighted by momentum) + current gradient (fresh)

**NAdam at step $t$:**
- Historical term uses $m_t$ → contains gradients $g_1, g_2, \ldots, g_{t-1}, g_t$
- Fresh term uses $g_t$
- Together: all gradients including $g_t$ (weighted by momentum) + current gradient (fresh again)

See it? $g_t$ appears twice in NAdam's formula. Once inside $m_t$, and once as the fresh gradient term. This double-counting is intentional. It gives the current gradient extra weight, which is exactly the Nesterov lookahead effect. You're saying: the current gradient is so important that I want to incorporate it into my momentum estimate *before* I use that momentum to step.

A cleaner way to think about it: how would Adam behave at step $t+1$? It would use $m_t$ as its historical momentum. NAdam uses $m_t$ at step $t$, borrowing the future version of the historical momentum while still standing at time $t$. That's the lookahead.

## Why bother?

The motivation is the same as regular Nesterov momentum: reduce oscillations near the optimum.

When you're far from the minimum, all optimizers with momentum benefit from that inertia. The accumulated velocity helps you barrel through flat regions. But when you're near the minimum, things get tricky. The gradient direction starts flipping. With standard momentum, you've built up velocity pointing one way, but the loss surface now wants to push you back. The result: overshoot, then correction, then overshoot again. Oscillation.

Nesterov's trick helps because the lookahead gradient sees the upcoming terrain. By using $m_t$ instead of $m_{t-1}$, NAdam's momentum term already knows about $g_t$. If $g_t$ is pointing backward (because you've overshot), that information is baked into the momentum *before* you take the step, not after. The correction happens sooner.

In practice, NAdam often converges slightly faster than Adam, especially on problems where oscillation near the optimum is a bottleneck. The improvement isn't dramatic. Both optimizers are doing roughly the same thing. But NAdam's momentum is slightly more anticipatory, which can shave off some iterations.

## The comparison table

| | Adam | NAdam |
|---|------|-------|
| Momentum term | $m_{t-1}$ | $m_t$ |
| Momentum bias correction | $1 - \beta_1^t$ | $1 - \beta_1^{t+1}$ |
| Gradient term | $g_t$ | $g_t$ |
| Gradient bias correction | $1 - \beta_1^t$ | $1 - \beta_1^t$ |
| Second moment | Same | Same |

NAdam is Adam with one change: it uses today's momentum instead of yesterday's, and adjusts the bias correction accordingly.

## Dozat's trick: avoiding the extra forward pass

Traditional Nesterov momentum computes the gradient at a lookahead point: $g_t = \nabla L(w_t - \beta v_{t-1})$. That's expensive. You need an extra forward pass just to evaluate the gradient somewhere other than your current position.

Dozat's insight (the NAdam paper) was to reformulate Nesterov so you don't need that extra computation. Instead of applying momentum twice—once to compute the lookahead gradient, once to update parameters—you apply the lookahead momentum *directly* in the update step. The gradient is still computed at $w_t$, the current position. But you use $m_t$ instead of $m_{t-1}$ when combining with that gradient.

The result: NAdam has the same computational cost as Adam. One forward pass, one backward pass, same memory footprint. The Nesterov benefit comes for free.

## When to use NAdam

NAdam is a reasonable drop-in replacement for Adam. Same hyperparameters ($\beta_1 = 0.9$, $\beta_2 = 0.999$, etc.), same tuning intuitions. If you're starting a new project and debating Adam vs NAdam, just pick NAdam. The Nesterov modification doesn't hurt, and sometimes helps.

**Where NAdam tends to help:**

- *Late-stage convergence.* If your loss curve flattens but keeps oscillating instead of settling, NAdam's anticipatory momentum can smooth that out.
- *Recurrent networks.* The original NAdam paper showed improvements on LSTM language modeling. RNNs often have tricky loss landscapes with sharp valleys—exactly where Nesterov shines.
- *Problems with momentum-dominated dynamics.* If you're using high $\beta_1$ (0.95 or 0.99), the lookahead correction matters more.

**Where it probably doesn't matter:**

- *Well-tuned Adam baselines.* If Adam is already converging smoothly, NAdam won't dramatically change things. The difference is often within noise.
- *Early training.* Far from the optimum, both optimizers behave similarly. The Nesterov benefit shows up near convergence.
- *Very noisy gradients.* With small batch sizes, gradient noise dominates. The difference between $m_{t-1}$ and $m_t$ gets washed out.

**The honest assessment:** The difference between Adam and NAdam is usually smaller than the difference between "tuned hyperparameters" and "untuned hyperparameters." If you're debugging why training isn't working, switching from Adam to NAdam is unlikely to fix it. But if Adam is already working and you want to squeeze out a bit more, NAdam is worth trying.

The takeaway: NAdam is the Nesterov treatment applied to Adam. It uses the updated momentum $m_t$ instead of the stale momentum $m_{t-1}$, with the bias correction adjusted to match. Dozat's reformulation makes this computationally free. The result is a slightly more anticipatory optimizer that reduces oscillation near the optimum—a small modification with a small benefit, which is exactly what you'd expect from applying a refinement (Nesterov) to an already-refined method (Adam).