# Introduction

There is a GD expression that we have all seen. And we know how backprop works with the forward, backward, and update rule. We also know how to imagine the landscape of loss where each weight is plotted on an axis with loss on the other axis and our goal is to get to the lowest loss using that weight. 

As we trudge along this landscape of loss, we realize that we are doing the most fundamental part of deep learning. If we think of llms as a giant mathematical expression, then the main objective of training is to tune the constants in that expression (weights). To move the weights such that the resulting expression is close enough to the data manifold used during training is the main objective of training. And in that sense, deep learning is essentially an optimization problem. 

And then if moving across this landscape and finding that elusive global minima in this fractured landscape is the goal, we start seeing that the key starts becoming exactly *how* we move through it. 

In this notebook, that is what we will cover. And for me, that is also the *hook* for optimizers. Because while it can get deeply mathematical and the explanations can get quite abstract, the truth is that at the end of the day, all we're doing is writing an algorithm to walk down a hill.

# Things to know

## Weight Update Formula

Let's review the most basic/fundamental weight update formula as done through vanilla SGD. This disucssion is a bit off topic, but I went through this rabbit hole for a while so just noting it down here. Also, only writing down the high level points, deeper analysis will lead to the question of what it means to stabilize the weight update equation, and by corollary, training itself - which is a larger question and won't be analyzed here. Idea here is just to think of the intuition. 

Here is the equation:

W_new = W_old - lr * L'(W_old)

There is something about this equation that feels a bit wrong. 

For example, at the most funadmental level, are the units in all the terms the same? Thinking from a physics perspective, if we strictly assign physical units, W_old is a position (e.g., "meters") and the gradient grad(f'(W_old)) is a slope ("energy per meter"). Subtracting a slope from a position is just physically impossible~ This reveals that the gradient is merely a directional signal (force) indicating urgency, not a spatial displacement. The equation requires a conversion factor to translate "steepness of the hill" into "distance to step."

Perhaps the resolution to this paradox lies in the Taylor Series expansion of the loss function. The "perfect" update rule, which accounts for the landscape's geometry, is:

W_new = W_old - lr * ( L'(W_old) / L''(W_old) )

Here, the numerator (gradient, $L'$) provides the push, while the denominator (curvature/Hessian, $L''$) provides the braking mechanism. Dimensionally, it can be argued that this balances because: the units of curvature (${energy}/{{meters}^2}$) divide the units of gradient (${energy}/{{meters}}$), canceling out the "energy" and leaving purely "meters." So it can be argued that this confirms that a valid update step requires two distinct derivatives: one to determine direction ($L'$) and one to determine scale ($L''$).

Of course, calculating the Hessian ($L''$) is computationally intractable for millions of parameters. Therefore, the learning rate $lr$ acts as a constant scalar proxy for the inverse curvature (1/{L''}$). When we set a learning rate, we are effectively guessing the geometry of the error surface. A high $lr$ assumes low curvature (a wide, flat valley where big steps are safe), while a low $lr$ assumes high curvature (a sharp, narrow ravine where precision is required). Thus, $lr$, in addition to being a speed setting, is also a unit-restoring term that bridges the gap between the "force" of the gradient and the "displacement" of the weight update.

Modern optimization strategies seem to implicitly acknowledge this relationship. Learning rate schedulers (decaying $lr$ over time) mimic the assumption that the loss landscape transitions from a broad basin (low curvature) to a narrow minimum (high curvature) as training progresses. More advanced optimizers like Adam take this a step further by dividing the gradient by a rolling average of squared gradients. This term ($\sqrt{v_t}$) serves as a computationally cheap estimation of the local curvature, attempting to replicate the unit-correcting behavior of Newton’s method by normalizing the step size for each parameter individually.

## Exponentially Weighted Moving Average

EWMA (exponentially weighted moving average) is the "smooth a noisy stream without forgetting the present" trick. You take a time series (daily temperature, stock price, whatever), and instead of a dumb uniform mean that treats last week and last year the same, you keep a running average that leans harder on recent values while older values fade out exponentially. Visually it's that clean black curve that hugs the data enough to track the trend but not enough to chase every random zig-zag. This exact "trend extraction" idea shows up all over: time series / financial forecasting, signal processing (it's basically a first-order low-pass filter), and in deep learning it's the core primitive behind momentum-style optimizers.

The entire method is one recurrence:

$$v_t = \beta v_{t-1} + (1 - \beta) x_t$$

Here $x_t$ is the value at time $t$ (temperature, gradient, whatever), and $v_t$ is the EWMA at time $t$. $\beta \in [0, 1)$ controls memory. If $\beta$ is large, you heavily trust the past state $v_{t-1}$, so the curve is stable and smooth. If $\beta$ is small, you aggressively trust the current observation $x_t$, so the curve becomes twitchy and tracks the data closely. Two practical notes: you need some $v_0$ to start the recurrence. People often set $v_0 = 0$, or set $v_0 = x_0$ (or some reasonable constant). Early on, initialization matters because the filter "warms up" from that starting point.

A tiny numerical walk-through makes it real. Say $\beta = 0.9$, $v_0 = 0$, and your first observation is $x_1 = 30$. Then $v_1 = 0.9 \cdot 0 + 0.1 \cdot 30 = 3$. Next if $x_2 = 17$, then $v_2 = 0.9 \cdot 3 + 0.1 \cdot 17 = 4.4$. And so on. Every step is "keep most of the previous smoothed value, mix in a little bit of the new measurement." Plotting $v_t$ against time gives you a trend line that reacts gradually instead of instantly.

There's a really useful intuition for what $\beta$ means: EWMA behaves kind of like an average over the last

$$\frac{1}{1 - \beta}$$

points (an "effective window length"). If $\beta = 0.9$, that's about $1/0.1 = 10$ steps of memory. If $\beta = 0.5$, that's $1/0.5 = 2$ steps, meaning you're basically averaging only the last couple points and your estimate will whip around. Same data, different $\beta$, wildly different behavior: high $\beta$ gives you a slow, stable, low-variance curve; low $\beta$ gives you a moody curve that updates its "belief" based almost entirely on what just happened.

If you want the "why does it weight recent points more?" proof, just expand the recurrence a few steps. Substitute $v_{t-1}$ into $v_t$, then substitute again, and you'll see the pattern:

$$v_t = (1 - \beta)(x_t + \beta x_{t-1} + \beta^2 x_{t-2} + \cdots + \beta^{t-1} x_1) + \beta^t v_0$$

So the contribution of $x_{t-k}$ is $(1 - \beta)\beta^k$. Since $\beta < 1$, those weights decay exponentially as you go back in time. That is literally the mechanism behind the two key properties: (1) newer points matter more, (2) any fixed old point's influence shrinks as time goes on.

Now connect this to deep learning optimizers: replace "temperature" with "gradient". Gradients are noisy minibatch estimates, so you don't want your update direction to flip around just because one batch was weird. Momentum and friends maintain an EWMA of gradients (or squared gradients), which is just this same filter applied to a different signal. High $\beta$ means "trust the long-term direction, smooth aggressively," low $\beta$ means "react quickly to new gradient information." The sweet spot is task-dependent, but in practice you see $\beta \approx 0.9$ all the time because it gives a nice stability/response tradeoff.

Implementation-wise, Python makes this boring (in a good way). In pandas you can use `ewm` to compute the exponential moving average over a column. Pandas often parameterizes the recurrence with alpha instead of $\beta$, where

$$\alpha = 1 - \beta$$

Same thing, just a different knob. So if someone says "alpha = 0.1", that corresponds to $\beta = 0.9$ (slow/stable). If they say "alpha = 0.9", that corresponds to $\beta = 0.1$ (fast/moody). Typical workflow: load time series (date, value), compute `ema = df['value'].ewm(alpha=alpha).mean()`, attach it as a new column, plot raw series plus EMA with matplotlib. The best exercise is to vary $\alpha$ / $\beta$ and actually feel how the curve transitions from smooth trend extractor to near-copy of the raw data—and then write the recurrence yourself in a loop once, so you internalize that it's just one line of state update repeated forever.

# SGD with momentum

SGD with momentum lives inside one big picture: you're not "optimizing weights," you're navigating a loss landscape. You feed input → get prediction $\hat{y}$ → compare to target $y$ with some loss like mean squared error

$$L(y, \hat{y}) = (y - \hat{y})^2$$

and because $\hat{y}$ depends on parameters $\theta$ (weights and biases), your loss is ultimately $L(\theta)$. In toy land, if $\theta$ is a single weight $w$, you can plot $L(w)$ as a 2D curve. If $\theta = (w, b)$, you can plot $L(w, b)$ as a 3D surface. In real nets $\theta \in \mathbb{R}^N$ with $N$ in the millions, so that surface exists but your human brain can't "look at it," so we constantly fall back to 1D/2D slices and their shadows.

That's where contour plots are secretly doing a ton of work. Take a 3D surface $L(w, b)$, look at it from the top, and draw curves where the loss is constant—those are level sets. That's the contour plot: a 2D projection of a 3D surface. You lost one dimension (height), so you encode it with color. The geometry is the key: where contour lines are close together, the slope magnitude is large (steep region). Where the lines are far apart, the surface is flat-ish (small gradient over a big region). Saddle points show up as that weird "up in one direction, down in the other" structure—often with large flat-ish neighborhoods where gradients are tiny and progress crawls. Local minima show up as "basins" that look like closed loops around a dip. You can mentally reverse-engineer the 3D surface from the 2D contours: tight rings = steep walls, spaced rings = gentle bowl, twisted pattern = saddle.

Now convex vs non-convex: convex is the friendly universe where there's one basin and every downhill path leads to the same global minimum. Non-convex is the deep learning universe: multiple basins, flats, ravines, saddles, weird curvature. The "why is optimization hard?" list in practice is brutally simple:

**Local minima:** you can fall into a small dip and stop improving, even though there's a deeper basin elsewhere (global minimum).

**Saddle points / plateaus:** gradients are tiny across a large region, so you move at a glacial pace (even if you're not "stuck" in a strict minimum).

**High curvature ravines:** one direction is steep (large curvature), another direction is shallow (small curvature), so vanilla updates bounce side-to-side while making slow forward progress.

Plus the "SGD reality tax": noisy gradients (because minibatches are stochastic estimates), and inconsistent gradients (directions can vary across steps due to noise + curvature + minibatch sampling).

Before momentum, the baseline update is: "step opposite the gradient." For parameters $\theta$,

$$\theta_{t+1} = \theta_t - \alpha \nabla_\theta L(\theta_t)$$

where $\alpha$ is the learning rate. If you compute $\nabla L$ over the full dataset, that's batch gradient descent: smooth, stable, often slow per step. If you compute it using one example at a time, that's stochastic gradient descent (SGD): cheap per step, but the path is jagged because the gradient estimate is noisy. If you compute it over a small batch, that's mini-batch GD (what people usually mean in deep learning): a practical middle ground—faster iteration than batch, less chaos than pure SGD.

So where does momentum enter? Momentum is basically: "don't treat each gradient like it's a brand-new opinion; treat gradients like noisy measurements of a trend." If the last several gradients have been pointing roughly the same way, you should build confidence and move faster in that direction. If gradients disagree (noise, oscillations across a ravine), you should damp the indecision.

There are two mental models that land well:

**Crowd directions model:** you're trying to reach point B in a city. You ask one person and they point east—maybe, maybe not. You ask four people and they all point east—now you commit and walk faster. If the crowd is split (two say east, two say west), you still move, but cautiously. Momentum is the "ask multiple past gradients, form a consensus direction."

**Physics model:** you're a ball rolling down a landscape. A ball doesn't teleport to "the steepest direction" and forget its previous motion every millisecond. It has velocity; it carries inertia. Momentum optimization explicitly introduces a velocity-like state. In Newtonian terms, momentum is $p = mv$. We don't really have a meaningful mass here, so you can pretend $m = 1$ and focus on "velocity."

Mathematically, SGD with momentum keeps a velocity vector $v_t$ (same shape as $\theta$). Replace the raw gradient step with an exponentially weighted moving average of past gradients (EMA), and use that to update parameters:

$$v_t = \beta v_{t-1} + \alpha g_t$$
$$\theta_{t+1} = \theta_t - v_t$$

where $g_t = \nabla_\theta L(\theta_t)$ (usually computed on a mini-batch), $\alpha$ is learning rate, and $\beta \in [0, 1)$ is the momentum coefficient (commonly $\beta = 0.9$).

Two important notes:

This is literally the exponential moving average idea applied to gradients (or "update directions").

If you unroll the recurrence, you see the EMA explicitly:

$$v_t = \alpha (g_t + \beta g_{t-1} + \beta^2 g_{t-2} + \cdots)$$

So the current velocity is a weighted sum of recent gradients, with weights decaying exponentially into the past. Recent gradients matter more; ancient gradients fade out.

The parameter $\beta$ is the "memory / decay factor." It sets how long the optimizer remembers the past. A useful rule of thumb: the effective averaging window length is about

$$\text{window} \approx \frac{1}{1 - \beta}$$

So $\beta = 0.9$ remembers roughly $\sim 10$ steps, $\beta = 0.99$ remembers $\sim 100$ steps, etc. Bigger $\beta$ = smoother, more inertia, more "commitment."

Edge cases explain the whole design:

**If $\beta = 0$:**

$$v_t = \alpha g_t, \quad \theta_{t+1} = \theta_t - \alpha g_t$$

That's just vanilla SGD. Momentum collapses to "no momentum."

**If $\beta \to 1$:** you stop forgetting. Velocity becomes a long-running accumulator. This can create persistent oscillations / a kind of dynamic equilibrium where you don't settle nicely, because the system carries too much inertia and not enough damping.

Now the core behavioral payoff. Picture the classic deep learning ravine: steep curvature in one direction (say vertical), shallow slope in the other (say horizontal). Vanilla SGD tends to bounce: it overshoots across the steep direction, flips gradient sign, overshoots back, repeat. You get a zig-zag trajectory that wastes steps oscillating "up and down" while inching forward along the shallow direction. Momentum acts like a low-pass filter: the oscillatory component cancels out over time (because it keeps switching sign), while the consistent component along the shallow direction accumulates. Net effect: less vertical oscillation, more horizontal progress, faster convergence.

That same mechanism also helps with:

**Noisy gradients (especially small batch sizes):** the noise is high-frequency randomness; EMA smooths it out.

**Inconsistent gradients:** if the direction is unstable from step to step, momentum refuses to fully commit; it averages them and produces a more stable update direction.

**Saddle points / flat regions:** gradients can be tiny for a long time. Momentum can "carry" you through these regions because velocity doesn't instantly drop to zero when the instantaneous gradient is small. You keep moving due to accumulated velocity.

**Local minima (the small annoying ones):** inertia can help you roll out of a shallow basin. If the dip is small and your velocity is high enough, you don't get trapped—you pass through and continue toward a better basin.

And here's the funny twist: momentum's superpower is also its most common failure mode. Because it builds velocity, it often overshoots the optimum and then has to correct. Near the minimum, the true gradient points back toward the basin center, but your velocity may still be blasting forward from earlier steps. So you fly past the bottom, climb the other side, turn around, fly past again… and you get oscillations around the optimum. The exponential decay ($\beta < 1$) acts as damping, so those oscillations usually shrink and you eventually settle, but you can waste time "ringing" around the minimum.

On a contour plot, this looks exactly like what your intuition expects: momentum trajectories cut through the landscape aggressively, often overshooting the basin center and spiraling or oscillating before stabilizing. Plain SGD looks more like a jittery random walk that eventually drifts into the basin, often slower but with less dramatic overshoot. In interactive visualizers (contour plot + click-to-start-point), this contrast is almost comically visible: SGD is the anxious squirrel; momentum is the overconfident skateboarder.

So the headline claims—when someone asks "why use momentum?"—are basically three:

**Speed:** it almost always reaches a good region faster than plain SGD, especially in ravines / high curvature terrain.

**Escaping shallow traps:** it can roll through small local minima or tiny bumps because it has inertia.

**Stability under noise:** it smooths noisy, stochastic gradients by averaging history.

And the main caution label:

**Overshoot + oscillation near optima:** momentum can waste steps bouncing around the minimum before damping out. It's still typically faster than vanilla SGD overall, but this is exactly why later methods try to keep the "fast" while fixing the "ringing."

The cleanest way to remember what's happening is this: SGD reacts to the present; momentum reacts to the recent past plus the present. It's a memory-equipped optimizer. That memory is an exponentially decayed history of gradients. Set $\beta$ too low and you're basically back to SGD. Set it too high and you get a stubborn optimizer that refuses to slow down and can oscillate for longer. In the sweet spot (often $\beta \approx 0.9$), you get the practical win: faster traversal across ugly non-convex terrain where gradients are noisy, curvature is weird, and the loss surface is doing its best impression of a crumpled bedsheet.

If you want one sentence that isn't lying: SGD with momentum replaces "take a step downhill" with "maintain a velocity that's an EMA of downhill directions, then step according to that velocity." That's it. Everything else—faster convergence in ravines, smoothing noise, escaping shallow minima, overshooting and oscillations—is just that sentence playing out in geometry.

# Nesterov Accelerated Gradient

You're training a network. You have parameters $w$ (weights, biases, whatever), a loss $L(w)$, and the entire game is: find $w^*$ that makes $L(w)$ small. Vanilla gradient descent just says: "look at the slope here, step downhill."

$$w_{t+1} = w_t - \eta \nabla L(w_t)$$

with learning rate $\eta$. In deep learning you see the usual trio: batch GD (full dataset gradient), SGD (one sample), mini-batch (the practical default). They all share the same core weakness: they can be slow and jittery, especially in landscapes that look like long ravines / narrow valleys (common with ill-conditioned curvature). Even in a toy convex case like linear regression with MSE, the loss surface is a smooth bowl in $(m, b)$ space: you start at some random $(m, b)$, and you "walk" toward the red dot (the optimum). Batch GD will get there, but it often takes a bunch of small, cautious steps (think 25–30-ish iterations in that demo) because the gradient keeps changing and you're always reacting locally.

Momentum is the first big hack that feels like physics: stop being a goldfish. Keep a running "velocity" that remembers where you've been going.

A common momentum form (matching the spirit of that transcript) is:

$$v_t = \beta v_{t-1} + \eta \nabla L(w_t)$$
$$w_{t+1} = w_t - v_t$$

Here $v_t$ is the velocity, $\beta \in [0, 1)$ is the momentum coefficient (decay factor), and $\eta$ is the learning rate. Substitute $v_t$ into the weight update and you see the key thing immediately:

$$w_{t+1} = w_t - \beta v_{t-1} - \eta \nabla L(w_t)$$

So the step is a sum of two pushes:

$-\eta \nabla L(w_t)$: the "fresh" downhill push from the current gradient

$-\beta v_{t-1}$: the "inertia" push from accumulated past gradients

That inertia is why momentum often rockets through the early part of optimization: in the linear regression animation you basically see the point zip toward the vicinity of the minimum in just a few epochs, while plain GD trudges there in many more. But you also see the dark side: overshoot and oscillation. If $\beta$ is high (e.g. 0.9), you're giving a lot of weight to history, so once you build speed you tend to blast past the minimum, then correct, then blast past again, with oscillations that gradually decay. Tuning $\beta$ down (e.g. 0.8) reduces how hard the past keeps shoving you, so the oscillations damp faster. That's the core trade: bigger $\beta$ = faster "ball rolling downhill" behavior, but also more ringing.

This oscillation problem is not just cosmetic. In non-convex deep nets (and even in convex-but-ill-conditioned bowls), that ringing can waste steps. The optimizer is spending compute doing a little interpretive dance around the minimum instead of just settling.

Nesterov Accelerated Gradient (NAG) is a deceptively small tweak on momentum that attacks exactly that: it tries to reduce the oscillation by being less surprised by where momentum is about to take you. The mental model: momentum is driving the car while staring at the road under the front bumper. NAG lets you peek a bit ahead, then steer.

Momentum computes the gradient at the current position $w_t$, then mixes it with velocity. NAG instead evaluates the gradient at a lookahead point: where momentum alone would take you.

Define the lookahead weight:

$$w_{\text{lookahead}} = w_t - \beta v_{t-1}$$

Now compute the gradient there, and update velocity using that gradient:

$$v_t = \beta v_{t-1} + \eta \nabla L(w_{\text{lookahead}}) = \beta v_{t-1} + \eta \nabla L(w_t - \beta v_{t-1})$$

Then the actual parameter update is still:

$$w_{t+1} = w_t - v_t$$

If you substitute, you get the compact "single-line" view that shows the difference vs momentum:

**Momentum**

$$w_{t+1} = w_t - \beta v_{t-1} - \eta \nabla L(w_t)$$

**NAG**

$$w_{t+1} = w_t - \beta v_{t-1} - \eta \nabla L(w_t - \beta v_{t-1})$$

That's it. Same ingredients. Same hyperparameters. One change: the gradient is measured after peeking in the direction momentum is about to move.

Why does this reduce oscillations? Picture the classic overshoot near a minimum along one direction. With momentum, you arrive near the bottom with a big $v$ pointing "forward." Even if the true gradient at $w_t$ is starting to change sign (meaning you've crossed the minimum), the update still includes this big inertial shove, so you keep going too far, then you correct back, then too far again. The optimizer is always a step late because it's using a gradient that doesn't account for the fact that momentum is about to move you.

NAG fixes the timing mismatch. It says: "Before you commit to the full step, pretend you already applied momentum, and from there ask the loss: which way is downhill?" Near the minimum, that lookahead point often lands on the "other side" of the bowl where the gradient flips direction. So the gradient term in NAG becomes a kind of early braking signal: it partially cancels the momentum shove before you overshoot as badly. Geometrically, instead of doing a big U-turn after you fly past the minimum, you do a smaller correction sooner. In those animations, that shows up as the same fast approach but noticeably less ringing.

So the slogan version (that's actually accurate) is:

**Momentum:** "push me based on where I am."

**NAG:** "push me based on where momentum is about to put me."

There is a real tradeoff lurking here, and the transcript calls it out in a simple way: damping oscillations can sometimes mean damping your ability to escape. If the loss surface has a little local basin separated by a ridge, momentum can occasionally carry enough inertia to roll up and out of that basin. NAG, by being more conservative near turning points (because it's constantly applying this lookahead correction), might fail to accumulate the same "slam through the barrier" behavior and can get stuck oscillating inside a local minimum region. That's not a universal law—deep net landscapes are weirder than the 2D cartoon—but it's a legit failure mode intuition: NAG is better behaved, and "better behaved" can sometimes mean "less exploratory."

Finally, the practical implementation detail from the Keras angle: NAG isn't some separate exotic optimizer class in the basic API sense—it's typically just a switch on SGD-with-momentum. Conceptually:

- **Vanilla SGD:** momentum off, Nesterov off
- **Momentum SGD:** momentum on, Nesterov off
- **NAG:** momentum on, Nesterov on

So you're really choosing: do I want the velocity term, and if yes, do I want the gradient evaluated at the current position (momentum) or at the lookahead position (NAG)?

Net-net: NAG is "momentum, but slightly more clairvoyant." Same core mechanism—accumulate velocity to speed up in consistent directions—but it uses that lookahead gradient to reduce overshoot and tame oscillations, which often makes it converge faster and cleaner in the kinds of curved valleys you see all over deep learning optimization.

# Adaptive Gradient