# Introduction

There is a GD expression that we have all seen. And we know how backprop works with the forward, backward, and update rule. We also know how to imagine the landscape of loss where each weight is plotted on an axis with loss on the other axis and our goal is to get to the lowest loss using that weight. 

As we trudge along this landscape of loss, we realize that we are doing the most fundamental part of deep learning. If we think of llms as a giant mathematical expression, then the main objective of training is to tune the constants in that expression (weights). To move the weights such that the resulting expression is close enough to the data manifold used during training is the main objective of training. And in that sense, deep learning is essentially an optimization problem. 

And then if moving across this landscape and finding that elusive global minima in this fractured landscape is the goal, we start seeing that the key starts becoming exactly *how* we move through it. 

In this notebook, that is what we will cover. And for me, that is also the *hook* for optimizers. Because while it can get deeply mathematical and the explanations can get quite abstract, the truth is that at the end of the day, all we're doing is writing an algorithm to walk down a hill.

# Things to know

## Weight Update Formula

Let's review the most basic/fundamental weight update formula as done through vanilla SGD. This disucssion is a bit off topic, but I went through this rabbit hole for a while so just noting it down here. Also, only writing down the high level points, deeper analysis will lead to the question of what it means to stabilize the weight update equation, and by corollary, training itself - which is a larger question and won't be analyzed here. Idea here is just to think of the intuition. 

Here is the equation:

W_new = W_old - lr * L'(W_old)

There is something about this equation that feels a bit wrong. 

For example, at the most funadmental level, are the units in all the terms the same? Thinking from a physics perspective, if we strictly assign physical units, W_old is a position (e.g., "meters") and the gradient grad(f'(W_old)) is a slope ("energy per meter"). Subtracting a slope from a position is just physically impossible~ This reveals that the gradient is merely a directional signal (force) indicating urgency, not a spatial displacement. The equation requires a conversion factor to translate "steepness of the hill" into "distance to step."

Perhaps the resolution to this paradox lies in the Taylor Series expansion of the loss function. The "perfect" update rule, which accounts for the landscape's geometry, is:

W_new = W_old - lr * ( L'(W_old) / L''(W_old) )

Here, the numerator (gradient, $L'$) provides the push, while the denominator (curvature/Hessian, $L''$) provides the braking mechanism. Dimensionally, it can be argued that this balances because: the units of curvature (${energy}/{{meters}^2}$) divide the units of gradient (${energy}/{{meters}}$), canceling out the "energy" and leaving purely "meters." So it can be argued that this confirms that a valid update step requires two distinct derivatives: one to determine direction ($L'$) and one to determine scale ($L''$).

Of course, calculating the Hessian ($L''$) is computationally intractable for millions of parameters. Therefore, the learning rate $lr$ acts as a constant scalar proxy for the inverse curvature (1/{L''}$). When we set a learning rate, we are effectively guessing the geometry of the error surface. A high $lr$ assumes low curvature (a wide, flat valley where big steps are safe), while a low $lr$ assumes high curvature (a sharp, narrow ravine where precision is required). Thus, $lr$, in addition to being a speed setting, is also a unit-restoring term that bridges the gap between the "force" of the gradient and the "displacement" of the weight update.

Modern optimization strategies seem to implicitly acknowledge this relationship. Learning rate schedulers (decaying $lr$ over time) mimic the assumption that the loss landscape transitions from a broad basin (low curvature) to a narrow minimum (high curvature) as training progresses. More advanced optimizers like Adam take this a step further by dividing the gradient by a rolling average of squared gradients. This term ($\sqrt{v_t}$) serves as a computationally cheap estimation of the local curvature, attempting to replicate the unit-correcting behavior of Newton’s method by normalizing the step size for each parameter individually.

## Exponentially Weighted Moving Average

EWMA (exponentially weighted moving average) is the "smooth a noisy stream without forgetting the present" trick. You take a time series (daily temperature, stock price, whatever), and instead of a dumb uniform mean that treats last week and last year the same, you keep a running average that leans harder on recent values while older values fade out exponentially. Visually it's that clean black curve that hugs the data enough to track the trend but not enough to chase every random zig-zag. This exact "trend extraction" idea shows up all over: time series / financial forecasting, signal processing (it's basically a first-order low-pass filter), and in deep learning it's the core primitive behind momentum-style optimizers.

The entire method is one recurrence:

$$v_t = \beta v_{t-1} + (1 - \beta) x_t$$

Here $x_t$ is the value at time $t$ (temperature, gradient, whatever), and $v_t$ is the EWMA at time $t$. $\beta \in [0, 1)$ controls memory. If $\beta$ is large, you heavily trust the past state $v_{t-1}$, so the curve is stable and smooth. If $\beta$ is small, you aggressively trust the current observation $x_t$, so the curve becomes twitchy and tracks the data closely. Two practical notes: you need some $v_0$ to start the recurrence. People often set $v_0 = 0$, or set $v_0 = x_0$ (or some reasonable constant). Early on, initialization matters because the filter "warms up" from that starting point.

A tiny numerical walk-through makes it real. Say $\beta = 0.9$, $v_0 = 0$, and your first observation is $x_1 = 30$. Then $v_1 = 0.9 \cdot 0 + 0.1 \cdot 30 = 3$. Next if $x_2 = 17$, then $v_2 = 0.9 \cdot 3 + 0.1 \cdot 17 = 4.4$. And so on. Every step is "keep most of the previous smoothed value, mix in a little bit of the new measurement." Plotting $v_t$ against time gives you a trend line that reacts gradually instead of instantly.

There's a really useful intuition for what $\beta$ means: EWMA behaves kind of like an average over the last

$$\frac{1}{1 - \beta}$$

points (an "effective window length"). If $\beta = 0.9$, that's about $1/0.1 = 10$ steps of memory. If $\beta = 0.5$, that's $1/0.5 = 2$ steps, meaning you're basically averaging only the last couple points and your estimate will whip around. Same data, different $\beta$, wildly different behavior: high $\beta$ gives you a slow, stable, low-variance curve; low $\beta$ gives you a moody curve that updates its "belief" based almost entirely on what just happened.

If you want the "why does it weight recent points more?" proof, just expand the recurrence a few steps. Substitute $v_{t-1}$ into $v_t$, then substitute again, and you'll see the pattern:

$$v_t = (1 - \beta)(x_t + \beta x_{t-1} + \beta^2 x_{t-2} + \cdots + \beta^{t-1} x_1) + \beta^t v_0$$

So the contribution of $x_{t-k}$ is $(1 - \beta)\beta^k$. Since $\beta < 1$, those weights decay exponentially as you go back in time. That is literally the mechanism behind the two key properties: (1) newer points matter more, (2) any fixed old point's influence shrinks as time goes on.

Now connect this to deep learning optimizers: replace "temperature" with "gradient". Gradients are noisy minibatch estimates, so you don't want your update direction to flip around just because one batch was weird. Momentum and friends maintain an EWMA of gradients (or squared gradients), which is just this same filter applied to a different signal. High $\beta$ means "trust the long-term direction, smooth aggressively," low $\beta$ means "react quickly to new gradient information." The sweet spot is task-dependent, but in practice you see $\beta \approx 0.9$ all the time because it gives a nice stability/response tradeoff.

Implementation-wise, Python makes this boring (in a good way). In pandas you can use `ewm` to compute the exponential moving average over a column. Pandas often parameterizes the recurrence with alpha instead of $\beta$, where

$$\alpha = 1 - \beta$$

Same thing, just a different knob. So if someone says "alpha = 0.1", that corresponds to $\beta = 0.9$ (slow/stable). If they say "alpha = 0.9", that corresponds to $\beta = 0.1$ (fast/moody). Typical workflow: load time series (date, value), compute `ema = df['value'].ewm(alpha=alpha).mean()`, attach it as a new column, plot raw series plus EMA with matplotlib. The best exercise is to vary $\alpha$ / $\beta$ and actually feel how the curve transitions from smooth trend extractor to near-copy of the raw data—and then write the recurrence yourself in a loop once, so you internalize that it's just one line of state update repeated forever.