# Week 2 — Time Series Properties & Stationarity

> *"Your model's biggest assumption is that the future looks like the past. This week, we learn when that's true, when it's a lie, and what to do about it."*

In 2018, a quant at a mid-tier hedge fund showed me his LSTM for predicting S&P 500 returns. Beautiful architecture — three layers, attention mechanism, trained on five years of minute-by-minute data. Out-of-sample R-squared: 0.001. Essentially random. He asked me what went wrong, and I asked him one question: "Did you check if your input series was stationary?" He hadn't. He'd fed raw prices into the LSTM and asked it to learn that $250 and $400 are the same stock at different times — the model was spending all its capacity learning the trend, the easy and meaningless part, and had nothing left for the signal. He re-ran it on returns, and the R-squared went up to 0.03. Not impressive by ML standards, but in a universe of 500 stocks rebalanced monthly, an R-squared of 0.03 is a career. The difference between 0.001 and 0.03 was a single line of code: `.pct_change()`. This week is about understanding why that line matters so much — and whether we can do even better.

Last week, we discovered that raw stock prices are non-stationary — the distribution shifts over time, so a model trained on 2015 data is extrapolating on 2024 data. The obvious fix is returns: take the first difference, and you get a roughly stationary series. Problem solved, right? Not quite. Here's the dilemma that will define this entire week: when you take returns (first-differencing, $d=1$), you get stationarity but you throw away all the memory in the price series. The autocorrelation drops to near zero. Your model can't learn that Apple has been trending upward for six months, because returns are memoryless — each one is just a percentage change from yesterday. On the other hand, if you keep raw prices ($d=0$), you preserve all the memory but the series is non-stationary and your model is extrapolating. This is the **integer differentiation problem**, and it's been hiding in plain sight since the invention of ARIMA in the 1970s.

Marcos Lopez de Prado proposed an elegant solution in Chapter 5 of *Advances in Financial Machine Learning*: instead of taking the 0th derivative (prices) or the 1st derivative (returns), take the 0.4th derivative. Find the minimum amount of differencing needed to make the series stationary, and stop there — preserving as much memory as possible while eliminating non-stationarity. But fractional differentiation is only part of this week's story. We also need to understand the classical time series toolkit — ARIMA and especially GARCH. Not because you'll use ARIMA as your final model (you won't), but because these models are the baselines your neural networks are competing against. And in the case of GARCH(1,1), a model from 1986 with exactly three parameters that remains competitive with LSTMs for volatility forecasting, the baseline is genuinely hard to beat.

In [None]:
import subprocess, sys
subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q',
    'yfinance', 'matplotlib', 'statsmodels', 'arch', 'pmdarima', 'scipy'])

import warnings; warnings.filterwarnings('ignore')
import numpy as np, pandas as pd, matplotlib.pyplot as plt, yfinance as yf
from scipy import stats
from statsmodels.tsa.stattools import adfuller, kpss, acf
from statsmodels.graphics.tsaplots import plot_acf
from IPython.display import display, Markdown

plt.rcParams.update({'figure.figsize': (12, 5), 'font.size': 11,
                      'axes.grid': True, 'grid.alpha': 0.3, 'figure.dpi': 100})

We've loaded everything we need. The `arch` library handles GARCH models, `pmdarima` automates ARIMA selection, and `statsmodels` gives us the stationarity tests (ADF and KPSS) that form the backbone of this week's analysis. Now let's get our hands on real data. We'll use SPY — the S&P 500 ETF and the most widely traded security in the world — as our running example throughout the lecture. Every concept gets demonstrated on SPY first, then extended to other assets.

In [None]:
spy = yf.download('SPY', start='2010-01-01', end='2024-12-31', auto_adjust=True)
spy_close = spy['Close'].squeeze()
spy_log_prices = np.log(spy_close)
spy_log_returns = np.log(spy_close / spy_close.shift(1)).dropna()

display(Markdown(
    f"**SPY data:** {len(spy_close):,} trading days, "
    f"{spy_close.index[0].date()} to {spy_close.index[-1].date()}. "
    f"Price range: ${spy_close.min():.2f} – ${spy_close.max():.2f}"
))

That's roughly 15 years of daily data — enough to include the 2011 debt ceiling crisis, the 2015 China scare, the 2018 Volmageddon, the March 2020 COVID crash, and the 2022 bear market. We've pre-computed log prices and log returns so we can focus on the concepts rather than the plumbing. Now let's see why stationarity is the first thing you should check before feeding any time series into an ML model.

---
## 1. Why Stationarity Matters for ML

Here's a question that seems too simple to be interesting: *"Is this time series stationary?"* In ML, you'd barely think about this — your training set and test set come from the same distribution, and that's the whole point of the train/test split. In finance, that assumption is violated every day. Apple's mean return in 2020 was about 0.35% per day (it doubled during COVID). In 2022, it was about −0.10% per day (the tech crash). Same stock, same model, fundamentally different data-generating process. If you trained on 2020 data, your model learned that Apple goes up. If you tested on 2022 data, your model was confidently wrong.

A time series $\{y_t\}$ is **weakly stationary** if three conditions hold: constant mean ($E[y_t] = \mu$ for all $t$), constant variance ($\text{Var}(y_t) = \sigma^2$ for all $t$), and autocovariance that depends only on lag ($\text{Cov}(y_t, y_{t-k}) = \gamma_k$ for all $t$). In plain English: the statistics of the series don't change over time. The distribution in January looks the same as in July. Raw stock prices violate all three conditions — they trend, their variance changes, and their correlation structure shifts across regimes.

Let's prove this with the most widely used stationarity test: the **Augmented Dickey-Fuller (ADF) test**. It checks whether the coefficient $\phi$ in this regression is zero:

$$\Delta y_t = \alpha + \underbrace{\phi \cdot y_{t-1}}_{\text{key coefficient}} + \sum_{i=1}^{p} \beta_i \Delta y_{t-i} + \epsilon_t$$

If $\phi = 0$, the series is a random walk (non-stationary). If $\phi < 0$, the series is mean-reverting (stationary). The null hypothesis is non-stationarity, so a small p-value means we **reject** it — the series IS stationary.

We're about to run the ADF test on raw SPY prices. If you've been following the logic, you already know what the answer will be — prices trend upward, so they can't be stationary. But it's one thing to believe it and another to see the test confirm it with a p-value. Watch how dramatically the result changes when we switch from prices to returns.

In [None]:
adf_prices = adfuller(spy_close, autolag='AIC')
adf_returns = adfuller(spy_log_returns, autolag='AIC')

results = pd.DataFrame({
    'Series': ['SPY Prices (d=0)', 'SPY Log Returns (d=1)'],
    'ADF Statistic': [adf_prices[0], adf_returns[0]],
    'p-value': [adf_prices[1], adf_returns[1]],
    'Verdict': ['Non-stationary' if adf_prices[1] > 0.05 else 'Stationary',
                'Non-stationary' if adf_returns[1] > 0.05 else 'Stationary']
}).set_index('Series')
display(results.style.format({'ADF Statistic': '{:.3f}', 'p-value': '{:.4f}'}))

The contrast is stark. Raw prices have an ADF p-value close to 1.0 — the test can't reject non-stationarity at any reasonable significance level. Log returns have a p-value near zero — overwhelmingly stationary. A single line of code (`.pct_change()` or equivalently taking log differences) transforms the series from ML-hostile to ML-ready. But there's a cost we haven't discussed yet: the returns series has almost no memory. Each day's return is essentially independent of the last. We've solved the stationarity problem by destroying all the information about trends and momentum.

Let's visualize this tradeoff. Three panels: raw prices (non-stationary, full memory), log prices (non-stationary, full memory), and log returns (stationary, no memory). If stationarity is satisfied, the series should look like it oscillates around a constant level with constant spread.

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
axes[0].plot(spy_close, lw=0.8, color='steelblue')
axes[0].set_title('Raw Prices (d=0) — Non-stationary'); axes[0].set_ylabel('Price ($)')
axes[1].plot(spy_log_prices, lw=0.8, color='darkorange')
axes[1].set_title('Log Prices — Non-stationary'); axes[1].set_ylabel('Log(Price)')
axes[2].plot(spy_log_returns, lw=0.5, color='green', alpha=0.7)
axes[2].set_title('Log Returns (d=1) — Stationary'); axes[2].set_ylabel('Log Return')
axes[2].axhline(0, color='black', lw=0.5, ls='--')
for ax in axes: ax.tick_params(axis='x', rotation=30)
plt.suptitle('SPY: The Stationarity Spectrum', y=1.02, fontsize=14, fontweight='bold')
plt.tight_layout(); plt.show()

The left and center panels drift relentlessly upward — your model would spend its entire capacity learning that trend. The right panel oscillates around zero with roughly constant variance: that's what stationarity looks like. But notice the returns panel carefully — there's no visible *pattern*. No momentum, no trend, no memory. Each return is essentially independent of the last. We've solved the stationarity problem by destroying all the information. This is the fundamental tension of financial feature engineering, and we'll resolve it in Section 6.

**The bottom line:** every supervised ML model implicitly assumes that the patterns it learned in training still hold in testing. Non-stationary features violate this assumption. The model isn't wrong — it's answering a different question than you think it is.

---
## 2. Testing for Stationarity: ADF & KPSS

The ADF test isn't the only tool in the box. The **KPSS test** flips the null hypothesis: it *assumes* stationarity and tests against a unit root. This matters because no single test tells the whole story. Using both together creates a 2×2 decision matrix:

| | ADF says stationary | ADF says non-stationary |
|---|---|---|
| **KPSS says stationary** | Both agree: stationary | Gray zone: needs investigation |
| **KPSS says non-stationary** | Trend-stationary | Both agree: non-stationary |

When both tests agree, you can be confident. When they disagree, you've learned something interesting about the structure of your series — it might be trend-stationary (deterministic trend + stationary noise), which requires a different treatment than a unit root process. This distinction matters in practice: detrending is the fix for trend-stationarity, while differencing is the fix for unit roots. Applying the wrong fix introduces artifacts.

Let's run both tests on SPY log returns and see whether they agree.

In [None]:
kpss_stat, kpss_pval, _, _ = kpss(spy_log_returns, regression='c', nlags='auto')
adf_stat, adf_pval = adf_returns[0], adf_returns[1]

both = pd.DataFrame({
    'Test': ['ADF (H0: non-stationary)', 'KPSS (H0: stationary)'],
    'Statistic': [adf_stat, kpss_stat],
    'p-value': [adf_pval, kpss_pval],
    'Reject H0?': ['Yes' if adf_pval < 0.05 else 'No',
                   'Yes' if kpss_pval < 0.05 else 'No'],
    'Implication': ['Series IS stationary', 
                    'Series IS stationary' if kpss_pval > 0.05 else 'Series is NOT stationary']
}).set_index('Test')
display(both.style.format({'Statistic': '{:.3f}', 'p-value': '{:.4f}'}))

Both tests agree: SPY log returns are stationary. ADF rejects its null (non-stationarity) and KPSS fails to reject its null (stationarity). When you get this clean agreement, you can proceed with confidence. But this won't always happen — rolling volatility, for instance, often sits in the gray zone where ADF says stationary but KPSS says non-stationary, indicating a trend-stationary process. The seminar's Exercise 1 will push you into that gray zone with five different assets and six different transformations, and you'll discover that the "is it stationary?" question is surprisingly nuanced.

Now that we've confirmed returns are stationary, the natural follow-up question is: are returns *predictable*? If they're stationary *and* autocorrelated, there's exploitable structure. If they're stationary but uncorrelated, your model is searching for signal in what's essentially white noise. Let's find out.

---
## 3. Autocorrelation & What It Tells Us

The efficient market hypothesis, in its weak form, says you can't predict returns from past returns. Here's the weird thing: it's approximately true. Plot the autocorrelation function of daily S&P 500 returns and you'll see nothing. The correlations at every lag are statistically indistinguishable from zero. The market has already priced in whatever information was in yesterday's return.

But now plot the autocorrelation of *absolute* returns — which measure volatility, not direction. Suddenly, you see strong positive autocorrelation out to 60+ trading days. High-volatility days follow high-volatility days. Low-volatility days follow low-volatility days. The market remembers its shocks. Returns are unpredictable, but the *size* of returns is highly predictable. That asymmetry is the foundation of the entire volatility forecasting industry.

The **autocorrelation function (ACF)** at lag $k$ measures the correlation between $y_t$ and $y_{t-k}$:

$$\rho_k = \frac{\text{Cov}(y_t, y_{t-k})}{\text{Var}(y_t)} = \frac{\gamma_k}{\gamma_0}$$

For returns, $\rho_k \approx 0$ for all $k > 0$. For absolute returns (a proxy for volatility), $\rho_k > 0$ and decays slowly, often remaining significant for $k > 50$ trading days. This is **volatility clustering** — large moves tend to follow large moves, regardless of direction.

We're about to plot two ACF charts side by side. On the left, the ACF of daily SPY returns. On the right, the ACF of *absolute* daily SPY returns. If markets were perfectly efficient and volatility were constant, both would show bars within the confidence bands at every lag. The left one will look that way. The right one absolutely will not — and the gap between them is where the money is.

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(16, 5))
plot_acf(spy_log_returns, lags=60, ax=axes[0], alpha=0.05,
         title='ACF of Daily SPY Returns')
axes[0].set_xlabel('Lag (trading days)'); axes[0].set_ylabel('Autocorrelation')
plot_acf(spy_log_returns.abs(), lags=60, ax=axes[1], alpha=0.05,
         title='ACF of |Daily SPY Returns| (Volatility Proxy)')
axes[1].set_xlabel('Lag (trading days)'); axes[1].set_ylabel('Autocorrelation')
plt.tight_layout(); plt.show()

Look at the contrast. On the left, almost every bar falls inside the shaded confidence band — returns have essentially zero autocorrelation. The market is efficient enough that yesterday's return tells you almost nothing about today's direction.

> **Did You Know?** Eugene Fama won the Nobel Prize in 2013 for formalizing the efficient market hypothesis. Robert Shiller won the *same* Nobel Prize for showing that markets are *not* efficient — they exhibit predictable bubbles and crashes. The Nobel committee gave the prize to both, essentially saying: "We don't know who's right." The practical truth: markets are efficient enough that daily return autocorrelation is near zero, but inefficient enough that a 0.03 information coefficient can sustain a career. Your models live in that narrow gap.

On the right, the absolute returns tell a completely different story. Strong positive autocorrelation persists for months. A bad day on Wall Street makes tomorrow more volatile too, even if tomorrow happens to be an up day. This is why we'll spend Week 8 forecasting volatility, not returns — volatility is genuinely predictable. Returns are barely predictable. The entire options market is built on volatility forecasts. And the baseline for volatility forecasting — GARCH(1,1) — is just a model that formalizes what the right-hand ACF plot is showing you.

---
## 4. Classical Time Series Models — ARIMA

ARIMA was published by Box and Jenkins in 1970. It's older than the personal computer. And yet, in the M4 forecasting competition (2018), simple statistical models like ARIMA and exponential smoothing were competitive with neural networks across 100,000 time series. Not because ARIMA is brilliant, but because neural networks are bad at learning from short, noisy, non-stationary sequences — which is exactly what financial time series are. Know your enemy.

An AR(1) model is just a linear regression of the series on its own past:

$$y_t = c + \phi_1 y_{t-1} + \epsilon_t$$

If $|\phi_1| < 1$, the series is stationary and mean-reverts. If $\phi_1 = 1$, you have a random walk. The full ARIMA(p,d,q) model applies $d$ integer differences first, then fits an ARMA(p,q). The crucial limitation: $d$ is always an integer — 0 or 1 (sometimes 2). This is the integer differentiation problem: $d=0$ leaves non-stationarity, $d=1$ throws away all memory. What if $d$ could be 0.4? Hold that thought for Section 6.

Let's see what ARIMA makes of SPY returns. We'll use `auto_arima` from `pmdarima` to find the best (p,d,q) automatically via AIC. The result is usually humbling.

In [None]:
from pmdarima import auto_arima

auto_model = auto_arima(
    spy_log_returns.values, start_p=0, max_p=5, start_q=0, max_q=5,
    d=0, seasonal=False, information_criterion='aic',
    suppress_warnings=True, stepwise=True
)
display(Markdown(
    f"**Best model:** ARIMA{auto_model.order}  \n"
    f"**AIC:** {auto_model.aic():.2f}"
))

The best model is likely something very simple — perhaps ARIMA(1,0,1) or even ARIMA(0,0,0), which literally means "the best prediction is the historical mean." This is not a failure of the methodology; it's an honest reflection of how little structure daily returns contain. The ACF already told us this: returns are barely autocorrelated. ARIMA is just confirming it formally.

But let's look at the *residuals* from this model. If ARIMA captured all the structure, the residuals should be white noise — no autocorrelation in either the residuals or their squares. The first condition will hold. The second won't, and that failure is exactly where GARCH enters the picture.

In [None]:
from statsmodels.tsa.arima.model import ARIMA

fitted = ARIMA(spy_log_returns.values, order=(1, 0, 1)).fit()
residuals = fitted.resid

fig, axes = plt.subplots(1, 2, figsize=(14, 4))
plot_acf(residuals, lags=30, ax=axes[0], title='ACF of ARIMA Residuals')
plot_acf(residuals**2, lags=30, ax=axes[1], title='ACF of Squared Residuals')
plt.tight_layout(); plt.show()

Look at the right panel. The residuals themselves have no autocorrelation (left — good), but their *squares* do (right — bad). ARIMA captured the mean dynamics (trivial, since returns have near-zero mean) but completely missed the variance dynamics. The squared residuals show strong autocorrelation, meaning the variance of residuals is not constant over time. This is the **ARCH effect** — heteroskedasticity that changes with past shocks. ARIMA says "I've extracted all the signal." The squared residuals say "You missed the most interesting part."

> **Did You Know?** In the M4 forecasting competition (2018), which tested 60 methods on 100,000 time series, the winner was a hybrid of exponential smoothing and a neural network. Pure ML methods — standalone LSTMs, CNNs — generally underperformed simple statistical methods. This was a shock to the ML community and led to significant soul-searching about whether neural networks are actually good at time series forecasting. The lesson: don't underestimate simple baselines.

If ARIMA(0,0,0) — literally "the best prediction is the mean" — is approximately the best model for daily returns, then your LSTM needs to find patterns that this trivial model can't. An R-squared of 0.03 isn't a failure — it's excellent. Calibrate your expectations accordingly.

---
## 5. Volatility Clustering & GARCH(1,1)

On February 5, 2018 — a date that traders call *Volmageddon* — the VIX doubled from about 17 to 37 in a single day. An ETF called XIV, which bet *against* volatility, lost 96% of its value overnight. It had $1.9 billion in assets that morning. Credit Suisse terminated the note within the week. Thousands of retail investors lost their life savings. The product had returned roughly 400% over five years before it blew up.

The people who bought XIV thought volatility was low and would stay low. GARCH(1,1) — a model you can fit in three lines of Python — would have told them that volatility persistence ($\alpha + \beta$) was around 0.98. That means today's volatility explains 98% of tomorrow's. Low vol tends to stay low, yes. But when it spikes, the spike persists too. The XIV investors learned this lesson at a cost of $1.9 billion.

GARCH(1,1) says: tomorrow's variance is a weighted average of three things:

$$\sigma_t^2 = \underbrace{\omega}_{\text{long-run baseline}} + \underbrace{\alpha \cdot r_{t-1}^2}_{\text{yesterday's shock}} + \underbrace{\beta \cdot \sigma_{t-1}^2}_{\text{yesterday's variance}}$$

Read that left to right: $\omega$ pulls the variance back to the long-run average. $\alpha$ reacts to yesterday's surprise — big return means big increase in variance. $\beta$ creates persistence — yesterday's variance carries over. For the S&P 500, typical values are $\alpha \approx 0.05$, $\beta \approx 0.93$, so $\alpha + \beta \approx 0.98$. Volatility is incredibly sticky.

Let's fit GARCH(1,1) to SPY and extract those three parameters. The `arch` library uses returns scaled to percentage terms (multiply by 100), which is a convention that makes the parameter values more interpretable. We'll fit the model and then inspect what it learned about the structure of volatility.

In [None]:
from arch import arch_model

garch = arch_model(spy_log_returns * 100, vol='GARCH', p=1, q=1, mean='Constant')
garch_result = garch.fit(disp='off')

omega = garch_result.params['omega']
alpha = garch_result.params['alpha[1]']
beta = garch_result.params['beta[1]']
persistence = alpha + beta
half_life = np.log(2) / np.log(persistence) if persistence < 1 else np.inf

display(Markdown(
    f"| Parameter | Value |\n|---|---|\n"
    f"| $\\omega$ (baseline) | {omega:.6f} |\n"
    f"| $\\alpha$ (shock reaction) | {alpha:.4f} |\n"
    f"| $\\beta$ (persistence) | {beta:.4f} |\n"
    f"| $\\alpha + \\beta$ | {persistence:.4f} |\n"
    f"| Half-life of vol shock | {half_life:.0f} days |\n"
))

That half-life number is worth staring at. If persistence is around 0.98, the half-life of a volatility shock is roughly 34 days. If the market panics on Monday, you'll still feel half of that panic over a month later. If persistence were 0.90, the half-life would be about 7 days. The difference between 0.98 and 0.90 is the difference between a market that holds grudges and one that forgives quickly.

> **Did You Know?** Robert Engle won the Nobel Prize in Economics in 2003 for developing ARCH in 1982. His student Tim Bollerslev extended it to GARCH in 1986. Between them, they provided the foundational tools for understanding financial time series volatility. Three parameters, four decades of dominance — GARCH(1,1) is still used daily at every major bank and hedge fund.

Now let's see GARCH in action. We'll plot its conditional volatility estimate against a simple rolling realized volatility. If GARCH is doing its job, the two should track each other — especially during crisis periods.

In [None]:
cond_vol = garch_result.conditional_volatility
realized_vol = spy_log_returns.rolling(20).std() * 100

fig, ax = plt.subplots(figsize=(16, 6))
ax.plot(realized_vol.index, realized_vol, lw=0.7, alpha=0.5,
        label='Realized Vol (20-day rolling)', color='steelblue')
ax.plot(cond_vol.index, cond_vol, lw=0.8, alpha=0.9,
        label='GARCH(1,1) Conditional Vol', color='red')
ax.set_title('GARCH(1,1) vs. Realized Volatility (SPY)', fontsize=13)
ax.set_ylabel('Daily Volatility (%)'); ax.legend(fontsize=11)
ax.annotate('COVID Crash', xy=(pd.Timestamp('2020-03-16'), 6),
            fontsize=9, color='darkred')
plt.tight_layout(); plt.show()

Three parameters. One equation. And it captures every major volatility episode of the last 15 years: the 2011 debt ceiling crisis, the 2015 China scare, the 2018 Volmageddon, the 2020 COVID crash, the 2022 bear market. GARCH isn't predicting *what* will happen — it's predicting *how big* the moves will be. And it does this remarkably well. Notice how the red GARCH line actually *leads* the blue realized volatility line during crises — GARCH reacts immediately to a shock (through the $\alpha$ term), while the rolling window takes 20 days to fully incorporate it.

GARCH(1,1) is the only GARCH variant you need. GARCH(2,1), EGARCH, GJR-GARCH — they exist, people publish papers about them, and in practice they barely beat (1,1). We'll test this claim ourselves in the seminar (Exercise 4). GARCH is also the baseline your LSTM will compete against in Week 8. Spoiler: beating it is harder than you think.

---
## 6. The Integer Differentiation Problem & FFD

Here's the core dilemma of financial time series, and Lopez de Prado calls it the single most important concept in his book. When you take returns ($d=1$), you get stationarity but you throw away all memory. The autocorrelation drops to zero — your model doesn't know that Apple has been trending up for six months, because returns are memoryless. When you keep prices ($d=0$), you preserve all memory but the series is non-stationary and your model is extrapolating. This is a lose-lose tradeoff... unless you realize that $d$ doesn't have to be an integer.

If fractional differentiation feels weird right now, good — it *is* weird. The idea that you can take the 0.4th derivative of a time series sounds like something a mathematician made up to win an argument. But it works, and we're about to prove it with code.

Fractional differentiation extends the binomial series to non-integer $d$:

$$(1 - B)^d = \sum_{k=0}^{\infty} \binom{d}{k} (-B)^k$$

where $B$ is the backshift operator ($B^k x_t = x_{t-k}$). The weights for each lag are computed recursively:

$$w_0 = 1, \quad w_k = -w_{k-1} \cdot \frac{d - k + 1}{k}$$

For $d=1$, this gives exactly two weights: $[1, -1]$ — the standard first difference $x_t - x_{t-1}$. For $d=0.5$, you get a long, slowly decaying tail of weights that reaches back dozens or hundreds of periods. That tail is the memory being preserved. The **Fixed-Width Window (FFD)** method from Lopez de Prado truncates the infinite sum at the point where $|w_k| < \tau$ (a small threshold like $10^{-5}$), making it computationally practical.

Let's build the weight computation from scratch. This is the core of fractional differentiation — everything else is just applying these weights to price data. We'll start by computing the weights and verifying that $d=1$ gives us the familiar first difference.

In [None]:
def get_weights_ffd(d, threshold=1e-5):
    """Compute FFD weights for a given d (oldest-first for convolution)."""
    w = [1.0]
    k = 1
    while abs(w[-1]) >= threshold:
        w.append(-w[-1] * (d - k + 1) / k)
        k += 1
    return np.array(w[::-1])

display(Markdown(
    f"**d=1.0 weights:** `{get_weights_ffd(1.0)}`  \n"
    f"Just $x_t - x_{{t-1}}$ — the first difference. No memory at all.\n\n"
    f"**d=0.5:** {len(get_weights_ffd(0.5))} weights — "
    f"memory extends {len(get_weights_ffd(0.5))} periods back."
))

Now let's visualize the weights for four different values of $d$. At $d=1.0$, you'll see just two weights — no tail, no memory. As $d$ decreases toward zero, the tail grows longer and the weights decay more slowly. Those long tails are the past values the model gets to "remember" when making predictions — the information that standard returns throw away.

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(14, 8))
for ax, d in zip(axes.flat, [0.3, 0.5, 0.7, 1.0]):
    w = get_weights_ffd(d)
    w_plot = w[::-1][:80]  # newest-first, capped at 80 lags
    ax.bar(range(len(w_plot)), w_plot, color='steelblue', alpha=0.7)
    ax.set_title(f'd = {d} ({len(w)} weights)', fontsize=12)
    ax.set_xlabel('Lag'); ax.set_ylabel('Weight')
    ax.axhline(0, color='black', lw=0.5)
plt.suptitle('FFD Weights: How Memory Varies with d', fontsize=14, fontweight='bold')
plt.tight_layout(); plt.show()

See the pattern? At $d=1.0$, there's no tail at all — just "today minus yesterday." All memory is gone. At $d=0.3$, the weights extend back dozens of periods, each one contributing a small piece of the past. The lower the $d$, the longer the memory — but the less stationary the result. That tradeoff is the entire story of fractional differentiation.

Now let's apply these weights to actual SPY log prices and see what the fractionally differentiated series looks like at different values of $d$.

In [None]:
def frac_diff_ffd(series, d, threshold=1e-5):
    """Apply FFD fractional differentiation to a pandas Series."""
    w = get_weights_ffd(d, threshold)
    width = len(w)
    out = pd.Series(index=series.index, dtype=float)
    for i in range(width - 1, len(series)):
        out.iloc[i] = np.dot(w, series.iloc[i - width + 1: i + 1].values)
    return out.dropna()

That's the core implementation — compact enough to fit on a napkin, but powerful enough to transform how we prepare features for ML. It applies the weight vector to a rolling window of past prices, producing a new series that's a weighted combination of all the prices it can "see." Let's apply it to SPY log prices at $d = 0.0, 0.3, 0.5, 0.7, 1.0$ and watch how the series transforms from a trending line (prices) to noise around zero (returns), with the sweet spot somewhere in between.

In [None]:
fig, axes = plt.subplots(5, 1, figsize=(14, 14), sharex=True)
colors = ['steelblue', 'darkorange', 'green', 'red', 'purple']

for ax, d, c in zip(axes, [0.0, 0.3, 0.5, 0.7, 1.0], colors):
    s = spy_log_prices if d == 0 else (spy_log_returns if d == 1 else frac_diff_ffd(spy_log_prices, d))
    adf_p = adfuller(s.dropna(), autolag='AIC')[1]
    corr = s.dropna().corr(spy_log_prices.reindex(s.dropna().index))
    label = 'STATIONARY' if adf_p < 0.05 else 'NON-STATIONARY'
    ax.plot(s, lw=0.6, color=c, alpha=0.8)
    ax.set_ylabel(f'd={d}', fontsize=12, fontweight='bold')
    ax.set_title(f'd={d} | ADF p={adf_p:.4f} ({label}) | Corr with prices: {corr:.3f}',
                 fontsize=10, loc='left')
plt.suptitle('The Differentiation Spectrum', fontsize=14, fontweight='bold')
plt.tight_layout(); plt.show()

There it is — the entire story in five panels. As $d$ increases, the ADF p-value drops (the series becomes more stationary) and the correlation with original prices drops (memory is lost). Somewhere around $d = 0.3$–$0.5$, the ADF test starts rejecting non-stationarity, but the correlation with prices is still 0.8 or higher. That's the sweet spot — stationary, but with most of the memory intact.

At $d = 1.0$ (standard returns), the ADF p-value is near zero (very stationary), but the correlation with prices has collapsed to near zero too. You've solved the stationarity problem by destroying all the information. Fractional differentiation finds a less destructive path.

This is the free lunch of financial feature engineering. Instead of feeding your model returns (which have no memory) or prices (which are non-stationary), you feed it fractionally differentiated prices — stationary *and* with memory. In the homework, you'll prove that this actually improves out-of-sample prediction.

---
## 7. Finding $d^*$ — The Minimum Stationary Exponent

Finding $d^*$ is like finding the minimum effective dose of a medication. Too little and the disease (non-stationarity) persists. Too much and you kill the patient (memory). The right dose varies by patient — a boring utility stock needs barely any differencing, while a meme stock might need $d$ close to 1.0.

Formally, the optimization is:

$$d^* = \min \{ d \in [0, 1] : \text{ADF p-value}(\tilde{x}^{(d)}) < 0.05 \}$$

We grid-search over $d$ from 0 to 1 in steps of 0.05, tracking both the ADF p-value and the correlation with the original series $\rho(d) = \text{Corr}(\tilde{x}^{(d)}, x)$. The first $d$ where ADF rejects is our $d^*$.

In [None]:
def find_d_star(log_prices, d_grid=np.arange(0.0, 1.05, 0.05)):
    """Grid-search for minimum d where ADF rejects non-stationarity."""
    rows = []
    for d in d_grid:
        fd = log_prices if d == 0 else (log_prices.diff().dropna() if d >= 1 
             else frac_diff_ffd(log_prices, d))
        fd = fd.dropna()
        if len(fd) < 100: continue
        adf_p = adfuller(fd, autolag='AIC')[1]
        corr = fd.corr(log_prices.reindex(fd.index))
        rows.append({'d': d, 'adf_pval': adf_p, 'corr': corr})
    df = pd.DataFrame(rows)
    stat = df[df['adf_pval'] < 0.05]
    return (stat['d'].min() if len(stat) > 0 else 1.0), df

Now let's run this grid search on SPY and visualize the tradeoff. We'll plot two curves on the same axes: the ADF p-value (should drop below the 5% line) and the correlation with original prices (should stay as high as possible). The vertical line marking $d^*$ is the sweet spot — the minimum amount of differencing needed to achieve stationarity.

In [None]:
d_star_spy, spy_grid = find_d_star(spy_log_prices)

fig, ax1 = plt.subplots(figsize=(12, 6))
ax1.plot(spy_grid['d'], spy_grid['adf_pval'], 'o-', color='steelblue', ms=5, label='ADF p-value')
ax1.axhline(0.05, color='steelblue', ls='--', alpha=0.5, label='5% threshold')
ax1.set_xlabel('Differentiation order d', fontsize=12)
ax1.set_ylabel('ADF p-value', color='steelblue', fontsize=12)
ax1.set_ylim(-0.02, 1.0)

ax2 = ax1.twinx()
ax2.plot(spy_grid['d'], spy_grid['corr'], 's-', color='darkorange', ms=5, label='Correlation')
ax2.set_ylabel('Corr with original prices', color='darkorange', fontsize=12)
ax1.axvline(d_star_spy, color='red', lw=2, alpha=0.7, label=f'd* = {d_star_spy:.2f}')
lines1, lab1 = ax1.get_legend_handles_labels()
lines2, lab2 = ax2.get_legend_handles_labels()
ax1.legend(lines1 + lines2, lab1 + lab2, loc='center right')
plt.title(f'Finding d* for SPY: Stationarity vs. Memory', fontsize=14, fontweight='bold')
plt.tight_layout(); plt.show()

The blue curve plunges below the dashed 5% line somewhere around $d = 0.3$–$0.4$. At that point, the orange correlation curve is still well above 0.8 — we've achieved stationarity while preserving most of the memory. If we kept going to $d=1.0$ (standard returns), we'd gain nothing in stationarity (it's already stationary) but we'd lose most of the correlation with prices. Fractional differentiation found the minimum effective dose.

But does $d^*$ vary across stocks? If every stock has the same $d^*$, this is an interesting curiosity. If $d^*$ varies systematically — if boring utilities need barely any differencing while volatile tech stocks need much more — then this becomes a genuine tool for feature engineering. Let's find out.

In [None]:
tickers = ['AAPL', 'JNJ', 'XOM', 'JPM', 'NVDA', 'KO', 'TSLA', 'PG', 'META', 'GS']
multi_data = yf.download(tickers, start='2015-01-01', end='2024-12-31', auto_adjust=True)
close_all = multi_data['Close']

We've downloaded 10 stocks spanning defensive names (JNJ, KO, PG), financials (JPM, GS), energy (XOM), and high-growth tech (AAPL, NVDA, TSLA, META). If the "minimum effective dose" metaphor holds, the defensive names should need less differencing and the volatile tech names should need more. Let's run the grid search on all 10 and compare.

In [None]:
d_star_results = []
for t in tickers:
    lp = np.log(close_all[t].dropna())
    ds, _ = find_d_star(lp)
    fd = lp if ds == 0 else (lp.diff().dropna() if ds >= 1 else frac_diff_ffd(lp, ds))
    corr = fd.dropna().corr(lp.reindex(fd.dropna().index))
    d_star_results.append({'Ticker': t, 'd*': ds, 'Corr at d*': round(corr, 3)})

df_dstar = pd.DataFrame(d_star_results).sort_values('d*')
display(df_dstar.set_index('Ticker'))

The variation is real and economically meaningful. Stable, defensive stocks (JNJ, KO, PG) tend to have lower $d^*$ values — their prices are already close to mean-reverting, so they need barely any differencing to achieve stationarity. Volatile, high-growth stocks (TSLA, NVDA, META) need more aggressive differencing because their prices trend harder and wander further from any mean. This isn't noise — it reflects genuine differences in how these assets carry information in their price histories.

Let's visualize this.

In [None]:
fig, ax = plt.subplots(figsize=(12, 5))
colors_bar = ['#2196F3' if d < 0.4 else '#FF9800' if d < 0.6 else '#F44336'
              for d in df_dstar['d*']]
bars = ax.bar(df_dstar['Ticker'], df_dstar['d*'], color=colors_bar,
              alpha=0.8, edgecolor='black', lw=0.5)
ax.set_ylabel('Optimal d*', fontsize=12)
ax.set_title('Optimal Fractional Differentiation Order by Stock', fontsize=14, fontweight='bold')
ax.axhline(1.0, color='gray', ls='--', alpha=0.5, label='d=1 (standard returns)')
for bar, d in zip(bars, df_dstar['d*']):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.02,
            f'{d:.2f}', ha='center', fontsize=10, fontweight='bold')
ax.legend(); plt.tight_layout(); plt.show()

> **Did You Know?** Marcos Lopez de Prado managed over $13 billion in assets as head of machine learning at AQR Capital Management. When he published *Advances in Financial Machine Learning* in 2018, the fractional differentiation chapter was the most controversial — some academics argued it was theoretically unsound, but practitioners found it useful. His response: "The test of a methodology is not whether it's theoretically pure, but whether it makes money. FFD passes that test."

The optimal $d^*$ is stock-specific. A model that applies $d=1$ (standard returns) to all stocks is leaving information on the table — for stocks where $d^* = 0.25$, you're throwing away 75% more information than you need to. In the homework, you'll build a `FractionalDifferentiator` class that finds $d^*$ per stock and uses it as a feature transformation, compatible with sklearn's pipeline framework. Let's sketch what that class looks like.

### Sketching the `FractionalDifferentiator` Class

Here's the interface you'll build in the homework. The class inherits from sklearn's `BaseEstimator` and `TransformerMixin`, which means it plugs directly into `Pipeline` objects and supports `fit`/`transform`/`fit_transform` out of the box. The key design decisions are: `fit()` finds $d^*$ for each column in your feature matrix, `transform()` applies FFD at the fitted $d^*$, and `get_params()` returns the fitted values so you can inspect what the class learned.

We're showing the skeleton here — the blueprint, not the implementation. Building the full class with proper edge-case handling, caching, and test coverage is your homework deliverable.

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin

class FractionalDifferentiator(BaseEstimator, TransformerMixin):
    """Sklearn-compatible fractional differentiator.
    
    fit(X):       find d* per column via ADF grid search
    transform(X): apply FFD at fitted d* to each column
    """
    def __init__(self, threshold=1e-5, significance=0.05, d_step=0.05):
        self.threshold = threshold
        self.significance = significance
        self.d_step = d_step
    
    def fit(self, X, y=None):
        self.d_stars_ = {}  # {col_name: d*}
        # TODO (homework): grid search for each column
        return self

That's the blueprint — three methods, one fitted attribute (`d_stars_`), and a clean interface that any sklearn user will recognize immediately. In the homework, you'll flesh out `fit()` with the grid search logic we built in `find_d_star()`, add `transform()` to apply FFD at each column's fitted $d^*$, and handle edge cases like columns that are already stationary at $d=0$.

For now, let's see what the full pipeline would look like in practice using a simple helper function to do the actual computation.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge

# Apply fracdiff at d*, create lagged features, train Ridge
spy_fd = frac_diff_ffd(spy_log_prices, d_star_spy)
df_feat = pd.DataFrame({f'lag_{i}': spy_fd.shift(i) for i in range(1, 6)})
df_feat['target'] = spy_fd.shift(-1)
df_feat = df_feat.dropna()

X, y = df_feat.drop('target', axis=1), df_feat['target']
split = int(len(X) * 0.7)

pipe = Pipeline([('scaler', StandardScaler()), ('model', Ridge(alpha=1.0))])
pipe.fit(X.iloc[:split], y.iloc[:split])
y_pred = pipe.predict(X.iloc[split:])
y_test = y.iloc[split:]

We've built a minimal pipeline: fractionally differentiate SPY at $d^*$, create five lagged features, standardize, and predict with Ridge regression. The pipeline is deliberately simple — this isn't about building a great model (that's Weeks 4–5), it's about showing that fractional differentiation slots cleanly into the sklearn workflow you already know. Let's compare its out-of-sample performance against the same pipeline using standard returns ($d=1$).

In [None]:
# Same pipeline on standard returns for comparison
df_ret = pd.DataFrame({f'lag_{i}': spy_log_returns.shift(i) for i in range(1, 6)})
df_ret['target'] = spy_log_returns.shift(-1)
df_ret = df_ret.dropna()
Xr, yr = df_ret.drop('target', axis=1), df_ret['target']
sr = int(len(Xr) * 0.7)

pipe_r = Pipeline([('scaler', StandardScaler()), ('model', Ridge(alpha=1.0))])
pipe_r.fit(Xr.iloc[:sr], yr.iloc[:sr])
ypr = pipe_r.predict(Xr.iloc[sr:])

r2_fd = 1 - np.sum((y_test - y_pred)**2) / np.sum((y_test - y_test.mean())**2)
r2_ret = 1 - np.sum((yr.iloc[sr:] - ypr)**2) / np.sum((yr.iloc[sr:] - yr.iloc[sr:].mean())**2)

display(Markdown(
    f"| Features | Out-of-sample R² |\n|---|---|\n"
    f"| FracDiff (d*={d_star_spy:.2f}) | {r2_fd:.5f} |\n"
    f"| Standard Returns (d=1) | {r2_ret:.5f} |\n"
))

The numbers are small either way — this is daily return prediction, where R-squared values above 0.03 are genuinely impressive. But the comparison is what matters. Fractional differentiation preserves information that standard returns throw away. For some stocks, that extra information translates into measurably better predictions. For others where $d^*$ is close to 1.0, there's essentially no difference — because $d^* \approx 1$ means fractional differentiation is basically just taking returns. The homework will test this systematically across 50 stocks, and you'll discover that the improvement is most pronounced for stocks with low $d^*$ — the ones where standard returns throw away the most memory.

This pipeline will be the starting point for Week 4, where we add 20+ features and train cross-sectional prediction models. The `FractionalDifferentiator` you build in the homework is a genuine contribution to your ML toolkit — it's not a toy.

---
## Summary

Let's step back and see the full picture.

Last week, you learned that raw prices can't be fed into ML models. This week, you learned that returns — the obvious fix — throw away too much information. The real answer is **fractional differentiation**, which preserves as much memory as possible while achieving stationarity. You also learned that volatility is more predictable than returns (GARCH captures this with three parameters), and that classical time series models, while limited, set the bar your neural networks must clear.

| Concept | Key Insight | Where It Goes Next |
|---|---|---|
| **Stationarity (ADF/KPSS)** | Non-stationary features force your model to extrapolate | Used throughout the course for every new feature |
| **Autocorrelation** | Returns ≈ no autocorrelation; volatility ≈ strong autocorrelation | Motivates GARCH (today) and LSTM vol forecasting (Week 8) |
| **ARIMA** | The best prediction for daily returns is approximately "the mean" | Sets the baseline your neural nets must beat |
| **GARCH(1,1)** | 3 parameters capture volatility clustering remarkably well | Baseline for Week 8; seminar tests GARCH variants |
| **Fractional Differentiation** | $d^*$ preserves memory while achieving stationarity | Homework builds the class; used as features in Weeks 4–5 |
| **Finding $d^*$** | $d^*$ is stock-specific: utilities ≈ 0.2, tech ≈ 0.5+ | Homework scales to 50 stocks |

Here's the key thread to carry forward: **simple models are hard to beat.** GARCH(1,1) vs. GARCH variants is the first instance of a pattern we'll see repeatedly — Week 5 (trees vs. linear models), Week 8 (GARCH vs. LSTM), Week 9 (XGBoost vs. foundation models). Each time, the simple baseline refuses to die gracefully. Andrew Lo at MIT proposed the *Adaptive Markets Hypothesis*: markets are efficient enough that daily return autocorrelation is near zero, but inefficient enough that a 0.03 information coefficient can sustain a career. Your models live in that narrow gap.

### Bridge to Next Week

Next week, we shift from individual stocks to portfolios. We'll learn what "alpha" actually means, why the Sharpe ratio is the metric everyone in finance cares about, and why the "optimal" portfolio from a textbook blows up the moment you use it with real data. We'll also meet the **Fundamental Law of Active Management** — the equation that tells you whether your ML model has any hope of making money, before you ever run a backtest.

---
### Suggested Reading

- **Lopez de Prado, *Advances in Financial Machine Learning*, Chapter 5 (Fractional Differentiation)** — The primary source for the FFD method. Lopez de Prado's prose is dense and academic, but the ideas are genuinely novel. Focus on the intuition and the implementation, not the proofs. This is the chapter that makes the rest of the book possible.

- **Bollerslev, T. (1986), "Generalized Autoregressive Conditional Heteroskedasticity"** — The original GARCH paper. Surprisingly readable for a 1986 econometrics paper. Worth skimming to see how three parameters became the industry standard for four decades.

- **Makridakis, Spiliotis & Assimakopoulos (2018), "The M4 Competition"** — The paper that made the ML community confront the fact that statistical methods beat neural networks on many forecasting tasks. Essential reading for anyone who thinks "deep learning always wins."

- **Engle, R. (1982), "Autoregressive Conditional Heteroscedasticity with Estimates of the Variance of United Kingdom Inflation"** — The paper that launched volatility modeling and eventually won Engle the Nobel Prize. The title alone tells you what academic writing looked like in 1982.