# Week 8 --- Sequence Models: LSTM/GRU for Volatility

**Quantitative Finance ML Course**

---

## Roadmap

1. Why sequence models for finance
2. RNN to LSTM to GRU: vanishing gradients and gating
3. Volatility forecasting as supervised learning
4. Realized volatility estimators
5. Classical baselines: GARCH(1,1) and HAR
6. LSTM architecture choices
7. Loss functions: MSE and QLIKE
8. Demo: LSTM vs GARCH for SPY volatility

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import warnings
warnings.filterwarnings('ignore')

plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (12, 5)

np.random.seed(42)
torch.manual_seed(42)

print(f'PyTorch version: {torch.__version__}')

---

## 1. Why Sequence Models for Finance?

Financial time series have **temporal dependencies**:

- **Volatility clustering**: high-vol days follow high-vol days (ARCH effects)
- **Regime persistence**: bull/bear markets persist for months
- **Autocorrelation in volatility**: RV is strongly autocorrelated at multiple horizons
- **Asymmetric response**: negative returns increase vol more than positive returns (leverage effect)

Feedforward nets (Week 7) treat each observation independently. Sequence models (RNN/LSTM/GRU) explicitly model the **history** of each time series.

### Key Application: Volatility Forecasting

Volatility forecasting is a natural fit for sequence models because:
1. The target (realized vol) is strongly autocorrelated
2. The features are sequential (lagged returns, lagged vol)
3. Classical models (GARCH) already exploit this temporal structure

---

## 2. RNN to LSTM to GRU

### Vanilla RNN

$$h_t = \tanh(W_{xh} x_t + W_{hh} h_{t-1} + b)$$

**Problem**: Vanishing gradients. After ~20 time steps, the gradient decays exponentially. The network "forgets" early inputs.

### LSTM (Long Short-Term Memory)

Adds a **cell state** $c_t$ and three **gates**:

| Gate | Purpose |
|------|----------|
| Forget gate $f_t$ | How much of the previous cell state to keep |
| Input gate $i_t$ | How much of the new candidate to add |
| Output gate $o_t$ | How much of the cell state to expose as output |

$$f_t = \sigma(W_f [h_{t-1}, x_t] + b_f)$$
$$i_t = \sigma(W_i [h_{t-1}, x_t] + b_i)$$
$$\tilde{c}_t = \tanh(W_c [h_{t-1}, x_t] + b_c)$$
$$c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t$$
$$o_t = \sigma(W_o [h_{t-1}, x_t] + b_o)$$
$$h_t = o_t \odot \tanh(c_t)$$

### GRU (Gated Recurrent Unit)

Simplified LSTM with 2 gates (reset and update). Fewer parameters, similar performance.

$$z_t = \sigma(W_z [h_{t-1}, x_t])$$
$$r_t = \sigma(W_r [h_{t-1}, x_t])$$
$$\tilde{h}_t = \tanh(W [r_t \odot h_{t-1}, x_t])$$
$$h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t$$

**For volatility forecasting, GRU often works as well as LSTM with less computation.**

---

## 3. Volatility Forecasting as Supervised Learning

### Problem Setup

Given a history of daily data up to day $t$, predict realized volatility for day $t+1$ (or the next 5 days, etc.).

**Features** (sequence of length $L$):
- Lagged realized volatility: $RV_{t}, RV_{t-1}, \ldots, RV_{t-L+1}$
- Daily returns: $r_t, r_{t-1}, \ldots$
- Volume (optional)

**Target**: $RV_{t+1}$ (next-day realized volatility)

This maps to a standard sequence-to-one regression problem.

---

## 4. Realized Volatility Estimators

### Close-to-Close (Standard)

$$\hat{\sigma}^2_{CC} = \sum_{i=1}^{n} r_i^2$$

where $r_i$ are intraday returns (or simply $r^2$ for daily).

### Parkinson (Range-Based)

$$\hat{\sigma}^2_P = \frac{1}{4 \ln 2} (\ln H - \ln L)^2$$

Uses high and low prices. More efficient than close-to-close.

### Garman-Klass

$$\hat{\sigma}^2_{GK} = 0.5 (\ln H - \ln L)^2 - (2\ln 2 - 1)(\ln C - \ln O)^2$$

Uses open, high, low, close. Most efficient single-day estimator.

In [None]:
# --- Realized volatility estimators ---

def rv_close_to_close(returns, window=5):
    """Close-to-close realized volatility (rolling sum of squared returns)."""
    return returns.pow(2).rolling(window).sum().apply(np.sqrt)


def rv_parkinson(high, low, window=5):
    """Parkinson range-based estimator."""
    log_hl = (np.log(high) - np.log(low)) ** 2
    return (log_hl / (4 * np.log(2))).rolling(window).mean().apply(np.sqrt)


def rv_garman_klass(open_p, high, low, close, window=5):
    """Garman-Klass estimator."""
    log_hl = (np.log(high) - np.log(low)) ** 2
    log_co = (np.log(close) - np.log(open_p)) ** 2
    gk = 0.5 * log_hl - (2 * np.log(2) - 1) * log_co
    return gk.rolling(window).mean().apply(np.sqrt)


print('RV estimators defined.')

In [None]:
# --- Generate synthetic daily data mimicking SPY ---

np.random.seed(42)
n_days = 2520  # ~10 years of trading days

# Simulate GARCH(1,1) process for returns
mu = 0.0003  # daily drift
omega = 1e-6
alpha = 0.08
beta = 0.90

returns = np.zeros(n_days)
sigma2 = np.zeros(n_days)
sigma2[0] = omega / (1 - alpha - beta)  # unconditional variance

for t in range(1, n_days):
    sigma2[t] = omega + alpha * returns[t-1]**2 + beta * sigma2[t-1]
    returns[t] = mu + np.sqrt(sigma2[t]) * np.random.randn()

# Generate OHLC from returns
close = 100 * np.exp(np.cumsum(returns))
# Simulate intraday range
daily_vol = np.sqrt(sigma2)
high = close * np.exp(np.abs(np.random.randn(n_days)) * daily_vol * 0.5)
low = close * np.exp(-np.abs(np.random.randn(n_days)) * daily_vol * 0.5)
open_p = close * np.exp(np.random.randn(n_days) * daily_vol * 0.2)

# Ensure high >= close >= low and high >= open >= low
high = np.maximum(high, np.maximum(close, open_p)) * 1.001
low = np.minimum(low, np.minimum(close, open_p)) * 0.999

# Create DataFrame
dates = pd.bdate_range('2015-01-02', periods=n_days)
spy = pd.DataFrame({
    'date': dates,
    'open': open_p, 'high': high, 'low': low, 'close': close,
    'return': returns,
    'volume': np.random.lognormal(18, 0.5, n_days)  # synthetic volume
}).set_index('date')

# Compute realized volatility (5-day, annualized)
spy['rv_cc'] = rv_close_to_close(spy['return'], window=5) * np.sqrt(252/5)
spy['rv_park'] = rv_parkinson(spy['high'], spy['low'], window=5) * np.sqrt(252)
spy['rv_gk'] = rv_garman_klass(spy['open'], spy['high'], spy['low'], spy['close'], window=5) * np.sqrt(252)

spy = spy.dropna()
print(f'SPY data: {len(spy)} days')
spy.head()

In [None]:
# --- Plot RV estimators ---
fig, axes = plt.subplots(2, 1, figsize=(14, 8))

axes[0].plot(spy.index, spy['return'], alpha=0.5, linewidth=0.5)
axes[0].set_ylabel('Daily Return')
axes[0].set_title('SPY Daily Returns (Simulated)')

axes[1].plot(spy.index, spy['rv_cc'], label='Close-to-Close', alpha=0.8)
axes[1].plot(spy.index, spy['rv_park'], label='Parkinson', alpha=0.8)
axes[1].plot(spy.index, spy['rv_gk'], label='Garman-Klass', alpha=0.8)
axes[1].set_ylabel('Annualized RV')
axes[1].set_title('Realized Volatility Estimators (5-day window)')
axes[1].legend()

plt.tight_layout()
plt.show()

---

## 5. Classical Baselines: GARCH(1,1) and HAR

### GARCH(1,1)

$$\sigma_t^2 = \omega + \alpha \epsilon_{t-1}^2 + \beta \sigma_{t-1}^2$$

The workhorse of volatility modeling. Captures volatility clustering with 3 parameters.

### HAR (Heterogeneous Autoregressive) Model

$$RV_{t+1} = \beta_0 + \beta_D RV_t^{(D)} + \beta_W RV_t^{(W)} + \beta_M RV_t^{(M)} + \epsilon_t$$

where:
- $RV^{(D)}$ = daily RV (1 day)
- $RV^{(W)}$ = weekly RV (average of last 5 days)
- $RV^{(M)}$ = monthly RV (average of last 22 days)

Simple, linear, and surprisingly hard to beat.

In [None]:
# --- GARCH(1,1) baseline ---
# Simple implementation: fit on training data, forecast one-step-ahead

from arch import arch_model

# Use close-to-close RV as our target
# For GARCH, we fit on returns directly
rv_col = 'rv_cc'

# Temporal split
train_end = int(len(spy) * 0.6)
val_end = int(len(spy) * 0.8)

train = spy.iloc[:train_end]
val = spy.iloc[train_end:val_end]
test = spy.iloc[val_end:]

print(f'Train: {len(train)} days, Val: {len(val)} days, Test: {len(test)} days')

# Fit GARCH on training returns
garch = arch_model(train['return'] * 100, vol='GARCH', p=1, q=1, mean='Constant')
garch_result = garch.fit(disp='off')
print(f'\nGARCH(1,1) parameters:')
print(garch_result.params)

In [None]:
# --- GARCH one-step-ahead forecasts on test set ---
# Re-fit on train+val, forecast on test
all_returns = spy['return'] * 100

garch_full = arch_model(all_returns.iloc[:val_end], vol='GARCH', p=1, q=1, mean='Constant')
garch_full_result = garch_full.fit(disp='off')

# Rolling one-step forecast
garch_forecasts = []
for t in range(val_end, len(spy)):
    # Use data up to t to forecast t+1 conditional volatility
    am = arch_model(all_returns.iloc[:t], vol='GARCH', p=1, q=1, mean='Constant')
    res = am.fit(disp='off', last_obs=t)
    fc = res.forecast(horizon=1)
    # Convert variance to annualized vol
    garch_vol = np.sqrt(fc.variance.values[-1, 0]) / 100 * np.sqrt(252)
    garch_forecasts.append(garch_vol)

test_copy = test.copy()
test_copy['garch_pred'] = garch_forecasts
print(f'GARCH forecasts: {len(garch_forecasts)} days')

In [None]:
# --- HAR baseline ---
from sklearn.linear_model import LinearRegression

# Compute HAR features
spy['rv_d'] = spy[rv_col]  # daily
spy['rv_w'] = spy[rv_col].rolling(5).mean()  # weekly
spy['rv_m'] = spy[rv_col].rolling(22).mean()  # monthly
spy['rv_target'] = spy[rv_col].shift(-1)  # next-day RV

har_features = ['rv_d', 'rv_w', 'rv_m']
spy_har = spy.dropna(subset=har_features + ['rv_target'])

# Split
train_end_har = int(len(spy_har) * 0.6)
val_end_har = int(len(spy_har) * 0.8)

har_train = spy_har.iloc[:train_end_har]
har_test = spy_har.iloc[val_end_har:]

# Fit HAR
har_model = LinearRegression()
har_model.fit(har_train[har_features], har_train['rv_target'])

# Predict
har_pred = har_model.predict(har_test[har_features])
print(f'HAR coefficients: D={har_model.coef_[0]:.3f}, W={har_model.coef_[1]:.3f}, M={har_model.coef_[2]:.3f}')
print(f'HAR intercept: {har_model.intercept_:.4f}')

---

## 6. LSTM Architecture for Volatility Forecasting

### Design Choices

| Choice | Recommendation | Reason |
|--------|----------------|--------|
| RNN type | LSTM or GRU | Both work; GRU is faster |
| Layers | 1-2 | More layers rarely help for vol |
| Hidden size | 32-64 | Vol is a "simple" signal |
| Sequence length | 20-60 days | Captures weekly + monthly patterns |
| Dropout | 0.2-0.3 | Less than cross-sectional models |

In [None]:
# --- LSTM for volatility forecasting ---

class VolSequenceDataset(Dataset):
    """
    Dataset that creates sequences of (features, target) for vol forecasting.
    Features: [lagged_rv, returns, log_volume] over a window.
    Target: next-day RV.
    """
    def __init__(self, data, feature_cols, target_col, seq_len=20):
        self.seq_len = seq_len
        self.features = data[feature_cols].values.astype(np.float32)
        self.targets = data[target_col].values.astype(np.float32)

        # Normalize features (z-score using training stats)
        self.feat_mean = self.features.mean(axis=0)
        self.feat_std = self.features.std(axis=0) + 1e-8
        self.features = (self.features - self.feat_mean) / self.feat_std

    def __len__(self):
        return len(self.features) - self.seq_len

    def __getitem__(self, idx):
        X = self.features[idx:idx + self.seq_len]  # (seq_len, n_features)
        y = self.targets[idx + self.seq_len]        # scalar
        return torch.from_numpy(X), torch.tensor(y)


print('VolSequenceDataset ready.')

In [None]:
class LSTMVolForecaster(nn.Module):
    """
    LSTM for volatility forecasting.
    Takes a sequence of features and outputs a single vol prediction.
    """
    def __init__(self, input_dim, hidden_dim=32, n_layers=2, dropout=0.2):
        super().__init__()
        self.lstm = nn.LSTM(
            input_size=input_dim,
            hidden_size=hidden_dim,
            num_layers=n_layers,
            batch_first=True,
            dropout=dropout if n_layers > 1 else 0.0
        )
        self.head = nn.Sequential(
            nn.Linear(hidden_dim, 16),
            nn.ReLU(),
            nn.Linear(16, 1)
        )

    def forward(self, x):
        # x shape: (batch, seq_len, input_dim)
        lstm_out, (h_n, c_n) = self.lstm(x)
        # Use the last hidden state
        last_hidden = lstm_out[:, -1, :]  # (batch, hidden_dim)
        out = self.head(last_hidden).squeeze(-1)  # (batch,)
        return out


# Test
model = LSTMVolForecaster(input_dim=3, hidden_dim=32, n_layers=2)
print(model)
x_test = torch.randn(16, 20, 3)  # batch=16, seq_len=20, features=3
print(f'Output shape: {model(x_test).shape}')  # (16,)

n_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f'Parameters: {n_params}')

---

## 7. Loss Functions for Volatility

### MSE

$$\mathcal{L}_{MSE} = \frac{1}{N} \sum_{t=1}^{N} (\sigma_t - \hat{\sigma}_t)^2$$

Treats all errors equally. Problem: errors on high-vol days dominate.

### QLIKE

$$\mathcal{L}_{QLIKE} = \frac{1}{N} \sum_{t=1}^{N} \left( \frac{\sigma_t^2}{\hat{\sigma}_t^2} - \ln \frac{\sigma_t^2}{\hat{\sigma}_t^2} - 1 \right)$$

The quasi-likelihood loss. Better calibrated for volatility because:
- It penalizes underestimation more than overestimation
- It's robust to the scale of volatility
- It's the natural loss for Gaussian MLE with heteroskedasticity

In [None]:
def mse_loss(y_pred, y_true):
    return ((y_pred - y_true) ** 2).mean()


def qlike_loss(y_pred, y_true):
    """
    QLIKE loss for volatility forecasting.
    y_pred, y_true are volatilities (not variances).
    """
    # Convert to variances
    var_pred = y_pred ** 2 + 1e-8  # avoid division by zero
    var_true = y_true ** 2 + 1e-8
    ratio = var_true / var_pred
    return (ratio - torch.log(ratio) - 1).mean()


# Demo: QLIKE penalizes underestimation more
true_vol = torch.tensor([0.20])
over_pred = torch.tensor([0.30])   # overestimate by 10pp
under_pred = torch.tensor([0.10])  # underestimate by 10pp

print(f'QLIKE(overestimate):  {qlike_loss(over_pred, true_vol):.4f}')
print(f'QLIKE(underestimate): {qlike_loss(under_pred, true_vol):.4f}')
print('(Underestimation penalized more --- good for risk management!)')

---

## 8. Demo: LSTM vs GARCH for SPY Volatility

In [None]:
# --- Prepare data for LSTM ---
spy['log_volume'] = np.log(spy['volume'])
lstm_features = ['rv_cc', 'return', 'log_volume']
target_col = 'rv_target'

# Filter to valid rows
spy_lstm = spy.dropna(subset=lstm_features + [target_col]).copy()

SEQ_LEN = 20

# Temporal split
n = len(spy_lstm)
train_end_idx = int(n * 0.6)
val_end_idx = int(n * 0.8)

train_data = spy_lstm.iloc[:train_end_idx]
val_data = spy_lstm.iloc[train_end_idx - SEQ_LEN:val_end_idx]  # overlap for sequence
test_data = spy_lstm.iloc[val_end_idx - SEQ_LEN:]  # overlap for sequence

# Create datasets
train_ds = VolSequenceDataset(train_data, lstm_features, target_col, seq_len=SEQ_LEN)
val_ds = VolSequenceDataset(val_data, lstm_features, target_col, seq_len=SEQ_LEN)
test_ds = VolSequenceDataset(test_data, lstm_features, target_col, seq_len=SEQ_LEN)

# Use training stats to normalize val/test
val_ds.feat_mean = train_ds.feat_mean
val_ds.feat_std = train_ds.feat_std
val_ds.features = (spy_lstm.iloc[train_end_idx - SEQ_LEN:val_end_idx][lstm_features].values.astype(np.float32) - train_ds.feat_mean) / train_ds.feat_std

test_ds.feat_mean = train_ds.feat_mean
test_ds.feat_std = train_ds.feat_std
test_ds.features = (spy_lstm.iloc[val_end_idx - SEQ_LEN:][lstm_features].values.astype(np.float32) - train_ds.feat_mean) / train_ds.feat_std

train_loader = DataLoader(train_ds, batch_size=64, shuffle=True)
val_loader = DataLoader(val_ds, batch_size=256, shuffle=False)
test_loader = DataLoader(test_ds, batch_size=256, shuffle=False)

print(f'Train sequences: {len(train_ds)}')
print(f'Val sequences: {len(val_ds)}')
print(f'Test sequences: {len(test_ds)}')

In [None]:
# --- Train LSTM ---

class EarlyStopping:
    def __init__(self, patience=15):
        self.patience = patience
        self.best_loss = float('inf')
        self.counter = 0
        self.best_state = None

    def step(self, val_loss, model):
        if val_loss < self.best_loss:
            self.best_loss = val_loss
            self.counter = 0
            self.best_state = {k: v.cpu().clone() for k, v in model.state_dict().items()}
            return False
        self.counter += 1
        return self.counter >= self.patience

    def restore_best(self, model):
        if self.best_state:
            model.load_state_dict(self.best_state)


lstm_model = LSTMVolForecaster(input_dim=len(lstm_features), hidden_dim=32, n_layers=2, dropout=0.2)
optimizer = torch.optim.Adam(lstm_model.parameters(), lr=1e-3, weight_decay=1e-5)
stopper = EarlyStopping(patience=15)

train_losses = []
val_losses = []

for epoch in range(200):
    # Train
    lstm_model.train()
    epoch_loss = []
    for X_b, y_b in train_loader:
        optimizer.zero_grad()
        pred = lstm_model(X_b)
        loss = mse_loss(pred, y_b)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(lstm_model.parameters(), 1.0)  # gradient clipping
        optimizer.step()
        epoch_loss.append(loss.item())
    train_losses.append(np.mean(epoch_loss))

    # Validate
    lstm_model.eval()
    v_losses = []
    with torch.no_grad():
        for X_b, y_b in val_loader:
            v_losses.append(mse_loss(lstm_model(X_b), y_b).item())
    val_losses.append(np.mean(v_losses))

    if epoch % 20 == 0:
        print(f'Epoch {epoch:3d}: train_loss={train_losses[-1]:.6f}, val_loss={val_losses[-1]:.6f}')

    if stopper.step(val_losses[-1], lstm_model):
        print(f'Early stopping at epoch {epoch}')
        break

stopper.restore_best(lstm_model)
print(f'Best val loss: {stopper.best_loss:.6f}')

In [None]:
# --- Training curves ---
fig, ax = plt.subplots(figsize=(10, 4))
ax.plot(train_losses, label='Train', alpha=0.8)
ax.plot(val_losses, label='Validation', alpha=0.8)
ax.set_xlabel('Epoch')
ax.set_ylabel('MSE Loss')
ax.set_title('LSTM Training Curves')
ax.legend()
plt.tight_layout()
plt.show()

In [None]:
# --- Evaluate on test set ---
lstm_model.eval()
all_preds = []
all_targets = []

with torch.no_grad():
    for X_b, y_b in test_loader:
        preds = lstm_model(X_b)
        all_preds.append(preds.numpy())
        all_targets.append(y_b.numpy())

lstm_preds = np.concatenate(all_preds)
actual_rv = np.concatenate(all_targets)

# Align with HAR and GARCH predictions
n_test = min(len(lstm_preds), len(har_pred), len(garch_forecasts))
lstm_preds_aligned = lstm_preds[:n_test]
har_pred_aligned = har_pred[:n_test]
garch_pred_aligned = np.array(garch_forecasts[:n_test])
actual_aligned = actual_rv[:n_test]

# MSE comparison
mse_lstm = np.mean((lstm_preds_aligned - actual_aligned) ** 2)
mse_har = np.mean((har_pred_aligned - actual_aligned) ** 2)
mse_garch = np.mean((garch_pred_aligned - actual_aligned) ** 2)

# Correlation
corr_lstm = np.corrcoef(lstm_preds_aligned, actual_aligned)[0, 1]
corr_har = np.corrcoef(har_pred_aligned, actual_aligned)[0, 1]
corr_garch = np.corrcoef(garch_pred_aligned, actual_aligned)[0, 1]

results = pd.DataFrame({
    'Model': ['GARCH(1,1)', 'HAR', 'LSTM'],
    'MSE': [mse_garch, mse_har, mse_lstm],
    'Correlation': [corr_garch, corr_har, corr_lstm]
}).set_index('Model')

print('Volatility Forecasting Results (Test Set):')
print(results.round(6).to_string())

In [None]:
# --- Plot predictions vs actual ---
fig, axes = plt.subplots(2, 1, figsize=(14, 8))

t = np.arange(n_test)

# Actual vs predicted
axes[0].plot(t, actual_aligned, label='Actual RV', color='black', alpha=0.6, linewidth=0.8)
axes[0].plot(t, lstm_preds_aligned, label='LSTM', alpha=0.8, linewidth=1.2)
axes[0].plot(t, har_pred_aligned, label='HAR', alpha=0.8, linewidth=1.2)
axes[0].plot(t, garch_pred_aligned, label='GARCH', alpha=0.8, linewidth=1.2)
axes[0].set_ylabel('Annualized Volatility')
axes[0].set_title('Volatility Forecasts vs Actual (Test Set)')
axes[0].legend()

# Forecast errors
axes[1].plot(t, (lstm_preds_aligned - actual_aligned), label='LSTM error', alpha=0.6)
axes[1].plot(t, (har_pred_aligned - actual_aligned), label='HAR error', alpha=0.6)
axes[1].plot(t, (garch_pred_aligned - actual_aligned), label='GARCH error', alpha=0.6)
axes[1].axhline(y=0, color='black', linewidth=0.5)
axes[1].set_ylabel('Forecast Error')
axes[1].set_xlabel('Test Day')
axes[1].set_title('Forecast Errors')
axes[1].legend()

plt.tight_layout()
plt.show()

---

## Key Takeaways

1. **Volatility clustering** makes forecasting possible --- vol is highly persistent.
2. **RV estimators**: Parkinson and Garman-Klass use OHLC data and are more efficient than close-to-close.
3. **GARCH(1,1)**: The classic baseline. Hard to beat for one-step-ahead forecasting.
4. **HAR model**: Simple linear model using daily/weekly/monthly RV. Surprisingly competitive.
5. **LSTM**: Can capture nonlinear patterns, but needs enough data and careful tuning.
6. **QLIKE loss**: Better than MSE for volatility --- penalizes underestimation more.
7. **Gradient clipping**: Essential for training LSTM on financial data.

### When Does LSTM Beat Classical?

- When there are **regime changes** that GARCH's exponential smoothing handles poorly
- When you have **many assets** (cross-learning)
- When **additional features** (volume, order flow) carry information
- For **multi-step** forecasts where GARCH reverts to the unconditional mean too fast

### Next Week

Week 9: Transformers and attention for financial time series.