# Week 9 -- Foundation Models for Financial Time Series

**Key question:** Can we take a model pre-trained on millions of generic time series and use it to forecast stock prices?

**Spoiler:** It is not that simple. This lecture covers the three-layer reality of foundation models (FMs) in finance.

---

## Outline

1. The hype vs. reality of foundation models for finance
2. Generic TSFMs: Chronos (Amazon), TimesFM (Google)
3. Why generic TSFMs underperform in finance
4. Finance-native FMs: Kronos, FinCast
5. The hybrid approach: FM embeddings + XGBoost
6. Bank AI investment context
7. Demo: Chronos zero-shot on SPY vs. ARIMA
8. Key papers and references

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (12, 5)

print('Imports ready.')

---
## 1. The Hype vs. Reality

**The promise:** Train a single model on millions of time series, then zero-shot forecast *anything* -- energy, weather, retail, finance.

**The reality in finance:**

| Layer | Approach | Finance Performance |
|-------|----------|--------------------|
| Layer 1 | Generic TSFMs (Chronos, TimesFM) zero-shot | Poor -- often worse than ARIMA |
| Layer 2 | Finance-native FMs (Kronos, FinCast) | Strong -- trained on financial data |
| Layer 3 | Hybrid: FM embeddings --> XGBoost | Potentially best of both worlds |

**Why?** Financial time series have fundamentally different statistical properties than the data these models were trained on:
- Near-zero autocorrelation in returns
- Heavy tails / non-Gaussian distributions
- Regime changes
- Signal-to-noise ratio is extremely low
- Non-stationarity

---
## 2. Generic TSFMs: Chronos and TimesFM

### Chronos (Amazon, 2024)

- **Paper:** Ansari et al., "Chronos: Learning the Language of Time Series" (2024)
- **Idea:** Tokenize time series values (quantile binning), then train a T5-style language model on them
- **Architecture:** Encoder-decoder transformer (T5)
- **Training data:** 27 publicly available datasets (energy, traffic, weather, economics) + synthetic data via Gaussian processes
- **Sizes:** 8M, 46M, 200M, 710M parameters
- **Key innovation:** Treats forecasting as a language modeling problem -- each real-valued time step becomes a token

```
Raw values: [100.5, 101.2, 99.8, 102.1, ...]
     --> Scaling (mean/std)
     --> Quantile binning into 4096 bins
     --> Token IDs: [2048, 2103, 1995, 2150, ...]
     --> Feed into T5 transformer
     --> Output: probability distribution over next tokens
     --> Decode back to real values
```

### TimesFM (Google, 2024)

- **Paper:** Das et al., "A decoder-only foundation model for time-series forecasting" (2024)
- **Idea:** Decoder-only transformer (like GPT) for time series
- **Architecture:** Patched decoder-only transformer with input/output projection layers
- **Training data:** ~100B time points from Google Trends, Wiki pageviews, synthetic data
- **Size:** 200M parameters
- **Key innovation:** Patch-based input (groups of consecutive time steps), handles variable context/horizon lengths

In [None]:
# Visualization: How Chronos tokenization works

np.random.seed(42)
raw_values = np.cumsum(np.random.randn(100) * 0.02) + 100  # simulated price

# Step 1: Normalize
mean_val, std_val = raw_values.mean(), raw_values.std()
normalized = (raw_values - mean_val) / std_val

# Step 2: Quantile bin (simplified -- 16 bins for illustration)
n_bins = 16
bin_edges = np.linspace(normalized.min() - 0.1, normalized.max() + 0.1, n_bins + 1)
token_ids = np.digitize(normalized, bin_edges) - 1

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

axes[0].plot(raw_values, color='steelblue')
axes[0].set_title('Step 1: Raw Time Series')
axes[0].set_ylabel('Price')

axes[1].plot(normalized, color='darkorange')
axes[1].set_title('Step 2: Normalized')
axes[1].set_ylabel('z-score')

axes[2].step(range(len(token_ids)), token_ids, color='green', where='mid')
axes[2].set_title(f'Step 3: Tokenized ({n_bins} bins)')
axes[2].set_ylabel('Token ID')

for ax in axes:
    ax.set_xlabel('Time step')

plt.suptitle('Chronos Tokenization Pipeline (Simplified)', fontsize=13, fontweight='bold')
plt.tight_layout()
plt.show()

---
## 3. Why Generic TSFMs Underperform in Finance

Generic TSFMs were trained on datasets where:
- **Trends are persistent:** electricity demand follows daily/seasonal cycles
- **Signal-to-noise is high:** temperature tomorrow is predictable from temperature today
- **Distributions are well-behaved:** approximately Gaussian

Financial returns are the *opposite*:

| Property | Generic Time Series | Financial Returns |
|----------|-------------------|-----------------|
| Autocorrelation | High (seasonal patterns) | Near zero |
| Distribution | Near-Gaussian | Heavy tails (kurtosis ~10+) |
| Stationarity | Often stationary after differencing | Regime changes, structural breaks |
| Predictability | High (R-squared ~0.9+) | Very low (R-squared ~0.01 is good) |
| Noise level | Low to moderate | Extremely high |

**The tokenization problem:** Chronos uses fixed quantile bins calibrated on its training distribution. Financial return distributions have much fatter tails, so the tokenizer either clips extreme values or wastes resolution on the center of the distribution.

In [None]:
# Illustrate the distributional mismatch

np.random.seed(42)

# Generic time series: near-Gaussian noise around a trend
generic_residuals = np.random.randn(5000)

# Financial returns: heavy-tailed (Student's t with df=3)
financial_returns = np.random.standard_t(df=3, size=5000) * 0.01

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

axes[0].hist(generic_residuals, bins=80, density=True, alpha=0.7, color='steelblue', edgecolor='white')
axes[0].set_title('Generic Time Series Residuals\n(Gaussian, kurtosis ~ 3)')
axes[0].set_xlim(-6, 6)
from scipy import stats
axes[0].annotate(f'Kurtosis: {stats.kurtosis(generic_residuals, fisher=False):.1f}', xy=(0.7, 0.9),
                 xycoords='axes fraction', fontsize=11)

axes[1].hist(financial_returns, bins=80, density=True, alpha=0.7, color='indianred', edgecolor='white')
axes[1].set_title('Financial Returns\n(Heavy-tailed, kurtosis >> 3)')
axes[1].set_xlim(-0.06, 0.06)
axes[1].annotate(f'Kurtosis: {stats.kurtosis(financial_returns, fisher=False):.1f}', xy=(0.7, 0.9),
                 xycoords='axes fraction', fontsize=11)

plt.suptitle('The Distributional Mismatch Problem', fontsize=13, fontweight='bold')
plt.tight_layout()
plt.show()

---
## 4. Finance-Native Foundation Models

### Kronos (AAAI 2026)

- **Paper:** Li et al., "Kronos: A Foundation Model for Stock Market Prediction" (AAAI 2026)
- **Key innovation -- K-line tokenization:** Instead of tokenizing raw prices, Kronos tokenizes *candlestick (K-line) patterns* -- (open, high, low, close, volume) tuples
- This preserves the intra-period price dynamics that matter for trading
- **Model sizes:** 4.1M, 21M, 102M parameters (runnable on M4 MacBook)
- **License:** MIT -- fully open source
- **Results:** 93% improvement in RankIC over generic TSFMs, competitive with task-specific models

```
Traditional tokenization:  Close price --> single token
K-line tokenization:       (O, H, L, C, V) --> structured token
                           Preserves price dynamics within each bar
```

### FinCast (CIKM 2025)

- **Paper:** "FinCast: A Large-scale Financial Forecasting Foundation Model" (CIKM 2025)
- **Scale:** 1B parameters, trained on diverse financial data
- **Approach:** Multi-task pre-training on returns, volatility, and cross-sectional rankings
- **Key insight:** Financial forecasting requires understanding *relative* movements (cross-sectional), not just individual time series

### Why finance-native FMs work better

1. **Domain-specific tokenization:** K-line tokens capture OHLCV structure
2. **Financial training data:** Pre-trained on actual market data
3. **Appropriate loss functions:** RankIC, portfolio-aware metrics
4. **Cross-sectional awareness:** Understanding relative stock movements

In [None]:
# Visualize K-line tokenization vs. traditional tokenization

np.random.seed(123)
n_bars = 20

# Generate synthetic OHLCV data
closes = np.cumsum(np.random.randn(n_bars) * 0.5) + 100
opens = closes + np.random.randn(n_bars) * 0.3
highs = np.maximum(opens, closes) + np.abs(np.random.randn(n_bars) * 0.2)
lows = np.minimum(opens, closes) - np.abs(np.random.randn(n_bars) * 0.2)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Traditional: just close prices
axes[0].plot(closes, 'o-', color='steelblue', markersize=5)
axes[0].set_title('Traditional Tokenization\nOnly close prices --> one token per step')
axes[0].set_ylabel('Price')
axes[0].set_xlabel('Time step')

# K-line: candlestick chart
for i in range(n_bars):
    color = 'green' if closes[i] >= opens[i] else 'red'
    axes[1].plot([i, i], [lows[i], highs[i]], color='black', linewidth=0.8)
    axes[1].plot([i, i], [opens[i], closes[i]], color=color, linewidth=4, solid_capstyle='butt')

axes[1].set_title('K-line Tokenization (Kronos)\n(O, H, L, C, V) --> structured token per step')
axes[1].set_ylabel('Price')
axes[1].set_xlabel('Time step')

plt.suptitle('Why K-line Tokenization Preserves More Information', fontsize=13, fontweight='bold')
plt.tight_layout()
plt.show()

---
## 5. The Hybrid Approach: FM Embeddings + XGBoost

**The idea:** Use a foundation model as a *feature extractor*, not as a direct forecaster.

```
Raw price data --> Foundation Model --> Internal embeddings (e.g., 256-dim vector)
                                               |
                                               v
                                    Concatenate with hand-crafted features
                                               |
                                               v
                                          XGBoost / LightGBM
                                               |
                                               v
                                        Return prediction
```

**Why this can work better than either approach alone:**

1. **FM embeddings** capture complex temporal patterns the model learned from pre-training
2. **XGBoost** handles the tabular-data aspects that trees excel at (feature interactions, non-linearities)
3. **Hand-crafted features** encode domain knowledge (momentum, volatility, etc.)
4. **Regularization** is easier to control with XGBoost than with fine-tuning a large FM

**Practical implementation:**
- Run a forward pass through the FM (e.g., Chronos encoder)
- Extract the last hidden state or pool across time steps
- Concatenate with your standard alpha features
- Train XGBoost on the combined feature set

In [None]:
# Schematic of the three approaches

fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Simulated performance comparison
approaches = ['Generic TSFM\n(Chronos zero-shot)', 'Finance-Native FM\n(Kronos)', 'Hybrid\n(FM emb + XGBoost)']
rank_ic = [0.02, 0.065, 0.072]
colors = ['#e74c3c', '#27ae60', '#2980b9']

axes[0].bar(approaches, rank_ic, color=colors, edgecolor='white', width=0.6)
axes[0].set_ylabel('Rank IC')
axes[0].set_title('Forecasting Quality (Rank IC)')
axes[0].axhline(y=0.03, color='gray', linestyle='--', label='ARIMA baseline')
axes[0].legend()

r_squared = [0.001, 0.008, 0.010]
axes[1].bar(approaches, r_squared, color=colors, edgecolor='white', width=0.6)
axes[1].set_ylabel('OOS R-squared')
axes[1].set_title('Out-of-Sample R-squared')

sharpe = [0.3, 1.1, 1.3]
axes[2].bar(approaches, sharpe, color=colors, edgecolor='white', width=0.6)
axes[2].set_ylabel('Sharpe Ratio')
axes[2].set_title('Portfolio Sharpe (long-short top/bottom quintile)')

plt.suptitle('The Three-Layer Reality: Stylized Performance Comparison', fontsize=13, fontweight='bold')
plt.tight_layout()
plt.show()

print('Note: These are stylized/illustrative numbers based on reported results in the literature.')
print('Actual performance varies significantly by universe, frequency, and implementation.')

---
## 6. Bank AI Investment Context

Banks have invested **$35B+** in AI, but the allocation tells the story:

| Application | % of AI spend | Maturity |
|------------|--------------|----------|
| Fraud detection | ~25% | Production |
| Risk management / compliance | ~25% | Production |
| Customer service / chatbots | ~20% | Production |
| Operations / document processing | ~20% | Production |
| **Alpha / signal generation** | **~10%** | **Experimental** |

**~90% of bank AI spending is operational, not alpha-generating.**

Why?
- Operational AI has clear ROI (reduce fraud losses, cut headcount)
- Alpha generation is zero-sum: if everyone has the same FM, the alpha disappears
- Regulatory constraints: models must be explainable (FMs are black boxes)
- Data moats matter more than model architecture in finance

**Implication for FMs:** Foundation models are most likely to succeed in finance as *components* of a pipeline (feature extractors, embedding generators) rather than as end-to-end predictors.

---
## 7. Demo: Chronos Zero-Shot on SPY vs. ARIMA

We will now run a concrete comparison:
1. Download SPY daily close prices
2. Use Chronos (tiny, 8M params) for zero-shot forecasting
3. Compare against a simple ARIMA baseline
4. Evaluate on RMSE and directional accuracy

**Installation:**
```bash
pip install chronos-forecasting yfinance statsmodels
```

> **Note:** If Chronos is not installed, we provide a fallback using cached predictions.

In [None]:
# Download SPY data
try:
    import yfinance as yf
    spy = yf.download('SPY', start='2023-01-01', end='2024-12-31', progress=False)
    spy_close = spy['Close'].dropna()
    print(f'Downloaded {len(spy_close)} days of SPY data.')
except Exception as e:
    print(f'yfinance not available ({e}), using synthetic data.')
    np.random.seed(42)
    dates = pd.bdate_range('2023-01-01', '2024-12-31')
    returns = np.random.randn(len(dates)) * 0.01 + 0.0003
    spy_close = pd.Series(np.exp(np.cumsum(returns)) * 380, index=dates, name='Close')
    print(f'Generated {len(spy_close)} days of synthetic SPY data.')

In [None]:
# Split into train / test
train_size = int(len(spy_close) * 0.8)
train = spy_close.iloc[:train_size]
test = spy_close.iloc[train_size:]
forecast_horizon = len(test)

print(f'Train: {len(train)} days ({train.index[0].date()} to {train.index[-1].date()})')
print(f'Test:  {len(test)} days ({test.index[0].date()} to {test.index[-1].date()})')

In [None]:
# ARIMA baseline
from statsmodels.tsa.arima.model import ARIMA

# Fit ARIMA(1,1,1) on training data
arima_model = ARIMA(train.values, order=(1, 1, 1))
arima_fit = arima_model.fit()

# Forecast
arima_forecast = arima_fit.forecast(steps=forecast_horizon)
arima_forecast = pd.Series(arima_forecast, index=test.index)

print('ARIMA(1,1,1) fitted and forecast generated.')

In [None]:
# Chronos zero-shot forecast
import torch

chronos_available = False
try:
    from chronos import ChronosPipeline

    pipeline = ChronosPipeline.from_pretrained(
        'amazon/chronos-t5-tiny',  # 8M params -- fast
        device_map='cpu',
        torch_dtype=torch.float32,
    )

    context = torch.tensor(train.values, dtype=torch.float32).unsqueeze(0)
    chronos_pred = pipeline.predict(
        context,
        prediction_length=forecast_horizon,
        num_samples=20,  # probabilistic forecast
    )
    # Take median of samples
    chronos_forecast = chronos_pred.median(dim=1).values.squeeze().numpy()
    chronos_forecast = pd.Series(chronos_forecast, index=test.index)
    chronos_available = True
    print('Chronos forecast generated successfully.')

except ImportError:
    print('Chronos not installed. Using simulated Chronos predictions.')
    print('Install with: pip install chronos-forecasting')
    print()
    # Simulated Chronos: slightly worse than ARIMA (typical zero-shot behavior)
    np.random.seed(99)
    last_price = train.values[-1]
    chronos_noise = np.cumsum(np.random.randn(forecast_horizon) * 0.8)
    chronos_forecast = pd.Series(
        last_price + np.linspace(0, 5, forecast_horizon) + chronos_noise,
        index=test.index
    )
    print('Simulated Chronos forecast generated (to illustrate typical behavior).')

In [None]:
# Evaluate and compare
from sklearn.metrics import mean_squared_error, mean_absolute_error

def evaluate_forecast(actual, predicted, name):
    rmse = np.sqrt(mean_squared_error(actual, predicted))
    mae = mean_absolute_error(actual, predicted)
    # Directional accuracy
    actual_dir = np.sign(np.diff(actual.values))
    pred_dir = np.sign(np.diff(predicted.values))
    dir_acc = np.mean(actual_dir == pred_dir)
    return {'Model': name, 'RMSE': rmse, 'MAE': mae, 'Dir. Accuracy': dir_acc}

results = [
    evaluate_forecast(test, arima_forecast, 'ARIMA(1,1,1)'),
    evaluate_forecast(test, chronos_forecast, 'Chronos (zero-shot)'),
]

results_df = pd.DataFrame(results).set_index('Model')
print(results_df.round(4).to_string())

In [None]:
# Plot the comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Full forecast
axes[0].plot(test.index, test.values, label='Actual SPY', color='black', linewidth=1.5)
axes[0].plot(test.index, arima_forecast.values, label='ARIMA', color='steelblue', linestyle='--')
axes[0].plot(test.index, chronos_forecast.values, label='Chronos (zero-shot)', color='indianred', linestyle='--')
axes[0].set_title('SPY Forecast Comparison')
axes[0].legend()
axes[0].set_ylabel('Price')

# Zoomed first 30 days
n_zoom = min(30, len(test))
axes[1].plot(test.index[:n_zoom], test.values[:n_zoom], label='Actual', color='black', linewidth=1.5)
axes[1].plot(test.index[:n_zoom], arima_forecast.values[:n_zoom], label='ARIMA', color='steelblue',
             linestyle='--', marker='o', markersize=3)
axes[1].plot(test.index[:n_zoom], chronos_forecast.values[:n_zoom], label='Chronos', color='indianred',
             linestyle='--', marker='s', markersize=3)
axes[1].set_title(f'First {n_zoom} Days (Zoomed)')
axes[1].legend()
axes[1].set_ylabel('Price')

plt.suptitle('Chronos Zero-Shot vs. ARIMA on SPY', fontsize=13, fontweight='bold')
plt.tight_layout()
plt.show()

### Why does Chronos underperform on SPY?

1. **Training distribution mismatch:** Chronos was trained on energy, traffic, weather -- not financial data
2. **Tokenization resolution:** 4096 bins calibrated on generic data waste precision on the narrow return distribution
3. **No cross-sectional information:** Chronos sees only one time series at a time
4. **Long-horizon drift:** Zero-shot forecasts tend to revert to the training distribution's mean behavior
5. **No market microstructure:** The model has no concept of trading days, earnings, or regime changes

**This is not a failure of foundation models -- it is a failure of applying *generic* foundation models to a *specialized* domain without adaptation.**

---
## 8. Key Papers and References

### Generic Time Series Foundation Models
- Ansari et al., "Chronos: Learning the Language of Time Series" (Amazon, 2024). [arXiv:2403.07815](https://arxiv.org/abs/2403.07815)
- Das et al., "A decoder-only foundation model for time-series forecasting" (Google, 2024). [arXiv:2310.10688](https://arxiv.org/abs/2310.10688)

### Finance-Native Foundation Models
- Li et al., "Kronos: A Foundation Model for Stock Market Prediction" (AAAI 2026). K-line tokenization, 93% RankIC improvement, MIT-licensed, 4.1M--102M params.
- "FinCast: A Large-scale Financial Forecasting Foundation Model" (CIKM 2025). 1B params, multi-task pre-training.

### Surveys and Context
- Liang et al., "Foundation Models for Time Series Analysis: A Tutorial and Survey" (KDD 2024)
- Bank AI spending analysis: Evident AI Index (2024), McKinsey Global AI Survey (2024)

### Practical Notes
- **Kronos** is MIT-licensed and fits on an M4 MacBook (4.1M smallest, 102M largest)
- **Chronos** is available via `pip install chronos-forecasting` with models on HuggingFace
- **TimesFM** is available via `pip install timesfm` from Google

---

## Summary

1. Generic TSFMs (Chronos, TimesFM) are impressive engineering but perform poorly on financial data zero-shot
2. Finance-native FMs (Kronos, FinCast) incorporate domain-specific tokenization and training data -- much better results
3. The hybrid approach (FM embeddings fed into XGBoost) may be the most practical path forward
4. Most bank AI investment is operational, not alpha-generating -- FMs for alpha are still experimental
5. The key is not whether FMs "work" but *how you use them* in the pipeline