# Week 9 Homework -- Foundation Model Evaluation

**Due:** One week from seminar date

**Total:** 100 points

| Part | Topic | Points |
|------|-------|--------|
| 1 | Benchmark Chronos zero-shot on 10 stocks | 20 |
| 2 | Fine-tune Chronos or small FM on financial data | 25 |
| 3 | Hybrid: FM embeddings --> XGBoost | 25 |
| 4 | Compare all approaches (IC, R-squared, Sharpe) | 15 |
| 5 | Write-up: when do FMs add value? | 15 |

---

**Setup:**
```bash
pip install chronos-forecasting yfinance xgboost statsmodels torch scikit-learn
```

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (12, 5)

print('Imports ready.')

---
## Part 1: Benchmark Chronos Zero-Shot on 10 Stocks (20 pts)

### Requirements

1. **Download daily close prices** for 10 stocks from at least 2 sectors (e.g., tech, financials, healthcare). Use 2020-2024 data.

2. **Run Chronos-tiny zero-shot** on each stock:
   - Context: last 252 trading days of training period
   - Forecast: next 21 trading days
   - Use `amazon/chronos-t5-tiny` model

3. **Compute baselines** for each stock:
   - Naive (last value)
   - ARIMA(1,1,1)
   - Simple moving average

4. **Report** for each stock:
   - RMSE
   - MAE
   - Directional accuracy

5. **Create a summary table** comparing all models across all stocks.

### Grading
- 10 stocks, correct data loading: 5 pts
- Chronos correctly run or sensible fallback: 5 pts
- Baseline models implemented: 5 pts
- Clean evaluation table: 5 pts

In [None]:
# YOUR CODE HERE -- Part 1

# Suggested tickers (mix of sectors):
tickers = ['AAPL', 'MSFT', 'GOOGL', 'NVDA', 'JPM', 'BAC', 'JNJ', 'PFE', 'XOM', 'CVX']

# Step 1: Download data
# ...

# Step 2: For each ticker, run Chronos zero-shot
# ...

# Step 3: Compute baselines (naive, ARIMA, SMA)
# ...

# Step 4: Evaluate all models
# ...

# Step 5: Summary table
# ...

---
## Part 2: Fine-Tune Chronos or Small FM (25 pts)

### Requirements

1. **Prepare financial training data:**
   - Use at least 20 stock price series as training data
   - Hold out the last 6 months as test period

2. **Fine-tune Chronos-tiny** on the financial data:
   - If compute allows: run actual fine-tuning (5-10 epochs)
   - If compute is limited: describe the setup in detail, show the configuration, and explain what you would expect

3. **Evaluate fine-tuned model** on the same 10 stocks from Part 1.

4. **Compare** zero-shot vs. fine-tuned performance.

### Grading
- Data preparation: 5 pts
- Fine-tuning setup (code + config): 10 pts
- Evaluation on test set: 5 pts
- Analysis of improvement: 5 pts

> **Note:** If you cannot run fine-tuning due to compute constraints, you can receive up to 20/25 pts by providing a detailed and correct setup with thoughtful analysis of expected results.

In [None]:
# YOUR CODE HERE -- Part 2

# Step 1: Prepare training data (20+ stock series)
# ...

# Step 2: Fine-tune Chronos-tiny
# ...

# Step 3: Evaluate fine-tuned model
# ...

# Step 4: Compare zero-shot vs. fine-tuned
# ...

---
## Part 3: Hybrid -- FM Embeddings + XGBoost (25 pts)

### Requirements

1. **Extract FM embeddings** from Chronos (or another FM) for each stock:
   - Use rolling 60-day windows
   - Extract encoder hidden states (or simulate with documented approach)
   - Result: one embedding vector per stock per day

2. **Create hand-crafted features:**
   - At minimum: momentum (1d, 5d, 20d), volatility (20d, 60d), SMA ratio, 52-week high distance

3. **Train three XGBoost models:**
   - (a) Hand-crafted features only
   - (b) FM embeddings only
   - (c) Combined (hybrid)

4. **Use proper time-series cross-validation** (expanding or rolling window).

5. **Report** IC, OOS R-squared, and feature importance analysis.

### Grading
- Embedding extraction (or well-documented simulation): 8 pts
- Hand-crafted features: 5 pts
- Three models trained with proper CV: 7 pts
- Feature importance analysis: 5 pts

In [None]:
# YOUR CODE HERE -- Part 3

# Step 1: Extract FM embeddings
# ...

# Step 2: Hand-crafted features
# ...

# Step 3: Train three XGBoost models
# ...

# Step 4: Time-series cross-validation
# ...

# Step 5: Feature importance analysis
# ...

---
## Part 4: Compare All Approaches (15 pts)

### Requirements

1. **Create a comprehensive comparison table** with columns:
   - Model name
   - IC (information coefficient)
   - OOS R-squared
   - RMSE
   - Directional accuracy
   - Portfolio Sharpe (long-short top/bottom quintile, if applicable)

2. **Include all models:**
   - Naive baseline
   - ARIMA
   - Chronos zero-shot
   - Chronos fine-tuned (or expected results)
   - XGBoost (hand-crafted only)
   - XGBoost (embeddings only)
   - XGBoost (hybrid)

3. **Visualize** the comparison with appropriate charts.

4. **Compute portfolio Sharpe** for the predictive models:
   - Each day, go long stocks in the top quintile of predicted returns
   - Go short stocks in the bottom quintile
   - Report annualized Sharpe ratio

### Grading
- Complete comparison table: 5 pts
- Visualization: 5 pts
- Portfolio Sharpe calculation: 5 pts

In [None]:
# YOUR CODE HERE -- Part 4

# Step 1: Aggregate all results into a single DataFrame
# ...

# Step 2: Visualization
# ...

# Step 3: Portfolio Sharpe calculation
# ...

---
## Part 5: Write-Up -- When Do FMs Add Value? (15 pts)

### Requirements

Write a 500-800 word analysis addressing:

1. **When do foundation models add value** in financial forecasting? Under what conditions (data regime, forecast horizon, asset class) did you see improvement?

2. **When do they fail?** What explains the cases where Chronos underperformed simple baselines?

3. **The hybrid approach:** Is the incremental value of FM embeddings worth the computational cost? Compare the marginal IC improvement vs. the infrastructure complexity.

4. **Future outlook:** Based on Kronos (AAAI 2026) and FinCast (CIKM 2025), where is this field heading? Will finance-native FMs become standard tools?

5. **Practical recommendation:** If you were advising a quantitative fund, would you invest in FM infrastructure today? Why or why not?

### Grading
- Depth of analysis: 5 pts
- Evidence from your experiments: 5 pts
- Practical insight: 5 pts

*YOUR WRITE-UP HERE*

...