
# Week 2 — Baseline Models & Time-Series Evaluation

## Learning goals
This week is about **building baseline models** and learning **evaluation discipline**.

By the end of Week 2, you should be able to:
- Build meaningful **baseline predictors** for returns
- Evaluate models using **walk-forward (time-series) validation**
- Understand why **prediction accuracy ≠ trading performance**
- Avoid common sources of **look-ahead bias**




## What you will build

You will implement and compare:

### Naive baselines
1. Zero-return predictor  (No model)
2. Rolling historical mean predictor

### Linear model
1. Ordinary Least Squares (OLS)  

### Tree models(optionally)
1. Decision Trees
2. Random Forests

You will evaluate them using **strictly forward-looking splits**.


## Why start with baselines?

Financial return prediction is an extremely noisy problem.  
Before using complex models, it is essential to understand how **simple strategies behave**.

Baselines help us:
- Set a realistic performance reference
- Detect data leakage and evaluation bugs
- Understand whether a model is learning signal or just noise

In finance, it is common for simple baselines to be **_surprisingly hard to beat_**.


## Naive baselines

### 1. Zero-return predictor
The zero predictor always predicts a return of zero.

This corresponds to the assumption that the price is constant

Why this matters:
- Daily asset returns have mean close to zero
- This predictor often achieves reasonable RMSE
- It provides a strong sanity check for all models

If a model cannot beat this baseline **out-of-sample**, it is likely useless.


### 2. Rolling historical mean predictor
This predictor estimates the expected return as the average of recent past returns:

$$
\hat r_{t+1} = \frac{1}{W}\sum_{i=t-W+1}^t r_i
$$

Interpretation:
- Assumes short-term return persistence
- Equivalent to a very simple momentum model
- Sensitive to window length

This baseline introduces **time dependence** without using any machine learning.


### 3. OLS Predictor

For each day \( t \), you construct a feature vector using only information
available up to that day:

$$
x_t =
\begin{bmatrix}
r_t \\
r_{t-1} \\
\text{rolling mean}_t \\
\text{rolling volatility}_t \\
\vdots
\end{bmatrix}
$$

**Important:**
- All features use data from time $\le t $
- No future returns appear in $x_t$

This is why the model is said to be **forward-looking**.

---

### Training data construction

From historical data, you build a dataset of input–output pairs:

$$
\{(x_1, y_1), (x_2, y_2), \dots, (x_T, y_T)\}
$$

where the target is defined as the **next-day return**:

$$
y_t = r_{t+1}
$$

Each example answers the question:
> “Using information available at day \( t \), can we predict the return at day \( t+1 \)?”

---

### OLS objective

Ordinary Least Squares (OLS) finds weights \( w \) by minimizing mean squared error:

$$
\min_w \sum_{t=1}^{T} \left( y_t - w^\top x_t \right)^2
$$

**Interpretation:**
- Learn a linear mapping from today’s features to tomorrow’s return
- This is standard regression, but applied to time-ordered financial data

---

### Making a prediction

At the **end of day \( T \)**:
- You observe the feature vector \( x_T \)
- You compute the prediction:

$$
\hat r_{T+1} = w^\top x_T
$$

This value is the model’s **predicted return for the next trading day**.

In general figuring out what features to use in $x$ can be very hard, will require experimentation and a lot of intuition. For this stage we do not need you to figure out which feature gives us best returns.

So the feature vector $ x_t $ should use:

- $r_t$
- $r_{t-1}$
- 20-day rolling mean of returns
- 20-day rolling volatility of returns
- 5-day momentum (cumulative return)



### 4. Random Forest Predictor (optional)

Random Forests are **tree-based ensemble models** that combine predictions from
many decision trees to reduce variance and improve stability.

As with OLS, the goal is to predict the **next-day return** using information
available up to the current day.

---

### Feature construction

For each day $t$, you construct a feature vector using only past and present
information:

$$
x_t =
\begin{bmatrix}
r_t \\
r_{t-1} \\
\text{rolling mean}_t \\
\text{rolling volatility}_t \\
\text{momentum}_t \\
\vdots
\end{bmatrix}
$$

**Important:**
- All features use data from time $\le t$
- No future returns appear in $x_t$

The same feature set used for linear models is reused here to ensure a fair
comparison across models.

---

### Training data construction

From historical data, you build a dataset of input–output pairs:

$$
\{(x_1, y_1), (x_2, y_2), \dots, (x_T, y_T)\}
$$

where the target is defined as the **next-day return**:

$$
y_t = r_{t+1}
$$

Each example answers the question:
> “Using information available at day $t$, can we predict the return at day $t+1$?”

---

### Random Forest objective (intuition)

A Random Forest consists of many decision trees, each trained on a bootstrap
sample of the data and a random subset of features.

Each tree produces a prediction $\hat r_{t+1}^{(k)}$, and the final forecast is
the average across trees:

$$
\hat r_{t+1}
=
\frac{1}{K}
\sum_{k=1}^{K} \hat r_{t+1}^{(k)}
$$

where $K$ is the number of trees in the forest.

---

### Interpretation

- Random Forests can capture **nonlinear relationships** and feature interactions
- They perform implicit feature selection
- They are far more flexible than linear models

However, in financial time series:
- Signal-to-noise ratios are extremely low
- Relationships are unstable over time
- High-capacity models often **overfit noise**

For this reason, Random Forests are included as a **learning tool**, not as a
recommended production model.

---

### Making a prediction

At the **end of day $T$**:
- You observe the feature vector $x_T$
- Each tree produces a prediction
- The Random Forest averages these predictions:

$$
\hat r_{T+1} = \text{RF}(x_T)
$$

This value is the model’s **predicted return for the next trading day**.

---

### Feature choice for this assignment

In general, designing good features for tree-based models is difficult and
requires extensive experimentation.

For this stage, you should **not** search for optimal features.

Use the following fixed feature set:

- $r_t$
- $r_{t-1}$
- 20-day rolling mean of returns
- 20-day rolling volatility of returns
- 5-day momentum (cumulative return)

This allows us to focus on **evaluation discipline** rather than feature engineering.


## Evaluating Prediction Accuracy

The goal of prediction evaluation is to measure how close the model’s predicted
returns are to the realized future returns.

Let:
- $y_t = r_{t+1}$ be the true next-day return
- $\hat y_t$ be the predicted return made using information available at time $t$

Evaluation must always be performed **out-of-sample** using forward-looking data.


### Mean Squared Error (MSE)

The Mean Squared Error measures the average squared difference between predicted
and realized returns:

$$
\text{MSE}
=
\frac{1}{N}
\sum_{t=1}^{N}
(\hat y_t - y_t)^2
$$

Properties:
- Penalizes large errors heavily
- Sensitive to outliers
- Commonly used for regression models

Lower MSE indicates more accurate predictions.


### Root Mean Squared Error (RMSE)

RMSE is the square root of MSE:

$$
\text{RMSE} = \sqrt{\text{MSE}}
$$

Why RMSE is preferred:
- Same units as returns
- Easier to interpret than MSE


Create plots to compare the prediction accuracy of models 

## References & Further Reading

### Python & Data Handling
- **pandas documentation**  
  https://pandas.pydata.org/docs/  
  (Used for time-series handling, rolling windows, and feature construction)

---

### Linear Models (OLS & Ridge)
- **scikit-learn — Linear Regression (OLS)**  
  https://scikit-learn.org/stable/modules/linear_model.html#ordinary-least-squares

- **scikit-learn — Ridge Regression**  
  https://scikit-learn.org/stable/modules/linear_model.html#ridge-regression

---

### Tree-Based Models (Optional)
- **scikit-learn — Decision Trees**  
  https://scikit-learn.org/stable/modules/tree.html

- **scikit-learn — Random Forest Regressor**  
  https://scikit-learn.org/stable/modules/ensemble.html#random-forests

---

### Statistics & Intuition (Highly Recommended)
- **StatQuest with Josh Starmer (YouTube)**  
  https://www.youtube.com/@statquest  
  (Clear intuition for regression, bias–variance tradeoff, and tree-based models)

Recommended StatQuest videos:
- Linear Regression  
  https://www.youtube.com/watch?v=nk2CQITm_eo
- Ridge Regression  
  https://www.youtube.com/watch?v=Q81RR3yKn30
- Decision Trees  
  https://www.youtube.com/watch?v=7VeUPuFGJHk
- Random Forests  
  https://www.youtube.com/watch?v=J4Wdy0Wc_xQ

---

### MIT OpenCourseWare (Optional, Deeper Theory)
- **MIT OCW — 18.065 Matrix Methods in Data Analysis**  
  https://ocw.mit.edu/courses/18-065-matrix-methods-in-data-analysis-signal-processing-and-machine-learning-spring-2018/

- **MIT OCW — 6.036 Introduction to Machine Learning**  
  https://ocw.mit.edu/courses/6-036-introduction-to-machine-learning-fall-2020/

These resources provide deeper mathematical intuition but are **not required**
to complete this assignment.



## Files
- `task2.ipynb` — complete all coding and questions here

Reuse the data and features created in **Week 1**.



## Note

- ❌ Random train/test splits are **not allowed**
- ✅ Only time-series / walk-forward evaluation is permitted
- ✅ Training data must come strictly **before** test data




## Submission instructions

1. Complete `task2.ipynb`
2. Create a branch:
```bash
git checkout -b submission-week2
git push -u origin submission-week2
```
1. Submit the link to the *most recent commit* to this branch (before submission date).

That commit will be treated as a **fixed snapshot**.
