# OPTIONAL: Time Series ML - Feature Windows and Splits

**Module**: ML700 Advanced Topics (Optional)  
**Notebook**: 02 - Time Series ML: Feature Windows and Splits  
**Status**: OPTIONAL - This notebook covers advanced material beyond the core curriculum.

---

## Learning Objectives

By the end of this notebook, you will be able to:

1. Explain why time series data violates the i.i.d. assumption and why random splits cause data leakage
2. Use `TimeSeriesSplit` for walk-forward validation
3. Engineer lag features, rolling window features, and date/time features
4. Properly fit scalers on training data only within each fold
5. Demonstrate the leakage effect of random KFold vs. TimeSeriesSplit

## Prerequisites

- Understanding of train/test splitting and cross-validation (Module ML100)
- Familiarity with regression models (Module ML200)
- Basic pandas and numpy operations

## Table of Contents

1. [Why Time Series Is Different](#1.-Why-Time-Series-Is-Different)
2. [TimeSeriesSplit: Walk-Forward Validation](#2.-TimeSeriesSplit)
3. [Feature Engineering for Time Series ML](#3.-Feature-Engineering)
4. [Hands-On: Synthetic Time Series Modeling](#4.-Hands-On)
5. [Visualizing Train/Test Splits Over Time](#5.-Visualizing-Splits)
6. [Leakage Demo: TimeSeriesSplit vs Random KFold](#6.-Leakage-Demo)
7. [Common Mistakes](#7.-Common-Mistakes)
8. [Summary](#8.-Summary)

---

## 1. Why Time Series Is Different

Most ML algorithms assume that data points are **independent and identically distributed (i.i.d.)**.
Time series data violates this assumption in two key ways:

1. **Temporal dependence**: Nearby observations are correlated (autocorrelation)
2. **Temporal ordering**: The order of observations matters; future data must not leak into training

**Consequence**: You cannot use random train/test splits or standard KFold cross-validation.
Doing so allows the model to "see the future," leading to overly optimistic performance estimates.

## 2. TimeSeriesSplit: Walk-Forward Validation

Scikit-learn provides `TimeSeriesSplit`, which implements **walk-forward validation**:

```
Fold 1: [TRAIN] [TEST]
Fold 2: [TRAIN     ] [TEST]
Fold 3: [TRAIN          ] [TEST]
Fold 4: [TRAIN               ] [TEST]
```

- Training set grows with each fold
- Test set always comes **after** the training set in time
- No future information leaks into training

## 3. Feature Engineering for Time Series ML

When using standard ML models (not specialized time series models) on temporal data,
you must create features that capture temporal patterns:

### Lag Features
Shifted values of the target or other variables:  
`lag_1 = value(t-1)`, `lag_2 = value(t-2)`, etc.

### Rolling Window Features
Statistics computed over a sliding window:  
`rolling_mean_7 = mean(value[t-7:t])`, `rolling_std_7 = std(value[t-7:t])`

### Date/Time Features
Extracted from timestamps:  
`hour`, `day_of_week`, `month`, `is_weekend`, `is_holiday` (placeholder)

## 4. Hands-On: Synthetic Time Series Modeling

Let us create a synthetic time series, engineer features, and evaluate with `TimeSeriesSplit`.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import TimeSeriesSplit, KFold, cross_val_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

np.random.seed(42)

In [None]:
# Create synthetic time series: sine wave + linear trend + noise
n_points = 500
time_index = pd.date_range(start='2022-01-01', periods=n_points, freq='D')

t = np.arange(n_points)
trend = 0.02 * t
seasonal = 5 * np.sin(2 * np.pi * t / 365.25)  # yearly cycle
noise = np.random.normal(0, 1, n_points)
values = trend + seasonal + noise

df = pd.DataFrame({'date': time_index, 'value': values})
df.set_index('date', inplace=True)

fig, ax = plt.subplots(figsize=(10, 4))
ax.plot(df.index, df['value'], linewidth=0.8)
ax.set_title('Synthetic Time Series (Trend + Seasonality + Noise)')
ax.set_xlabel('Date')
ax.set_ylabel('Value')
plt.tight_layout()
plt.show()

In [None]:
# Feature engineering
def create_time_series_features(df, target_col='value', lags=[1, 2, 3, 7, 14], windows=[7, 14, 30]):
    """Create lag, rolling, and date features for time series ML."""
    result = df.copy()
    
    # Lag features
    for lag in lags:
        result[f'lag_{lag}'] = result[target_col].shift(lag)
    
    # Rolling window features
    for window in windows:
        result[f'rolling_mean_{window}'] = result[target_col].shift(1).rolling(window).mean()
        result[f'rolling_std_{window}'] = result[target_col].shift(1).rolling(window).std()
    
    # Date/time features
    result['day_of_week'] = result.index.dayofweek
    result['month'] = result.index.month
    result['day_of_year'] = result.index.dayofyear
    result['is_weekend'] = (result.index.dayofweek >= 5).astype(int)
    
    return result

df_features = create_time_series_features(df)

# Drop rows with NaN from lag/rolling features
df_features.dropna(inplace=True)
print(f"Features created. Shape: {df_features.shape}")
print(f"Columns: {list(df_features.columns)}")

In [None]:
# Prepare X and y
feature_cols = [c for c in df_features.columns if c != 'value']
X = df_features[feature_cols].values
y = df_features['value'].values

print(f"X shape: {X.shape}, y shape: {y.shape}")

In [None]:
# Evaluate with TimeSeriesSplit (proper way)
tscv = TimeSeriesSplit(n_splits=5)

# Use a pipeline so scaler is fit on train portion only in each fold
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', Ridge(alpha=1.0))
])

ts_scores = cross_val_score(pipeline, X, y, cv=tscv, scoring='neg_mean_squared_error')
ts_rmse = np.sqrt(-ts_scores)

print("TimeSeriesSplit Results (RMSE per fold):")
for i, score in enumerate(ts_rmse):
    print(f"  Fold {i+1}: {score:.4f}")
print(f"  Mean RMSE: {ts_rmse.mean():.4f} (+/- {ts_rmse.std():.4f})")

## 5. Visualizing Train/Test Splits Over Time

Let us visualize how TimeSeriesSplit divides the data in each fold.

In [None]:
# Visualize TimeSeriesSplit folds
fig, axes = plt.subplots(5, 1, figsize=(10, 8), sharex=True)
dates = df_features.index

for i, (train_idx, test_idx) in enumerate(tscv.split(X)):
    ax = axes[i]
    ax.plot(dates[train_idx], y[train_idx], 'b-', linewidth=0.7, label='Train')
    ax.plot(dates[test_idx], y[test_idx], 'r-', linewidth=0.7, label='Test')
    ax.set_ylabel(f'Fold {i+1}')
    if i == 0:
        ax.legend(loc='upper left', fontsize=8)

axes[0].set_title('TimeSeriesSplit: Walk-Forward Validation')
axes[-1].set_xlabel('Date')
plt.tight_layout()
plt.show()

print("Each fold uses only past data for training and future data for testing.")
print("The training set grows with each fold.")

## 6. Leakage Demo: TimeSeriesSplit vs Random KFold

Let us compare the performance estimates from TimeSeriesSplit (correct) vs random KFold (data leakage).

In [None]:
# Correct: TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)
ts_scores = cross_val_score(pipeline, X, y, cv=tscv, scoring='neg_mean_squared_error')
ts_rmse_mean = np.sqrt(-ts_scores).mean()

# WRONG: Random KFold (causes data leakage for time series)
kf = KFold(n_splits=5, shuffle=True, random_state=42)
kf_scores = cross_val_score(pipeline, X, y, cv=kf, scoring='neg_mean_squared_error')
kf_rmse_mean = np.sqrt(-kf_scores).mean()

print("Comparison of Cross-Validation Strategies:")
print(f"  TimeSeriesSplit Mean RMSE: {ts_rmse_mean:.4f}  (correct, no leakage)")
print(f"  Random KFold Mean RMSE:   {kf_rmse_mean:.4f}  (WRONG, data leakage!)")
print()
if kf_rmse_mean < ts_rmse_mean:
    pct = (1 - kf_rmse_mean / ts_rmse_mean) * 100
    print(f"Random KFold appears {pct:.1f}% better -- this is an ILLUSION caused by data leakage.")
    print("The model 'sees the future' when random splits mix past and future data.")
else:
    print("In this case the scores are similar, but random KFold is still methodologically wrong.")
    print("With stronger temporal patterns, the leakage effect would be more dramatic.")

### Key Takeaway

Random KFold often produces **overly optimistic** error estimates because the model trains on
future data points that are correlated with test points. The "real-world" performance
(predicting genuinely future values) is almost always worse than what random KFold suggests.

## 7. Common Mistakes

1. **Random splitting time series data**: Always respect temporal order. Use `TimeSeriesSplit` or manual temporal splits.
2. **Using future data as features**: Lag features must use `shift(1)` or more. Rolling windows must not include the current value.
3. **Not respecting temporal order in scaling**: Fit the scaler on the training portion only. Use `Pipeline` so this happens automatically in cross-validation.
4. **Forgetting to drop NaN rows**: Lag and rolling features create NaN values at the beginning of the series. Drop them before modeling.
5. **Ignoring the growing training set**: In walk-forward validation, earlier folds have less training data and may perform worse. This is expected behavior.

## 8. Summary

- Time series data is **not i.i.d.** -- temporal order and autocorrelation must be respected
- Use `TimeSeriesSplit` for walk-forward cross-validation (never random KFold)
- Key feature types: **lag features**, **rolling window statistics**, **date/time features**
- Always fit scalers and transformers on the **training portion only** (use `Pipeline`)
- Random KFold on time series causes **data leakage**, producing overly optimistic estimates
- The gap between TimeSeriesSplit and random KFold scores reveals the extent of leakage