__Forecasting__

- Build sequences (input windows).
- Train baselines (naive, EWMA).
- Train LSTM/GRU (direct multi‑output); tune window, depths, dropout.
- Evaluate per horizon and by regime; rolling origin backtest.
- Refine with regime‑aware heads or Seq2Seq if needed.

In [11]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

__Load parquet with aggregated features and regimes__

In [17]:
regimes_df = pd.read_parquet("daily_features_with_regimes.parquet") 
regimes_df.head()

Unnamed: 0,date,ATM_IV,Skew,Curvature,dgs2,ticker,year,Regime_K
0,2023-01-03,0.282223,0.047618,-0.000384,4.4,QQQ,2023,1
1,2023-01-04,0.274456,0.061536,-0.001318,4.36,QQQ,2023,1
2,2023-01-05,0.28238,0.038932,-0.0062,4.45,QQQ,2023,1
3,2023-01-06,0.258931,0.052242,0.000721,4.24,QQQ,2023,1
4,2023-01-09,0.267719,0.045165,0.00132,4.19,QQQ,2023,1


In [None]:
Yes — before building sequence windows, we should normalize the features (e.g., z‑score scaling) so that ATM_IV, Skew, Curvature, and dgs2 are on comparable scales. This reduces bias toward large‑magnitude variables and stabilizes training.

In [None]:
scale_cols = ['ATM_IV','Skew','Curvature','dgs2']
scaler = StandardScaler()
regimes_df[scale_cols] = scaler.fit_transform(regimes_df[scale_cols])

In [None]:
Since Regime_K is a categorical state label with no natural ordering, it’s better to one‑hot encode it before feeding into the model, rather than keeping it as a raw integer.

In [22]:
W = 20   # input window length - increase with more data
H = 3    # forecast horizon (3 trading days ahead), target is 30

Xs, ys, meta = [], [], []

for ticker, g in regimes_df.groupby('ticker'):
    g = g.sort_values('date').reset_index(drop=True)
    
    # Features
    X_mat = g[feature_cols].values
    # Targets: next 3 trading days of ATM_IV
    y_mat = g['ATM_IV'].values
    dates = g['date'].values
    
    for i in range(W-1, len(g)-H):
        x_win = X_mat[i-W+1:i+1]           # shape (W, F)
        y_target = y_mat[i+1:i+H+1]        # shape (3,)
        
        if np.isnan(y_target).any() or np.isnan(x_win).any():
            continue
        
        Xs.append(x_win)
        ys.append(y_target)
        meta.append({'ticker': ticker, 'end_date': dates[i]})

X = np.array(Xs)   # shape: (N, W, F)
y = np.array(ys)   # shape: (N, H) where H=3
meta = pd.DataFrame(meta)

print("X shape:", X.shape)  # (samples, timesteps, features)
print("y shape:", y.shape)  # (samples, 3)

X shape: (80, 20, 5)
y shape: (80, 3)


  for ticker, g in regimes_df.groupby('ticker'):


In [23]:
meta.head()

Unnamed: 0,ticker,end_date
0,QQQ,2023-01-31
1,QQQ,2023-02-01
2,QQQ,2023-02-02
3,QQQ,2023-02-03
4,QQQ,2023-02-06
