# 8) Technical Indicators ML — Weekly Return Prediction (ENet, RF, XGBoost)

This notebook uses **simulated daily price & volume** data to engineer **10 technical/liquidity features** and predict **next-week (5-day) returns**.

**Models**: Elastic Net, Random Forest, **XGBoost**, and a **stacked ensemble**.  
**CV**: `TimeSeriesSplit` + **RandomizedSearchCV** for hyperparameter tuning.

> Notes
- No external data; completely simulated.
- Keep it fast & deterministic (set random seeds).
- Treat CV folds as expanding windows to avoid lookahead.
- If `xgboost` isn't installed in your environment, install it or run the GradientBoosting fallback in the solution notebook (we include a safe import).


## Setup & Data (simulated)

In [1]:
import numpy as np, pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import TimeSeriesSplit, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import ElasticNet, Ridge
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import r2_score, mean_squared_error
from scipy.stats import spearmanr

# Try to import XGBoost; provide a fallback variable if missing
try:
    from xgboost import XGBRegressor
    HAS_XGB = True
except Exception as e:
    HAS_XGB = False
    XGBRegressor = None
    print("WARNING: xgboost not available in this environment. The XGBoost parts will need xgboost installed.\n", e)

np.random.seed(123)

# --- Simulate panel of daily prices & volumes for N stocks ---
n_stocks = 300
days = pd.date_range("2016-01-01", "2020-12-31", freq="B")
tickers = [f"U{i:04d}" for i in range(n_stocks)]

# latent drift + volatility per name
drift = np.random.normal(0.0002, 0.0004, n_stocks)
vol = np.random.uniform(0.01, 0.03, n_stocks)
level0 = np.random.uniform(10, 100, n_stocks)
vol_level = np.random.uniform(1e5, 2e6, n_stocks)

prices = pd.DataFrame(index=days, columns=tickers, dtype=float)
volumes = pd.DataFrame(index=days, columns=tickers, dtype=float)

for j, t in enumerate(tickers):
    eps = np.random.normal(drift[j], vol[j], len(days))
    p = np.empty(len(days))
    p[0] = level0[j]
    for i in range(1, len(days)):
        p[i] = max(0.5, p[i-1] * (1 + eps[i]))
    prices[t] = p
    # volume with autocorrelation + relation to volatility
    v = np.abs(np.random.normal(vol_level[j], vol_level[j]*0.2, len(days)))
    v = pd.Series(v, index=days).ewm(alpha=0.2).mean().values
    volumes[t] = v

# --- Build features & target (weekly fwd return) ---
px = prices.copy()
ret1 = px.pct_change()
ret5 = px.pct_change(5)
fwd5 = px.shift(-5).pct_change(5)  # target: next-week return
dollar_vol = (px * volumes)

def roll_z(df, w):
    return (df - df.rolling(w).mean()) / df.rolling(w).std(ddof=0)

features = {}
# 1: 5d momentum (past week return)
features["mom_5"] = ret5
# 2: 20d momentum
features["mom_20"] = px.pct_change(20)
# 3: volatility (20d std of daily returns)
features["vol_20"] = ret1.rolling(20).std()
# 4: RSI-like (ratio of up vs down over 14d)
up = ret1.clip(lower=0).rolling(14).mean()
dn = (-ret1.clip(upper=0)).rolling(14).mean()
features["rsi_14"] = up / (up + dn)
# 5: price above/below 20d SMA
features["px_sma20_gap"] = (px / px.rolling(20).mean()) - 1
# 6: price above/below 50d SMA
features["px_sma50_gap"] = (px / px.rolling(50).mean()) - 1
# 7: volume z-score (20d)
features["vol_z20"] = roll_z(volumes, 20)
# 8: dollar volume z-score (20d)
features["dvol_z20"] = roll_z(dollar_vol, 20)
# 9: intrawindow min-max oscillator (20d)
features["osc_20"] = (px - px.rolling(20).min()) / (px.rolling(20).max() - px.rolling(20).min())
# 10: rolling beta to market proxy (equal-weight index) over 60d
mkt = px.mean(axis=1).pct_change()
def rolling_beta_to_mkt(r, m, w=60):
    cov = r.rolling(w).cov(m)
    var = m.rolling(w).var()
    return cov / var
features["beta_mkt_60"] = ret1.apply(lambda s: rolling_beta_to_mkt(s, mkt), axis=0)

# Combine feature panel
feat_panel = pd.concat(features, axis=1)
target = fwd5

# Collapse to weekly (Fri) observations by taking last available day of week
feat_weekly = feat_panel.resample("W-FRI").last()
tgt_weekly = target.resample("W-FRI").last()

# Long-to-wide to long format
X = feat_weekly.stack().dropna()
y = tgt_weekly.stack().reindex(X.index)

# Drop any remaining NA pairs
df = pd.concat([X, y.rename("y")], axis=1).dropna()
df.head()


 No module named 'xgboost'


  fwd5 = px.shift(-5).pct_change(5)  # target: next-week return
  X = feat_weekly.stack().dropna()


Unnamed: 0,Unnamed: 1,mom_5,mom_20,vol_20,rsi_14,px_sma20_gap,px_sma50_gap,vol_z20,dvol_z20,osc_20,beta_mkt_60,y
2016-03-25,U0000,-0.010319,0.041101,0.011672,0.586379,0.014533,0.008185,0.768401,1.011646,0.726798,2.017661,-0.048802
2016-03-25,U0001,0.05106,0.361122,0.025821,0.794991,0.155985,0.278758,-0.594963,1.337492,1.0,1.164209,-0.023535
2016-03-25,U0002,-0.049858,-0.041284,0.025415,0.476618,-0.043886,-0.039853,0.365881,-0.457461,0.0,1.456139,0.076441
2016-03-25,U0003,-0.045784,0.081439,0.015745,0.61208,0.003044,0.09674,-1.059869,-1.01101,0.512123,-0.463705,-0.015999
2016-03-25,U0004,-0.039011,0.124597,0.024937,0.654415,0.040415,0.063415,-1.190204,-0.095073,0.731161,6.518904,0.098654


In [4]:
df.index = df.index.rename(['date', 'ticker'])

In [13]:
df.describe().round(2)

Unnamed: 0,mom_5,mom_20,vol_20,rsi_14,px_sma20_gap,px_sma50_gap,vol_z20,dvol_z20,osc_20,beta_mkt_60,y
count,75000.0,75000.0,75000.0,75000.0,75000.0,75000.0,75000.0,75000.0,75000.0,75000.0,75000.0
mean,0.0,0.0,0.02,0.51,0.0,0.0,0.0,0.01,0.5,0.63,0.0
std,0.05,0.1,0.01,0.17,0.05,0.09,1.15,1.21,0.36,1.95,0.05
min,-0.22,-0.4,0.0,0.0,-0.25,-0.38,-3.55,-3.66,0.0,-8.69,-0.22
25%,-0.03,-0.06,0.01,0.39,-0.03,-0.05,-0.85,-0.93,0.14,-0.57,-0.03
50%,0.0,0.0,0.02,0.51,0.0,-0.0,0.0,-0.02,0.49,0.52,0.0
75%,0.03,0.06,0.02,0.62,0.03,0.05,0.86,0.93,0.85,1.7,0.03
max,0.25,0.7,0.05,1.0,0.28,0.53,3.7,3.68,1.0,11.79,0.25


## TODOs & Scaffold

## TODOs
1. **Train/Validation Split (time-based):** Use `TimeSeriesSplit(n_splits=5)` for CV.
2. **Pipelines:**  
   - `ElasticNet` with `StandardScaler`  
   - `RandomForestRegressor`  
   - **`XGBRegressor`** (from `xgboost`)
3. **RandomizedSearchCV:** 30 iterations each with sensible hyperparameter ranges; scoring by **neg_mean_squared_error**.
4. **Stacked Ensemble:** Use the three tuned models as base estimators with a `Ridge` meta-learner.
5. **Evaluation:** Report **MSE**, **R^2**, and **Spearman IC** on an **out-of-sample test** set (final 20% of dates).
6. **Feature Importance:**  
   - RF/XGB: native importances or permutation  
   - ElasticNet: coefficients (after standardization)
7. **Stability Check:** Refit your **best single model** on rolling windows and plot IC drift.


In [21]:
df_trans = df.copy()
df_zs = df.groupby('date')[df.columns.tolist()].transform(lambda x : np.clip((x - x.mean())/x.std(ddof=0), -4, 4))

In [23]:
df_trans[df.columns.tolist()] = df_zs

In [30]:
# === TODO 1: Build TimeSeriesSplit and train/validation scheme ===
# tscv = TimeSeriesSplit(n_splits=5)
tscv = TimeSeriesSplit(n_splits=5)

# Build train/test cut by time for final evaluation (last 20% dates)
mi = df_trans.index
dates = mi.get_level_values(0).unique().sort_values()
cut = dates[int(len(dates)*0.8)]
train_idx = mi.get_level_values(0) <= cut
test_idx  = mi.get_level_values(0) >  cut

X_all = df_trans.drop(columns=["y"])
y_all = df_trans["y"]
X_train, y_train = X_all[train_idx], y_all[train_idx]
X_test,  y_test  = X_all[test_idx],  y_all[test_idx]


In [49]:
p = len(df_trans.columns) - 1
p

10

In [107]:
# === TODO 2–3: Build pipelines & RandomizedSearchCV grids ===

en = ElasticNet()
param_en = {'l1_ratio' : np.linspace(0.0, 1.0, 30), 'alpha': np.logspace(-4, 0, 50)}
rs_en = RandomizedSearchCV(en, param_en, cv= tscv, n_iter=30, scoring='neg_mean_squared_error', random_state=123, n_jobs=-1, verbose=0)

rf = RandomForestRegressor(n_jobs=-1) # parallel over cpus
param_rf = {'n_estimators': np.arange(100, 501, 50), 'max_depth': [None] + list(np.arange(3, 13, 2)), 'max_features': ["auto", "sqrt", 0.5]}
rs_rf = RandomizedSearchCV(rf, param_rf, cv= tscv, n_iter=30, scoring='neg_mean_squared_error', random_state=123, n_jobs=1, verbose=0)

xg = XGBRegressor(n_jobs=-1) # parallel over cpus
param_xg = {'n_estimators': np.arange(200, 801, 100), 'max_depth': np.arange(2, 9), 'learning_rate': np.linspace(0.02, 0.3, 15), 'gamma': np.logspace(-4, 2, 10), 'reg_lambda': np.linspace(0.5, 2.0, 7)}
rs_xg = RandomizedSearchCV(xg, param_xg, cv= tscv, n_iter=30, scoring='neg_mean_squared_error', random_state=123, n_jobs=1, verbose=0)

In [None]:
# === TODO 4: Fit searches and capture best estimators ===

In [108]:
rs_en.fit(X_train, y_train)
print("Fitted ElasticNet")

  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


Fitted ElasticNet


  model = cd_fast.enet_coordinate_descent(


In [109]:
rs_rf.fit(X_train, y_train)
print("Fitted RandomForest")

55 fits failed out of a total of 150.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
55 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/mikesong/miniconda3/envs/test/lib/python3.10/site-packages/sklearn/model_selection/_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/mikesong/miniconda3/envs/test/lib/python3.10/site-packages/sklearn/base.py", line 1467, in wrapper
    estimator._validate_params()
  File "/Users/mikesong/miniconda3/envs/test/lib/python3.10/site-packages/sklearn/base.py", line 666, in _validate_params
    validate_parameter_constraints(
  File "/Users/mikesong/miniconda3/envs/test/lib/python3.10/site-packages/sklearn/utils

Fitted RandomForest


In [110]:
rs_xg.fit(X_train, y_train)
print("Fitted XGBoost")

Fitted XGBoost


In [111]:
best_en = rs_en.best_estimator_
best_rf = rs_rf.best_estimator_
best_xg = rs_xg.best_estimator_

In [112]:
# === TODO 4: Stacked ensemble ===
from sklearn.ensemble import StackingRegressor
stack = StackingRegressor(
    estimators=[("en", best_en), ("rf", best_rf), ("xgb", best_xg)],
    final_estimator=Ridge(alpha=1.0, random_state=123),
    passthrough=False
)
stack.fit(X_train, y_train)


In [113]:
stack.final_estimator_.coef_

array([-0.01530617,  0.61584759, -0.01530586])

In [114]:
# === TODO 5: Evaluation (MSE, R^2, Spearman) on test set ===
# def eval_model(name, mdl):
#     ...
def eval_model(name, mdl):
    
    pred = mdl.predict(X_test.values)
    ground_truth = y_test.values
    na = np.isnan(ground_truth)

    pred = pred[~na]
    ground_truth = ground_truth[~na]

    mse = mean_squared_error(ground_truth, pred)
    r2 = r2_score(ground_truth, pred)
    ic = spearmanr(ground_truth, pred)[0]

    print(f"{name:12s} | MSE: {mse:.6f}  R2: {r2:.4f}  Spearman IC: {ic:.4f}")

for name, mdl in [("ElasticNet", best_en), ("RandomForest", best_rf), ("XGBoost", best_xg), ("Stack", stack)]:
    eval_model(name, mdl)


ElasticNet   | MSE: 0.996148  R2: -0.0000  Spearman IC: nan
RandomForest | MSE: 0.996546  R2: -0.0004  Spearman IC: -0.0056
XGBoost      | MSE: 0.996148  R2: -0.0000  Spearman IC: nan
Stack        | MSE: 0.996176  R2: -0.0000  Spearman IC: 0.0025


  ic = spearmanr(ground_truth, pred)[0]
  ic = spearmanr(ground_truth, pred)[0]


In [116]:

# === TODO 6: Feature importance / coefficients ===
# Access mdl[-1] inside pipeline if needed, or use permutation importance.

# Feature importance / coefficients
cols = X_train.columns

def elasticnet_coefs(en, cols):
    return pd.Series(en.coef_, index=cols).sort_values(key=lambda s: s.abs(), ascending=False)

en_imp = elasticnet_coefs(best_en, cols)
print("\nTop ElasticNet coefficients:")
print(en_imp.head(10))

# Tree importances

rf_imp = pd.Series(best_rf.feature_importances_, index=cols).sort_values(ascending=False)
print("\nTop RF importances:"); print(rf_imp.head(10))

xgb_model = best_xg
xgb_imp = pd.Series(xgb_model.feature_importances_, index=cols).sort_values(ascending=False)
print("\nTop XGB importances:"); print(xgb_imp.head(10))



Top ElasticNet coefficients:
mom_5           0.0
mom_20          0.0
vol_20         -0.0
rsi_14          0.0
px_sma20_gap    0.0
px_sma50_gap    0.0
vol_z20         0.0
dvol_z20        0.0
osc_20          0.0
beta_mkt_60    -0.0
dtype: float64

Top RF importances:
px_sma50_gap    0.161139
beta_mkt_60     0.158129
vol_20          0.124547
vol_z20         0.124286
mom_5           0.098222
px_sma20_gap    0.083034
dvol_z20        0.082934
mom_20          0.066862
rsi_14          0.064560
osc_20          0.036287
dtype: float64

Top XGB importances:
mom_5           0.0
mom_20          0.0
vol_20          0.0
rsi_14          0.0
px_sma20_gap    0.0
px_sma50_gap    0.0
vol_z20         0.0
dvol_z20        0.0
osc_20          0.0
beta_mkt_60     0.0
dtype: float32


In [None]:

# === TODO 7: Stability — rolling-window refit for top model ===
# Iterate over time blocks, refit, record test metrics, and plot over time.
