# 02 â€” Modeling + Meta-Labeling

Implements Part 2 and Phase II of the Plan:
- Model A: high-recall logic sieve (baseline)
- Model B: LightGBM meta-model to filter false positives

Key requirement: time-aware evaluation. Replace the simple split with walk-forward (rolling window).

In [None]:
import numpy as np
import pandas as pd

from at.models.signals import logic_sieve_signals
from at.models.meta_label import fit_meta_label_model, predict_meta_probs
from at.utils.paths import get_paths

In [None]:
paths = get_paths()
df = pd.read_parquet(paths.data_processed / 'features.parquet')
df = df.sort_values(['date','ticker']).reset_index(drop=True)
df.head()

In [None]:
df['signal_a'] = logic_sieve_signals(df)
df['meta_y'] = ((df['signal_a'] == 1) & (df['fwd_ret_1d'] > 0)).astype(int)
cand = df[df['signal_a'] == 1].copy()
cand[['signal_a','meta_y']].mean()

## Walk-forward template

Suggested: train on trailing 12 months, test on next 1 month, slide forward.
Store out-of-sample `meta_prob` for the entire backtest period.

In [None]:
feature_cols = [
    'vol_20d','atr_14','vol_spike_20','close_to_vwap_20','rsi_14','macd_hist_12_26_9','vol_x_mom'
]
feature_cols = [c for c in feature_cols if c in cand.columns]
cand = cand.dropna(subset=feature_cols + ['meta_y','date'])
cand = cand.sort_values('date')
feature_cols

In [None]:
# TODO: implement rolling window here.
# For now, do a simple time split as a placeholder.
cut = int(len(cand) * 0.7)
train = cand.iloc[:cut]
test = cand.iloc[cut:]

model = fit_meta_label_model(train[feature_cols], train['meta_y'])
test['meta_prob'] = predict_meta_probs(model, test[feature_cols])

oos = test[['date','ticker','meta_prob']].copy()
oos.head()

In [None]:
out_path = paths.data_processed / 'meta_probs.parquet'
oos.to_parquet(out_path, index=False)
out_path