# 02 — Feature Analysis (Derived Features + t1 Deep Dive)

Goals:
- Analyze contribution of the 10 current derived features (`src/data/preprocessing.py`)
- Deep analysis of `t1` predictability and cross-feature interactions
- Evaluate candidate new derived signals (rolling stats, ROC, volatility)

Artifacts are saved in `notebooks/artifacts/02_feature_analysis/`.

In [None]:
from pathlib import Path
import json
import pandas as pd
import matplotlib.pyplot as plt

ROOT = Path('..') if Path.cwd().name == 'notebooks' else Path('.')
ART = ROOT / 'notebooks' / 'artifacts' / '02_feature_analysis'
ART.mkdir(parents=True, exist_ok=True)

## 1) Load datasets (required context check)

In [None]:
train = pd.read_parquet(ROOT / 'datasets' / 'train.parquet')
valid = pd.read_parquet(ROOT / 'datasets' / 'valid.parquet')

print('Train shape:', train.shape)
print('Valid shape:', valid.shape)
print('Train sequences:', train['seq_ix'].nunique())
print('Valid sequences:', valid['seq_ix'].nunique())
print('Scored rows train:', int(train['need_prediction'].sum()))
print('Scored rows valid:', int(valid['need_prediction'].sum()))

## 2) Run full analysis pipeline

This script computes all derived-feature rankings, t1 analysis, interaction scans, lag analysis, and candidate feature tests.

In [None]:
# Uncomment to recompute from scratch
# import subprocess, sys
# subprocess.run([sys.executable, str(ROOT / 'notebooks' / 'run_02_feature_analysis.py')], check=True)

## 3) Derived feature contribution results

In [None]:
derived_rank = pd.read_csv(ART / 'derived_feature_contribution_rank.csv')
perm = pd.read_csv(ART / 'derived_feature_permutation_importance.csv')
proxy = json.load(open(ART / 'feature_set_proxy_scores.json', 'r', encoding='utf-8'))

display(derived_rank[['feature', 'valid_pearson_t0', 'valid_pearson_t1', 'valid_weightedcorr_t0', 'valid_weightedcorr_t1']])
display(perm[['feature', 'delta_avg', 'delta_t0', 'delta_t1']])
proxy

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
axes[0].imshow(plt.imread(ART / 'derived_feature_corr_bars.png'))
axes[0].axis('off')
axes[0].set_title('Derived corr bars')
axes[1].imshow(plt.imread(ART / 'derived_feature_permutation_importance.png'))
axes[1].axis('off')
axes[1].set_title('Permutation importance')
plt.tight_layout()
plt.show()

## 4) t1 predictability deep dive

In [None]:
t1_corr = pd.read_csv(ART / 't1_feature_correlations_42.csv')
t1_int = pd.read_csv(ART / 't1_interaction_scan_top10.csv')
t1_lag = pd.read_csv(ART / 't1_lag_feature_correlations.csv')
t1_mi = pd.read_csv(ART / 't1_mutual_information.csv')

print('Top t1 features by |valid pearson|:')
display(t1_corr.head(12))
print('Top interaction terms for t1:')
display(t1_int.head(12))
print('Top MI features for t1:')
display(t1_mi.head(12))
print('Lag correlations (feature[t-lag] vs t1[t]):')
display(t1_lag.head(25))

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
axes[0].imshow(plt.imread(ART / 't1_top_feature_corr.png'))
axes[0].axis('off')
axes[0].set_title('Top t1 feature correlations')
axes[1].imshow(plt.imread(ART / 't1_top_interactions.png'))
axes[1].axis('off')
axes[1].set_title('Top t1 interactions')
plt.tight_layout()
plt.show()

## 5) Candidate new derived features

In [None]:
cand = pd.read_csv(ART / 'candidate_new_features_t1_corr.csv')

display(cand)

img = plt.imread(ART / 'candidate_new_features_t1_corr.png')
plt.figure(figsize=(10, 5))
plt.imshow(img)
plt.axis('off')
plt.title('Candidate new derived features vs t1')
plt.show()

## 6) Key conclusions

- Current derived set helps mainly via spreads and trade-intensity (`spread_2`, `spread_0`, `trade_intensity`).
- `t1` remains weak in linear signal; best standalone correlations are low (roughly 0.02-0.04), but interactions add incremental signal.
- Strongest new candidate for `t1` from this pass is `spread0_roc1`, then `spread0_roc5`, then short rolling mean of trade intensity.
- Next feature-engineering round should prioritize temporal derivatives and rolling volatility-style channels.