# Feature Engineering for Tree Models

**Objective**: Create stationary, informative features optimized for LightGBM/XGBoost.

**Strategy**:
1. Ensure stationarity (ratios, diffs, ranks, percentages)
2. Layer 1: Momentum features (lags, rolling stats)
3. Layer 3: Volume ratio features
4. Rolling time-series ranking (historical percentiles)
5. MI-based interaction features
6. Feature selection with LightGBM importance

**Datasets**:
- Stable (2007-2025): Long history, stable features only
- Recent (2018-2025): Recent history, all features

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)

print('Libraries loaded successfully')

Libraries loaded successfully


## Section 0: Load Cleaned Data

In [2]:
# Load the two cleaned datasets
df_stable = pd.read_csv('../data/hull-tactical-market-prediction/train_no_missing_2007_2025_cleaned.csv')
df_recent = pd.read_csv('../data/hull-tactical-market-prediction/train_2018_2025_cleaned.csv')

print('='*80)
print('DATA LOADED')
print('='*80)
print(f'\nStable (2007-2025): {df_stable.shape}')
print(f'Recent (2018-2025): {df_recent.shape}')
print(f'\nMissing values:')
print(f'  Stable: {df_stable.isnull().sum().sum()}')
print(f'  Recent: {df_recent.isnull().sum().sum()}')
print('='*80)

DATA LOADED

Stable (2007-2025): (4625, 90)
Recent (2018-2025): (1872, 98)

Missing values:
  Stable: 0
  Recent: 0


## Section 1: Stationarity Check

Verify that existing features are stationary. Most features (D, E, I, M, P, S, V) are likely already indicators/ratios.

In [3]:
from statsmodels.tsa.stattools import adfuller

def check_stationarity(series, name):
    """Perform Augmented Dickey-Fuller test for stationarity"""
    result = adfuller(series.dropna(), autolag='AIC')
    return {
        'Feature': name,
        'ADF_Statistic': result[0],
        'p_value': result[1],
        'Is_Stationary': result[1] < 0.05  # Reject null hypothesis of unit root
    }

# Test key features for stationarity
key_features = ['risk_free_rate', 'market_forward_excess_returns', 'forward_returns']
# Add some V, I, E features
V_cols = [col for col in df_stable.columns if col.startswith('V')]
I_cols = [col for col in df_stable.columns if col.startswith('I')]
key_features.extend(V_cols[:3])  # First 3 V features
key_features.extend(I_cols[:3])  # First 3 I features

stationarity_results = []
for feat in key_features:
    if feat in df_stable.columns:
        result = check_stationarity(df_stable[feat], feat)
        stationarity_results.append(result)

stationarity_df = pd.DataFrame(stationarity_results)
print('\nStationarity Test Results (ADF Test):')
print(stationarity_df.to_string(index=False))
print(f"\nStationary features: {stationarity_df['Is_Stationary'].sum()} / {len(stationarity_df)}")


Stationarity Test Results (ADF Test):
                      Feature  ADF_Statistic      p_value  Is_Stationary
               risk_free_rate      -1.489207 5.388434e-01          False
market_forward_excess_returns     -17.690184 3.575168e-30           True
              forward_returns     -17.729218 3.440865e-30           True
                           V1      -3.814168 2.765642e-03           True
                          V11      -2.490484 1.178069e-01          False
                          V12      -5.826093 4.076133e-07           True
                           I1      -4.203085 6.510212e-04           True
                           I2      -2.313791 1.675587e-01          False
                           I3      -3.284714 1.557572e-02           True

Stationary features: 6 / 9


## Section 2: Layer 1 - Momentum Features

Create lag and rolling statistics for key features.

In [4]:
def create_momentum_features(df, feature_cols, prefix=''):
    """Create lag and rolling features"""
    df_new = df.copy()
    
    for col in feature_cols:
        if col not in df.columns:
            continue
            
        # Lag features (1, 5, 20 days)
        df_new[f'{col}_lag1'] = df[col].shift(1)
        df_new[f'{col}_lag5'] = df[col].shift(5)
        df_new[f'{col}_lag20'] = df[col].shift(20)
        
        # Rolling statistics (5, 20, 60 day windows)
        for window in [5, 20, 60]:
            df_new[f'{col}_mean_{window}d'] = df[col].rolling(window).mean()
            df_new[f'{col}_std_{window}d'] = df[col].rolling(window).std()
        
        # Momentum indicators
        df_new[f'{col}_zscore_20d'] = (df[col] - df[col].rolling(20).mean()) / df[col].rolling(20).std()
        df_new[f'{col}_pct_chg_5d'] = df[col] / df[col].rolling(5).mean() - 1
    
    return df_new

# Apply to key features
momentum_features = ['market_forward_excess_returns', 'risk_free_rate']
# Add V features (volatility)
V_features = [col for col in df_stable.columns if col.startswith('V')]
momentum_features.extend(V_features[:5])  # Top 5 V features

print('Creating momentum features...')
df_stable_eng = create_momentum_features(df_stable, momentum_features)
df_recent_eng = create_momentum_features(df_recent, momentum_features)

print(f'\nStable dataset: {df_stable.shape} → {df_stable_eng.shape}')
print(f'Recent dataset: {df_recent.shape} → {df_recent_eng.shape}')
print(f'New features added: {df_stable_eng.shape[1] - df_stable.shape[1]}')

Creating momentum features...

Stable dataset: (4625, 90) → (4625, 167)
Recent dataset: (1872, 98) → (1872, 175)
New features added: 77


## Section 3: Layer 3 - Volume Ratio Features

Use M-features as proxy for volume/market breadth.

In [5]:
def create_volume_ratio_features(df):
    """Create volume ratio features using M-features"""
    df_new = df.copy()
    
    # Get M features
    M_cols = [col for col in df.columns if col.startswith('M') and col[1:].isdigit()]
    
    # Relative to recent average
    for col in M_cols:
        df_new[f'{col}_rel_20d'] = df[col] / df[col].rolling(20).mean()
        df_new[f'{col}_rel_60d'] = df[col] / df[col].rolling(60).mean()
    
    # Cross-M ratios (only if M features exist)
    if len(M_cols) >= 2:
        # Avoid division by zero
        if 'M1' in M_cols and 'M6' in M_cols:
            df_new['M1_M6_ratio'] = df['M1'] / (df['M6'] + 1e-8)
        if 'M13' in M_cols and 'M14' in M_cols:
            df_new['M13_M14_ratio'] = df['M13'] / (df['M14'] + 1e-8)
    
    return df_new

print('Creating volume ratio features...')
df_stable_eng = create_volume_ratio_features(df_stable_eng)
df_recent_eng = create_volume_ratio_features(df_recent_eng)

print(f'\nStable dataset: {df_stable_eng.shape}')
print(f'Recent dataset: {df_recent_eng.shape}')

Creating volume ratio features...

Stable dataset: (4625, 195)
Recent dataset: (1872, 213)


## Section 4: Rolling Time-Series Ranking

**CRITICAL**: Calculate percentile rank within rolling historical windows.

In [6]:
def create_rolling_rank_features(df, feature_cols, windows=[60, 200]):
    """Create rolling percentile rank features"""
    df_new = df.copy()
    
    for col in feature_cols:
        if col not in df.columns:
            continue
            
        for window in windows:
            # Calculate percentile rank within rolling window
            df_new[f'{col}_rank_{window}d'] = df[col].rolling(window).rank(pct=True)
    
    return df_new

# Apply to key features
rank_features = ['market_forward_excess_returns', 'risk_free_rate']
# Add all V features (volatility)
V_cols = [col for col in df_stable_eng.columns if col.startswith('V') and col[1:].isdigit()]
rank_features.extend(V_cols)
# Add key I features (rates)
I_cols = [col for col in df_stable_eng.columns if col.startswith('I') and col[1:].isdigit()]
rank_features.extend(I_cols[:5])

print('Creating rolling rank features...')
df_stable_eng = create_rolling_rank_features(df_stable_eng, rank_features, windows=[60, 200])
df_recent_eng = create_rolling_rank_features(df_recent_eng, rank_features, windows=[60, 200])

print(f'\nStable dataset: {df_stable_eng.shape}')
print(f'Recent dataset: {df_recent_eng.shape}')

# Show example
print(f"\nExample: market_forward_excess_returns_rank_200d")
print(f"Min: {df_stable_eng['market_forward_excess_returns_rank_200d'].min():.3f}")
print(f"Max: {df_stable_eng['market_forward_excess_returns_rank_200d'].max():.3f}")
print(f"Mean: {df_stable_eng['market_forward_excess_returns_rank_200d'].mean():.3f}")

Creating rolling rank features...

Stable dataset: (4625, 231)
Recent dataset: (1872, 249)

Example: market_forward_excess_returns_rank_200d
Min: 0.005
Max: 1.000
Mean: 0.502


## Section 5: MI-Based Interaction Features

Create interactions between high mutual information features.

In [7]:
def create_interaction_features(df):
    """Create interaction features based on high-MI features"""
    df_new = df.copy()
    
    # High-MI features (from previous analysis): E19, S8, M1, V10, V9
    # Create interactions with V-features (volatility)
    
    # 1. E19 / V10 (sentiment/volatility ratio) - if both exist
    if 'E19' in df.columns and 'V10' in df.columns:
        df_new['E19_V10_ratio'] = df['E19'] / (df['V10'] + 1e-8)
    
    # 2. S8 * V9 (factor × volatility interaction)
    if 'S8' in df.columns and 'V9' in df.columns:
        df_new['S8_V9_product'] = df['S8'] * df['V9']
    
    # 3. M1 / V10 (market breadth/volatility)
    if 'M1' in df.columns and 'V10' in df.columns:
        df_new['M1_V10_ratio'] = df['M1'] / (df['V10'] + 1e-8)
    
    # 4. E19 * risk_free_rate (sentiment in different rate regimes)
    if 'E19' in df.columns and 'risk_free_rate' in df.columns:
        df_new['E19_rate_product'] = df['E19'] * df['risk_free_rate']
    
    # 5. (E19 - rolling_mean) / V10 (sentiment deviation adjusted for vol)
    if 'E19' in df.columns and 'V10' in df.columns:
        E19_dev = df['E19'] - df['E19'].rolling(20).mean()
        df_new['E19_dev_V10_ratio'] = E19_dev / (df['V10'] + 1e-8)
    
    # 6. V1 / V10 (short-term vs long-term volatility)
    if 'V1' in df.columns and 'V10' in df.columns:
        df_new['V1_V10_ratio'] = df['V1'] / (df['V10'] + 1e-8)
    
    return df_new

print('Creating MI-based interaction features...')
df_stable_eng = create_interaction_features(df_stable_eng)
df_recent_eng = create_interaction_features(df_recent_eng)

print(f'\nStable dataset: {df_stable_eng.shape}')
print(f'Recent dataset: {df_recent_eng.shape}')

# Count interaction features created
interaction_cols = [col for col in df_stable_eng.columns if '_ratio' in col or '_product' in col]
print(f'\nInteraction features created: {len(interaction_cols)}')

Creating MI-based interaction features...

Stable dataset: (4625, 232)
Recent dataset: (1872, 255)

Interaction features created: 1


## Section 6: Handle Missing Values from Feature Engineering

Lag and rolling features create NaN values. Fill them appropriately.

In [8]:
print('Handling missing values from feature engineering...')
print(f'\nBefore:')
print(f'  Stable NaN: {df_stable_eng.isnull().sum().sum()}')
print(f'  Recent NaN: {df_recent_eng.isnull().sum().sum()}')

# Forward fill then median fill
for col in df_stable_eng.columns:
    if df_stable_eng[col].isnull().any():
        df_stable_eng[col] = df_stable_eng[col].fillna(method='ffill')
        if df_stable_eng[col].isnull().any():
            df_stable_eng[col] = df_stable_eng[col].fillna(df_stable_eng[col].median())

for col in df_recent_eng.columns:
    if df_recent_eng[col].isnull().any():
        df_recent_eng[col] = df_recent_eng[col].fillna(method='ffill')
        if df_recent_eng[col].isnull().any():
            df_recent_eng[col] = df_recent_eng[col].fillna(df_recent_eng[col].median())

print(f'\nAfter:')
print(f'  Stable NaN: {df_stable_eng.isnull().sum().sum()}')
print(f'  Recent NaN: {df_recent_eng.isnull().sum().sum()}')
print('\n✓ All NaN values handled')

Handling missing values from feature engineering...

Before:
  Stable NaN: 8341
  Recent NaN: 7558

After:
  Stable NaN: 0
  Recent NaN: 0

✓ All NaN values handled


## Section 7: Save Engineered Datasets

Save datasets before feature selection (we'll do selection in a separate notebook with LightGBM).

In [9]:
# Save engineered datasets
path_stable = '../data/hull-tactical-market-prediction/train_no_missing_2007_2025_engineered.csv'
path_recent = '../data/hull-tactical-market-prediction/train_2018_2025_engineered.csv'

df_stable_eng.to_csv(path_stable, index=False)
df_recent_eng.to_csv(path_recent, index=False)

print('='*80)
print('ENGINEERED DATASETS SAVED')
print('='*80)
print(f'\nStable (2007-2025):')
print(f'  Path: {path_stable}')
print(f'  Shape: {df_stable_eng.shape}')
print(f'  Features: {df_stable.shape[1]} → {df_stable_eng.shape[1]} (+{df_stable_eng.shape[1] - df_stable.shape[1]})')

print(f'\nRecent (2018-2025):')
print(f'  Path: {path_recent}')
print(f'  Shape: {df_recent_eng.shape}')
print(f'  Features: {df_recent.shape[1]} → {df_recent_eng.shape[1]} (+{df_recent_eng.shape[1] - df_recent.shape[1]})')

print('\n' + '='*80)
print('✓ Feature Engineering Complete!')
print('='*80)
print('\nNext Steps:')
print('1. Train LightGBM baseline model')
print('2. Extract feature importance')
print('3. Prune low-importance features')
print('4. Build final ensemble model')

ENGINEERED DATASETS SAVED

Stable (2007-2025):
  Path: ../data/hull-tactical-market-prediction/train_no_missing_2007_2025_engineered.csv
  Shape: (4625, 232)
  Features: 90 → 232 (+142)

Recent (2018-2025):
  Path: ../data/hull-tactical-market-prediction/train_2018_2025_engineered.csv
  Shape: (1872, 255)
  Features: 98 → 255 (+157)

✓ Feature Engineering Complete!

Next Steps:
1. Train LightGBM baseline model
2. Extract feature importance
3. Prune low-importance features
4. Build final ensemble model
