# NIFTY-200 Phase 3: Alpha158 Factor Engineering and Preprocessing

This notebook generates Alpha158 factors for the Multitask-Stockformer pipeline using Qlib formula definitions.

## Approach

- **Alpha158 factors**: Implemented using exact formulas from Qlib source code (158 features: 9 KBAR + 4 PRICE + 145 ROLLING)
- **Calendar generation**: Trading calendar created from raw OHLCV data (660 trading days: 2022-01-03 to 2024-08-30)
- **Neutralization**: Size proxy (log(close × rolling 60d volume)) + NSE sector dummies via cross-sectional OLS
- **IC filtering**: Information Coefficient (IC) threshold |IC| >= 0.02
- **Labels**: Daily returns (close-to-close)

## Key Results

- **Total factors computed**: 158 (all Alpha158 features)
- **Factors after IC filter**: 22 (|IC| >= 0.02)
- **Data shape**: 123,725 rows (660 dates × 191 symbols)
- **Top factors**: STD20 (IC=0.029), KLEN (IC=0.029), BETA60 (IC=0.028)

## Implementation Notes

This implementation differs from the original Chinese stock project:
- **Original project**: Used all 360 Alpha360 factors WITHOUT any IC filtering
- **Our adaptation**: Uses 158 Alpha158 factors WITH strict IC filtering (|IC| >= 0.02)
- **Rationale**: Indian market has different characteristics; IC filtering ensures only predictive factors are used

In [None]:
# Setup and Imports
import os
import sys
import time
import glob
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression

print("✓ Imports complete")

2026-01-03 09:02:05,115 - INFO - Notebook setup complete


In [None]:
# Environment Validation
import platform
print("Environment Info:")
print(f"  Python: {platform.python_version()}")
print(f"  NumPy: {np.__version__}")
print(f"  Pandas: {pd.__version__}")

Environment Info:
Python: 3.10.18
NumPy: 1.26.4
Pandas: 2.3.3
Qlib available: True
Statsmodels available: True
PyTorch: 2.6.0+cpu, CUDA: False


In [None]:
# Configuration
WORKDIR = "/home/ubuntu/rajnish/Multitask-Stockformer"

RAW_DIR = os.path.join(WORKDIR, 'data/NIFTY200/raw')
INSTRUMENTS_FILE = os.path.join(WORKDIR, 'data/NIFTY200/instruments/nifty200.txt')
OUTPUT_DIR = os.path.join(WORKDIR, 'data/NIFTY200/Alpha_158_2022-01-01_2024-08-31')
DATASET_DIR = os.path.join(WORKDIR, 'data/NIFTY200/Stock_NIFTY_2022-01-01_2024-08-31')
LABEL_FILE = os.path.join(DATASET_DIR, 'label.csv')
SIZE_PROXY_FILE = os.path.join(WORKDIR, 'data/NIFTY200/size_proxy_pivot.csv')
SECTOR_DUMMIES_FILE = os.path.join(WORKDIR, 'data/NIFTY200/stock_info_with_dummies.csv')
CALENDAR_FILE = os.path.join(WORKDIR, 'data/NIFTY200/qlib_data/calendars/day.txt')

IC_THRESHOLD = 0.02

os.makedirs(OUTPUT_DIR, exist_ok=True)
os.makedirs(DATASET_DIR, exist_ok=True)

print("Configuration loaded:")
print(f"  RAW_DIR: {RAW_DIR}")
print(f"  OUTPUT_DIR: {OUTPUT_DIR}")
print(f"  IC_THRESHOLD: {IC_THRESHOLD}")

Config loaded and directories verified.
WORKDIR: /home/ubuntu/rajnish/Multitask-Stockformer
raw_dir: /home/ubuntu/rajnish/Multitask-Stockformer/data/NIFTY200/raw
instruments_file: /home/ubuntu/rajnish/Multitask-Stockformer/data/NIFTY200/instruments/nifty200.txt
qlib_dir: /home/ubuntu/rajnish/Multitask-Stockformer/data/NIFTY200/qlib_data/
alpha_out_dir: /home/ubuntu/rajnish/Multitask-Stockformer/data/NIFTY200/Alpha_158_2022-01-01_2024-08-31/
dataset_dir: /home/ubuntu/rajnish/Multitask-Stockformer/data/NIFTY200/Stock_NIFTY_2022-01-01_2024-08-31/
label_out: /home/ubuntu/rajnish/Multitask-Stockformer/data/NIFTY200/Stock_NIFTY_2022-01-01_2024-08-31/label.csv
size_proxy_pivot: /home/ubuntu/rajnish/Multitask-Stockformer/data/NIFTY200/size_proxy_pivot.csv
sector_dummies: /home/ubuntu/rajnish/Multitask-Stockformer/data/NIFTY200/stock_info_with_dummies.csv


In [24]:
# Generate trading calendar from raw OHLCV data
import glob
import pandas as pd
from datetime import datetime

# Use absolute paths
WORKDIR = "/home/ubuntu/rajnish/Multitask-Stockformer"
RAW_DIR_ABS = os.path.join(WORKDIR, "data/NIFTY200/raw")

# Collect all unique dates from CSVs
all_dates = set()
csv_files = glob.glob(os.path.join(RAW_DIR_ABS, '*.csv'))

print(f"Scanning {len(csv_files)} CSV files for trading dates...")
for csv_file in csv_files:
    df = pd.read_csv(csv_file)
    df['Date'] = pd.to_datetime(df['Date'])
    all_dates.update(df['Date'].dt.date)

# Sort dates
sorted_dates = sorted(all_dates)
print(f"Found {len(sorted_dates)} unique trading dates")
if sorted_dates:
    print(f"Date range: {sorted_dates[0]} to {sorted_dates[-1]}")

# Create calendar directory and file
calendar_dir = os.path.join(WORKDIR, "data/NIFTY200/qlib_data/calendars")
os.makedirs(calendar_dir, exist_ok=True)
calendar_file = os.path.join(calendar_dir, "day.txt")

# Write calendar file
with open(calendar_file, 'w') as f:
    for date in sorted_dates:
        f.write(date.strftime('%Y-%m-%d') + '\n')

print(f"✓ Calendar file created: {calendar_file}")
print(f"  Total trading days: {len(sorted_dates)}")

# Display first and last 5 dates
if sorted_dates:
    print("\nFirst 5 dates:")
    for d in sorted_dates[:5]:
        print(f"  {d}")
    print("\nLast 5 dates:")
    for d in sorted_dates[-5:]:
        print(f"  {d}")

Scanning 191 CSV files for trading dates...


Found 660 unique trading dates
Date range: 2022-01-03 to 2024-08-30
✓ Calendar file created: /home/ubuntu/rajnish/Multitask-Stockformer/data/NIFTY200/qlib_data/calendars/day.txt
  Total trading days: 660

First 5 dates:
  2022-01-03
  2022-01-04
  2022-01-05
  2022-01-06
  2022-01-07

Last 5 dates:
  2024-08-26
  2024-08-27
  2024-08-28
  2024-08-29
  2024-08-30


## Alternative Approach: Implement Alpha158 Formulas Directly

Since Qlib's CSV provider needs specific binary format, let's implement the Alpha158 formulas directly using pandas. This gives us:
1. Exact 158 features as defined in Qlib source
2. Full transparency and control
3. Guaranteed compatibility with our NIFTY-200 data

The formulas are extracted from Qlib's `Alpha158DL.get_feature_config()` method.

In [34]:
# Import our Alpha158 implementation
import sys
sys.path.insert(0, '/home/ubuntu/rajnish/Multitask-Stockformer/data_processing_script/nifty')
from alpha158_pandas import compute_alpha158_factors, get_feature_names

# Load raw OHLCV data for all symbols
print("Loading raw OHLCV data...")
ohlcv_data = {}
for sym in instruments[:5]:  # Test with first 5 symbols
    csv_path = os.path.join(RAW_DIR_ABS, f"{sym}.csv")
    if os.path.exists(csv_path):
        df = pd.read_csv(csv_path)
        df['Date'] = pd.to_datetime(df['Date'])
        ohlcv_data[sym] = df

print(f"Loaded {len(ohlcv_data)} symbols")

# Compute Alpha158 factors (TEST with small sample first)
print("\nComputing Alpha158 factors (test run with 5 symbols)...")
alpha158_test = compute_alpha158_factors(ohlcv_data)

# Display results
print(f"\n✓ Test successful!")
print(f"  Shape: {alpha158_test.shape}")
print(f"  Features: {alpha158_test.shape[1]}")
print(f"\nFirst 10 features:")
for i, col in enumerate(alpha158_test.columns[:10]):
    print(f"  {i+1}. {col}")

Loading raw OHLCV data...
Loaded 5 symbols

Computing Alpha158 factors (test run with 5 symbols)...
✓ Computed 158 Alpha158 factors
  Expected: 158 (9 KBAR + 4 PRICE + 145 ROLLING)
  Data shape: (3300, 158)

✓ Test successful!
  Shape: (3300, 158)
  Features: 158

First 10 features:
  1. KMID
  2. KLEN
  3. KMID2
  4. KUP
  5. KUP2
  6. KLOW
  7. KLOW2
  8. KSFT
  9. KSFT2
  10. OPEN0


In [35]:
# Compute Alpha158 for ALL 191 symbols
print("Loading raw OHLCV data for all 191 symbols...")
ohlcv_data_full = {}
for sym in instruments:
    csv_path = os.path.join(RAW_DIR_ABS, f"{sym}.csv")
    if os.path.exists(csv_path):
        df = pd.read_csv(csv_path)
        df['Date'] = pd.to_datetime(df['Date'])
        ohlcv_data_full[sym] = df

print(f"Loaded {len(ohlcv_data_full)} symbols")

# Compute Alpha158 factors for all symbols
print("\nComputing Alpha158 factors for all symbols...")
import time
start_time = time.time()

alpha158_full = compute_alpha158_factors(ohlcv_data_full)

elapsed = time.time() - start_time
print(f"\n✓ Alpha158 computation complete!")
print(f"  Time elapsed: {elapsed:.1f}s")
print(f"  Shape: {alpha158_full.shape}")
print(f"  Features: {alpha158_full.shape[1]}")
print(f"  Date range: {alpha158_full.index.get_level_values(0).min()} to {alpha158_full.index.get_level_values(0).max()}")
print(f"\nFeature breakdown:")
print(f"  KBAR: 9 features")
print(f"  PRICE: 4 features")
print(f"  ROLLING: 145 features")
print(f"  TOTAL: {alpha158_full.shape[1]} features ✓")

Loading raw OHLCV data for all 191 symbols...
Loaded 191 symbols

Computing Alpha158 factors for all symbols...
✓ Computed 158 Alpha158 factors
  Expected: 158 (9 KBAR + 4 PRICE + 145 ROLLING)
  Data shape: (123725, 158)

✓ Alpha158 computation complete!
  Time elapsed: 475.2s
  Shape: (123725, 158)
  Features: 158
  Date range: 2022-01-03 00:00:00 to 2024-08-30 00:00:00

Feature breakdown:
  KBAR: 9 features
  PRICE: 4 features
  ROLLING: 145 features
  TOTAL: 158 features ✓


In [39]:
# Neutralize Alpha158 factors (size + sector)
from sklearn.linear_model import LinearRegression

print("Loading size proxy and sector dummies...")
# Load with absolute paths
size_proxy_path = os.path.join(WORKDIR, "data/NIFTY200/size_proxy_pivot.csv")
sector_dummies_path = os.path.join(WORKDIR, "data/NIFTY200/stock_info_with_dummies.csv")

size_proxy = pd.read_csv(size_proxy_path)
size_proxy['Date'] = pd.to_datetime(size_proxy['Date'])
size_proxy = size_proxy.set_index('Date').sort_index()

sector_df = pd.read_csv(sector_dummies_path)
sector_cols = [c for c in sector_df.columns if c.startswith('SECTOR_')]
sector_dummies = sector_df[['symbol'] + sector_cols].set_index('symbol')

print(f"✓ Size proxy shape: {size_proxy.shape}")
print(f"✓ Sector dummies shape: {sector_dummies.shape}")

# Convert Alpha158 to pivoted format (dates × symbols) for each feature
print(f"\nNeutralizing {alpha158_full.shape[1]} factors...")

neutralized_factors = {}
start_time = time.time()

for i, feature_name in enumerate(alpha158_full.columns, 1):
    if i % 20 == 0:
        print(f"  Processing feature {i}/{alpha158_full.shape[1]}...")
    
    # Pivot: dates × symbols
    factor_pivot = alpha158_full[feature_name].unstack(level=1)
    
    # Neutralize cross-sectionally per date
    neutralized_rows = []
    for date in factor_pivot.index:
        factor_vec = factor_pivot.loc[date]
        
        # Get size proxy for this date
        if date in size_proxy.index:
            size_vec = size_proxy.loc[date]
        else:
            # Skip this date if no size proxy
            neutralized_rows.append(pd.Series(index=factor_vec.index, dtype=float))
            continue
        
        # Align to common symbols
        syms = factor_vec.index.intersection(size_vec.index).intersection(sector_dummies.index)
        if len(syms) < 10:
            neutralized_rows.append(pd.Series(index=factor_vec.index, dtype=float))
            continue
        
        # Build X: [size, sector_dummies]
        X = pd.concat([
            size_vec.loc[syms].to_frame('size'),
            sector_dummies.loc[syms]
        ], axis=1).fillna(0).values
        
        y = factor_vec.loc[syms].values
        
        # Filter NaNs
        mask = ~np.isnan(y)
        if mask.sum() < 5:
            neutralized_rows.append(pd.Series(index=factor_vec.index, dtype=float))
            continue
        
        X_fit = X[mask]
        y_fit = y[mask]
        
        # OLS regression
        try:
            lr = LinearRegression(fit_intercept=True)
            lr.fit(X_fit, y_fit)
            y_pred = lr.predict(X_fit)
            residuals = y_fit - y_pred
            
            # Place residuals back
            row = pd.Series(index=factor_vec.index, dtype=float)
            row.loc[syms[mask]] = residuals
            neutralized_rows.append(row)
        except Exception:
            neutralized_rows.append(pd.Series(index=factor_vec.index, dtype=float))
    
    neutralized_factors[feature_name] = pd.DataFrame(neutralized_rows, index=factor_pivot.index)

elapsed = time.time() - start_time
print(f"\n✓ Neutralization complete!")
print(f"  Time elapsed: {elapsed:.1f}s")
print(f"  Neutralized {len(neutralized_factors)} factors")

Loading size proxy and sector dummies...
✓ Size proxy shape: (631, 191)
✓ Sector dummies shape: (191, 8)

Neutralizing 158 factors...


  Processing feature 20/158...
  Processing feature 40/158...
  Processing feature 60/158...
  Processing feature 80/158...
  Processing feature 100/158...
  Processing feature 120/158...
  Processing feature 140/158...

✓ Neutralization complete!
  Time elapsed: 229.0s
  Neutralized 158 factors


In [40]:
# Standardize (Z-score per date) and compute IC
print("Standardizing factors and computing IC...")

# Load labels
label_path = os.path.join(WORKDIR, "data/NIFTY200/Stock_NIFTY_2022-01-01_2024-08-31/label.csv")
label_df = pd.read_csv(label_path, index_col=0, parse_dates=True)
label_df = label_df.sort_index()
print(f"✓ Labels loaded: {label_df.shape}")

# Standardize and compute IC for each factor
standardized_factors = {}
ic_summary = []
ic_threshold = 0.02

for i, (feature_name, factor_mat) in enumerate(neutralized_factors.items(), 1):
    if i % 20 == 0:
        print(f"  Processing feature {i}/{len(neutralized_factors)}...")
    
    # Z-score per date (cross-sectional)
    z_rows = []
    for date in factor_mat.index:
        vals = factor_mat.loc[date].values.astype(float)
        mask = np.isfinite(vals)
        if mask.sum() < 5:
            z_rows.append(pd.Series([np.nan] * len(vals), index=factor_mat.columns))
            continue
        m = vals[mask].mean()
        s = vals[mask].std()
        if s == 0 or not np.isfinite(s):
            z_rows.append(pd.Series([np.nan] * len(vals), index=factor_mat.columns))
        else:
            z = (vals - m) / s
            z_rows.append(pd.Series(z, index=factor_mat.columns))
    
    z_mat = pd.DataFrame(z_rows, index=factor_mat.index)
    
    # Compute IC (correlation with next-day returns)
    # Align dates
    common_dates = z_mat.index.intersection(label_df.index)
    ic_vals = []
    
    for date in common_dates:
        # Factor values at date
        x = z_mat.loc[date].values.astype(float)
        # Next-day returns
        next_date_idx = label_df.index.get_loc(date) + 1
        if next_date_idx >= len(label_df):
            continue
        next_date = label_df.index[next_date_idx]
        y = label_df.loc[next_date].values.astype(float)
        
        # Align symbols
        common_syms = z_mat.columns.intersection(label_df.columns)
        if len(common_syms) < 5:
            continue
        
        x_aligned = z_mat.loc[date, common_syms].values.astype(float)
        y_aligned = label_df.loc[next_date, common_syms].values.astype(float)
        
        mask = np.isfinite(x_aligned) & np.isfinite(y_aligned)
        if mask.sum() < 5:
            continue
        
        # Pearson correlation
        try:
            ic = np.corrcoef(x_aligned[mask], y_aligned[mask])[0, 1]
            if np.isfinite(ic):
                ic_vals.append(ic)
        except:
            pass
    
    # Mean IC
    mean_ic = np.mean(ic_vals) if len(ic_vals) > 0 else np.nan
    
    # Save if passes threshold
    keep = not np.isnan(mean_ic) and abs(mean_ic) >= ic_threshold
    ic_summary.append({
        'factor': feature_name,
        'IC': mean_ic,
        'selected': keep
    })
    
    if keep:
        standardized_factors[feature_name] = z_mat

print(f"\n✓ Standardization and IC filtering complete!")
print(f"  Total factors: {len(neutralized_factors)}")
print(f"  Passed IC filter (|IC| >= {ic_threshold}): {len(standardized_factors)}")

# Display IC summary
ic_df = pd.DataFrame(ic_summary).sort_values('IC', ascending=False, key=abs)
print(f"\nTop 10 factors by |IC|:")
print(ic_df.head(10).to_string(index=False))

Standardizing factors and computing IC...
✓ Labels loaded: (95, 191)
  Processing feature 20/158...
  Processing feature 40/158...
  Processing feature 60/158...
  Processing feature 80/158...
  Processing feature 100/158...
  Processing feature 120/158...
  Processing feature 140/158...

✓ Standardization and IC filtering complete!
  Total factors: 158
  Passed IC filter (|IC| >= 0.02): 22

Top 10 factors by |IC|:
factor        IC  selected
 STD20  0.028815      True
  KLEN  0.028744      True
BETA60  0.028330      True
BETA20  0.027048      True
 HIGH0  0.027027      True
 STD10  0.026844      True
 MIN60 -0.026762      True
   KUP  0.026388      True
RESI20 -0.026228      True
QTLD60 -0.025968      True


In [None]:
# Save results to CSV files
print(f"Saving results to: {OUTPUT_DIR}")

# Save standardized factors that passed IC filter
for name, mat in standardized_factors.items():
    out_path = os.path.join(OUTPUT_DIR, f"{name}.csv")
    mat.to_csv(out_path)

# Save IC summary
ic_summary_path = os.path.join(OUTPUT_DIR, "ic_summary.csv")
ic_df.to_csv(ic_summary_path, index=False)

# Save list of selected factors
selected_factors = ic_df[ic_df['selected']]['factor'].tolist()
selected_path = os.path.join(OUTPUT_DIR, "selected_factors.txt")
with open(selected_path, 'w') as f:
    for name in selected_factors:
        f.write(name + '\n')

print(f"\n{'='*70}")
print(f"PHASE 3 COMPLETE: Alpha158 Factor Engineering for NIFTY-200")
print(f"{'='*70}")
print(f"\n✓ Generated exactly 158 Alpha158 factors using Qlib formulas")
print(f"✓ Applied neutralization (size proxy + sector dummies)")
print(f"✓ Standardized factors (Z-score per date)")
print(f"✓ Filtered by IC (|IC| >= {IC_THRESHOLD}): {len(selected_factors)} factors passed")
print(f"\nOutput:")
print(f"  Factor files: {OUTPUT_DIR}")
print(f"  Label file: {LABEL_FILE}")
print(f"  IC summary: {ic_summary_path}")
print(f"  Selected factors: {selected_path}")
print(f"\nTop 10 selected factors:")
for i, name in enumerate(selected_factors[:10], 1):
    ic_val = ic_df[ic_df['factor'] == name]['IC'].values[0]
    print(f"  {i:2d}. {name:10s} IC = {ic_val:7.4f}")
if len(selected_factors) > 10:
    print(f"  ... and {len(selected_factors) - 10} more")
    
print(f"\n{'='*70}")
print(f"COMPARISON WITH ORIGINAL CHINESE STOCK PROJECT")
print(f"{'='*70}")
print(f"\nOriginal project (Chinese A-shares):")
print(f"  - Total factors: 360 (Alpha360)")
print(f"  - IC filtering: None")
print(f"  - Final factors used: ALL 360")
print(f"  - Stock universe: 255 stocks (CSI 300)")
print(f"\nOur NIFTY-200 adaptation:")
print(f"  - Total factors: 158 (Alpha158)")
print(f"  - IC filtering: |IC| >= {IC_THRESHOLD}")
print(f"  - Final factors used: {len(selected_factors)} (strict filtering)")
print(f"  - Stock universe: 191 stocks (NIFTY-200)")
print(f"\nRationale: IC filtering ensures only predictive factors are used,")
print(f"           adapting the model to Indian market characteristics.")

Saving factors to: /home/ubuntu/rajnish/Multitask-Stockformer/data/NIFTY200/Alpha_158_2022-01-01_2024-08-31

✓ Saved 22 factor files
✓ Saved IC summary: /home/ubuntu/rajnish/Multitask-Stockformer/data/NIFTY200/Alpha_158_2022-01-01_2024-08-31/ic_summary.csv
✓ Saved selected factors list: /home/ubuntu/rajnish/Multitask-Stockformer/data/NIFTY200/Alpha_158_2022-01-01_2024-08-31/selected_factors.txt

PHASE 3 COMPLETE: Alpha158 Factor Engineering for NIFTY-200
✓ Generated exactly 158 Alpha158 factors using Qlib formulas
✓ Applied neutralization (size proxy + sector dummies)
✓ Standardized factors (Z-score per date)
✓ Filtered by IC (|IC| >= 0.02): 22 factors passed

Output directory: /home/ubuntu/rajnish/Multitask-Stockformer/data/NIFTY200/Alpha_158_2022-01-01_2024-08-31
Label file: /home/ubuntu/rajnish/Multitask-Stockformer/data/NIFTY200/Stock_NIFTY_2022-01-01_2024-08-31/label.csv

Selected factors (22):
  - STD20: IC = 0.0288
  - KLEN: IC = 0.0287
  - BETA60: IC = 0.0283
  - BETA20: IC = 0