# Feature Forecasting for Predictive Modeling

## Forecasting Framework

This notebook implements a sophisticated feature forecasting system using gradient boosting methodology to predict future values of time-varying features. The approach recognizes that effective stock price prediction requires not only historical feature values but also forecasted feature values for the prediction horizon.

## Feature Classification and Forecasting Strategy

Our methodology categorizes features based on their temporal characteristics and predictability:

**Predictable Features**: These include technical indicators, market indices, interest rates, and other variables that exhibit daily variation and can be modeled using historical patterns. These features are forecasted using XGBoost models trained on lagged feature values.

**Forward-Filled Features**: Economic indicators such as CPI, PPI, and employment data are released monthly with reporting lags. These features are forward-filled from their last known values, as they represent structural economic conditions that change gradually.

**Temporal Structure**: The forecasting models utilize multiple lag periods to capture both short-term momentum and longer-term trend patterns in the feature evolution.

The system generates two critical datasets: historical validation data for model verification and future prediction data for forward-looking analysis. This dual approach enables robust validation of forecasting accuracy while providing the forward-looking features necessary for stock price prediction.


In [1]:
# Initialize feature forecasting pipeline
import sys
sys.path.append('../src')
from feature_forecasting import (
    load_and_prepare_data, create_lagged_features,
    train_feature_models, create_historical_validation_data, 
    create_future_prediction_data, diagnostic_analysis
)

# Define forecasting parameters
ticker = input("Enter the ticker symbol to analyze (or press Enter to use AAPL): ").upper() or "AAPL"
prediction_horizon = 5
lag_periods = [1, 2, 3]

print(f"Feature forecasting analysis for {ticker}")
print(f"Prediction horizon: {prediction_horizon} days")
print(f"Lag structure: {lag_periods} periods")

# Load and categorize features for forecasting
df, forward_filled_features, predictable_features, all_features = load_and_prepare_data(ticker)
print(f"Dataset loaded: {df.shape}")
print(f"Predictable features: {len(predictable_features)} (require forecasting)")
print(f"Forward-filled features: {len(forward_filled_features)} (structural/economic)")

# Create temporal lag structure for predictive modeling
df_lagged = create_lagged_features(df, predictable_features, lag_periods)
print(f"Lagged feature matrix: {df_lagged.shape}")
print("Temporal dependencies captured through multi-period lag structure")



Analyzing data for MSFT
🚀 Feature Forecasting Step-by-Step for MSFT
📊 Step 1: Loading and preparing data...
Data shape: (451, 34)
Date range: 2023-09-01 00:00:00 to 2025-06-20 00:00:00

Columns: ['Date', 'CPI', 'Close', 'Consumer_Sentiment', 'Dow_Jones', 'Fed_Funds_Rate', 'High', 'Housing_Starts', 'Industrial_Production', 'Low', 'NASDAQ', 'Open', 'PPI', 'S&P_500', 'Sector_Technology', 'Treasury_10Y', 'Treasury_3M', 'Unemployment_Rate', 'VIX', 'Volume', 'SMA_20', 'SMA_50', 'EMA_20', 'ATR', 'Bollinger_Upper', 'Bollinger_Lower', 'RSI', 'OBV', 'CPI_YoY_Change', 'PPI_YoY_Change', 'Yield_Curve_Spread', 'Fed_10Y_Spread', 'Treasury_10Y_DoD', 'Treasury_3M_DoD']
Forward-filled features (13): ['Fed_Funds_Rate', 'CPI', 'PPI', 'Unemployment_Rate', 'Industrial_Production', 'Consumer_Sentiment', 'Housing_Starts', 'CPI_YoY_Change', 'PPI_YoY_Change', 'Yield_Curve_Spread', 'Fed_10Y_Spread', 'Treasury_10Y_DoD', 'Treasury_3M_DoD']

Predictable features (19): ['Dow_Jones', 'High', 'Low', 'NASDAQ', 'Open', 

In [None]:
# Configure model training parameters
n_iter = 50
cv_folds = 20
param_grid = {
    'max_depth': [2, 3, 4],  
    'min_child_weight': [5, 10, 15],  
    'subsample': [0.7, 0.8, 0.9],
    'colsample_bytree': [0.7, 0.8, 0.9],
    'reg_alpha': [0.1, 0.5, 1, 2, 3],  
    'reg_lambda': [0.1, 0.5, 1, 2, 5], 
    'gamma': [0.4, 0.5],
    'n_estimators': [50, 100, 150, 200]
}

# Train individual forecasting models for each predictable feature
models, scalers, performance, lagged_feature_names = train_feature_models(
    df_lagged, predictable_features, lag_periods,
    n_iter=n_iter, cv_folds=cv_folds, param_grid=param_grid
)

# Evaluate overall forecasting performance
avg_smape = sum(perf['test_smape'] for perf in performance.values()) / len(performance)
print(f"Feature forecasting models trained: {len(models)}")
print(f"Average forecasting accuracy (sMAPE): {avg_smape:.4f}")
print("Individual XGBoost models optimized for each predictable feature")



Step 2: Training feature models...
Training on 358 samples
Testing on 90 samples
Using 57 lagged features

Training model for Dow_Jones (1/19)...
  Train RMSE: 144.058576, Test RMSE: 867.034357
  Train sMAPE: 0.00%, Test sMAPE: 0.01% 🟢
  Performance: EXCELLENT

Training model for High (2/19)...
  Train RMSE: 2.617526, Test RMSE: 12.617349
  Train sMAPE: 0.00%, Test sMAPE: 0.01% 🟢
  Performance: VERY GOOD

Training model for Low (3/19)...
  Train RMSE: 1.803694, Test RMSE: 15.414154
  Train sMAPE: 0.00%, Test sMAPE: 0.02% 🟢
  Performance: VERY GOOD

Training model for NASDAQ (4/19)...
  Train RMSE: 21.902998, Test RMSE: 123.456737
  Train sMAPE: 0.00%, Test sMAPE: 0.01% 🟢
  Performance: EXCELLENT

Training model for Open (5/19)...
  Train RMSE: 1.506580, Test RMSE: 12.092204
  Train sMAPE: 0.00%, Test sMAPE: 0.01% 🟢
  Performance: VERY GOOD

Training model for S&P_500 (6/19)...
  Train RMSE: 1.510113, Test RMSE: 5.712762
  Train sMAPE: 0.00%, Test sMAPE: 0.01% 🟢
  Performance: VERY GOOD

In [None]:
# Generate historical validation dataset with forecasted features
validation_data = create_historical_validation_data(
    df, df_lagged, models, scalers,
    predictable_features, forward_filled_features, lag_periods
)
print(f"Historical validation dataset created: {len(validation_data)} observations")
print("Contains forecasted feature values for recent historical periods")
print("Enables validation of forecasting accuracy against known outcomes")

Step 4: Creating historical validation data...
Test set: indices 358 to 448 (size: 90)

GENERATING MULTIPLE VALIDATION DATASETS

1. LAST 5 DAYS VALIDATION:
   Historical start index: 440 (out of 448 total)
   Predicting features for last 5 historical days
Predicting day 1/5...
Predicting day 2/5...
Predicting day 3/5...
Predicting day 4/5...
Predicting day 5/5...
   Shape: (5, 22)
   Date range: 2025-06-13 00:00:00 to 2025-06-20 00:00:00

2. FIRST 5 DAYS OF TEST SET:
   Start index: 355 (predicting from test index 358)
Predicting day 1/5...
Predicting day 2/5...
Predicting day 3/5...
Predicting day 4/5...
Predicting day 5/5...
   Shape: (5, 22)
   Date range: 2025-02-11 00:00:00 to 2025-02-18 00:00:00
Predicting day 1/5...
Predicting day 2/5...
Predicting day 3/5...
Predicting day 4/5...
Predicting day 5/5...
Generated Test_Days_15-19: (5, 22)
Predicting day 1/5...
Predicting day 2/5...
Predicting day 3/5...
Predicting day 4/5...
Predicting day 5/5...
Generated Test_Days_30-34: (5, 22)

In [None]:
# Generate future prediction dataset for forward-looking analysis
future_data = create_future_prediction_data(
    df, df_lagged, models, scalers,
    predictable_features, forward_filled_features, lag_periods
)
print(f"Future prediction dataset created: {len(future_data)} forecast periods")
print("Contains forecasted feature values for forward-looking price prediction")
print("Feature forecasting pipeline completed successfully")


In [4]:
# Step 5: Diagnostic Analysis
print("Step 5: Diagnostic Analysis")
diagnostic_analysis(validation_data, df, predictable_features)
print("=" * 80)
print("Diagnostic analysis complete")
print("=" * 80)

Step 5: Diagnostic Analysis
📊 COMPARISON DATASET:
   • Total comparison samples: 35
   • Validation periods: 7
   • Date range: 2025-02-11 00:00:00 to 2025-06-20 00:00:00
   • Features to compare: 19
🔍 DIAGNOSTIC ANALYSIS: MODEL PREDICTION ISSUES
🎯 ANALYZING TEMPORAL PREDICTION PATTERNS:
------------------------------------------------------------

📊 HIGH ANALYSIS:
Period-wise Analysis:
  First_5_Test_Days        : Pred= 417.2 | Actual= 409.2 | Diff=  -8.0 | Error= -2.0% 🟢 GOOD
  Last_5_Days              : Pred= 453.0 | Actual= 480.6 | Diff= +27.6 | Error= +5.8% 🟡 MODERATE
  Test_Days_15-19          : Pred= 402.4 | Actual= 393.5 | Diff=  -9.0 | Error= -2.3% 🟢 GOOD
  Test_Days_30-34          : Pred= 413.0 | Actual= 386.6 | Diff= -26.4 | Error= -6.8% 🟡 MODERATE
  Test_Days_45-49          : Pred= 408.4 | Actual= 373.0 | Diff= -35.4 | Error= -9.5% 🟡 MODERATE
  Test_Days_60-64          : Pred= 436.5 | Actual= 446.8 | Diff= +10.4 | Error= +2.3% 🟢 GOOD
  Test_Days_75-79          : Pred= 453.7

In [5]:
# Step 5: Create Future Prediction Data
print("Step 5: Creating future prediction data...")
future_data = create_future_prediction_data(
    df, df_lagged, models, scalers,
    predictable_features, forward_filled_features, lag_periods, prediction_horizon
)
print(f"   • Future predictions: {len(future_data)}")

Step 5: Creating future prediction data...
Generating future prediction database...
Future start index: 445 (out of 448 total)
Predicting features for next 5 days beyond the dataset
Predicting day 1/5...
Predicting day 2/5...
Predicting day 3/5...
Predicting day 4/5...
Predicting day 5/5...

Adding forward-filled features from last available day...
Forward-filled features added: ['Fed_Funds_Rate', 'CPI', 'PPI', 'Unemployment_Rate', 'Industrial_Production', 'Consumer_Sentiment', 'Housing_Starts', 'CPI_YoY_Change', 'PPI_YoY_Change', 'Yield_Curve_Spread', 'Fed_10Y_Spread', 'Treasury_10Y_DoD', 'Treasury_3M_DoD']
Updated shape: (5, 33)

Future prediction database shape: (5, 34)

Sample of future predictions:
      Dow_Jones        High         Low       NASDAQ        Open     S&P_500  \
0  42386.365221  454.116652  439.958096  5921.096241  452.205844  233.421314   
1  42336.744060  450.609565  440.636304  5870.751604  448.351474  230.255856   
2  42417.733837  447.490311  434.796700  5875.7

In [6]:
# Step 6: Save Results
print("Step 6: Saving results...")
import os
output_dir = f'../data/predicted_features/{ticker}'
os.makedirs(output_dir, exist_ok=True)

validation_data.to_csv(f'{output_dir}/historical_validation_features_combined.csv', index=False)
future_data.to_csv(f'{output_dir}/future_prediction_features.csv', index=False)

print("✅ Feature Forecasting Completed Successfully!")
print(f"📂 Results saved to: {output_dir}/")

Step 6: Saving results...
✅ Feature Forecasting Completed Successfully!
📂 Results saved to: ../data/predicted_features/MSFT/
