# HULL TACTICAL MARKET PREDICTION - SUBMISSION NOTEBOOK

This notebook trains a Linear Regression model for tactical market prediction and sets up an inference server following the official competition format.

**Workflow:**
1. Load and prepare training data
2. Train Linear Regression model
3. Save model artifacts
4. Define prediction function
5. Start inference server

In [18]:
import os
import pandas as pd
import numpy as np
import joblib
import warnings
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
import polars as pl

try:
    import kaggle_evaluation.default_inference_server as ke
except ImportError:
    ke = None
    print("⚠ kaggle_evaluation not installed (not needed for local testing)")

warnings.filterwarnings('ignore')

print("✓ All libraries imported successfully")

⚠ kaggle_evaluation not installed (not needed for local testing)
✓ All libraries imported successfully


IF RUNNING ON KAGGLE, DONT FORGET TO CHANGE CONFIG

In [19]:
# Configuration Kaggle Notebook Version
# CONFIG = {
#     'train_data_path': '/kaggle/input/hull-tactical-market-prediction/train.csv',
#     'input_dir': '/kaggle/input/hull-tactical-market-prediction/',
#     'model_save_path': '/tmp/model.joblib',
#     'scaler_save_path': '/tmp/scaler.joblib',
#     'features_save_path': '/tmp/feature_cols.joblib',
#     'target_column': 'forward_returns',
#     'prediction_bounds': (0.0, 2.0),
# }

import os
from pathlib import Path  # Add this import!

# Get the directory where this notebook is located
NOTEBOOK_DIR = Path.cwd()

PROJECT_DIR = NOTEBOOK_DIR.parent


# Configuration - For your setup
CONFIG = {
    'train_data_path': PROJECT_DIR / 'data' / 'train.csv',
    'test_data_path': PROJECT_DIR / 'data' / 'test.csv',
    'input_dir': PROJECT_DIR / 'data',
    
    # Save models in artifacts folder
    'model_save_path': PROJECT_DIR / 'artifacts' / 'model.joblib',
    'scaler_save_path': PROJECT_DIR / 'artifacts' / 'scaler.joblib',
    'features_save_path': PROJECT_DIR / 'artifacts' / 'feature_cols.joblib',
    
    'target_column': 'forward_returns',
    'prediction_bounds': (0.0, 2.0),
}

print(f"✓ Configuration loaded")

✓ Configuration loaded


In [20]:
print("\n" + "="*70)
print("TRAINING PHASE")
print("="*70)

# Load training data
train_df = pd.read_csv(CONFIG['train_data_path'])
print(f"✓ Loaded training data: {train_df.shape}")
print(f"  - Rows: {train_df.shape[0]:,}")
print(f"  - Columns: {train_df.shape[1]}")


TRAINING PHASE
✓ Loaded training data: (9021, 98)
  - Rows: 9,021
  - Columns: 98


In [21]:
#Drop colums
all_feature_cols = [c for c in train_df.columns if c not in ['date', 'forward_returns', 'ID']]

threshold = 0.30
FEATURE_COLS = train_df.columns[train_df.isnull().mean() < threshold].tolist()
# Columns to exclude (not in test, or are targets/IDs)
exclude_cols = [
    'date', 
    'date_id',
    'forward_returns', 
    'ID', 
    'market_forward_excess_returns',
    'risk_free_rate',  # This was causing your error!
    'is_scored',
    'lagged_forward_returns',
    'lagged_risk_free_rate',
    'lagged_market_forward_excess_returns'
]

FEATURE_COLS = [c for c in FEATURE_COLS if c not in exclude_cols]
print(f"Columns dropped: {set(all_feature_cols) - set(FEATURE_COLS)}")

Columns dropped: {'E7', 'M6', 'S8', 'market_forward_excess_returns', 'M5', 'date_id', 'risk_free_rate', 'V9', 'S3', 'V10', 'M1', 'M2', 'M13', 'M14', 'S12'}


In [22]:
# Extract features and target
X_train = train_df[FEATURE_COLS].copy()
y_train = train_df[CONFIG['target_column']].copy()

# Handle missing values
X_train = X_train.ffill().bfill()

# Remove NaN rows
valid_idx = ~(X_train.isnull().any(axis=1) | y_train.isnull())
X_train = X_train[valid_idx]
y_train = y_train[valid_idx]

print(f"\nTraining samples after cleaning: {len(X_train)}")

#Scaling features

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)



Training samples after cleaning: 9021


In [23]:
# Train Linear Regression model
model = LinearRegression()
model.fit(X_train_scaled, y_train)

train_r2 = model.score(X_train_scaled, y_train)

print(f"\n✓ Training R²: {train_r2:.6f}")
print(f"✓ Model trained with {len(FEATURE_COLS)} features")

# Save Model Artifacts
print(f"\n✓ Saving model artifacts...")

joblib.dump(model, str(CONFIG['model_save_path']))
joblib.dump(scaler, str(CONFIG['scaler_save_path']))
joblib.dump(FEATURE_COLS, str(CONFIG['features_save_path']))

print(f"  - Model saved to: {CONFIG['model_save_path']}")
print(f"  - Scaler saved to: {CONFIG['scaler_save_path']}")
print(f"  - Features saved to: {CONFIG['features_save_path']}")
print("✓ Model saved")


✓ Training R²: 0.022070
✓ Model trained with 82 features

✓ Saving model artifacts...
  - Model saved to: c:\Users\Zehra Marziya Cengiz\Desktop\HullTactical\artifacts\model.joblib
  - Scaler saved to: c:\Users\Zehra Marziya Cengiz\Desktop\HullTactical\artifacts\scaler.joblib
  - Features saved to: c:\Users\Zehra Marziya Cengiz\Desktop\HullTactical\artifacts\feature_cols.joblib
✓ Model saved


In [24]:
model = joblib.load(CONFIG['model_save_path'])
scaler = joblib.load(CONFIG['scaler_save_path'])
FEATURE_COLS = joblib.load(CONFIG['features_save_path'])

FOLLOWING CODE USES KAGGLE COMPETITION DEMO SUBMISSION FORMAT

In [25]:
def predict(test: pl.DataFrame) -> float:
    """
    Make a prediction for market allocation.
    
    Called once per day by Kaggle server.
    
    Parameters
    ----------
    test : pl.DataFrame
        Input data from Kaggle (Polars format)
        Contains features for ONE trading day
    
    Returns
    -------
    float
        Predicted market allocation weight (clipped to [0, 2])
    """
    # Convert to pandas
    test_df = test.to_pandas()
    
    # Extract features
    X_test = test_df[FEATURE_COLS].copy()
    
    # Handle missing values (forward fill, backward fill, then zeros)
    X_test = X_test.ffill().bfill().fillna(0)
    
    # Scale features using training scaler
    X_test_scaled = scaler.transform(X_test)
    
    # Make prediction
    prediction = model.predict(X_test_scaled)[0]
    
    # Clip to valid range [0, 2]
    final_allocation = float(np.clip(
        prediction,
        CONFIG['prediction_bounds'][0],
        CONFIG['prediction_bounds'][1]
    ))
    
    return final_allocation

print("✓ Prediction function defined")

✓ Prediction function defined


REQUIRED FOR KAGGLE COMPETITION SUBMISSION

In [26]:
# inference_server = ke.DefaultInferenceServer(predict)

# if os.getenv('KAGGLE_IS_COMPETITION_RERUN'):
#     print("✓ Running on hidden test set...\n")
#     inference_server.serve()
# else:
#     print("✓ Running locally on test.csv...\n")
#     inference_server.run_local_gateway((CONFIG['input_dir'],))

SUBMISSION ANALYSIS

WHAT HAPPENED:
- Trained Linear Regression on 20 features
- Model learned to predict 0.0 (safest bet, but useless)
- Score: 0.288 (low because model has no predictive power) 

WHY SCORE IS LOW:
1. Linear Regression is too simple for market prediction
2. Only 20 features selected (missing important signals)
3. No feature engineering (no momentum, volatility composites, etc.)

THE KAGGLE COMPETION REQURIRES FOLLOWING STRUCTURES FOR OUTPUT FILE
example
   date_id  prediction
0     8957         0.0
1     8958         0.0
2     8959         0.0
3     8960         0.0
4     8961         0.0
5     8962         0.0
6     8963         0.0
7     8964         0.0
8     8965         0.0
9     8966         0.0

