# Competition-Aware S&P 500 Position Optimization

## Overview
This notebook implements models that optimize directly for the competition metric: a volatility-adjusted Sharpe ratio that penalizes excess volatility and poor returns.

**Key Changes from Baseline:**
- **Target**: Optimal positions (0-2) instead of return predictions
- **Loss Function**: Competition metric (adjusted Sharpe ratio)
- **Features**: Full feature set from imputed data
- **Evaluation**: Risk-adjusted performance metrics
- **Strategy**: Portfolio optimization, not prediction accuracy

**Competition Metric:**
```
adjusted_sharpe = sharpe_ratio / (volatility_penalty * return_penalty)
```

Where:
- `volatility_penalty = 1 + max(0, strategy_vol/market_vol - 1.2)`
- `return_penalty = 1 + (max(0, market_return - strategy_return) * 100 * 252)¬≤/100`

## 1. Setup and Competition Metric Implementation

In [1]:
# Import essential libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

# Machine learning libraries
from sklearn.model_selection import train_test_split, TimeSeriesSplit
from sklearn.preprocessing import StandardScaler, RobustScaler, MinMaxScaler
from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.base import BaseEstimator, RegressorMixin
from sklearn.utils.validation import check_X_y, check_array

# Advanced models
try:
    import xgboost as xgb
    import lightgbm as lgb
    ADVANCED_MODELS = True
    print("‚úÖ Advanced models (XGBoost, LightGBM) available")
except ImportError:
    ADVANCED_MODELS = False
    print("‚ö†Ô∏è Advanced models not available")

# Configure plotting
plt.style.use('default')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (14, 8)
plt.rcParams['font.size'] = 10

print("‚úÖ Libraries imported successfully")

‚úÖ Advanced models (XGBoost, LightGBM) available
‚úÖ Libraries imported successfully


In [2]:
# Implement the competition metric exactly as provided
def calculate_competition_metric(positions, forward_returns, risk_free_rates, verbose=False):
    """
    Calculate the competition's volatility-adjusted Sharpe ratio.
    
    Args:
        positions: Array of position weights (0-2)
        forward_returns: Array of market forward returns
        risk_free_rates: Array of risk-free rates
        verbose: Print detailed calculations
    
    Returns:
        float: Adjusted Sharpe ratio (competition metric)
    """
    
    # Ensure arrays
    positions = np.array(positions)
    forward_returns = np.array(forward_returns)
    risk_free_rates = np.array(risk_free_rates)
    
    # Validate position constraints
    MIN_INVESTMENT = 0
    MAX_INVESTMENT = 2
    
    if positions.max() > MAX_INVESTMENT or positions.min() < MIN_INVESTMENT:
        if verbose:
            print(f"‚ö†Ô∏è Position constraint violation: [{positions.min():.4f}, {positions.max():.4f}]")
        return -1000  # Heavy penalty for constraint violation
    
    # Calculate strategy returns
    strategy_returns = risk_free_rates * (1 - positions) + positions * forward_returns
    
    # Calculate strategy's Sharpe ratio
    strategy_excess_returns = strategy_returns - risk_free_rates
    strategy_excess_cumulative = (1 + strategy_excess_returns).prod()
    strategy_mean_excess_return = strategy_excess_cumulative ** (1 / len(strategy_returns)) - 1
    strategy_std = strategy_returns.std()
    
    trading_days_per_yr = 252
    
    if strategy_std == 0:
        return -1000  # Penalty for zero volatility
    
    sharpe = strategy_mean_excess_return / strategy_std * np.sqrt(trading_days_per_yr)
    strategy_volatility = float(strategy_std * np.sqrt(trading_days_per_yr) * 100)
    
    # Calculate market return and volatility
    market_excess_returns = forward_returns - risk_free_rates
    market_excess_cumulative = (1 + market_excess_returns).prod()
    market_mean_excess_return = market_excess_cumulative ** (1 / len(forward_returns)) - 1
    market_std = forward_returns.std()
    market_volatility = float(market_std * np.sqrt(trading_days_per_yr) * 100)
    
    if market_volatility == 0:
        return -1000  # Penalty for zero market volatility
    
    # Calculate the volatility penalty
    excess_vol = max(0, strategy_volatility / market_volatility - 1.2) if market_volatility > 0 else 0
    vol_penalty = 1 + excess_vol
    
    # Calculate the return penalty
    return_gap = max(
        0,
        (market_mean_excess_return - strategy_mean_excess_return) * 100 * trading_days_per_yr,
    )
    return_penalty = 1 + (return_gap**2) / 100
    
    # Adjust the Sharpe ratio by the volatility and return penalty
    adjusted_sharpe = sharpe / (vol_penalty * return_penalty)
    
    if verbose:
        print(f"üìä Competition Metric Breakdown:")
        print(f"   ‚Ä¢ Strategy Sharpe: {sharpe:.4f}")
        print(f"   ‚Ä¢ Strategy Volatility: {strategy_volatility:.2f}%")
        print(f"   ‚Ä¢ Market Volatility: {market_volatility:.2f}%")
        print(f"   ‚Ä¢ Volatility Penalty: {vol_penalty:.4f}")
        print(f"   ‚Ä¢ Return Penalty: {return_penalty:.4f}")
        print(f"   ‚Ä¢ Adjusted Sharpe: {adjusted_sharpe:.4f}")
    
    return min(float(adjusted_sharpe), 1_000_000)

# Additional helper functions
def calculate_hit_rate(y_true, y_pred):
    """Calculate directional accuracy"""
    return np.mean(np.sign(y_true) == np.sign(y_pred))

def calculate_strategy_volatility(positions, forward_returns, risk_free_rates):
    """Calculate annualized strategy volatility"""
    strategy_returns = risk_free_rates * (1 - positions) + positions * forward_returns
    return strategy_returns.std() * np.sqrt(252) * 100

def calculate_strategy_return(positions, forward_returns, risk_free_rates):
    """Calculate annualized strategy return"""
    strategy_returns = risk_free_rates * (1 - positions) + positions * forward_returns
    cumulative_return = (1 + strategy_returns).prod()
    return (cumulative_return ** (252 / len(strategy_returns)) - 1) * 100

print("‚úÖ Competition metric functions implemented")

‚úÖ Competition metric functions implemented


## 2. Data Loading and Preparation

In [3]:
# Load datasets
print("üìä LOADING DATASETS")
print("=" * 50)

# Load clean imputed training data (our best shot at good features)
try:
    df_train_imputed = pd.read_csv('../data/cleaned/train_imputed.csv')
    print(f"‚úÖ Clean training data loaded: {df_train_imputed.shape}")
    print(f"   ‚Ä¢ Missing values: {df_train_imputed.isnull().sum().sum():,}")
    USE_IMPUTED = True
except FileNotFoundError:
    print("‚ö†Ô∏è Clean imputed data not found, using original training data")
    df_train_imputed = pd.read_csv('../data/raw/train.csv')
    USE_IMPUTED = False

# Load test data
df_test = pd.read_csv('../data/raw/test.csv')
print(f"‚úÖ Test data loaded: {df_test.shape}")
print(f"   ‚Ä¢ Missing values: {df_test.isnull().sum().sum():,}")

# Check for required columns
required_cols = ['forward_returns', 'risk_free_rate']
missing_cols = [col for col in required_cols if col not in df_train_imputed.columns]

if missing_cols:
    print(f"‚ùå Missing required columns: {missing_cols}")
    # Try alternative names
    if 'market_forward_excess_returns' in df_train_imputed.columns:
        print("   Using market_forward_excess_returns as proxy for forward_returns")
        if 'forward_returns' not in df_train_imputed.columns:
            df_train_imputed['forward_returns'] = df_train_imputed['market_forward_excess_returns']
else:
    print(f"‚úÖ All required columns present")

print(f"\nüìã Available targets: {[col for col in df_train_imputed.columns if 'return' in col.lower() or 'rate' in col.lower()]}")

üìä LOADING DATASETS


‚úÖ Clean training data loaded: (8990, 100)
   ‚Ä¢ Missing values: 110,204
‚úÖ Test data loaded: (10, 99)
   ‚Ä¢ Missing values: 0
‚úÖ All required columns present

üìã Available targets: ['forward_returns', 'risk_free_rate', 'market_forward_excess_returns']


In [4]:
# Feature engineering and preparation
print("üîß FEATURE ENGINEERING")
print("=" * 50)

# Define feature categories based on test data structure
exclude_cols = ['date_id', 'forward_returns', 'market_forward_excess_returns', 'risk_free_rate', 
                'lagged_forward_returns', 'lagged_risk_free_rate', 'lagged_market_forward_excess_returns',
                'is_scored']

# Get feature columns that exist in both train and test
train_features = [col for col in df_train_imputed.columns if col not in exclude_cols]
test_features = [col for col in df_test.columns if col not in exclude_cols]
common_features = list(set(train_features) & set(test_features))

print(f"üìä Feature Analysis:")
print(f"   ‚Ä¢ Training features: {len(train_features)}")
print(f"   ‚Ä¢ Test features: {len(test_features)}")
print(f"   ‚Ä¢ Common features: {len(common_features)}")

# Categorize features by type
feature_categories = {
    'D_features': [f for f in common_features if f.startswith('D')],
    'E_features': [f for f in common_features if f.startswith('E')],
    'I_features': [f for f in common_features if f.startswith('I')],
    'M_features': [f for f in common_features if f.startswith('M')],
    'P_features': [f for f in common_features if f.startswith('P')],
    'S_features': [f for f in common_features if f.startswith('S')],
    'V_features': [f for f in common_features if f.startswith('V')]
}

print(f"\nüìà Feature Categories:")
for category, features in feature_categories.items():
    print(f"   ‚Ä¢ {category}: {len(features)} features")
    if len(features) <= 5:
        print(f"     {features}")

# Create strategic feature sets for position optimization
feature_sets = {
    'all_features': common_features,
    'volatility_focused': feature_categories['V_features'] + feature_categories['M_features'],
    'price_focused': feature_categories['P_features'] + feature_categories['E_features'],
    'binary_signals': feature_categories['D_features'],
    'top_50': common_features[:50] if len(common_features) >= 50 else common_features
}

# Filter out empty feature sets
feature_sets = {name: features for name, features in feature_sets.items() if len(features) > 0}

print(f"\nüéØ Strategic Feature Sets:")
for name, features in feature_sets.items():
    print(f"   ‚Ä¢ {name}: {len(features)} features")

print(f"\n‚úÖ Feature engineering complete!")

üîß FEATURE ENGINEERING
üìä Feature Analysis:
   ‚Ä¢ Training features: 96
   ‚Ä¢ Test features: 94
   ‚Ä¢ Common features: 94

üìà Feature Categories:
   ‚Ä¢ D_features: 9 features
   ‚Ä¢ E_features: 20 features
   ‚Ä¢ I_features: 9 features
   ‚Ä¢ M_features: 18 features
   ‚Ä¢ P_features: 13 features
   ‚Ä¢ S_features: 12 features
   ‚Ä¢ V_features: 13 features

üéØ Strategic Feature Sets:
   ‚Ä¢ all_features: 94 features
   ‚Ä¢ volatility_focused: 31 features
   ‚Ä¢ price_focused: 33 features
   ‚Ä¢ binary_signals: 9 features
   ‚Ä¢ top_50: 50 features

‚úÖ Feature engineering complete!


## 3. Competition-Aware Model Development

In [5]:
# Create a custom regressor that optimizes for the competition metric
class CompetitionAwareRegressor(BaseEstimator, RegressorMixin):
    """
    A wrapper that trains any regressor to optimize portfolio positions
    for the competition metric instead of prediction accuracy.
    """
    
    def __init__(self, base_estimator, position_bounds=(0, 2), alpha=0.1):
        self.base_estimator = base_estimator
        self.position_bounds = position_bounds
        self.alpha = alpha  # Regularization for extreme positions
        
    def fit(self, X, y, forward_returns=None, risk_free_rates=None):
        """
        Fit the model. If forward_returns and risk_free_rates are provided,
        we can try to optimize positions directly.
        """
        X, y = check_X_y(X, y)
        
        # For simplicity, we'll train the base estimator on returns first
        # Then convert predictions to optimal positions
        self.base_estimator.fit(X, y)
        
        # Store training statistics for position scaling
        predictions = self.base_estimator.predict(X)
        self.prediction_std_ = np.std(predictions)
        self.prediction_mean_ = np.mean(predictions)
        
        # If we have the required data, optimize position scaling
        if forward_returns is not None and risk_free_rates is not None:
            self._optimize_position_scaling(predictions, forward_returns, risk_free_rates)
        else:
            # Default scaling: center around 1.0 (100% market exposure)
            self.position_scale_ = 1.0
            self.position_offset_ = 1.0
        
        return self
    
    def _optimize_position_scaling(self, predictions, forward_returns, risk_free_rates):
        """
        Optimize the scaling from predictions to positions using the competition metric.
        """
        best_score = -np.inf
        best_scale = 1.0
        best_offset = 1.0
        
        # Grid search over scaling parameters
        scales = np.linspace(0.1, 3.0, 20)
        offsets = np.linspace(0.5, 1.5, 15)
        
        for scale in scales:
            for offset in offsets:
                # Convert predictions to positions
                positions = self._predictions_to_positions(predictions, scale, offset)
                
                # Calculate competition metric
                try:
                    score = calculate_competition_metric(positions, forward_returns, risk_free_rates)
                    if score > best_score:
                        best_score = score
                        best_scale = scale
                        best_offset = offset
                except:
                    continue
        
        self.position_scale_ = best_scale
        self.position_offset_ = best_offset
        self.best_training_score_ = best_score
    
    def _predictions_to_positions(self, predictions, scale=None, offset=None):
        """
        Convert model predictions to valid portfolio positions.
        """
        if scale is None:
            scale = self.position_scale_
        if offset is None:
            offset = self.position_offset_
        
        # Normalize predictions
        if self.prediction_std_ > 0:
            normalized = (predictions - self.prediction_mean_) / self.prediction_std_
        else:
            normalized = predictions - self.prediction_mean_
        
        # Scale and shift to position space
        positions = offset + scale * normalized
        
        # Clip to valid range
        positions = np.clip(positions, self.position_bounds[0], self.position_bounds[1])
        
        return positions
    
    def predict(self, X):
        """
        Predict optimal portfolio positions.
        """
        X = check_array(X)
        
        # Get base predictions
        base_predictions = self.base_estimator.predict(X)
        
        # Convert to positions
        positions = self._predictions_to_positions(base_predictions)
        
        return positions

print("‚úÖ Competition-aware regressor implemented")

# Create model configurations optimized for position prediction
def create_competition_models():
    """Create models optimized for the competition metric."""
    
    base_models = {
        'Ridge': Ridge(alpha=1.0, random_state=42),
        'Lasso': Lasso(alpha=0.1, random_state=42, max_iter=2000),
        'ElasticNet': ElasticNet(alpha=0.1, l1_ratio=0.5, random_state=42, max_iter=2000),
        'RandomForest': RandomForestRegressor(
            n_estimators=100, max_depth=8, random_state=42, n_jobs=-1
        ),
        'GradientBoosting': GradientBoostingRegressor(
            n_estimators=100, learning_rate=0.1, max_depth=6, random_state=42
        )
    }
    
    if ADVANCED_MODELS:
        base_models['XGBoost'] = xgb.XGBRegressor(
            n_estimators=100, learning_rate=0.1, max_depth=6, 
            random_state=42, verbosity=0
        )
        base_models['LightGBM'] = lgb.LGBMRegressor(
            n_estimators=100, learning_rate=0.1, max_depth=6,
            random_state=42, verbosity=-1
        )
    
    # Wrap each base model with competition-aware wrapper
    competition_models = {}
    for name, model in base_models.items():
        competition_models[f"Comp_{name}"] = CompetitionAwareRegressor(model)
    
    return competition_models

models = create_competition_models()
print(f"‚úÖ Created {len(models)} competition-aware models")
for name in models.keys():
    print(f"   ‚Ä¢ {name}")

‚úÖ Competition-aware regressor implemented
‚úÖ Created 7 competition-aware models
   ‚Ä¢ Comp_Ridge
   ‚Ä¢ Comp_Lasso
   ‚Ä¢ Comp_ElasticNet
   ‚Ä¢ Comp_RandomForest
   ‚Ä¢ Comp_GradientBoosting
   ‚Ä¢ Comp_XGBoost
   ‚Ä¢ Comp_LightGBM


## 4. Training and Evaluation Framework

In [6]:
# Prepare training data
print("üìä TRAINING DATA PREPARATION")
print("=" * 50)

# Ensure we have the required target variables
if 'forward_returns' not in df_train_imputed.columns:
    print("‚ùå forward_returns not found in training data!")
    available_targets = [col for col in df_train_imputed.columns if 'return' in col.lower()]
    print(f"Available targets: {available_targets}")
    if available_targets:
        primary_target = available_targets[0]
        print(f"Using {primary_target} as proxy for forward_returns")
        df_train_imputed['forward_returns'] = df_train_imputed[primary_target]
    else:
        raise ValueError("No suitable target variable found!")

if 'risk_free_rate' not in df_train_imputed.columns:
    print("‚ö†Ô∏è risk_free_rate not found, creating synthetic risk-free rate")
    # Use a small constant risk-free rate as approximation
    df_train_imputed['risk_free_rate'] = 0.0001  # ~2.5% annualized

# Extract target variables
y_forward_returns = df_train_imputed['forward_returns'].copy()
y_risk_free_rate = df_train_imputed['risk_free_rate'].copy()

# Remove rows with missing targets
valid_idx = (~y_forward_returns.isnull()) & (~y_risk_free_rate.isnull())
print(f"üìà Target Variable Analysis:")
print(f"   ‚Ä¢ Total samples: {len(df_train_imputed):,}")
print(f"   ‚Ä¢ Valid samples: {valid_idx.sum():,}")
print(f"   ‚Ä¢ Missing targets: {(~valid_idx).sum():,}")

if valid_idx.sum() < len(df_train_imputed):
    print(f"   ‚Ä¢ Removing {(~valid_idx).sum()} samples with missing targets")
    df_train_clean = df_train_imputed[valid_idx].copy()
    y_forward_returns = y_forward_returns[valid_idx]
    y_risk_free_rate = y_risk_free_rate[valid_idx]
else:
    df_train_clean = df_train_imputed.copy()

print(f"\nüìä Final Training Data:")
print(f"   ‚Ä¢ Samples: {len(df_train_clean):,}")
print(f"   ‚Ä¢ Forward returns range: [{y_forward_returns.min():.6f}, {y_forward_returns.max():.6f}]")
print(f"   ‚Ä¢ Risk-free rate range: [{y_risk_free_rate.min():.6f}, {y_risk_free_rate.max():.6f}]")
print(f"   ‚Ä¢ Forward returns std: {y_forward_returns.std():.6f}")

print(f"\n‚úÖ Training data prepared successfully!")

üìä TRAINING DATA PREPARATION
üìà Target Variable Analysis:
   ‚Ä¢ Total samples: 8,990
   ‚Ä¢ Valid samples: 8,990
   ‚Ä¢ Missing targets: 0

üìä Final Training Data:
   ‚Ä¢ Samples: 8,990
   ‚Ä¢ Forward returns range: [-0.039754, 0.040661]
   ‚Ä¢ Risk-free rate range: [-0.000004, 0.000317]
   ‚Ä¢ Forward returns std: 0.010551

‚úÖ Training data prepared successfully!


In [7]:
# Competition-aware evaluation function
def evaluate_competition_model(model, X_train, y_train, X_val, y_val, 
                             forward_returns_train, risk_free_train,
                             forward_returns_val, risk_free_val,
                             model_name="Model"):
    """
    Evaluate a model using competition-specific metrics.
    """
    try:
        # Train the model with competition context
        if hasattr(model, 'fit') and 'forward_returns' in model.fit.__code__.co_varnames:
            # Competition-aware model
            model.fit(X_train, y_train, 
                     forward_returns=forward_returns_train, 
                     risk_free_rates=risk_free_train)
        else:
            # Standard model
            model.fit(X_train, y_train)
        
        # Get position predictions
        positions_train = model.predict(X_train)
        positions_val = model.predict(X_val)
        
        # Calculate competition metrics
        train_score = calculate_competition_metric(
            positions_train, forward_returns_train, risk_free_train
        )
        val_score = calculate_competition_metric(
            positions_val, forward_returns_val, risk_free_val
        )
        
        # Calculate additional metrics
        train_volatility = calculate_strategy_volatility(
            positions_train, forward_returns_train, risk_free_train
        )
        val_volatility = calculate_strategy_volatility(
            positions_val, forward_returns_val, risk_free_val
        )
        
        train_return = calculate_strategy_return(
            positions_train, forward_returns_train, risk_free_train
        )
        val_return = calculate_strategy_return(
            positions_val, forward_returns_val, risk_free_val
        )
        
        # Position statistics
        avg_position = np.mean(positions_val)
        position_std = np.std(positions_val)
        
        results = {
            'model': model_name,
            'train_competition_score': train_score,
            'val_competition_score': val_score,
            'train_volatility': train_volatility,
            'val_volatility': val_volatility,
            'train_return': train_return,
            'val_return': val_return,
            'avg_position': avg_position,
            'position_std': position_std,
            'overfitting': abs(train_score - val_score),
            'status': 'success'
        }
        
        return results
        
    except Exception as e:
        return {
            'model': model_name,
            'error': str(e),
            'train_competition_score': np.nan,
            'val_competition_score': np.nan,
            'status': 'failed'
        }

print("‚úÖ Competition evaluation function defined")

‚úÖ Competition evaluation function defined


## 5. Model Training and Selection

In [8]:
# Main training loop for competition-aware models
print("üöÄ COMPETITION-AWARE MODEL TRAINING")
print("=" * 60)

all_results = []
best_model_info = {'score': -np.inf, 'model': None, 'features': None}

# Test each feature set
for set_name, feature_list in feature_sets.items():
    print(f"\nüìä TESTING FEATURE SET: {set_name.upper()}")
    print("-" * 50)
    
    try:
        # Get available features
        available_features = [f for f in feature_list if f in df_train_clean.columns]
        
        if len(available_features) == 0:
            print(f"   ‚ùå No features available for {set_name}")
            continue
        
        print(f"   ‚Ä¢ Features: {len(available_features)} available")
        
        # Prepare feature data
        X_full = df_train_clean[available_features].copy()
        
        # Handle any remaining missing values
        missing_count = X_full.isnull().sum().sum()
        if missing_count > 0:
            print(f"   ‚ö†Ô∏è Handling {missing_count} missing values")
            X_full = X_full.fillna(X_full.median()).fillna(0)
        
        # Scale features
        scaler = RobustScaler()
        X_scaled = scaler.fit_transform(X_full)
        
        # Time-series split (preserving order)
        split_idx = int(len(X_scaled) * 0.8)
        
        X_train = X_scaled[:split_idx]
        X_val = X_scaled[split_idx:]
        y_train = y_forward_returns.iloc[:split_idx]
        y_val = y_forward_returns.iloc[split_idx:]
        
        # Target data for competition metric
        forward_returns_train = y_forward_returns.iloc[:split_idx]
        forward_returns_val = y_forward_returns.iloc[split_idx:]
        risk_free_train = y_risk_free_rate.iloc[:split_idx]
        risk_free_val = y_risk_free_rate.iloc[split_idx:]
        
        print(f"   ‚Ä¢ Train: {len(X_train):,} samples")
        print(f"   ‚Ä¢ Validation: {len(X_val):,} samples")
        
        # Test each model
        set_results = []
        for model_name, model in models.items():
            print(f"   üîÑ {model_name}...", end=" ")
            
            result = evaluate_competition_model(
                model, X_train, y_train, X_val, y_val,
                forward_returns_train, risk_free_train,
                forward_returns_val, risk_free_val,
                model_name
            )
            
            result['feature_set'] = set_name
            result['n_features'] = len(available_features)
            
            if result['status'] == 'failed':
                print(f"‚ùå Failed: {result.get('error', 'Unknown error')[:50]}")
            else:
                score = result['val_competition_score']
                volatility = result['val_volatility']
                avg_pos = result['avg_position']
                
                print(f"‚úÖ Score={score:.4f}, Vol={volatility:.1f}%, Pos={avg_pos:.2f}")
                
                # Track best model
                if score > best_model_info['score']:
                    best_model_info.update({
                        'score': score,
                        'model': model,
                        'model_name': model_name,
                        'features': available_features,
                        'scaler': scaler,
                        'feature_set': set_name
                    })
            
            set_results.append(result)
            all_results.append(result)
        
        # Best in this feature set
        valid_results = [r for r in set_results if r['status'] == 'success']
        if valid_results:
            best_in_set = max(valid_results, key=lambda x: x['val_competition_score'])
            print(f"   üèÜ Best in set: {best_in_set['model']} (Score={best_in_set['val_competition_score']:.4f})")
        
    except Exception as e:
        print(f"   ‚ùå Feature set {set_name} failed: {str(e)[:100]}")

print(f"\n‚úÖ Competition-aware training complete! Tested {len(all_results)} configurations.")

üöÄ COMPETITION-AWARE MODEL TRAINING

üìä TESTING FEATURE SET: ALL_FEATURES
--------------------------------------------------
   ‚Ä¢ Features: 94 available
   ‚ö†Ô∏è Handling 110204 missing values


   ‚Ä¢ Train: 7,192 samples
   ‚Ä¢ Validation: 1,798 samples
   üîÑ Comp_Ridge... ‚úÖ Score=0.2905, Vol=34.7%, Pos=1.87
   üîÑ Comp_Lasso... ‚úÖ Score=0.6225, Vol=13.1%, Pos=0.74
   üîÑ Comp_ElasticNet... ‚úÖ Score=0.6225, Vol=13.1%, Pos=0.74
   üîÑ Comp_ElasticNet... ‚úÖ Score=0.6225, Vol=13.1%, Pos=0.74
   üîÑ Comp_RandomForest... ‚úÖ Score=0.6225, Vol=13.1%, Pos=0.74
   üîÑ Comp_RandomForest... ‚úÖ Score=0.3306, Vol=20.8%, Pos=0.47
   üîÑ Comp_GradientBoosting... ‚úÖ Score=0.3306, Vol=20.8%, Pos=0.47
   üîÑ Comp_GradientBoosting... ‚úÖ Score=0.5046, Vol=23.1%, Pos=0.57
   üîÑ Comp_XGBoost... ‚úÖ Score=0.5046, Vol=23.1%, Pos=0.57
   üîÑ Comp_XGBoost... ‚úÖ Score=0.3585, Vol=26.2%, Pos=0.77
   üîÑ Comp_LightGBM... ‚úÖ Score=0.3585, Vol=26.2%, Pos=0.77
   üîÑ Comp_LightGBM... ‚úÖ Score=0.3892, Vol=27.4%, Pos=0.76
   üèÜ Best in set: Comp_Lasso (Score=0.6225)

üìä TESTING FEATURE SET: VOLATILITY_FOCUSED
--------------------------------------------------
   ‚Ä¢ Features: 31

## 6. Results Analysis and Model Selection

In [9]:
# Analyze competition-aware results
print("üìä COMPETITION RESULTS ANALYSIS")
print("=" * 60)

if len(all_results) == 0:
    print("‚ùå No successful model runs to analyze!")
else:
    # Create results DataFrame
    successful_results = [r for r in all_results if r['status'] == 'success']
    
    if len(successful_results) == 0:
        print("‚ùå All models failed!")
    else:
        results_df = pd.DataFrame(successful_results)
        
        print(f"‚úÖ Successfully trained {len(results_df)} models")
        
        # Overall performance summary
        print(f"\nüìà COMPETITION PERFORMANCE SUMMARY:")
        print(f"   ‚Ä¢ Average Competition Score: {results_df['val_competition_score'].mean():.4f}")
        print(f"   ‚Ä¢ Best Competition Score: {results_df['val_competition_score'].max():.4f}")
        print(f"   ‚Ä¢ Average Volatility: {results_df['val_volatility'].mean():.1f}%")
        print(f"   ‚Ä¢ Average Return: {results_df['val_return'].mean():.1f}%")
        print(f"   ‚Ä¢ Average Position: {results_df['avg_position'].mean():.2f}")
        
        # Top 5 models by competition score
        print(f"\nüèÜ TOP 5 COMPETITION MODELS:")
        print("-" * 50)
        top_models = results_df.nlargest(5, 'val_competition_score')
        
        for i, (_, row) in enumerate(top_models.iterrows(), 1):
            print(f"   {i}. {row['model']} ({row['feature_set']})")
            print(f"      Competition Score: {row['val_competition_score']:.4f}")
            print(f"      Strategy Volatility: {row['val_volatility']:.1f}%")
            print(f"      Strategy Return: {row['val_return']:.1f}%")
            print(f"      Average Position: {row['avg_position']:.2f}")
            print(f"      Features: {row['n_features']}")
            print()
        
        # Model type comparison
        print(f"üìä MODEL TYPE COMPARISON (by Competition Score):")
        print("-" * 40)
        model_comparison = results_df.groupby('model').agg({
            'val_competition_score': ['mean', 'max', 'count'],
            'val_volatility': 'mean',
            'val_return': 'mean',
            'avg_position': 'mean'
        }).round(4)
        
        model_comparison.columns = ['avg_score', 'best_score', 'count', 'avg_vol', 'avg_return', 'avg_position']
        model_comparison = model_comparison.sort_values('avg_score', ascending=False)
        
        for model_name, stats in model_comparison.iterrows():
            print(f"   {model_name}:")
            print(f"      Avg Score: {stats['avg_score']:.4f} (Best: {stats['best_score']:.4f})")
            print(f"      Avg Volatility: {stats['avg_vol']:.1f}%")
            print(f"      Avg Return: {stats['avg_return']:.1f}%")
            print(f"      Avg Position: {stats['avg_position']:.2f}")
            print(f"      Configurations: {int(stats['count'])}")
            print()
        
        # Feature set comparison
        print(f"üîß FEATURE SET COMPARISON:")
        print("-" * 30)
        feature_comparison = results_df.groupby('feature_set').agg({
            'val_competition_score': ['mean', 'max'],
            'val_volatility': 'mean',
            'n_features': 'first'
        }).round(4)
        
        feature_comparison.columns = ['avg_score', 'best_score', 'avg_vol', 'n_features']
        feature_comparison = feature_comparison.sort_values('avg_score', ascending=False)
        
        for set_name, stats in feature_comparison.iterrows():
            print(f"   {set_name} ({int(stats['n_features'])} features):")
            print(f"      Avg Score: {stats['avg_score']:.4f} (Best: {stats['best_score']:.4f})")
            print(f"      Avg Volatility: {stats['avg_vol']:.1f}%")
            print()
        
        # Best model details
        if best_model_info['model'] is not None:
            print(f"üéØ SELECTED BEST MODEL:")
            print(f"   ‚Ä¢ Model: {best_model_info['model_name']}")
            print(f"   ‚Ä¢ Feature Set: {best_model_info['feature_set']}")
            print(f"   ‚Ä¢ Features: {len(best_model_info['features'])}")
            print(f"   ‚Ä¢ Competition Score: {best_model_info['score']:.4f}")
            
            # Show detailed breakdown
            best_result = results_df[results_df['val_competition_score'] == best_model_info['score']].iloc[0]
            print(f"   ‚Ä¢ Strategy Volatility: {best_result['val_volatility']:.2f}%")
            print(f"   ‚Ä¢ Strategy Return: {best_result['val_return']:.2f}%")
            print(f"   ‚Ä¢ Average Position: {best_result['avg_position']:.3f}")
            print(f"   ‚Ä¢ Position Std: {best_result['position_std']:.3f}")
            
print(f"\n‚úÖ Competition analysis complete!")

üìä COMPETITION RESULTS ANALYSIS
‚úÖ Successfully trained 35 models

üìà COMPETITION PERFORMANCE SUMMARY:
   ‚Ä¢ Average Competition Score: 0.4756
   ‚Ä¢ Best Competition Score: 0.6765
   ‚Ä¢ Average Volatility: 19.9%
   ‚Ä¢ Average Return: 14.4%
   ‚Ä¢ Average Position: 0.76

üèÜ TOP 5 COMPETITION MODELS:
--------------------------------------------------
   1. Comp_RandomForest (price_focused)
      Competition Score: 0.6765
      Strategy Volatility: 18.8%
      Strategy Return: 16.5%
      Average Position: 0.59
      Features: 33

   2. Comp_XGBoost (price_focused)
      Competition Score: 0.6231
      Strategy Volatility: 23.8%
      Strategy Return: 21.6%
      Average Position: 0.71
      Features: 33

   3. Comp_Lasso (all_features)
      Competition Score: 0.6225
      Strategy Volatility: 13.1%
      Strategy Return: 11.9%
      Average Position: 0.74
      Features: 94

   4. Comp_ElasticNet (all_features)
      Competition Score: 0.6225
      Strategy Volatility: 13.1%


## 7. Final Model Training and Test Predictions

In [10]:
# Train final model and generate competition submissions
print("üéØ FINAL MODEL TRAINING FOR COMPETITION")
print("=" * 60)

if best_model_info['model'] is None:
    print("‚ùå No best model found!")
else:
    print(f"üèÜ Training Final Model: {best_model_info['model_name']}")
    print(f"   ‚Ä¢ Feature Set: {best_model_info['feature_set']}")
    print(f"   ‚Ä¢ Features: {len(best_model_info['features'])}")
    print(f"   ‚Ä¢ Validation Score: {best_model_info['score']:.4f}")
    
    try:
        # Prepare final training data
        X_final = df_train_clean[best_model_info['features']].copy()
        
        # Handle missing values
        missing_final = X_final.isnull().sum().sum()
        if missing_final > 0:
            print(f"   ‚Ä¢ Handling {missing_final} missing values")
            X_final = X_final.fillna(X_final.median()).fillna(0)
        
        # Scale features
        scaler_final = RobustScaler()
        X_final_scaled = scaler_final.fit_transform(X_final)
        
        # Prepare test data
        test_features_available = [f for f in best_model_info['features'] if f in df_test.columns]
        X_test_final = df_test[test_features_available].copy()
        
        print(f"\nüìä Final Data Preparation:")
        print(f"   ‚Ä¢ Training samples: {len(X_final_scaled):,}")
        print(f"   ‚Ä¢ Test samples: {len(X_test_final):,}")
        print(f"   ‚Ä¢ Features used: {len(test_features_available)}")
        print(f"   ‚Ä¢ Features missing in test: {len(best_model_info['features']) - len(test_features_available)}")
        
        # Handle missing values in test data
        test_missing = X_test_final.isnull().sum().sum()
        if test_missing > 0:
            print(f"   ‚Ä¢ Handling {test_missing} missing values in test data")
            X_test_final = X_test_final.fillna(X_test_final.median()).fillna(0)
        
        # Scale test features
        X_test_final_scaled = scaler_final.transform(X_test_final)
        
        # Train final model
        final_model = best_model_info['model']
        print(f"\nüöÄ Training final model on full dataset...")
        
        # Train with competition context
        if hasattr(final_model, 'fit') and 'forward_returns' in final_model.fit.__code__.co_varnames:
            final_model.fit(X_final_scaled, y_forward_returns, 
                           forward_returns=y_forward_returns, 
                           risk_free_rates=y_risk_free_rate)
        else:
            final_model.fit(X_final_scaled, y_forward_returns)
        
        # Generate predictions
        final_positions = final_model.predict(X_test_final_scaled)
        
        print(f"\n‚úÖ FINAL COMPETITION PREDICTIONS:")
        print(f"=" * 50)
        print(f"üìä Position Statistics:")
        print(f"   ‚Ä¢ Number of predictions: {len(final_positions):,}")
        print(f"   ‚Ä¢ Position range: [{final_positions.min():.4f}, {final_positions.max():.4f}]")
        print(f"   ‚Ä¢ Mean position: {final_positions.mean():.4f}")
        print(f"   ‚Ä¢ Position std: {final_positions.std():.4f}")
        print(f"   ‚Ä¢ Positions > 1.0: {(final_positions > 1.0).sum()} ({(final_positions > 1.0).mean()*100:.1f}%)")
        print(f"   ‚Ä¢ Positions < 1.0: {(final_positions < 1.0).sum()} ({(final_positions < 1.0).mean()*100:.1f}%)")
        
        # Constraint validation
        constraint_violations = (final_positions < 0) | (final_positions > 2)
        if constraint_violations.any():
            print(f"   ‚ö†Ô∏è Constraint violations: {constraint_violations.sum()}")
            final_positions = np.clip(final_positions, 0, 2)
            print(f"   ‚úÖ Positions clipped to valid range [0, 2]")
        else:
            print(f"   ‚úÖ All positions within valid range [0, 2]")
        
        # Create submission DataFrame
        submission_df = pd.DataFrame({
            'date_id': df_test['date_id'],
            'prediction': final_positions
        })
        
        # Save competition submission
        submission_path = '../data/predictions/competition_submission.csv'
        import os
        os.makedirs('../data/predictions', exist_ok=True)
        submission_df.to_csv(submission_path, index=False)
        
        print(f"\nüíæ Competition submission saved to: {submission_path}")
        print(f"   ‚Ä¢ Format: date_id, prediction")
        print(f"   ‚Ä¢ Ready for Kaggle submission")
        
        # Show sample predictions
        print(f"\nüìã Sample Predictions:")
        print(submission_df.head(10).to_string(index=False, float_format='%.6f'))
        
        # Training performance on full dataset
        train_positions = final_model.predict(X_final_scaled)
        final_train_score = calculate_competition_metric(
            train_positions, y_forward_returns, y_risk_free_rate, verbose=True
        )
        
        print(f"\nüèÜ FINAL TRAINING PERFORMANCE:")
        print(f"   ‚Ä¢ Competition Score: {final_train_score:.4f}")
        
        # Strategy analysis
        strategy_vol = calculate_strategy_volatility(train_positions, y_forward_returns, y_risk_free_rate)
        strategy_ret = calculate_strategy_return(train_positions, y_forward_returns, y_risk_free_rate)
        market_vol = y_forward_returns.std() * np.sqrt(252) * 100
        
        print(f"   ‚Ä¢ Strategy Volatility: {strategy_vol:.2f}%")
        print(f"   ‚Ä¢ Market Volatility: {market_vol:.2f}%")
        print(f"   ‚Ä¢ Strategy Return: {strategy_ret:.2f}%")
        print(f"   ‚Ä¢ Volatility Ratio: {strategy_vol/market_vol:.2f}")
        
    except Exception as e:
        print(f"‚ùå Final training failed: {str(e)}")
        import traceback
        traceback.print_exc()

print(f"\n" + "=" * 60)
print(f"‚úÖ COMPETITION-AWARE MODELING COMPLETE")
print(f"üéØ OPTIMIZED FOR VOLATILITY-ADJUSTED SHARPE RATIO")
print(f"üìà READY FOR KAGGLE SUBMISSION")
print(f"=" * 60)

üéØ FINAL MODEL TRAINING FOR COMPETITION
üèÜ Training Final Model: Comp_RandomForest
   ‚Ä¢ Feature Set: price_focused
   ‚Ä¢ Features: 33
   ‚Ä¢ Validation Score: 0.6765
   ‚Ä¢ Handling 14888 missing values

üìä Final Data Preparation:
   ‚Ä¢ Training samples: 8,990
   ‚Ä¢ Test samples: 10
   ‚Ä¢ Features used: 33
   ‚Ä¢ Features missing in test: 0

üöÄ Training final model on full dataset...

‚úÖ FINAL COMPETITION PREDICTIONS:
üìä Position Statistics:
   ‚Ä¢ Number of predictions: 10
   ‚Ä¢ Position range: [0.4784, 0.8218]
   ‚Ä¢ Mean position: 0.5128
   ‚Ä¢ Position std: 0.1030
   ‚Ä¢ Positions > 1.0: 0 (0.0%)
   ‚Ä¢ Positions < 1.0: 10 (100.0%)
   ‚úÖ All positions within valid range [0, 2]

üíæ Competition submission saved to: ../data/predictions/competition_submission.csv
   ‚Ä¢ Format: date_id, prediction
   ‚Ä¢ Ready for Kaggle submission

üìã Sample Predictions:
 date_id  prediction
    8980    0.478435
    8981    0.478435
    8982    0.821777
    8983    0.478435
    