# Lab 10: Production ML Pipelines & Rigorous Backtesting

From research prototype to production-ready factor models

> **Expected Time**
>
> -   FIN510: Exercises 1-2 ≈ 75 min
> -   FIN720: All exercises ≈ 100 min
> -   Directed learning extensions ≈ 60 min
>
> ### Prerequisites
>
> This lab extends concepts from the JKP factor replication lab.
> Familiarity with factor models and portfolio construction is helpful
> but not required.

<figure>
<a
href="https://colab.research.google.com/github/quinfer/fin510-colab-notebooks/blob/main/labs/lab10_pipelines.ipynb"><img
src="https://colab.research.google.com/assets/colab-badge.svg" /></a>
<figcaption>Open in Colab</figcaption>
</figure>

## Before You Code: The Big Picture

**Research code ≠ Production code.** Your Jupyter notebook with 85%
backtest Sharpe ratio? It won’t survive first contact with live markets.
Here’s how to build **production-grade** ML pipelines that actually
work.

> **The Research-to-Production Gap**
>
> **Why Models Fail in Production:** 1. **Look-ahead bias**: Used future
> data to make past predictions 2. **Overfitting**: Optimized on test
> data, no true holdout set 3. **Data drift**: Training distribution ≠
> production distribution 4. **Multiple testing**: Tried 100 features,
> reported the 5 that worked 5. **Leakage**: Features contain
> information not available at prediction time
>
> **The Evidence:** - **70% of ML projects fail to deploy** (Gartner
> 2021) - **90% of deployed models underperform expectations**
> (Algorithmia 2020) - Average Sharpe ratio decline: 0.5 → 0.1 from
> backtest to live (industry estimates)
>
> **What Separates Winners from Losers:** - **Rigorous validation**:
> CPCV, PBO, embargo periods, multiple testing corrections - **Temporal
> correctness**: Strict point-in-time data, no future information -
> **Production monitoring**: Drift detection, model versioning,
> automated rollback - **Documentation**: Reproducible, auditable,
> explainable

### What You’ll Build Today

By the end of this lab, you will have:

-   ✅ End-to-end ML pipeline (ingestion → features → training →
    monitoring)
-   ✅ Temporal correctness (no look-ahead bias, point-in-time features)
-   ✅ Multiple testing corrections (Bonferroni, FDR)
-   ✅ Combinatorial Purged Cross-Validation (gold standard for finance)
-   ✅ Production monitoring (drift detection, performance tracking)

**Time estimate:** 75 minutes (FIN510) \| 100 minutes (FIN720 with all
exercises)

> **Why This Matters**
>
> **This is Coursework 2 best practices.** If you implement these
> patterns—CPCV, PBO, embargo periods, multiple testing
> corrections—you’ll stand out. Most students submit naive backtests.
> You’ll submit **production-grade** work that could actually be
> deployed.

## Learning Objectives

By the end of this lab, you will be able to:

-   Design and implement end-to-end ML pipelines for financial
    applications
-   Engineer features with temporal correctness preventing look-ahead
    bias
-   Apply multiple testing corrections (Bonferroni, FDR) to prevent
    false discoveries
-   Implement combinatorial purged cross-validation for robust
    backtesting
-   Calculate probability of backtest overfitting quantifying risk
-   Monitor model performance and detect data drift in production
-   Track model versions and implement rollback capabilities
-   Evaluate production readiness using comprehensive validation

## Setup and Dependencies

In [1]:
# Core libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
from scipy import stats
from itertools import combinations
import warnings
warnings.filterwarnings('ignore')

# Visualization settings
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 11

# For statistical tests
try:
    from statsmodels.stats.multitest import multipletests
except ImportError:
    print("Installing statsmodels...")
    !pip install -q statsmodels
    from statsmodels.stats.multitest import multipletests

print("✓ Setup complete - ready for production ML pipeline development")

## Exercise 1: ML Pipeline Implementation

### Understanding Pipeline Architecture

Production ML systems aren’t standalone scripts—they’re pipelines with
clearly defined components, interfaces, and orchestration. We’ll
implement a simple but realistic pipeline for factor-based investing,
demonstrating principles applicable to any ML application.

### Data Ingestion with Versioning

In [2]:
class DataIngestionPipeline:
    """
    Data ingestion component with versioning and validation
    """
    
    def __init__(self, data_source="synthetic"):
        self.data_source = data_source
        self.version = None
        self.ingestion_timestamp = None
        
    def ingest_factor_data(self, n_periods=120, n_assets=50):
        """
        Ingest or generate factor data with metadata
        
        In production, this would:
        - Query databases or APIs
        - Handle retries and failures
        - Validate schemas
        - Version the extracted data
        """
        np.random.seed(42)
        
        # Generate synthetic factor data (simulating market data)
        dates = pd.date_range(end=datetime.now(), periods=n_periods, freq='M')
        
        # Factors: Value, Momentum, Quality, Size, Low Vol
        factor_names = ['value', 'momentum', 'quality', 'size', 'low_vol']
        
        data = []
        for asset in range(n_assets):
            asset_id = f"asset_{asset:03d}"
            
            for date in dates:
                # Generate factor exposures with some persistence
                row = {'date': date, 'asset_id': asset_id}
                
                for factor in factor_names:
                    # Factors have autocorrelation (realistic)
                    if len(data) > 0 and any(d['asset_id'] == asset_id for d in data):
                        prev_vals = [d[factor] for d in data if d['asset_id'] == asset_id]
                        prev = prev_vals[-1] if prev_vals else 0
                        row[factor] = 0.7 * prev + 0.3 * np.random.randn()
                    else:
                        row[factor] = np.random.randn()
                
                # Generate forward returns (target variable)
                # Returns correlated with factors (but not perfectly)
                factor_vals = [row[f] for f in factor_names]
                true_factor_loadings = [0.05, 0.03, 0.04, -0.02, -0.01]  # True relationships
                
                expected_return = sum(f * l for f, l in zip(factor_vals, true_factor_loadings))
                row['return_1m'] = expected_return + 0.10 * np.random.randn()  # Add noise
                
                data.append(row)
        
        df = pd.DataFrame(data)
        
        # Add metadata
        self.version = datetime.now().strftime("%Y%m%d_%H%M%S")
        self.ingestion_timestamp = datetime.now()
        
        # Validation
        self._validate_data(df)
        
        print(f"✓ Data ingested: {len(df)} records")
        print(f"  Version: {self.version}")
        print(f"  Date range: {df['date'].min()} to {df['date'].max()}")
        print(f"  Assets: {df['asset_id'].nunique()}")
        print(f"  Factors: {', '.join(factor_names)}")
        
        return df
    
    def _validate_data(self, df):
        """Validate data quality"""
        # Check for required columns
        required_cols = ['date', 'asset_id', 'return_1m']
        missing = [c for c in required_cols if c not in df.columns]
        if missing:
            raise ValueError(f"Missing required columns: {missing}")
        
        # Check for nulls
        null_counts = df.isnull().sum()
        if null_counts.sum() > 0:
            print(f"  ⚠️  Warning: Found {null_counts.sum()} null values")
        
        # Check for duplicates
        dupes = df.duplicated(['date', 'asset_id']).sum()
        if dupes > 0:
            raise ValueError(f"Found {dupes} duplicate (date, asset_id) pairs")
        
        print(f"  ✓ Data validation passed")


# Demonstrate data ingestion
print("="*70)
print("DATA INGESTION PIPELINE")
print("="*70)

ingestion_pipeline = DataIngestionPipeline()
factor_data = ingestion_pipeline.ingest_factor_data(n_periods=120, n_assets=50)

print("\nSample data:")
print(factor_data.head(10))

print("\nData statistics:")
print(factor_data[['value', 'momentum', 'quality', 'size', 'low_vol', 'return_1m']].describe())

### Feature Engineering with Temporal Correctness

In [3]:
class FeatureEngineeringPipeline:
    """
    Feature engineering ensuring temporal correctness (no look-ahead bias)
    """
    
    def __init__(self):
        self.feature_definitions = {}
        self.version = "1.0.0"
        
    def engineer_features(self, df, lookback_periods=[3, 6, 12]):
        """
        Engineer temporal features with strict point-in-time correctness
        
        Key principle: Only use data available at prediction time
        """
        df = df.copy().sort_values(['asset_id', 'date'])
        
        # Original factors (already point-in-time correct)
        features = ['value', 'momentum', 'quality', 'size', 'low_vol']
        
        # Engineer lagged aggregations (moving averages, volatilities)
        for asset_id, asset_df in df.groupby('asset_id'):
            for period in lookback_periods:
                for factor in ['value', 'momentum', 'quality']:
                    # Moving average (using only past data)
                    col_name = f'{factor}_ma{period}'
                    df.loc[df['asset_id'] == asset_id, col_name] = (
                        asset_df[factor].rolling(window=period, min_periods=1).mean()
                    )
                    
                    # Volatility (using only past data)
                    col_name = f'{factor}_vol{period}'
                    df.loc[df['asset_id'] == asset_id, col_name] = (
                        asset_df[factor].rolling(window=period, min_periods=2).std()
                    )
        
        # Fill NaN from rolling windows (first periods)
        engineered_features = [c for c in df.columns if '_ma' in c or '_vol' in c]
        df[engineered_features] = df[engineered_features].fillna(0)
        
        print(f"✓ Features engineered: {len(engineered_features)} new features")
        print(f"  Lookback periods: {lookback_periods}")
        print(f"  Total features: {len(features) + len(engineered_features)}")
        
        # Verify no look-ahead bias
        self._verify_temporal_correctness(df)
        
        return df
    
    def _verify_temporal_correctness(self, df):
        """
        Verify features don't use future information
        
        Check: For prediction at time t, all features use data <= t
        """
        # Sample verification: check if any feature perfectly predicts returns
        # (would indicate leakage)
        
        feature_cols = [c for c in df.columns if c not in ['date', 'asset_id', 'return_1m']]
        
        # Calculate correlation with future returns
        max_corr = 0
        max_corr_feature = None
        
        for col in feature_cols:
            corr = abs(df[col].corr(df['return_1m']))
            if corr > max_corr:
                max_corr = corr
                max_corr_feature = col
        
        if max_corr > 0.9:  # Suspiciously high correlation
            print(f"  ⚠️  Warning: Feature {max_corr_feature} has correlation {max_corr:.3f} with returns")
            print(f"     This might indicate look-ahead bias!")
        else:
            print(f"  ✓ Temporal correctness verified (max correlation: {max_corr:.3f})")


# Demonstrate feature engineering
print("\n" + "="*70)
print("FEATURE ENGINEERING PIPELINE")
print("="*70)

feature_pipeline = FeatureEngineeringPipeline()
factor_data_with_features = feature_pipeline.engineer_features(factor_data, lookback_periods=[3, 6, 12])

print("\nEngineered features sample:")
feature_cols = [c for c in factor_data_with_features.columns if '_ma' in c or '_vol' in c]
print(factor_data_with_features[['date', 'asset_id'] + feature_cols[:6]].head(10))

### Model Training with Versioning

In [4]:
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler

class ModelTrainingPipeline:
    """
    Model training with versioning and metadata tracking
    """
    
    def __init__(self):
        self.models = {}
        self.scalers = {}
        self.metadata = {}
        self.version_counter = 0
        
    def train_factor_model(self, df, train_end_date, features=None):
        """
        Train factor model with proper train/test split
        
        Args:
            df: Full dataset
            train_end_date: Last date for training (anything after is test)
            features: List of feature columns (if None, use all numeric except target)
        """
        # Split data temporally (no random split - that would leak!)
        train_df = df[df['date'] <= train_end_date].copy()
        test_df = df[df['date'] > train_end_date].copy()
        
        # Select features
        if features is None:
            exclude_cols = ['date', 'asset_id', 'return_1m']
            features = [c for c in df.columns if c not in exclude_cols]
        
        X_train = train_df[features]
        y_train = train_df['return_1m']
        X_test = test_df[features]
        y_test = test_df['return_1m']
        
        # Scale features (fit on train only!)
        scaler = StandardScaler()
        X_train_scaled = scaler.fit_transform(X_train)
        X_test_scaled = scaler.transform(X_test)
        
        # Train model
        model = Ridge(alpha=1.0)
        model.fit(X_train_scaled, y_train)
        
        # Evaluate
        train_score = model.score(X_train_scaled, y_train)
        test_score = model.score(X_test_scaled, y_test)
        
        train_pred = model.predict(X_train_scaled)
        test_pred = model.predict(X_test_scaled)
        
        train_mse = np.mean((y_train - train_pred) ** 2)
        test_mse = np.mean((y_test - test_pred) ** 2)
        
        # Version and store
        version_id = f"v{self.version_counter}_{datetime.now().strftime('%Y%m%d_%H%M%S')}"
        self.version_counter += 1
        
        self.models[version_id] = model
        self.scalers[version_id] = scaler
        self.metadata[version_id] = {
            'train_end_date': train_end_date,
            'features': features,
            'n_train': len(train_df),
            'n_test': len(test_df),
            'train_r2': train_score,
            'test_r2': test_score,
            'train_mse': train_mse,
            'test_mse': test_mse,
            'model_type': 'Ridge',
            'hyperparameters': {'alpha': 1.0}
        }
        
        print(f"\n✓ Model trained: {version_id}")
        print(f"  Training period: {train_df['date'].min()} to {train_end_date}")
        print(f"  Test period: {test_df['date'].min()} to {test_df['date'].max()}")
        print(f"  Training samples: {len(train_df):,}")
        print(f"  Test samples: {len(test_df):,}")
        print(f"  Features: {len(features)}")
        print(f"\n  Performance:")
        print(f"    Train R²: {train_score:.4f}")
        print(f"    Test R²:  {test_score:.4f}")
        print(f"    Train MSE: {train_mse:.6f}")
        print(f"    Test MSE:  {test_mse:.6f}")
        
        return version_id, model, scaler
    
    def get_model_metadata(self, version_id):
        """Retrieve model metadata for versioning and auditing"""
        return self.metadata.get(version_id, {})


# Demonstrate model training
print("\n" + "="*70)
print("MODEL TRAINING PIPELINE")
print("="*70)

training_pipeline = ModelTrainingPipeline()

# Train model with 80-20 temporal split
all_dates = sorted(factor_data_with_features['date'].unique())
split_idx = int(len(all_dates) * 0.8)
train_end_date = all_dates[split_idx]

version_id, model, scaler = training_pipeline.train_factor_model(
    factor_data_with_features,
    train_end_date
)

# Show model coefficients (feature importance)
feature_cols = [c for c in factor_data_with_features.columns 
                if c not in ['date', 'asset_id', 'return_1m']]
coefficients = model.coef_

print(f"\nTop 10 features by absolute coefficient:")
feature_importance = pd.DataFrame({
    'feature': feature_cols,
    'coefficient': coefficients
}).sort_values('coefficient', key=abs, ascending=False)

print(feature_importance.head(10).to_string(index=False))

### Reflection Questions (Exercise 1)

Write 200-250 words addressing:

1.  **Pipeline Benefits**: How does organizing ML code into pipeline
    components (ingestion, features, training) improve maintainability
    compared to monolithic scripts? What challenges does this introduce?

2.  **Temporal Correctness**: Why is temporal correctness critical for
    financial ML? What would happen if we accidentally used future
    information in features? How can we systematically verify temporal
    correctness?

3.  **Model Versioning**: What information should model versions track
    for production systems? How does versioning enable debugging,
    auditing, and rollback capabilities?

## Exercise 2: Rigorous Backtesting with Multiple Testing Correction

### The Multiple Testing Problem

In [5]:
def demonstrate_multiple_testing_problem(n_random_features=100, n_samples=1000, alpha=0.05):
    """
    Demonstrate how testing many random features produces false discoveries
    """
    np.random.seed(42)
    
    # Generate random features (no predictive power)
    X_random = np.random.randn(n_samples, n_random_features)
    y = np.random.randn(n_samples)  # Random returns
    
    # Test each feature for significance
    p_values = []
    correlations = []
    
    for i in range(n_random_features):
        corr = np.corrcoef(X_random[:, i], y)[0, 1]
        # T-test for correlation significance
        t_stat = corr * np.sqrt(n_samples - 2) / np.sqrt(1 - corr**2)
        p_val = 2 * (1 - stats.t.cdf(abs(t_stat), n_samples - 2))
        
        p_values.append(p_val)
        correlations.append(abs(corr))
    
    # Count "significant" features (without correction)
    significant_uncorrected = sum(p < alpha for p in p_values)
    
    # Apply Bonferroni correction
    alpha_bonferroni = alpha / n_random_features
    significant_bonferroni = sum(p < alpha_bonferroni for p in p_values)
    
    # Apply Benjamini-Hochberg FDR correction
    rejected, p_corrected, _, _ = multipletests(p_values, alpha=alpha, method='fdr_bh')
    significant_fdr = sum(rejected)
    
    print("="*70)
    print("MULTIPLE TESTING DEMONSTRATION")
    print("="*70)
    print(f"\nTesting {n_random_features} random features (no true predictive power)")
    print(f"Sample size: {n_samples}")
    print(f"Significance level: α = {alpha}")
    
    print(f"\nResults WITHOUT correction:")
    print(f"  'Significant' features: {significant_uncorrected}")
    print(f"  Expected false positives: {n_random_features * alpha:.1f}")
    print(f"  → {significant_uncorrected/n_random_features*100:.1f}% of features appear significant!")
    
    print(f"\nResults WITH Bonferroni correction:")
    print(f"  Significant features: {significant_bonferroni}")
    print(f"  → Properly controls false positives")
    
    print(f"\nResults WITH FDR correction:")
    print(f"  Significant features: {significant_fdr}")
    print(f"  → Balances power and error control")
    
    # Visualize
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # P-value distribution
    axes[0].hist(p_values, bins=20, edgecolor='black', alpha=0.7)
    axes[0].axvline(alpha, color='red', linestyle='--', linewidth=2, label=f'α = {alpha}')
    axes[0].axvline(alpha_bonferroni, color='green', linestyle='--', linewidth=2, 
                    label=f'Bonferroni = {alpha_bonferroni:.4f}')
    axes[0].set_xlabel('P-value')
    axes[0].set_ylabel('Frequency')
    axes[0].set_title('P-value Distribution (Random Features)', fontweight='bold')
    axes[0].legend()
    axes[0].grid(alpha=0.3)
    
    # Sorted p-values with FDR threshold
    sorted_p = sorted(p_values)
    ranks = np.arange(1, len(sorted_p) + 1)
    fdr_threshold = alpha * ranks / n_random_features
    
    axes[1].plot(ranks, sorted_p, 'o', markersize=4, alpha=0.6, label='P-values')
    axes[1].plot(ranks, fdr_threshold, 'r--', linewidth=2, label='FDR threshold')
    axes[1].set_xlabel('Rank')
    axes[1].set_ylabel('P-value')
    axes[1].set_title('Benjamini-Hochberg FDR Procedure', fontweight='bold')
    axes[1].legend()
    axes[1].grid(alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    return p_values, significant_uncorrected, significant_bonferroni, significant_fdr


# Run demonstration
p_vals, n_uncorr, n_bonf, n_fdr = demonstrate_multiple_testing_problem()

print("\n💡 Key Insight: Testing many features guarantees false discoveries")
print("   Multiple testing correction is ESSENTIAL for valid research!")

### Implementing Combinatorial Purged Cross-Validation

In [6]:
def combinatorial_purged_cv(df, features, n_splits=5, embargo_pct=0.05):
    """
    Implement Bailey & López de Prado's Combinatorial Purged Cross-Validation
    
    Steps:
    1. Create multiple train/test splits
    2. Purge training samples near test samples (prevent leakage)
    3. Add embargo period (account for label lag)
    4. Train model on each split
    5. Compute performance metrics
    6. Calculate Probability of Backtest Overfitting (PBO)
    
    Args:
        df: Dataset with features and returns
        features: List of feature column names
        n_splits: Number of CV splits
        embargo_pct: Percentage of data to embargo (gap between train/test)
    """
    df = df.sort_values('date').reset_index(drop=True)
    dates = sorted(df['date'].unique())
    n_dates = len(dates)
    
    # Create split indices
    split_size = n_dates // n_splits
    embargo_periods = int(split_size * embargo_pct)
    
    # Generate combinations of splits for train/test
    # Use subset of all possible combinations (computational limit)
    split_indices = list(range(n_splits))
    n_combinations = min(10, 2 ** (n_splits - 1))  # Limit combinations
    
    results = []
    
    print("\n" + "="*70)
    print("COMBINATORIAL PURGED CROSS-VALIDATION")
    print("="*70)
    print(f"\nConfiguration:")
    print(f"  Splits: {n_splits}")
    print(f"  Embargo: {embargo_pct*100:.0f}% ({embargo_periods} periods)")
    print(f"  Testing {n_combinations} train/test combinations")
    
    for combo_idx in range(n_combinations):
        # Randomly select test splits
        np.random.seed(combo_idx)
        test_splits = np.random.choice(split_indices, size=max(1, n_splits // 3), replace=False)
        train_splits = [s for s in split_indices if s not in test_splits]
        
        # Convert splits to date ranges
        test_dates = set()
        for split_idx in test_splits:
            start_idx = split_idx * split_size
            end_idx = min((split_idx + 1) * split_size, n_dates)
            test_dates.update(dates[start_idx:end_idx])
        
        train_dates = set()
        for split_idx in train_splits:
            start_idx = split_idx * split_size
            end_idx = min((split_idx + 1) * split_size, n_dates)
            
            # Purge: remove dates close to test dates
            split_dates = dates[start_idx:end_idx]
            for date in split_dates:
                # Check if date is too close to any test date
                too_close = False
                for test_date in test_dates:
                    date_diff = abs((date - test_date).days)
                    if date_diff < embargo_periods * 30:  # Assuming monthly data
                        too_close = True
                        break
                
                if not too_close:
                    train_dates.add(date)
        
        # Create train/test sets
        train_df = df[df['date'].isin(train_dates)]
        test_df = df[df['date'].isin(test_dates)]
        
        if len(train_df) < 100 or len(test_df) < 50:
            continue  # Skip if insufficient data
        
        # Train model
        X_train = train_df[features]
        y_train = train_df['return_1m']
        X_test = test_df[features]
        y_test = test_df['return_1m']
        
        scaler = StandardScaler()
        X_train_scaled = scaler.fit_transform(X_train)
        X_test_scaled = scaler.transform(X_test)
        
        model = Ridge(alpha=1.0)
        model.fit(X_train_scaled, y_train)
        
        train_pred = model.predict(X_train_scaled)
        test_pred = model.predict(X_test_scaled)
        
        # Calculate Sharpe ratios (common financial metric)
        train_sharpe = np.mean(train_pred) / (np.std(train_pred) + 1e-6) * np.sqrt(12)
        test_sharpe = np.mean(test_pred) / (np.std(test_pred) + 1e-6) * np.sqrt(12)
        
        results.append({
            'combo': combo_idx,
            'train_sharpe': train_sharpe,
            'test_sharpe': test_sharpe,
            'train_size': len(train_df),
            'test_size': len(test_df)
        })
    
    results_df = pd.DataFrame(results)
    
    # Calculate Probability of Backtest Overfitting (PBO)
    median_train_sharpe = results_df['train_sharpe'].median()
    pbo = (results_df['test_sharpe'] < median_train_sharpe).mean()
    
    print(f"\n" + "-"*70)
    print("RESULTS")
    print("-"*70)
    print(f"\nCombinations tested: {len(results_df)}")
    print(f"Median train Sharpe: {median_train_sharpe:.4f}")
    print(f"Mean test Sharpe: {results_df['test_sharpe'].mean():.4f}")
    print(f"\nProbability of Backtest Overfitting (PBO): {pbo:.4f}")
    
    if pbo > 0.5:
        print(f"  ⚠️  HIGH OVERFITTING RISK! (PBO > 0.5)")
        print(f"     Strategy likely captured noise, not signal")
    elif pbo > 0.3:
        print(f"  ⚠️  MODERATE OVERFITTING RISK (PBO > 0.3)")
        print(f"     Proceed with caution, validate further")
    else:
        print(f"  ✓ LOW OVERFITTING RISK (PBO < 0.3)")
        print(f"     Strategy appears robust")
    
    # Visualize
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Train vs Test Sharpe
    axes[0].scatter(results_df['train_sharpe'], results_df['test_sharpe'], alpha=0.6, s=50)
    axes[0].plot([results_df['train_sharpe'].min(), results_df['train_sharpe'].max()],
                 [results_df['train_sharpe'].min(), results_df['train_sharpe'].max()],
                 'r--', label='45° line')
    axes[0].axvline(median_train_sharpe, color='green', linestyle='--', label='Median train')
    axes[0].set_xlabel('Train Sharpe Ratio')
    axes[0].set_ylabel('Test Sharpe Ratio')
    axes[0].set_title('Train vs Test Performance', fontweight='bold')
    axes[0].legend()
    axes[0].grid(alpha=0.3)
    
    # Distribution of test Sharpe
    axes[1].hist(results_df['test_sharpe'], bins=15, edgecolor='black', alpha=0.7)
    axes[1].axvline(median_train_sharpe, color='green', linestyle='--', linewidth=2,
                    label=f'Median train = {median_train_sharpe:.3f}')
    axes[1].axvline(results_df['test_sharpe'].mean(), color='red', linestyle='--', linewidth=2,
                    label=f'Mean test = {results_df["test_sharpe"].mean():.3f}')
    axes[1].set_xlabel('Test Sharpe Ratio')
    axes[1].set_ylabel('Frequency')
    axes[1].set_title(f'Test Performance Distribution (PBO={pbo:.3f})', fontweight='bold')
    axes[1].legend()
    axes[1].grid(alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    return results_df, pbo


# Run combinatorial purged CV
feature_cols = [c for c in factor_data_with_features.columns 
                if c not in ['date', 'asset_id', 'return_1m']]

cv_results, pbo = combinatorial_purged_cv(
    factor_data_with_features,
    feature_cols,
    n_splits=5,
    embargo_pct=0.05
)

### Reflection Questions (Exercise 2)

Write 250-300 words addressing:

1.  **Multiple Testing Impact**: In the demonstration, ~5 features
    appeared significant by chance. How would this affect a researcher
    who tests 100 features and publishes the “significant” ones? What
    are the consequences for financial markets if overfit strategies are
    widely adopted?

2.  **PBO Interpretation**: The Probability of Backtest Overfitting
    measures what fraction of test periods underperform the median
    training performance. Why is this a useful metric for detecting
    overfitting? What PBO threshold should trigger concern?

3.  **Purging and Embargo**: Why do we purge training samples near test
    periods and add embargo gaps? What would happen if we skipped these
    steps? How do these relate to the temporal structure of financial
    data?

## Exercise 3: Production Monitoring and Drift Detection

### Simulating Production Deployment

In [7]:
class ProductionMonitor:
    """
    Monitor model performance and detect drift in production
    """
    
    def __init__(self, model, scaler, features, baseline_data):
        self.model = model
        self.scaler = scaler
        self.features = features
        
        # Calculate baseline statistics for drift detection
        self.baseline_mean = baseline_data[features].mean()
        self.baseline_std = baseline_data[features].std()
        
        # Performance tracking
        self.performance_history = []
        
    def predict_and_monitor(self, df):
        """
        Make predictions and monitor for drift/degradation
        """
        X = df[self.features]
        X_scaled = self.scaler.transform(X)
        predictions = self.model.predict(X_scaled)
        
        # Calculate performance (when ground truth available)
        if 'return_1m' in df.columns:
            mse = np.mean((df['return_1m'] - predictions) ** 2)
            mae = np.mean(np.abs(df['return_1m'] - predictions))
            corr = np.corrcoef(df['return_1m'], predictions)[0, 1]
            
            sharpe = np.mean(predictions) / (np.std(predictions) + 1e-6) * np.sqrt(12)
            
            self.performance_history.append({
                'timestamp': datetime.now(),
                'n_samples': len(df),
                'mse': mse,
                'mae': mae,
                'correlation': corr,
                'sharpe': sharpe
            })
        
        # Detect data drift (Population Stability Index)
        psi_scores = self._calculate_psi(df[self.features])
        
        # Alert if drift detected
        alerts = []
        for feature, psi in psi_scores.items():
            if psi > 0.25:  # Significant drift threshold
                alerts.append(f"⚠️  DRIFT ALERT: {feature} (PSI={psi:.3f})")
        
        return predictions, psi_scores, alerts
    
    def _calculate_psi(self, current_data):
        """
        Calculate Population Stability Index for each feature
        
        PSI measures how much the distribution has shifted
        PSI < 0.1: No significant change
        0.1 < PSI < 0.25: Moderate change
        PSI > 0.25: Significant change
        """
        psi_scores = {}
        
        for feature in self.features:
            # Use baseline and current distributions
            baseline_values = self.baseline_mean[feature]
            current_mean = current_data[feature].mean()
            
            # Simplified PSI calculation
            # In production, use proper binned distributions
            if self.baseline_std[feature] > 0:
                z_score = abs(current_mean - baseline_values) / self.baseline_std[feature]
                psi = z_score / 10  # Approximate PSI
            else:
                psi = 0
            
            psi_scores[feature] = psi
        
        return psi_scores
    
    def generate_monitoring_report(self):
        """Generate monitoring dashboard"""
        if not self.performance_history:
            print("No performance data available yet")
            return
        
        perf_df = pd.DataFrame(self.performance_history)
        
        print("\n" + "="*70)
        print("PRODUCTION MONITORING REPORT")
        print("="*70)
        
        print(f"\nPerformance Summary (last {len(perf_df)} evaluations):")
        print(f"  MSE:  mean={perf_df['mse'].mean():.6f}, std={perf_df['mse'].std():.6f}")
        print(f"  MAE:  mean={perf_df['mae'].mean():.6f}, std={perf_df['mae'].std():.6f}")
        print(f"  Corr: mean={perf_df['correlation'].mean():.4f}, std={perf_df['correlation'].std():.4f}")
        print(f"  Sharpe: mean={perf_df['sharpe'].mean():.4f}, std={perf_df['sharpe'].std():.4f}")
        
        # Check for degradation
        if len(perf_df) >= 5:
            recent_sharpe = perf_df['sharpe'].iloc[-3:].mean()
            historical_sharpe = perf_df['sharpe'].iloc[:-3].mean()
            
            if recent_sharpe < historical_sharpe - 0.5:
                print(f"\n⚠️  PERFORMANCE DEGRADATION DETECTED!")
                print(f"   Recent Sharpe: {recent_sharpe:.4f}")
                print(f"   Historical Sharpe: {historical_sharpe:.4f}")
                print(f"   Consider retraining or rolling back model")
        
        # Visualize performance over time
        fig, axes = plt.subplots(2, 2, figsize=(14, 10))
        
        # MSE over time
        axes[0, 0].plot(range(len(perf_df)), perf_df['mse'], marker='o', linewidth=2)
        axes[0, 0].set_xlabel('Evaluation Period')
        axes[0, 0].set_ylabel('MSE')
        axes[0, 0].set_title('Mean Squared Error Over Time', fontweight='bold')
        axes[0, 0].grid(alpha=0.3)
        
        # Correlation over time
        axes[0, 1].plot(range(len(perf_df)), perf_df['correlation'], marker='o', linewidth=2, color='green')
        axes[0, 1].axhline(0, color='red', linestyle='--', alpha=0.3)
        axes[0, 1].set_xlabel('Evaluation Period')
        axes[0, 1].set_ylabel('Correlation')
        axes[0, 1].set_title('Prediction-Actual Correlation Over Time', fontweight='bold')
        axes[0, 1].grid(alpha=0.3)
        
        # Sharpe ratio over time
        axes[1, 0].plot(range(len(perf_df)), perf_df['sharpe'], marker='o', linewidth=2, color='purple')
        axes[1, 0].axhline(0, color='red', linestyle='--', alpha=0.3)
        axes[1, 0].set_xlabel('Evaluation Period')
        axes[1, 0].set_ylabel('Sharpe Ratio')
        axes[1, 0].set_title('Sharpe Ratio Over Time', fontweight='bold')
        axes[1, 0].grid(alpha=0.3)
        
        # Performance distribution
        axes[1, 1].hist(perf_df['sharpe'], bins=10, edgecolor='black', alpha=0.7, color='orange')
        axes[1, 1].axvline(perf_df['sharpe'].mean(), color='red', linestyle='--', linewidth=2,
                          label=f'Mean = {perf_df["sharpe"].mean():.3f}')
        axes[1, 1].set_xlabel('Sharpe Ratio')
        axes[1, 1].set_ylabel('Frequency')
        axes[1, 1].set_title('Sharpe Ratio Distribution', fontweight='bold')
        axes[1, 1].legend()
        axes[1, 1].grid(alpha=0.3)
        
        plt.tight_layout()
        plt.show()


# Demonstrate production monitoring
print("\n" + "="*70)
print("PRODUCTION DEPLOYMENT SIMULATION")
print("="*70)

# Use trained model from Exercise 1
all_dates = sorted(factor_data_with_features['date'].unique())
test_start_idx = int(len(all_dates) * 0.8)
test_dates = all_dates[test_start_idx:]

# Create baseline from training data
train_data = factor_data_with_features[
    factor_data_with_features['date'] <= all_dates[test_start_idx]
]

# Initialize monitor
monitor = ProductionMonitor(model, scaler, feature_cols, train_data)

# Simulate production: process data in monthly batches
print("\nSimulating production deployment with monthly monitoring:")
print("-"*70)

for i, date in enumerate(test_dates):
    batch_data = factor_data_with_features[factor_data_with_features['date'] == date]
    
    predictions, psi_scores, alerts = monitor.predict_and_monitor(batch_data)
    
    print(f"\nPeriod {i+1} ({date.strftime('%Y-%m')}): {len(batch_data)} predictions")
    
    # Show drift warnings
    max_psi = max(psi_scores.values())
    if max_psi > 0.25:
        print(f"  ⚠️  High drift detected (max PSI={max_psi:.3f})")
        for alert in alerts[:3]:  # Show top 3
            print(f"     {alert}")
    else:
        print(f"  ✓ No significant drift (max PSI={max_psi:.3f})")

# Generate comprehensive monitoring report
monitor.generate_monitoring_report()

### Reflection Questions (Exercise 3)

Write 200-250 words addressing:

1.  **Monitoring Strategy**: What metrics should production ML systems
    monitor? Why track both model performance (MSE, correlation) and
    data characteristics (PSI, drift)? How quickly should monitoring
    systems detect issues?

2.  **Drift Response**: When data drift is detected, what actions should
    be taken? When is retraining necessary versus other interventions
    (feature engineering, model rollback, investigation)? What are the
    risks of over-reacting to drift versus under-reacting?

3.  **Production vs Research**: This lab demonstrated many differences
    between research ML and production ML. What are the three most
    important differences? How do these differences affect how
    organizations should structure their ML teams and processes?

## Summary and Integration

### What We’ve Learned

Through these exercises, you’ve:

1.  **Implemented end-to-end ML pipeline** with data ingestion, feature
    engineering, training, and versioning

2.  **Ensured temporal correctness** preventing look-ahead bias that
    creates unrealistic performance

3.  **Applied multiple testing corrections** preventing false
    discoveries from testing many features

4.  **Implemented combinatorial purged CV** with proper train/test
    separation for time-series

5.  **Calculated PBO** quantifying probability that backtest performance
    resulted from overfitting

6.  **Monitored production deployment** detecting drift and performance
    degradation

7.  **Understood overfitting pervasiveness** in financial ML requiring
    rigorous validation

### Connections to Course Themes

-   **Week 9 (Smart Contracts)**: Both smart contracts and ML models
    make consequential decisions—require rigorous testing before
    deployment

-   **Week 8 (Fraud Detection)**: Production fraud detection exemplifies
    real-time ML pipelines with monitoring

-   **Week 4 (Robo-Advisors)**: Portfolio optimization at scale requires
    production ML infrastructure

-   **Throughout course**: Evidence-based evaluation—don’t trust
    impressive backtests without rigorous validation

### Critical Evaluation Framework

When evaluating ML systems or research:

1.  **Validation rigor**: Are multiple testing corrections applied? Is
    out-of-sample validation proper?
2.  **Temporal correctness**: Are features point-in-time correct? Any
    look-ahead bias?
3.  **Production readiness**: Is there monitoring? Drift detection?
    Rollback capability?
4.  **Reproducibility**: Can results be reproduced? Is data/code
    versioned?
5.  **Business value**: Do model improvements translate to business
    impact?

### Assessment Preparation

**FIN510 Coursework 2**: Apply rigorous backtesting to factor
replication—use multiple testing corrections, implement proper temporal
splits, calculate PBO, validate out-of-sample.

**FIN720**: Critically evaluate ML applications in finance—assess
validation rigor, production readiness, and gap between research claims
and deployment reality.

### Further Exploration

If interested in extending your analysis:

-   **Advanced feature stores**: Implement Feast or Tecton for feature
    consistency
-   **Continuous training**: Automate retraining pipelines triggered by
    drift detection
-   **Fairness monitoring**: Track model performance across demographic
    segments
-   **Explainability**: Implement SHAP or LIME for model
    interpretability
-   **Regulatory compliance**: Document models following SR 11-7
    framework

------------------------------------------------------------------------

## Advanced Extension: Asset Embeddings as Learned Features (FIN720)

> **Optional Advanced Module**
>
> This extension is designed for FIN720 students seeking to demonstrate
> technical sophistication. It connects to cutting-edge research (Gabaix
> et al. 2025) and provides optional enhancement for Coursework 2.
>
> **Expected time**: 45-60 minutes

### Motivation: Beyond Hand-Crafted Characteristics

Traditional factor models use characteristics we believe matter—size,
value, momentum. But what if we learned representations directly from
portfolio holdings? This exercise demonstrates how embedding techniques
from NLP translate to finance, bridging Week 4’s discussion of
algorithmic investment with production ML concepts.

The key insight from Gabaix et al. (2025): portfolio holdings encode
rich information about asset relationships. Assets appearing in similar
portfolios likely share investment characteristics—just as words
appearing in similar contexts have related meanings. We can learn these
relationships without hand-crafting features.

### Simplified Holdings-Based Embeddings

We’ll implement a lightweight version using synthetic mutual fund
holdings:

In [8]:
# === Asset Embeddings Implementation ===
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from scipy.spatial.distance import cosine
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE

print("\n" + "="*70)
print("ASSET EMBEDDINGS: LEARNING FROM PORTFOLIO HOLDINGS")
print("="*70)

# Generate synthetic portfolio holdings
# In practice, use 13F filings or mutual fund holdings data
np.random.seed(42)

n_funds = 100  # Number of institutional portfolios
n_assets = 50  # Number of assets in universe

print(f"\nGenerating synthetic holdings data:")
print(f"  Portfolios: {n_funds}")
print(f"  Assets: {n_assets}")

# Create holdings matrix using latent factor structure
# This simulates reality: portfolios share common themes (value, growth, sector)
n_latent_factors = 5
fund_loadings = np.random.randn(n_funds, n_latent_factors)
asset_loadings = np.random.randn(n_assets, n_latent_factors)

# Holdings as exp(fund_factor • asset_factor) for log-normality
holdings_raw = np.exp(fund_loadings @ asset_loadings.T)

# Normalize to portfolio weights (each row sums to 1)
holdings = holdings_raw / holdings_raw.sum(axis=1, keepdims=True)

# Create DataFrame
asset_names = [f"ASSET_{i:02d}" for i in range(n_assets)]
fund_names = [f"FUND_{i:03d}" for i in range(n_funds)]
holdings_df = pd.DataFrame(holdings, index=fund_names, columns=asset_names)

print("\nSample Holdings (first 5 funds × first 5 assets):")
print(holdings_df.iloc[:5, :5].round(4))
print(f"\nPortfolio weight verification (each row should sum to ≈1.0):")
print(f"  Min: {holdings_df.sum(axis=1).min():.4f}")
print(f"  Max: {holdings_df.sum(axis=1).max():.4f}")

### Method 1: PCA-Based Embeddings (Recommender System Approach)

In [9]:
class PortfolioEmbeddings:
    """
    Learn asset embeddings from portfolio holdings via PCA.
    
    This is analogous to matrix factorization in recommender systems
    (e.g., Netflix learning movie embeddings from user ratings).
    """
    
    def __init__(self, n_components=10):
        self.n_components = n_components
        self.scaler = StandardScaler()
        self.pca = PCA(n_components=n_components)
        self.asset_embeddings_ = None
        self.explained_variance_ = None
        
    def fit(self, holdings_df):
        """
        Learn embeddings from holdings matrix.
        
        Args:
            holdings_df: DataFrame (funds × assets) of portfolio weights
        
        Returns:
            self
        """
        # Transpose: we want to represent assets in terms of fund holdings
        # Each asset is characterized by which funds hold it
        X = holdings_df.T  # Now assets × funds
        
        # Standardize (mean-center and scale)
        X_scaled = self.scaler.fit_transform(X)
        
        # Fit PCA: find principal components
        self.pca.fit(X_scaled)
        
        # Asset embeddings are projections onto principal components
        self.asset_embeddings_ = pd.DataFrame(
            self.pca.transform(X_scaled),
            index=holdings_df.columns,  # Asset names
            columns=[f"embed_{i}" for i in range(self.n_components)]
        )
        
        self.explained_variance_ = self.pca.explained_variance_ratio_
        
        return self
    
    def similarity(self, asset1, asset2):
        """Compute cosine similarity between two assets."""
        emb1 = self.asset_embeddings_.loc[asset1].values
        emb2 = self.asset_embeddings_.loc[asset2].values
        return 1 - cosine(emb1, emb2)
    
    def most_similar(self, asset, top_n=5):
        """Find most similar assets to a given asset."""
        target_emb = self.asset_embeddings_.loc[asset].values
        
        similarities = {}
        for other_asset in self.asset_embeddings_.index:
            if other_asset != asset:
                other_emb = self.asset_embeddings_.loc[other_asset].values
                similarities[other_asset] = 1 - cosine(target_emb, other_emb)
        
        return pd.Series(similarities).nlargest(top_n)

# Fit embeddings
print("\n" + "="*70)
print("LEARNING EMBEDDINGS")
print("="*70)

embedder = PortfolioEmbeddings(n_components=10)
embedder.fit(holdings_df)

print(f"\nLearned {embedder.n_components}-dimensional embeddings for {len(embedder.asset_embeddings_)} assets")
print(f"\nVariance explained by components:")
for i, var in enumerate(embedder.explained_variance_[:5]):
    print(f"  Component {i+1}: {var:.1%}")
print(f"  Total (first 5): {embedder.explained_variance_[:5].sum():.1%}")

print("\nExample: Most similar assets to ASSET_00:")
print(embedder.most_similar("ASSET_00", top_n=5).round(4))

### Validation: Do Embeddings Predict Co-movement?

The critical test: do assets with similar embeddings have correlated
returns? If embeddings only compress noise, similarity won’t predict
co-movement. If they capture real structure, similar assets should move
together.

In [10]:
print("\n" + "="*70)
print("VALIDATION: EMBEDDINGS vs RETURN COMOVEMENT")
print("="*70)

# Generate synthetic returns using same factor structure
# This creates returns with known correlation structure
returns = asset_loadings @ np.random.randn(n_latent_factors, 120)  # 120 months
returns_df = pd.DataFrame(returns.T, columns=asset_names)

# Add idiosyncratic noise
returns_df = returns_df + np.random.randn(*returns_df.shape) * 0.5

print(f"\nGenerated {len(returns_df)} months of returns for {len(asset_names)} assets")

# Compute actual return correlations
return_corr = returns_df.corr()

# Compute embedding-based similarities for all pairs
embedding_sim = pd.DataFrame(
    index=asset_names,
    columns=asset_names,
    dtype=float
)

for i, asset1 in enumerate(asset_names):
    for j, asset2 in enumerate(asset_names):
        if i == j:
            embedding_sim.loc[asset1, asset2] = 1.0
        else:
            embedding_sim.loc[asset1, asset2] = embedder.similarity(asset1, asset2)

# Extract upper triangle (avoid double-counting pairs)
upper_tri_idx = np.triu_indices_from(return_corr.values, k=1)
actual_corrs = return_corr.values[upper_tri_idx]
embedding_sims = embedding_sim.astype(float).values[upper_tri_idx]

# Compute validation metric
correlation = np.corrcoef(actual_corrs, embedding_sims)[0, 1]
r_squared = correlation ** 2

print(f"\nValidation Results:")
print(f"  Correlation between embedding similarity and return correlation: {correlation:.3f}")
print(f"  R² (variance explained): {r_squared:.1%}")
print(f"\n  Interpretation: Embeddings explain {r_squared:.1%} of return comovement")

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Scatter plot: embeddings vs. correlations
axes[0].scatter(actual_corrs, embedding_sims, alpha=0.3, s=20)
axes[0].set_xlabel("Actual Return Correlation", fontsize=11)
axes[0].set_ylabel("Embedding Similarity", fontsize=11)
axes[0].set_title("Embeddings vs. Return Comovement", fontsize=12, fontweight='bold')
axes[0].plot([-1, 1], [-1, 1], 'r--', linewidth=2, label='Perfect Prediction')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
axes[0].text(0.05, 0.95, f"Correlation: {correlation:.3f}\nR²: {r_squared:.1%}",
             transform=axes[0].transAxes, verticalalignment='top', fontsize=10,
             bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.7))

# 2D visualization of embedding space
embeddings_2d = TSNE(n_components=2, random_state=42, perplexity=min(30, n_assets-1)).fit_transform(
    embedder.asset_embeddings_.values
)

axes[1].scatter(embeddings_2d[:, 0], embeddings_2d[:, 1], alpha=0.6, s=50)
axes[1].set_xlabel("Embedding Dimension 1 (t-SNE)", fontsize=11)
axes[1].set_ylabel("Embedding Dimension 2 (t-SNE)", fontsize=11)
axes[1].set_title("Asset Embedding Space", fontsize=12, fontweight='bold')
axes[1].grid(True, alpha=0.3)

# Annotate a few assets
for i in [0, 10, 20, 30, 40]:
    if i < len(asset_names):
        axes[1].annotate(asset_names[i], 
                         xy=(embeddings_2d[i, 0], embeddings_2d[i, 1]),
                         xytext=(5, 5), textcoords='offset points', fontsize=8)

plt.tight_layout()
plt.show()

print("\n✓ Validation complete")

### Discussion Questions (Advanced Extension)

Write 250-300 words addressing:

1.  **Representation Learning**: How do embeddings differ from
    traditional factor exposures (e.g., Fama-French loadings)? What
    information might embeddings capture that hand-crafted
    characteristics miss? What are the trade-offs?

2.  **Temporal Correctness**: Holdings data (13F filings) updates
    quarterly with 45-day lag. How does this affect using embeddings
    for:

    -   Daily portfolio rebalancing?
    -   Monthly factor construction?
    -   Long-term strategic allocation?

    How would you construct point-in-time embeddings avoiding look-ahead
    bias?

3.  **Production Considerations**:

    -   How often should embeddings be retrained as new holdings data
        arrives?
    -   What drift metrics would detect when embeddings become stale?
    -   How computationally expensive is embedding training compared to
        computing characteristics?

4.  **Governance & Explainability**: Imagine presenting to a risk
    committee:

    -   “We recommend buying TSLA because it has high embedding
        similarity to AAPL.”
    -   How would you make this interpretable and defensible?
    -   What documentation would satisfy regulators?
    -   When should opaque-but-accurate embeddings be preferred over
        transparent-but-limited characteristics?

5.  **Connection to Coursework 2**: Could embeddings improve your factor
    replication?

    -   Use first few PCA components as additional factors in alpha
        tests
    -   Compare characteristic-based vs. embedding-based portfolio
        construction
    -   Evaluate out-of-sample: do embeddings generalize or overfit?

### Extension for Ambitious Students

If you want to push further, implement a **masked-asset prediction
task**:

1.  For each portfolio, randomly mask (hide) 10% of holdings
2.  Train model to predict masked assets from observed holdings
3.  Evaluate: Do top-10 predictions include actual masked assets?
4.  Compare: PCA baseline vs. simple neural network

This mirrors Gabaix et al. (2025)’ core methodology and provides
hands-on experience with the masked prediction objective used in
transformer models (BERT-style).

### Key Takeaways

**Connections to Production ML:** - Embeddings are learned
features—automatically extracted from data - Must retrain as holdings
data arrives (temporal pipeline consideration) - Require validation (do
they predict out-of-sample outcomes?) - Trade interpretability for
potentially richer representations

**Connections to Week 4 (Robo-Advisors):** - Embeddings enable
recommendation systems for portfolio construction - Learning from
professional investors’ revealed preferences - Potential to democratize
sophisticated strategies at lower cost

**Connections to Coursework:** - Optional advanced component for FIN720
factor replication - Demonstrates technical sophistication and
engagement with frontier research - Shows understanding of
representation learning in financial contexts

------------------------------------------------------------------------

**Excellent work! You’ve built production-ready ML pipelines with
rigorous validation, understanding the engineering discipline and
statistical rigor required for reliable financial ML systems.**

Gabaix, Xavier, Ralph S. J. Koijen, Robert Richmond, and Motohiro Yogo.
2025. “Asset Embeddings.” Working Paper. SSRN Electronic Journal.
<https://doi.org/10.2139/ssrn.4507511>.