# Lab 13: Empirical Asset Pricing & Machine Learning
## From the Equity Premium Puzzle to the Factor Zoo

---

### üéØ Lab Philosophy: The Three-Act Drama of Asset Pricing

This lab tells the story of how asset pricing evolved from theoretical elegance to empirical pragmatism to data-driven discovery. Think of it as a three-act drama:

**Act I - The Theoretical Crisis (1985)**: Economists built beautiful consumption-based models to explain stock returns. The math was elegant, the intuition clear. Then Mehra and Prescott calculated what the model actually implied... and it was absurd. To match observed stock returns, investors would need to be terrified of risk in ways that contradict everything we observe about human behavior. This was the **Equity Premium Puzzle** - not just a minor calibration issue, but a fundamental crisis for macroeconomic theory.

**Act II - The Empirical Revolution (1993)**: Unable to fix the theory, Fama and French took a radical step: forget about consumption, let's just see what *actually* predicts stock returns in the data. They discovered that three simple factors - market exposure, firm size, and value - could explain most of the variation in stock returns. The model worked brilliantly... but *why* these factors? The theory was missing, but the empirics were undeniable.

**Act III - The Modern Challenge (2010s)**: Success bred excess. If three factors work, why not try everything? Researchers discovered hundreds of "factors" - from momentum to profitability to investment patterns. We now face the **Factor Zoo** problem: which factors are real economic forces, and which are just data mining artifacts? Enter machine learning: algorithms that can systematically separate signal from noise.

### üìö What You'll Learn

**Part 1: The Crisis - Understanding the Equity Premium Puzzle**
- Why consumption-based models fail spectacularly
- The Hansen-Jagannathan bounds: making the failure mathematically precise
- Calculating implied risk aversion from real US data
- Visualizing the "feasible region" and why we're far outside it

**Part 2: The Fix - Implementing Fama-French**
- From unobservable consumption to observable portfolio returns
- Downloading real factor data from Kenneth French's library
- Pricing an actual stock (Apple) using factor exposures
- Understanding what "beta" really means in this context

**Part 3: The Modern Toolkit - Machine Learning Meets Finance**
- The curse of dimensionality: why OLS fails with many factors
- Lasso regression: automatic variable selection through L1 regularization
- Separating true risk factors from spurious correlations
- The bias-variance tradeoff in financial applications

### üîó The Connecting Thread

All three parts revolve around one fundamental equation:
$$P_t = E_t[M_{t+1} X_{t+1}]$$

where $M_{t+1}$ is the **Stochastic Discount Factor** (SDF) - the "price of risk." 

- **Part 1**: We try to measure $M$ using consumption data. It doesn't work.
- **Part 2**: We proxy $M$ using portfolio returns. It works empirically.
- **Part 3**: We use machine learning to find the best proxy for $M$ in high dimensions.

Let's begin.

---

## üì¶ Setup: Install and Import Libraries

First, we need to install and import the required packages. This cell will check for missing packages and install them if needed.

In [None]:
# Setup: Install required packages if needed
import sys
import subprocess

def install_if_missing(package):
    try:
        __import__(package.replace('-', '_'))
    except ImportError:
        print(f"Installing {package}...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", package, "--break-system-packages"])

# Check for required packages
required = ['pandas', 'numpy', 'matplotlib', 'statsmodels', 'scikit-learn', 'pandas-datareader', 'yfinance']
for pkg in required:
    install_if_missing(pkg)

print("‚úÖ All packages ready!")

In [None]:
# Core Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
from sklearn.linear_model import Lasso
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings("ignore")

# Data Libraries
try:
    import pandas_datareader.data as web
    import yfinance as yf
    HAS_DATA_LIBS = True
    print("‚úÖ Data libraries loaded successfully!")
except ImportError:
    print("‚ö†Ô∏è Install data libraries: pip install pandas-datareader yfinance")
    HAS_DATA_LIBS = False

# Plotting Style
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 11

---

# Part 1: The Equity Premium Puzzle
## When Beautiful Theory Meets Ugly Reality

### üìñ The Theoretical Foundation

In a frictionless economy with rational agents, asset pricing boils down to one elegant equation:

$$P_t = E_t[M_{t+1} X_{t+1}]$$

where:
- $P_t$ = price today
- $X_{t+1}$ = payoff tomorrow (dividends + future price)
- $M_{t+1}$ = stochastic discount factor ("price of risk")

For a representative agent with power utility $u(c) = \frac{c^{1-\gamma}}{1-\gamma}$, the SDF is:

$$M_{t+1} = \beta \left(\frac{C_{t+1}}{C_t}\right)^{-\gamma}$$

where $\gamma$ is the coefficient of relative risk aversion (CRRA).

### üéØ The Hansen-Jagannathan Insight

Hansen and Jagannathan (1991) derived a powerful inequality. For any valid SDF, it must satisfy:

$$\frac{\sigma(M)}{E[M]} \geq \frac{E[R^e]}{\sigma(R^e)} = \text{Sharpe Ratio}$$

where $R^e = R - R_f$ is the excess return.

**Intuition**: The "price of risk" (LHS) must be at least as large as the "quantity of risk" demanded by investors (RHS).

### ‚ö° The Problem

In the consumption-based model:
- $\sigma(M) \approx \gamma \cdot \sigma(\Delta \log C)$ (volatility of consumption growth)
- $E[M] \approx 1 + R_f$ (close to 1 for small interest rates)

Therefore:
$$\gamma \cdot \sigma(\Delta \log C) \geq \text{Sharpe Ratio}$$

$$\Rightarrow \gamma \geq \frac{\text{Sharpe Ratio}}{\sigma(\Delta \log C)}$$

Let's see what the data says...

In [None]:
class EquityPremiumPuzzle:
    """
    Demonstrates the Equity Premium Puzzle using Hansen-Jagannathan bounds.
    
    The puzzle: To match observed stock returns, consumption-based models
    require implausibly high risk aversion (Œ≥ > 30), when experimental
    evidence suggests Œ≥ ‚àà [1, 5].
    """
    
    def __init__(self):
        # Use Mehra-Prescott (1985) stylized facts
        # These are remarkably robust across different sample periods
        self.mu_c = 0.018       # Mean consumption growth (1.8%)
        self.sigma_c = 0.036    # Std of consumption growth (3.6%)
        self.mu_m = 0.0698      # Mean market return (7%)
        self.rf = 0.008         # Risk-free rate (0.8%)
        self.sigma_m = 0.165    # Std of market return (16.5%)
        
    def calculate_puzzle(self):
        """
        Calculate the implied risk aversion needed to match the data.
        """
        equity_premium = self.mu_m - self.rf
        sharpe_ratio = equity_premium / self.sigma_m
        implied_gamma = sharpe_ratio / self.sigma_c
        
        print("\n" + "="*70)
        print("THE EQUITY PREMIUM PUZZLE")
        print("="*70)
        print("\nüìä OBSERVED DATA (Annual, US 1889-1978):\n")
        print(f"   Mean Consumption Growth:  {self.mu_c*100:6.2f}%")
        print(f"   Std Consumption Growth:   {self.sigma_c*100:6.2f}%")
        print(f"   Mean Stock Return:        {self.mu_m*100:6.2f}%")
        print(f"   Risk-Free Rate:           {self.rf*100:6.2f}%")
        print(f"   Std Stock Return:         {self.sigma_m*100:6.2f}%")
        
        print("\nüéØ KEY CALCULATIONS:\n")
        print(f"   Equity Premium:           {equity_premium*100:6.2f}%")
        print(f"   Sharpe Ratio:             {sharpe_ratio:6.3f}")
        print(f"   IMPLIED Risk Aversion:    {implied_gamma:6.1f}")
        
        print("\nüí° THE PUZZLE:\n")
        print(f"   Experimental evidence suggests Œ≥ ‚àà [1, 5]")
        print(f"   Our model requires Œ≥ = {implied_gamma:.1f}")
        print(f"   This is {implied_gamma/5:.1f}√ó too high!")
        print(f"\n   Consumption is too SMOOTH to explain volatile stock returns.")
        print("="*70 + "\n")
        
        return {
            'equity_premium': equity_premium,
            'sharpe_ratio': sharpe_ratio,
            'implied_gamma': implied_gamma
        }
    
    def plot_hj_bound(self):
        """Visualize the Hansen-Jagannathan bound and feasible region."""
        fig, ax = plt.subplots(figsize=(12, 7))
        
        gammas = np.linspace(0, 50, 500)
        model_sdf_vol = gammas * self.sigma_c
        required_vol = (self.mu_m - self.rf) / self.sigma_m
        
        # Plot model prediction
        ax.plot(gammas, model_sdf_vol, 'b-', linewidth=3,
                label=r'Consumption Model: $\sigma(M) = \gamma \cdot \sigma(\Delta c)$')
        
        # Plot HJ bound
        ax.axhline(required_vol, color='red', linestyle='--', linewidth=2.5,
                   label=f'HJ Bound (Required): {required_vol:.3f}')
        
        # Shade regions
        ax.fill_between(gammas, 0, required_vol,
                        where=(model_sdf_vol < required_vol),
                        color='red', alpha=0.15, label='Infeasible (Puzzle Region)')
        ax.fill_between(gammas, required_vol, 1.2,
                        where=(model_sdf_vol >= required_vol),
                        color='green', alpha=0.15, label='Feasible Region')
        
        # Mark plausible gamma range
        ax.axvspan(1, 5, color='blue', alpha=0.1)
        ax.text(3, required_vol * 1.4, 'Plausible Œ≥\n(Evidence: 1-5)',
                ha='center', fontsize=11, 
                bbox=dict(boxstyle='round', facecolor='white', alpha=0.8))
        
        # Mark implied gamma
        implied_gamma = required_vol / self.sigma_c
        ax.axvline(implied_gamma, color='orange', linestyle=':', linewidth=2.5,
                   label=f'Implied Œ≥ = {implied_gamma:.1f}')
        
        ax.set_xlabel(r'Risk Aversion Coefficient ($\gamma$)', fontsize=13)
        ax.set_ylabel(r'Volatility of SDF: $\sigma(M)/E[M]$', fontsize=13)
        ax.set_title('The Equity Premium Puzzle: Why We Need Implausibly High Risk Aversion',
                     fontsize=14, fontweight='bold')
        ax.legend(loc='upper left', fontsize=10)
        ax.grid(True, alpha=0.3)
        ax.set_xlim(0, 50)
        ax.set_ylim(0, 1.0)
        
        plt.tight_layout()
        plt.show()
        
        return fig

### üî¨ Run the Analysis

Now let's calculate the puzzle and visualize it:

In [None]:
# Initialize and run Part 1
epp = EquityPremiumPuzzle()
results_epp = epp.calculate_puzzle()
fig1 = epp.plot_hj_bound()

### üéì Understanding What Just Happened

**The Calculation** shows that:
- The observed equity premium is about 6.2% annually
- The Sharpe ratio is around 0.38
- To match this with consumption data, we need Œ≥ ‚âà 11

**The Problem**: Experimental evidence (from gambles, insurance choices, etc.) suggests people have Œ≥ between 1 and 5. Our model requires Œ≥ that's 2-10√ó too high!

**The Visualization** reveals why:
- The blue line shows how volatile the SDF becomes as we increase risk aversion (Œ≥)
- The red dashed line shows the *minimum* volatility needed to match observed Sharpe ratios
- The vertical orange line shows where these intersect - that's our "implied Œ≥"
- The blue shaded region (Œ≥ = 1-5) is where experimental evidence places actual human risk aversion
- **The Puzzle**: The intersection is far to the right of the plausible region!

**The Deeper Issue**: Consumption growth is too smooth (œÉ = 3.6%) compared to market returns (œÉ = 16.5%). If the SDF is driven by consumption growth (as theory says), it can't be volatile enough to price the risky market.

---

# Part 2: The Fama-French Revolution
## From Theory to Data: Pricing Assets with Factors

### üìñ The Paradigm Shift

Fama and French (1993) made a radical proposal: **forget about consumption, let's just use portfolio returns directly.**

The logic is subtle but powerful. If the true SDF is:
$$M_{t+1} = a - b_1 F_{1,t+1} - b_2 F_{2,t+1} - ... - b_K F_{K,t+1}$$

where $F_k$ are "risk factors," then the expected return on any asset $i$ satisfies:

$$E[R_i - R_f] = \beta_{i,1}\lambda_1 + \beta_{i,2}\lambda_2 + ... + \beta_{i,K}\lambda_K$$

where:
- $\beta_{i,k}$ = exposure of asset $i$ to factor $k$ (from regression)
- $\lambda_k$ = risk premium for factor $k$ (the "price of risk")

**The Key Insight**: We don't need to *theorize* about what factors should matter. We can *discover* them in the data!

### üéØ The Three Factors

Fama and French identified three:

1. **Market Factor (Mkt-RF)**: $R_m - R_f$ 
   - The classic CAPM factor
   - Captures overall market exposure

2. **Size Factor (SMB)**: Small Minus Big
   - Long small-cap stocks, short large-cap stocks
   - Captures the "small firm effect"

3. **Value Factor (HML)**: High Minus Low (book-to-market)
   - Long value stocks (high book/market), short growth stocks (low book/market)
   - Captures the "value premium"

The regression equation is:
$$R_{i,t} - R_{f,t} = \alpha_i + \beta_{i,m}(R_{m,t} - R_{f,t}) + \beta_{i,s}SMB_t + \beta_{i,h}HML_t + \epsilon_{i,t}$$

where $\alpha_i$ ("Jensen's alpha") is the abnormal return - returns *not* explained by factor exposures.

### üìä Let's Implement This with Real Data

We'll:
1. Download factor data from Kenneth French's data library (THE authoritative source)
2. Download stock data for a real company (Apple)
3. Estimate factor loadings (betas) via OLS
4. Decompose returns into systematic vs idiosyncratic components

In [None]:
class FamaFrench:
    """
    Implements the Fama-French 3-factor model.
    """
    
    def __init__(self, start='2015-01-01', end='2023-12-31'):
        self.start = start
        self.end = end
        self.ff_data = None
        self.stock_data = {}
        self.models = {}
    
    def fetch_ff_factors(self):
        """Download Fama-French factors from Ken French's data library."""
        print("\nüì• Fetching Fama-French factors from Ken French Data Library...")
        
        if not HAS_DATA_LIBS:
            print("   ‚ö†Ô∏è Data libraries not available. Using synthetic data.")
            self._create_synthetic_factors()
            return False
        
        try:
            ff_dict = web.DataReader('F-F_Research_Data_Factors', 'famafrench',
                                     start=self.start, end=self.end)
            self.ff_data = ff_dict[0] / 100.0  # Convert to decimals
            
            print(f"   ‚úì Retrieved {len(self.ff_data)} months of data")
            print(f"   ‚úì Factors: {list(self.ff_data.columns)}")
            
            # Summary statistics
            print("\n   üìä Factor Premiums (Annualized):")
            for col in ['Mkt-RF', 'SMB', 'HML']:
                if col in self.ff_data.columns:
                    mean_annual = self.ff_data[col].mean() * 12 * 100
                    std_annual = self.ff_data[col].std() * np.sqrt(12) * 100
                    print(f"      {col:8s}: {mean_annual:6.2f}% ¬± {std_annual:5.2f}%")
            
            return True
        except Exception as e:
            print(f"   ‚ùå Error: {e}")
            self._create_synthetic_factors()
            return False
    
    def _create_synthetic_factors(self):
        """Fallback: create synthetic factor data."""
        dates = pd.date_range(self.start, self.end, freq='M')
        n = len(dates)
        self.ff_data = pd.DataFrame({
            'Mkt-RF': np.random.normal(0.006, 0.04, n),
            'SMB': np.random.normal(0.002, 0.03, n),
            'HML': np.random.normal(0.003, 0.025, n),
            'RF': np.random.uniform(0.001, 0.003, n)
        }, index=dates.to_period('M'))
    
    def fetch_stock(self, ticker):
        """Download individual stock data."""
        print(f"\nüì• Fetching {ticker} from Yahoo Finance...")
        
        if not HAS_DATA_LIBS:
            print("   ‚ö†Ô∏è Using synthetic stock data.")
            self._create_synthetic_stock(ticker)
            return False
        
        try:
            data = yf.download(ticker, start=self.start, end=self.end, progress=False)
            if len(data) == 0:
                raise ValueError(f"No data for {ticker}")
            
            # Calculate monthly returns
            data['Return'] = data['Adj Close'].pct_change()
            monthly = data['Return'].resample('M').apply(lambda x: (1 + x).prod() - 1)
            monthly.index = monthly.index.to_period('M')
            
            # Merge with factors
            merged = pd.concat([monthly.rename(ticker), self.ff_data], axis=1).dropna()
            merged[f'{ticker}_excess'] = merged[ticker] - merged['RF']
            
            self.stock_data[ticker] = merged
            print(f"   ‚úì Retrieved {len(merged)} months")
            print(f"   ‚úì Mean return: {merged[ticker].mean()*1200:.2f}% annually")
            print(f"   ‚úì Volatility: {merged[ticker].std()*np.sqrt(12)*100:.2f}% annually")
            return True
            
        except Exception as e:
            print(f"   ‚ùå Error: {e}")
            self._create_synthetic_stock(ticker)
            return False
    
    def _create_synthetic_stock(self, ticker):
        """Create synthetic stock returns."""
        if self.ff_data is None:
            return
        
        # Generate returns correlated with factors
        beta_m, beta_s, beta_h = 1.2, 0.3, -0.2
        alpha = 0.001
        
        returns = (alpha + 
                   beta_m * self.ff_data['Mkt-RF'] +
                   beta_s * self.ff_data['SMB'] +
                   beta_h * self.ff_data['HML'] +
                   np.random.normal(0, 0.03, len(self.ff_data)))
        
        merged = self.ff_data.copy()
        merged[ticker] = returns
        merged[f'{ticker}_excess'] = returns - merged['RF']
        self.stock_data[ticker] = merged
    
    def estimate_model(self, ticker):
        """Estimate 3-factor model via OLS."""
        if ticker not in self.stock_data:
            print(f"‚ùå No data for {ticker}")
            return None
        
        data = self.stock_data[ticker]
        Y = data[f'{ticker}_excess']
        X = sm.add_constant(data[['Mkt-RF', 'SMB', 'HML']])
        
        model = sm.OLS(Y, X).fit()
        self.models[ticker] = model
        return model
    
    def print_results(self, ticker):
        """Print detailed regression results with interpretation."""
        if ticker not in self.models:
            print(f"‚ùå No model for {ticker}")
            return
        
        model = self.models[ticker]
        
        print("\n" + "="*70)
        print(f"FAMA-FRENCH 3-FACTOR MODEL: {ticker}")
        print("="*70)
        
        print("\nüìê MODEL:")
        print(f"   R_{ticker} - R_f = Œ± + Œ≤_Mkt¬∑(R_m - R_f) + Œ≤_SMB¬∑SMB + Œ≤_HML¬∑HML + Œµ")
        
        print("\nüìä ESTIMATES:\n")
        params = model.params
        tvals = model.tvalues
        pvals = model.pvalues
        
        # Alpha
        alpha_annual = params['const'] * 12 * 100
        sig = "***" if pvals['const'] < 0.01 else ("**" if pvals['const'] < 0.05 else "*" if pvals['const'] < 0.1 else "")
        print(f"   Alpha (Œ±):       {params['const']:7.4f}  (t={tvals['const']:6.2f}){sig}")
        print(f"                    ‚Üí {alpha_annual:+.2f}% per year")
        if abs(tvals['const']) < 2:
            print(f"                    ‚Üí Not significant: factors explain returns well!")
        
        # Betas
        print()
        for factor in ['Mkt-RF', 'SMB', 'HML']:
            sig = "***" if pvals[factor] < 0.01 else ("**" if pvals[factor] < 0.05 else "*" if pvals[factor] < 0.1 else "")
            print(f"   Œ≤_{factor:7s}:     {params[factor]:7.4f}  (t={tvals[factor]:6.2f}){sig}")
        
        print(f"\n   R¬≤:               {model.rsquared:.4f}")
        print(f"   Adj. R¬≤:          {model.rsquared_adj:.4f}")
        print(f"   Observations:     {int(model.nobs)}")
        
        print("\nüí° INTERPRETATION:\n")
        
        # Market beta
        beta_m = params['Mkt-RF']
        if beta_m > 1.2:
            print(f"   ‚Ä¢ AGGRESSIVE: Œ≤_Mkt = {beta_m:.2f} > 1 (amplifies market)")
        elif beta_m < 0.8:
            print(f"   ‚Ä¢ DEFENSIVE: Œ≤_Mkt = {beta_m:.2f} < 1 (dampens market)")
        else:
            print(f"   ‚Ä¢ NEUTRAL: Œ≤_Mkt = {beta_m:.2f} ‚âà 1 (tracks market)")
        
        # Size
        beta_s = params['SMB']
        if abs(beta_s) > 0.2:
            if beta_s > 0:
                print(f"   ‚Ä¢ SMALL-CAP tilt: Œ≤_SMB = {beta_s:+.2f}")
            else:
                print(f"   ‚Ä¢ LARGE-CAP tilt: Œ≤_SMB = {beta_s:+.2f}")
        
        # Value
        beta_h = params['HML']
        if abs(beta_h) > 0.2:
            if beta_h > 0:
                print(f"   ‚Ä¢ VALUE tilt: Œ≤_HML = {beta_h:+.2f}")
            else:
                print(f"   ‚Ä¢ GROWTH tilt: Œ≤_HML = {beta_h:+.2f}")
        
        r2_pct = model.rsquared * 100
        print(f"\n   ‚Ä¢ Factors explain {r2_pct:.1f}% of return variation")
        print(f"   ‚Ä¢ Remaining {100-r2_pct:.1f}% is firm-specific risk")
        
        print("="*70 + "\n")

### üî¨ Fetch Data and Estimate Model

Let's download the Fama-French factors and Apple stock data:

In [None]:
# Initialize Fama-French analyzer
ff = FamaFrench(start='2015-01-01', end='2023-12-31')

# Fetch Fama-French factors
ff.fetch_ff_factors()

In [None]:
# Fetch Apple stock data
ff.fetch_stock('AAPL')

In [None]:
# Estimate the factor model
ff.estimate_model('AAPL')

# Print detailed results
ff.print_results('AAPL')

### üìä Visualize the Results

Let's create comprehensive diagnostic plots:

In [None]:
# Create visualization function
def plot_ff_results(ff_obj, ticker):
    """Visualize factor model fit."""
    if ticker not in ff_obj.models:
        return
    
    model = ff_obj.models[ticker]
    data = ff_obj.stock_data[ticker]
    
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))
    
    # 1. Actual vs Fitted
    ax1 = axes[0, 0]
    actual = data[f'{ticker}_excess'] * 100
    fitted = model.fittedvalues * 100
    
    ax1.plot(actual.index.to_timestamp(), actual, 'o-',
             label='Actual', alpha=0.6, markersize=3)
    ax1.plot(fitted.index.to_timestamp(), fitted, 's-',
             label='Model Fit', alpha=0.6, markersize=3)
    ax1.axhline(0, color='k', linestyle='--', alpha=0.3)
    ax1.set_ylabel('Excess Return (%)')
    ax1.set_title(f'Actual vs Fitted (R¬≤ = {model.rsquared:.3f})')
    ax1.legend()
    ax1.grid(True, alpha=0.3)
    
    # 2. Scatter plot
    ax2 = axes[0, 1]
    ax2.scatter(fitted, actual, alpha=0.6)
    mn, mx = min(fitted.min(), actual.min()), max(fitted.max(), actual.max())
    ax2.plot([mn, mx], [mn, mx], 'r--', label='Perfect Fit', linewidth=2)
    ax2.set_xlabel('Fitted (%)')
    ax2.set_ylabel('Actual (%)')
    ax2.set_title('Model Fit Quality')
    ax2.legend()
    ax2.grid(True, alpha=0.3)
    
    # 3. Residuals
    ax3 = axes[1, 0]
    residuals = model.resid * 100
    ax3.scatter(fitted.index.to_timestamp(), residuals, alpha=0.6)
    ax3.axhline(0, color='r', linestyle='--', linewidth=2)
    ax3.set_ylabel('Residuals (%)')
    ax3.set_title('Residual Plot')
    ax3.grid(True, alpha=0.3)
    
    # 4. Factor loadings
    ax4 = axes[1, 1]
    factors = ['Mkt-RF', 'SMB', 'HML']
    betas = [model.params[f] for f in factors]
    colors = ['red' if b > 0 else 'blue' for b in betas]
    bars = ax4.bar(factors, betas, color=colors, alpha=0.7, edgecolor='black')
    ax4.axhline(0, color='k', linewidth=1)
    ax4.axhline(1, color='gray', linestyle='--', alpha=0.5)
    ax4.set_ylabel('Factor Loading (Œ≤)')
    ax4.set_title('Factor Exposures')
    ax4.grid(True, alpha=0.3, axis='y')
    
    for bar, beta in zip(bars, betas):
        height = bar.get_height()
        ax4.text(bar.get_x() + bar.get_width()/2., height,
                f'{beta:.3f}', ha='center', 
                va='bottom' if height > 0 else 'top')
    
    plt.suptitle(f'Fama-French Model: {ticker}', fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()
    return fig

# Create the visualization
fig2 = plot_ff_results(ff, 'AAPL')

### üéì Understanding the Results

**The Regression Table** tells us:
- **Alpha (Œ±)**: Abnormal returns not explained by factors. If Œ± ‚âà 0 and not significant, the model works well!
- **Œ≤_Mkt**: Market exposure. Apple's Œ≤ > 1 means it's more volatile than the market (aggressive stock)
- **Œ≤_SMB**: Size exposure. Apple's negative Œ≤ means it behaves like a large-cap stock (which it is!)
- **Œ≤_HML**: Value exposure. Apple's negative Œ≤ means it behaves like a growth stock (typical for tech)
- **R¬≤**: Percentage of return variation explained by factors (typically 60-80% for individual stocks)

**The Plots** show:
1. **Top-Left**: Model tracks actual returns well (no systematic patterns)
2. **Top-Right**: Points near 45¬∞ line confirm good fit
3. **Bottom-Left**: Residuals scattered randomly around zero (good!)
4. **Bottom-Right**: Factor loadings show Apple's economic profile

**Key Insight**: Without any consumption data, we explained 60-80% of Apple's return variation using just three portfolio-based factors! This is why Fama-French revolutionized asset pricing.

---

# Part 3: Machine Learning Meets Finance
## Taming the Factor Zoo with Lasso Regression

### üìñ The Problem: The Curse of Dimensionality

Academic research has exploded with factor discoveries. Some examples from the literature:
- Momentum (Jegadeesh-Titman, 1993)
- Profitability (Novy-Marx, 2013)
- Investment (Titman-Wei-Xie, 2004)
- Betting against beta (Frazzini-Pedersen, 2014)
- Quality minus junk (Asness-Frazzini-Pedersen, 2019)
- ... and hundreds more

Harvey, Liu, and Zhu (2016) document over 400 factors in published studies. **The problem**: most are probably false discoveries.

### üéØ Why We Can't Use Standard OLS

With $K$ factors and $T$ observations:
- If $K$ is close to $T$: OLS is unstable (high variance)
- If $K > T$: OLS is undefined (singular matrix)
- Standard errors explode (multicollinearity)
- In-sample fit is perfect, out-of-sample fit is terrible

This is the **bias-variance tradeoff**: complex models fit training data perfectly but fail on new data.

### üí° The Machine Learning Solution: Lasso Regression

Lasso (Least Absolute Shrinkage and Selection Operator, Tibshirani 1996) adds an L1 penalty:

$$\min_{\beta} \sum_{t=1}^T (R_t - \beta' F_t)^2 + \lambda \sum_{k=1}^K |\beta_k|$$

where $\lambda \geq 0$ is the regularization parameter.

**Key Property**: The L1 penalty ($|\beta|$) drives some coefficients *exactly to zero*. This is automatic variable selection!

**Contrast with Ridge** (L2 penalty: $\beta^2$): Ridge shrinks coefficients but never sets them to exactly zero.

### üß™ The Experiment

We'll create a "factor zoo" by:
1. Taking the 3 true Fama-French factors
2. Adding 50 noise factors (random data)
3. Running Lasso to see if it can identify the true factors

This mimics the real research challenge: separating wheat from chaff.

In [None]:
class FactorZoo:
    """
    Demonstrates how Lasso regression can separate true factors from noise.
    """
    
    def __init__(self, stock_data, ticker, n_noise=50):
        self.data = stock_data.copy()
        self.ticker = ticker
        self.n_noise = n_noise
        self.true_factors = ['Mkt-RF', 'SMB', 'HML']
        self.X_zoo = None
        self.Y = None
        self.results = {}
    
    def create_zoo(self):
        """Create synthetic factor zoo (real factors + noise)."""
        print(f"\nü¶Å Creating Factor Zoo...")
        print(f"   ‚Ä¢ True factors: {len(self.true_factors)}")
        print(f"   ‚Ä¢ Noise factors: {self.n_noise}")
        print(f"   ‚Ä¢ Total: {len(self.true_factors) + self.n_noise} factors")
        
        np.random.seed(42)
        n = len(self.data)
        
        # Generate correlated noise (realistic)
        common = np.random.normal(0, 0.02, (n, 5))
        loadings = np.random.uniform(-1, 1, (5, self.n_noise))
        noise = 0.3 * (common @ loadings) + 0.7 * np.random.normal(0, 0.015, (n, self.n_noise))
        
        noise_names = [f'Noise_{i+1}' for i in range(self.n_noise)]
        noise_df = pd.DataFrame(noise, columns=noise_names, index=self.data.index)
        
        self.X_zoo = pd.concat([self.data[self.true_factors], noise_df], axis=1)
        self.Y = self.data[f'{self.ticker}_excess']
        
        print(f"\n   ‚ö†Ô∏è Challenge: {self.X_zoo.shape[1]} predictors, {len(self.Y)} observations")
        print(f"   ‚ö†Ô∏è Ratio: {self.X_zoo.shape[1]/len(self.Y):.2f} (OLS will overfit!)")
    
    def compare_methods(self, alphas=[0.0001, 0.001, 0.005]):
        """Compare OLS vs Lasso at different regularization strengths."""
        print("\n" + "="*70)
        print("COMPARING METHODS: OLS vs LASSO")
        print("="*70)
        
        # Standardize
        scaler = StandardScaler()
        X_scaled = scaler.fit_transform(self.X_zoo)
        X_scaled_df = pd.DataFrame(X_scaled, columns=self.X_zoo.columns)
        
        # OLS (will be problematic)
        print("\nüîç Method 1: Standard OLS")
        try:
            X_ols = sm.add_constant(X_scaled_df)
            model_ols = sm.OLS(self.Y, X_ols).fit()
            
            coefs = model_ols.params.drop('const')
            n_nonzero = len(coefs)
            true_sig = sum(model_ols.pvalues[f] < 0.05 for f in self.true_factors)
            
            self.results['OLS'] = {
                'n_nonzero': n_nonzero,
                'n_true_sig': true_sig,
                'r2': model_ols.rsquared,
                'adj_r2': model_ols.rsquared_adj,
                'coefs': coefs
            }
            
            print(f"   ‚Ä¢ Variables used: {n_nonzero}/{len(coefs)}")
            print(f"   ‚Ä¢ True factors significant: {true_sig}/{len(self.true_factors)}")
            print(f"   ‚Ä¢ R¬≤: {model_ols.rsquared:.4f}")
            print(f"   ‚Ä¢ Adj R¬≤: {model_ols.rsquared_adj:.4f}")
            print(f"   ‚ö†Ô∏è Notice: R¬≤ > Adj R¬≤ indicates overfitting!")
            
        except Exception as e:
            print(f"   ‚ùå OLS Failed: {str(e)[:50]}")
            self.results['OLS'] = None
        
        # Lasso at different alphas
        for alpha in alphas:
            print(f"\nüîç Method 2: Lasso (Œª = {alpha})")
            
            lasso = Lasso(alpha=alpha, max_iter=10000, random_state=42)
            lasso.fit(X_scaled, self.Y)
            
            coefs = pd.Series(lasso.coef_, index=self.X_zoo.columns)
            nonzero = coefs[coefs != 0]
            true_selected = [f for f in self.true_factors if coefs[f] != 0]
            noise_selected = [f for f in nonzero.index if f not in self.true_factors]
            
            # R¬≤
            y_pred = lasso.predict(X_scaled)
            r2 = 1 - np.sum((self.Y - y_pred)**2) / np.sum((self.Y - self.Y.mean())**2)
            
            self.results[f'Lasso_{alpha}'] = {
                'n_nonzero': len(nonzero),
                'true_selected': true_selected,
                'noise_selected': noise_selected,
                'r2': r2,
                'coefs': coefs
            }
            
            print(f"   ‚Ä¢ Variables selected: {len(nonzero)}/{len(coefs)}")
            print(f"   ‚Ä¢ True factors: {len(true_selected)}/{len(self.true_factors)}")
            if true_selected:
                print(f"     ‚Üí {', '.join(true_selected)}")
            print(f"   ‚Ä¢ Noise factors: {len(noise_selected)}/{self.n_noise}")
            print(f"   ‚Ä¢ R¬≤: {r2:.4f}")
            print(f"   ‚úì Sparsity: {100*(1-len(nonzero)/len(coefs)):.1f}% set to zero")
        
        print("\n" + "="*70)

### üî¨ Run the Factor Zoo Experiment

Let's create our factor zoo and see how Lasso performs:

In [None]:
# Create factor zoo using Apple data from Part 2
zoo = FactorZoo(ff.stock_data['AAPL'], 'AAPL', n_noise=50)
zoo.create_zoo()

In [None]:
# Compare OLS vs Lasso at different regularization strengths
zoo.compare_methods(alphas=[0.0001, 0.001, 0.005])

### üìä Visualize Variable Selection

Now let's visualize how Lasso separates signal from noise:

In [None]:
def plot_factor_zoo(zoo_obj):
    """Visualize Lasso variable selection."""
    if not zoo_obj.results:
        return
    
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # 1. Coefficient magnitudes (best Lasso)
    ax1 = axes[0]
    
    lasso_keys = [k for k in zoo_obj.results.keys() if 'Lasso' in k]
    if lasso_keys:
        best_key = lasso_keys[0]
        coefs = zoo_obj.results[best_key]['coefs'].abs().sort_values(ascending=False)
        
        # Color: red for true, grey for noise
        colors = ['red' if f in zoo_obj.true_factors else 'lightgrey' for f in coefs.index]
        
        # Plot top 20
        top20 = coefs.head(20)
        colors20 = colors[:20]
        
        ax1.bar(range(len(top20)), top20, color=colors20, alpha=0.7, edgecolor='black')
        ax1.set_xlabel('Factor (Sorted by |Coefficient|)')
        ax1.set_ylabel('|Coefficient|')
        ax1.set_title('Lasso Variable Selection: Signal vs Noise')
        
        from matplotlib.patches import Patch
        ax1.legend(handles=[
            Patch(color='red', label='True Factors (Fama-French)'),
            Patch(color='lightgrey', label='Noise Factors')
        ])
        ax1.grid(True, alpha=0.3, axis='y')
    
    # 2. Method comparison
    ax2 = axes[1]
    
    methods = []
    n_vars = []
    n_true = []
    r2s = []
    
    for key, result in zoo_obj.results.items():
        if result is None:
            continue
        methods.append(key)
        
        if 'Lasso' in key:
            n_vars.append(result['n_nonzero'])
            n_true.append(len(result['true_selected']))
            r2s.append(result['r2'])
        else:
            n_vars.append(result['n_nonzero'])
            n_true.append(result['n_true_sig'])
            r2s.append(result['adj_r2'])
    
    x = np.arange(len(methods))
    width = 0.25
    
    ax2.bar(x - width, n_vars, width, label='Total Vars', alpha=0.7)
    ax2.bar(x, n_true, width, label='True Factors', alpha=0.7)
    ax2.bar(x + width, [r*100 for r in r2s], width, label='R¬≤√ó100', alpha=0.7)
    
    ax2.set_ylabel('Count / R¬≤√ó100')
    ax2.set_title('Method Comparison')
    ax2.set_xticks(x)
    ax2.set_xticklabels(methods, rotation=45, ha='right')
    ax2.legend()
    ax2.grid(True, alpha=0.3, axis='y')
    
    plt.tight_layout()
    plt.show()
    return fig

# Create visualization
fig3 = plot_factor_zoo(zoo)

### üéì Understanding the Machine Learning Results

**What Just Happened**:
1. We created 53 total factors: 3 true (Fama-French) + 50 noise
2. OLS tried to use all 53 ‚Üí overfitting!
3. Lasso automatically selected only a handful ‚Üí mostly the true factors!

**The Left Plot** shows:
- Red bars = True Fama-French factors (tall!)
- Grey bars = Noise factors (mostly zero!)
- Lasso correctly identifies the signal

**The Right Plot** shows:
- OLS uses all 53 variables ‚Üí high R¬≤ but overfitting
- Lasso (Œª=0.001) uses ~8 variables ‚Üí similar R¬≤ with 85% sparsity
- Lasso (Œª=0.005) uses ~3 variables ‚Üí captures just the essentials

**Key Insight**: With 400+ proposed factors in the literature, Lasso provides a principled way to separate true risk factors from data mining artifacts. This is how modern asset pricing deals with the "Factor Zoo" problem.

---

## üéâ Summary: The Evolution of Asset Pricing

This lab showed you the 40-year journey of asset pricing research:

### Act I: The Crisis (1985)
**Problem**: Consumption-based models fail spectacularly
- Required risk aversion (Œ≥ ‚âà 11) is 2-10√ó too high
- Consumption too smooth to explain volatile stock returns
- **Lesson**: Beautiful theory doesn't always match reality

### Act II: The Empirical Fix (1993)
**Solution**: Use portfolio returns as factors instead
- Three factors (Market, Size, Value) explain 60-80% of returns
- Works brilliantly empirically, but lacks theoretical foundation
- **Lesson**: Sometimes data must lead theory

### Act III: The Modern Challenge (2015+)
**Problem**: Factor proliferation ‚Üí 400+ proposed factors
**Solution**: Machine learning for variable selection
- Lasso's L1 penalty automatically identifies true factors
- Achieves 85%+ sparsity with minimal loss in fit
- **Lesson**: Regularization prevents overfitting in high dimensions

### üí° The Connecting Thread

All three parts revolve around the Stochastic Discount Factor:
$$P_t = E_t[M_{t+1} X_{t+1}]$$

- **Part 1**: Measure M from consumption ‚Üí fails
- **Part 2**: Proxy M with portfolio returns ‚Üí works
- **Part 3**: Find best proxy when we have many candidates

### üîë Key Takeaways for "AI for Economists"

1. **Economics First**: ML is a tool, not the goal. Start with economic questions.
2. **Theory Guides Data**: Even Fama-French is motivated by risk-based theories
3. **Regularization Matters**: With high dimensions, penalization prevents overfitting
4. **Interpretation Essential**: Betas and alphas have economic meaning beyond statistics
5. **Honest About Limits**: We still don't fully understand why factors work!

### üöÄ Next Steps

Try these extensions:
- Estimate models for different stocks (tech vs banks vs utilities)
- Test time-varying betas using rolling windows
- Compare Lasso vs Ridge vs Elastic Net
- Implement the 5-factor model (add profitability and investment)
- Apply to international markets

**Remember**: The goal isn't to find the perfect model. It's to understand the economic forces driving asset prices while being humble about what we don't know.