<a href="https://colab.research.google.com/github/rpjena/random_matrix/blob/main/grinold_factor_metrics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Grinold Factor Model Metrics

Implementation of factor model evaluation metrics from the **Grinold & Kahn** framework,
as described in *A Practitioner's Guide to Factor Models* (CFA Institute, 1994) and
*Active Portfolio Management* (Grinold & Kahn, 1999).

## Factor Model Specification

The cross-sectional factor model for asset returns:

$$r_i = \sum_{k=1}^{K} X_{ik} f_k + u_i$$

where:
- $r_i$ = excess return of asset $i$
- $X_{ik}$ = exposure of asset $i$ to factor $k$ (z-scored characteristic)
- $f_k$ = return to factor $k$ (estimated via cross-sectional regression)
- $u_i$ = specific (idiosyncratic) return of asset $i$

In matrix form: $\mathbf{r} = \mathbf{X} \mathbf{f} + \mathbf{u}$

## Metrics Implemented

1. **Factor Return Estimation** via WLS cross-sectional regression
2. **Factor Return t-statistics** and cumulative returns
3. **Information Coefficient (IC)** — rank correlation of exposures vs. forward returns
4. **IC Information Ratio (ICIR)** — mean IC / std IC
5. **Quantile Analysis** — mean returns by factor exposure quintile
6. **Cross-Sectional R-squared** — goodness of fit per period
7. **Bias Statistic** — realized vs. predicted risk ratio
8. **Factor Covariance Matrix** and correlation structure
9. **Specific Risk Analysis** — residual diagnostics
10. **Portfolio Risk Decomposition** — factor vs. specific risk
11. **Factor Exposure Turnover** — stability of exposures over time
12. **Variance Inflation Factor (VIF)** — multicollinearity diagnostic

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from scipy.stats import spearmanr

np.random.seed(42)
sns.set_style('whitegrid')

## 1. Synthetic Data Generation

We generate a realistic panel of asset returns driven by a known factor structure.
This allows us to validate our metrics against ground truth.

In [None]:
def generate_factor_model_data(N=200, T=120, K=5, seed=42):
    """
    Generate synthetic cross-sectional factor model data.

    The data follows r_{i,t} = X_{i,t} f_t + u_{i,t} where factor exposures
    evolve slowly over time (AR(1) with high persistence) and specific returns
    have heterogeneous volatility.

    Parameters:
        N (int): Number of assets.
        T (int): Number of time periods (months).
        K (int): Number of factors.
        seed (int): Random seed for reproducibility.

    Returns:
        returns (np.array): Asset returns, shape (T, N).
        exposures (np.array): Factor exposures, shape (T, N, K).
        true_factor_returns (np.array): True factor returns, shape (T, K).
        specific_returns (np.array): Specific returns, shape (T, N).
        market_cap (np.array): Market capitalizations, shape (N,).
        factor_names (list): Names of the K factors.
    """
    rng = np.random.RandomState(seed)

    factor_names = ['Market', 'Size', 'Value', 'Momentum', 'Volatility'][:K]

    # Market capitalizations (log-normal, for WLS weights)
    log_mcap = rng.normal(loc=8.0, scale=1.5, size=N)  # log market cap
    market_cap = np.exp(log_mcap)

    # Factor exposures: AR(1) process with persistence rho=0.95
    rho = 0.95
    exposures = np.zeros((T, N, K))
    exposures[0] = rng.randn(N, K)
    for t in range(1, T):
        exposures[t] = rho * exposures[t - 1] + np.sqrt(1 - rho**2) * rng.randn(N, K)

    # Z-score exposures cross-sectionally each period
    for t in range(T):
        mu = exposures[t].mean(axis=0)
        sigma = exposures[t].std(axis=0)
        exposures[t] = (exposures[t] - mu) / sigma

    # True factor returns: mean-reverting with realistic magnitudes
    # Annualized: Market~6%, Size~2%, Value~3%, Momentum~4%, Vol~-1%
    monthly_means = np.array([0.005, 0.0017, 0.0025, 0.0033, -0.0008])[:K]
    monthly_stds = np.array([0.02, 0.012, 0.015, 0.018, 0.01])[:K]
    true_factor_returns = np.zeros((T, K))
    for k in range(K):
        true_factor_returns[:, k] = rng.normal(monthly_means[k], monthly_stds[k], T)

    # Specific returns: heterogeneous volatility (smaller for large-cap)
    specific_vol = 0.08 / np.sqrt(market_cap / np.median(market_cap))  # monthly
    specific_returns = np.zeros((T, N))
    for t in range(T):
        specific_returns[t] = rng.normal(0, specific_vol)

    # Asset returns: r = X f + u
    returns = np.zeros((T, N))
    for t in range(T):
        returns[t] = exposures[t] @ true_factor_returns[t] + specific_returns[t]

    return returns, exposures, true_factor_returns, specific_returns, market_cap, factor_names

In [None]:
N, T, K = 200, 120, 5
returns, exposures, true_fret, specific_ret, market_cap, factor_names = \
    generate_factor_model_data(N, T, K)

print(f'Assets: {N}, Periods: {T}, Factors: {K}')
print(f'Factor names: {factor_names}')
print(f'Returns shape: {returns.shape}')
print(f'Exposures shape: {exposures.shape}')
print(f'Market cap range: [{market_cap.min():.0f}, {market_cap.max():.0f}]')

## 2. Cross-Sectional WLS Regression (Factor Return Estimation)

Factor returns are estimated each period by regressing the cross-section of asset returns
on factor exposures using Weighted Least Squares (WLS):

$$\hat{\mathbf{f}}_t = (\mathbf{X}_t^\top \mathbf{W}_t \mathbf{X}_t)^{-1} \mathbf{X}_t^\top \mathbf{W}_t \mathbf{r}_t$$

where $\mathbf{W}_t = \text{diag}(\sqrt{\text{mcap}})$ following the BARRA convention
that idiosyncratic risk decreases with market capitalization.

In [None]:
def estimate_factor_returns_wls(returns, exposures, market_cap):
    """
    Estimate factor returns via WLS cross-sectional regression each period.

    Parameters:
        returns (np.array): Asset returns, shape (T, N).
        exposures (np.array): Factor exposures, shape (T, N, K).
        market_cap (np.array): Market capitalizations, shape (N,).

    Returns:
        factor_returns (np.array): Estimated factor returns, shape (T, K).
        residuals (np.array): Specific returns (residuals), shape (T, N).
        r_squared (np.array): Cross-sectional R-squared, shape (T,).
    """
    T, N, K = exposures.shape
    w = np.sqrt(market_cap)  # WLS weights
    W = np.diag(w)

    factor_returns = np.zeros((T, K))
    residuals = np.zeros((T, N))
    r_squared = np.zeros(T)

    for t in range(T):
        X_t = exposures[t]  # (N, K)
        r_t = returns[t]    # (N,)

        # WLS: f = (X'WX)^{-1} X'Wr
        XtW = X_t.T @ W  # (K, N)
        factor_returns[t] = np.linalg.solve(XtW @ X_t, XtW @ r_t)

        # Residuals
        r_hat = X_t @ factor_returns[t]
        residuals[t] = r_t - r_hat

        # Weighted R-squared
        r_bar = np.average(r_t, weights=w)
        ss_tot = np.sum(w * (r_t - r_bar)**2)
        ss_res = np.sum(w * residuals[t]**2)
        r_squared[t] = 1.0 - ss_res / ss_tot if ss_tot > 0 else 0.0

    return factor_returns, residuals, r_squared

In [None]:
est_fret, residuals, r_squared = estimate_factor_returns_wls(returns, exposures, market_cap)

print('Estimated factor returns shape:', est_fret.shape)
print('Residuals shape:', residuals.shape)
print(f'Mean cross-sectional R-squared: {r_squared.mean():.4f}')

## 3. Factor Return Analysis

For each factor, we compute:
- **Mean return** (annualized)
- **Volatility** (annualized)
- **t-statistic**: $t_k = \frac{\bar{f}_k}{\text{se}(f_k)} = \frac{\bar{f}_k}{\sigma_k / \sqrt{T}}$
- **Cumulative returns**: $\prod_{t=1}^{T}(1 + f_{k,t}) - 1$

In [None]:
def factor_return_statistics(factor_returns, factor_names, periods_per_year=12):
    """
    Compute summary statistics for estimated factor returns.

    Parameters:
        factor_returns (np.array): Estimated factor returns, shape (T, K).
        factor_names (list): Factor names.
        periods_per_year (int): Periods per year for annualization.

    Returns:
        stats_df (pd.DataFrame): Summary statistics per factor.
    """
    T, K = factor_returns.shape
    means = factor_returns.mean(axis=0)
    stds = factor_returns.std(axis=0, ddof=1)
    t_stats = means / (stds / np.sqrt(T))

    ann_mean = means * periods_per_year
    ann_vol = stds * np.sqrt(periods_per_year)
    sharpe = ann_mean / ann_vol

    pct_positive = (factor_returns > 0).mean(axis=0)

    stats_df = pd.DataFrame({
        'Ann. Mean (%)': ann_mean * 100,
        'Ann. Vol (%)': ann_vol * 100,
        'Sharpe': sharpe,
        't-stat': t_stats,
        '% Positive': pct_positive * 100
    }, index=factor_names)

    return stats_df

In [None]:
fret_stats = factor_return_statistics(est_fret, factor_names)
print('Factor Return Statistics (Estimated):')
print(fret_stats.round(3))
print()

fret_stats_true = factor_return_statistics(true_fret, factor_names)
print('Factor Return Statistics (True):')
print(fret_stats_true.round(3))

In [None]:
# Cumulative factor returns: estimated vs. true
fig, axes = plt.subplots(2, 3, figsize=(15, 8))
axes = axes.flatten()

for k in range(K):
    ax = axes[k]
    cum_est = np.cumprod(1 + est_fret[:, k]) - 1
    cum_true = np.cumprod(1 + true_fret[:, k]) - 1
    ax.plot(cum_est, label='Estimated', linewidth=1.5)
    ax.plot(cum_true, label='True', linewidth=1.5, linestyle='--')
    ax.set_title(f'{factor_names[k]} (t={fret_stats.loc[factor_names[k], "t-stat"]:.2f})')
    ax.set_xlabel('Month')
    ax.set_ylabel('Cumulative Return')
    ax.legend(fontsize=8)
    ax.axhline(0, color='grey', linewidth=0.5)

# Hide unused subplot
axes[-1].set_visible(False)
fig.suptitle('Cumulative Factor Returns: Estimated vs. True', fontsize=14)
plt.tight_layout()
plt.show()

## 4. Information Coefficient (IC)

The **Information Coefficient** measures the predictive power of factor exposures.
For each factor $k$ at time $t$:

$$\text{IC}_{k,t} = \text{RankCorr}(X_{k,t}, r_{t+1})$$

This is the Spearman rank correlation between factor exposures at time $t$
and subsequent asset returns at $t+1$.

The **IC Information Ratio** (ICIR) summarizes IC persistence:

$$\text{ICIR}_k = \frac{\overline{\text{IC}}_k}{\sigma(\text{IC}_k)}$$

In [None]:
def compute_information_coefficient(returns, exposures, factor_names):
    """
    Compute the Information Coefficient (IC) for each factor over time.

    IC is the Spearman rank correlation between factor exposures at time t
    and asset returns at time t+1.

    Parameters:
        returns (np.array): Asset returns, shape (T, N).
        exposures (np.array): Factor exposures, shape (T, N, K).
        factor_names (list): Factor names.

    Returns:
        ic_df (pd.DataFrame): IC time series, shape (T-1, K).
        ic_summary (pd.DataFrame): IC summary statistics per factor.
    """
    T, N, K = exposures.shape
    ic_values = np.zeros((T - 1, K))

    for t in range(T - 1):
        for k in range(K):
            corr, _ = spearmanr(exposures[t, :, k], returns[t + 1])
            ic_values[t, k] = corr

    ic_df = pd.DataFrame(ic_values, columns=factor_names)

    # Summary statistics
    ic_mean = ic_df.mean()
    ic_std = ic_df.std()
    icir = ic_mean / ic_std
    ic_t = ic_mean / (ic_std / np.sqrt(len(ic_df)))
    pct_positive = (ic_df > 0).mean()

    ic_summary = pd.DataFrame({
        'Mean IC': ic_mean,
        'Std IC': ic_std,
        'ICIR': icir,
        't-stat': ic_t,
        '% Positive': pct_positive * 100
    })

    return ic_df, ic_summary

In [None]:
ic_df, ic_summary = compute_information_coefficient(returns, exposures, factor_names)

print('IC Summary:')
print(ic_summary.round(4))

In [None]:
# IC time series with rolling mean
fig, axes = plt.subplots(2, 3, figsize=(15, 8))
axes = axes.flatten()

for k in range(K):
    ax = axes[k]
    ax.bar(range(len(ic_df)), ic_df[factor_names[k]], alpha=0.4, width=1.0,
           color='steelblue', label='IC')
    rolling_ic = ic_df[factor_names[k]].rolling(12).mean()
    ax.plot(rolling_ic, color='darkred', linewidth=2, label='12m rolling mean')
    ax.axhline(0, color='black', linewidth=0.5)
    ax.axhline(ic_summary.loc[factor_names[k], 'Mean IC'], color='green',
               linewidth=1, linestyle='--', label=f'Mean={ic_summary.loc[factor_names[k], "Mean IC"]:.3f}')
    ax.set_title(f'{factor_names[k]} IC (ICIR={ic_summary.loc[factor_names[k], "ICIR"]:.2f})')
    ax.set_xlabel('Month')
    ax.legend(fontsize=7)

axes[-1].set_visible(False)
fig.suptitle('Information Coefficient (IC) Time Series', fontsize=14)
plt.tight_layout()
plt.show()

## 5. Quantile Analysis

For each factor, assets are sorted into quintiles based on exposure, and the
mean forward return of each quintile is computed. A monotonic relationship
from Q1 to Q5 indicates predictive power.

The **long-short spread** (Q5 - Q1) is the return from going long the
top quintile and short the bottom quintile.

In [None]:
def quantile_analysis(returns, exposures, factor_names, n_quantiles=5):
    """
    Compute mean forward returns by factor exposure quantile.

    Parameters:
        returns (np.array): Asset returns, shape (T, N).
        exposures (np.array): Factor exposures, shape (T, N, K).
        factor_names (list): Factor names.
        n_quantiles (int): Number of quantile bins.

    Returns:
        quantile_returns (dict): {factor_name: DataFrame of mean returns per quantile per period}.
        quantile_summary (pd.DataFrame): Annualized mean return per quantile per factor.
    """
    T, N, K = exposures.shape
    quantile_returns = {}

    for k in range(K):
        qr = np.zeros((T - 1, n_quantiles))
        for t in range(T - 1):
            # Assign quintiles based on exposure at time t
            ranks = pd.Series(exposures[t, :, k]).rank(method='first')
            quantile_labels = pd.qcut(ranks, n_quantiles, labels=False)

            for q in range(n_quantiles):
                mask = quantile_labels == q
                qr[t, q] = returns[t + 1, mask].mean()

        quantile_returns[factor_names[k]] = pd.DataFrame(
            qr, columns=[f'Q{i+1}' for i in range(n_quantiles)])

    # Summary: annualized mean returns per quantile
    summary_data = {}
    for k in range(K):
        name = factor_names[k]
        mean_qr = quantile_returns[name].mean() * 12  # annualize
        summary_data[name] = mean_qr

    quantile_summary = pd.DataFrame(summary_data).T
    quantile_summary['L/S Spread'] = quantile_summary['Q5'] - quantile_summary['Q1']

    return quantile_returns, quantile_summary

In [None]:
quantile_ret, quantile_summary = quantile_analysis(returns, exposures, factor_names)

print('Quantile Returns (Annualized %):')
print((quantile_summary * 100).round(2))

In [None]:
# Quantile return bar charts
fig, axes = plt.subplots(2, 3, figsize=(15, 8))
axes = axes.flatten()

for k in range(K):
    ax = axes[k]
    name = factor_names[k]
    mean_qr = quantile_ret[name].mean() * 12 * 100  # annualized %
    colors = ['#d73027', '#fc8d59', '#fee08b', '#91cf60', '#1a9850']
    ax.bar(mean_qr.index, mean_qr.values, color=colors)
    ax.set_title(f'{name} (Spread={quantile_summary.loc[name, "L/S Spread"]*100:.1f}%)')
    ax.set_ylabel('Ann. Return (%)')
    ax.axhline(0, color='black', linewidth=0.5)

axes[-1].set_visible(False)
fig.suptitle('Mean Quantile Returns by Factor Exposure', fontsize=14)
plt.tight_layout()
plt.show()

## 6. Cross-Sectional R-squared

The cross-sectional $R^2_t$ measures how much of the cross-sectional variation
in returns the factor model explains at each time $t$:

$$R^2_t = 1 - \frac{\sum_i w_i (r_{i,t} - \hat{r}_{i,t})^2}{\sum_i w_i (r_{i,t} - \bar{r}_t)^2}$$

A high-quality factor structure should explain a substantial fraction of
cross-sectional return dispersion.

In [None]:
fig, ax = plt.subplots(figsize=(12, 4))
ax.plot(r_squared, color='steelblue', linewidth=1, alpha=0.7)
rolling_r2 = pd.Series(r_squared).rolling(12).mean()
ax.plot(rolling_r2, color='darkred', linewidth=2, label='12m rolling mean')
ax.axhline(r_squared.mean(), color='green', linestyle='--',
           label=f'Mean={r_squared.mean():.3f}')
ax.set_xlabel('Month')
ax.set_ylabel('Cross-Sectional R-squared')
ax.set_title('Cross-Sectional R-squared Over Time')
ax.legend()
plt.tight_layout()
plt.show()

print(f'R-squared: mean={r_squared.mean():.4f}, '
      f'median={np.median(r_squared):.4f}, '
      f'min={r_squared.min():.4f}, max={r_squared.max():.4f}')

## 7. Bias Statistic

The **bias statistic** tests whether the risk model's forecasts are well-calibrated.
For a portfolio $p$ with predicted volatility $\sigma_p$, the standardized return is:

$$z_{p,t} = \frac{r_{p,t}}{\sigma_{p,t}}$$

The bias statistic is:

$$B_p = \text{std}(z_{p,t})$$

- $B_p \approx 1.0$: risk forecast is well-calibrated (unbiased)
- $B_p > 1.0$: risk is under-predicted
- $B_p < 1.0$: risk is over-predicted

We compute bias statistics for each factor portfolio (unit exposure to one factor, zero to others)
and for random portfolios.

In [None]:
def compute_bias_statistics(factor_returns, residuals, exposures, market_cap,
                            factor_names, window=60):
    """
    Compute bias statistics for factor portfolios and specific returns.

    The bias statistic is the standard deviation of standardized returns
    (realized return / predicted volatility). A value of 1.0 indicates
    unbiased risk forecasts.

    Parameters:
        factor_returns (np.array): Estimated factor returns, shape (T, K).
        residuals (np.array): Specific returns, shape (T, N).
        exposures (np.array): Factor exposures, shape (T, N, K).
        market_cap (np.array): Market capitalizations, shape (N,).
        factor_names (list): Factor names.
        window (int): Rolling window for volatility estimation.

    Returns:
        bias_df (pd.DataFrame): Bias statistics per factor.
        specific_bias (pd.DataFrame): Bias stats by specific risk decile.
    """
    T, K = factor_returns.shape
    N = residuals.shape[1]

    # Factor bias statistics
    bias_data = []
    for k in range(K):
        standardized = []
        for t in range(window, T):
            # Rolling predicted volatility
            pred_vol = factor_returns[t-window:t, k].std(ddof=1)
            if pred_vol > 1e-10:
                standardized.append(factor_returns[t, k] / pred_vol)
        bias_stat = np.std(standardized, ddof=1) if standardized else np.nan
        bias_data.append({
            'Factor': factor_names[k],
            'Bias Statistic': bias_stat,
            'Status': 'OK' if 0.75 < bias_stat < 1.25 else 'Warning'
        })

    bias_df = pd.DataFrame(bias_data).set_index('Factor')

    # Specific return bias by volatility decile
    spec_vol = residuals.std(axis=0, ddof=1)
    decile_labels = pd.qcut(spec_vol, 10, labels=[f'D{i+1}' for i in range(10)])

    decile_bias = []
    for d in range(10):
        label = f'D{d+1}'
        mask = decile_labels == label
        assets_in_decile = np.where(mask)[0]

        standardized_all = []
        for i in assets_in_decile:
            for t in range(window, T):
                pred_v = residuals[t-window:t, i].std(ddof=1)
                if pred_v > 1e-10:
                    standardized_all.append(residuals[t, i] / pred_v)

        b = np.std(standardized_all, ddof=1) if standardized_all else np.nan
        decile_bias.append({'Decile': label, 'Bias Statistic': b})

    specific_bias = pd.DataFrame(decile_bias).set_index('Decile')

    return bias_df, specific_bias

In [None]:
bias_df, specific_bias = compute_bias_statistics(
    est_fret, residuals, exposures, market_cap, factor_names, window=36)

print('Factor Bias Statistics (target = 1.0):')
print(bias_df.round(3))
print()
print('Specific Return Bias by Volatility Decile:')
print(specific_bias.round(3))

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Factor bias statistics
colors = ['green' if s == 'OK' else 'orange' for s in bias_df['Status']]
ax1.barh(bias_df.index, bias_df['Bias Statistic'], color=colors)
ax1.axvline(1.0, color='red', linestyle='--', linewidth=1.5, label='Ideal = 1.0')
ax1.axvline(0.75, color='grey', linestyle=':', linewidth=1)
ax1.axvline(1.25, color='grey', linestyle=':', linewidth=1)
ax1.set_xlabel('Bias Statistic')
ax1.set_title('Factor Bias Statistics')
ax1.legend()

# Specific risk bias by decile
ax2.bar(specific_bias.index, specific_bias['Bias Statistic'], color='steelblue')
ax2.axhline(1.0, color='red', linestyle='--', linewidth=1.5, label='Ideal = 1.0')
ax2.axhline(0.75, color='grey', linestyle=':', linewidth=1)
ax2.axhline(1.25, color='grey', linestyle=':', linewidth=1)
ax2.set_xlabel('Specific Vol Decile (Low to High)')
ax2.set_ylabel('Bias Statistic')
ax2.set_title('Specific Return Bias by Volatility Decile')
ax2.legend()

plt.tight_layout()
plt.show()

## 8. Factor Covariance Matrix

The factor covariance matrix $\Sigma_f$ is estimated from the time series of
factor returns. This is a key input to portfolio risk:

$$\Sigma = \mathbf{X} \Sigma_f \mathbf{X}^\top + \mathbf{D}$$

where $\mathbf{D}$ is the diagonal matrix of specific variances.

In [None]:
# Estimated factor covariance and correlation
factor_cov = np.cov(est_fret, rowvar=False) * 12  # annualized
factor_corr = np.corrcoef(est_fret, rowvar=False)

# True factor covariance
true_factor_cov = np.cov(true_fret, rowvar=False) * 12

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

sns.heatmap(pd.DataFrame(factor_corr, index=factor_names, columns=factor_names),
            annot=True, fmt='.2f', cmap='RdBu_r', center=0, vmin=-1, vmax=1,
            ax=ax1)
ax1.set_title('Estimated Factor Correlation')

# Annualized factor volatilities: estimated vs true
est_vol = np.sqrt(np.diag(factor_cov)) * 100
true_vol = np.sqrt(np.diag(true_factor_cov)) * 100
x = np.arange(K)
width = 0.35
ax2.bar(x - width/2, est_vol, width, label='Estimated', color='steelblue')
ax2.bar(x + width/2, true_vol, width, label='True', color='coral')
ax2.set_xticks(x)
ax2.set_xticklabels(factor_names)
ax2.set_ylabel('Annualized Volatility (%)')
ax2.set_title('Factor Volatilities: Estimated vs True')
ax2.legend()

plt.tight_layout()
plt.show()

## 9. Specific Risk Analysis

Specific (idiosyncratic) returns $u_{i,t}$ should be:
- Approximately normally distributed
- Uncorrelated across assets (the factor model captured all common variation)
- Have volatility that decreases with market capitalization

We check these properties with distributional diagnostics and
cross-asset correlation analysis.

In [None]:
def specific_risk_analysis(residuals, market_cap):
    """
    Analyze properties of specific (idiosyncratic) returns.

    Parameters:
        residuals (np.array): Specific returns, shape (T, N).
        market_cap (np.array): Market capitalizations, shape (N,).

    Returns:
        stats (dict): Distributional statistics.
        spec_vol (np.array): Per-asset specific volatility, shape (N,).
    """
    T, N = residuals.shape
    spec_vol = residuals.std(axis=0, ddof=1) * np.sqrt(12)  # annualized

    # Distributional stats of pooled residuals
    pooled = residuals.flatten()
    stats_dict = {
        'Mean': pooled.mean(),
        'Std': pooled.std(),
        'Skewness': float(pd.Series(pooled).skew()),
        'Kurtosis': float(pd.Series(pooled).kurtosis()),
        'Mean Spec Vol (ann %)': spec_vol.mean() * 100,
        'Median Spec Vol (ann %)': np.median(spec_vol) * 100
    }

    # Cross-asset correlation of residuals (should be near zero)
    sample_pairs = min(500, N * (N - 1) // 2)
    rng = np.random.RandomState(0)
    pairwise_corrs = []
    for _ in range(sample_pairs):
        i, j = rng.choice(N, 2, replace=False)
        c = np.corrcoef(residuals[:, i], residuals[:, j])[0, 1]
        pairwise_corrs.append(c)
    stats_dict['Mean Pairwise Corr'] = np.mean(pairwise_corrs)
    stats_dict['Std Pairwise Corr'] = np.std(pairwise_corrs)

    return stats_dict, spec_vol

In [None]:
spec_stats, spec_vol = specific_risk_analysis(residuals, market_cap)

print('Specific Return Statistics:')
for k, v in spec_stats.items():
    print(f'  {k}: {v:.4f}')

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(16, 4))

# Distribution of pooled residuals
ax = axes[0]
pooled = residuals.flatten()
ax.hist(pooled, bins=80, density=True, alpha=0.7, color='steelblue')
x_grid = np.linspace(pooled.min(), pooled.max(), 200)
ax.plot(x_grid, stats.norm.pdf(x_grid, pooled.mean(), pooled.std()),
        'r-', linewidth=2, label='Normal fit')
ax.set_title('Distribution of Specific Returns')
ax.set_xlabel('Specific Return')
ax.legend()

# Specific vol vs log market cap
ax = axes[1]
ax.scatter(np.log(market_cap), spec_vol * 100, alpha=0.5, s=15, color='steelblue')
z = np.polyfit(np.log(market_cap), spec_vol * 100, 1)
p = np.poly1d(z)
x_fit = np.linspace(np.log(market_cap).min(), np.log(market_cap).max(), 100)
ax.plot(x_fit, p(x_fit), 'r-', linewidth=2)
ax.set_xlabel('Log Market Cap')
ax.set_ylabel('Annualized Specific Vol (%)')
ax.set_title('Specific Risk vs. Market Cap')

# Distribution of specific volatilities
ax = axes[2]
ax.hist(spec_vol * 100, bins=30, alpha=0.7, color='steelblue')
ax.axvline(np.median(spec_vol) * 100, color='red', linestyle='--',
           label=f'Median={np.median(spec_vol)*100:.1f}%')
ax.set_xlabel('Annualized Specific Vol (%)')
ax.set_ylabel('Count')
ax.set_title('Distribution of Specific Volatilities')
ax.legend()

plt.tight_layout()
plt.show()

## 10. Portfolio Risk Decomposition

For a portfolio with weight vector $\mathbf{x}$, total risk decomposes as:

$$\sigma_p^2 = \underbrace{\mathbf{x}^\top \mathbf{X} \Sigma_f \mathbf{X}^\top \mathbf{x}}_{\text{factor risk}} + \underbrace{\mathbf{x}^\top \mathbf{D} \mathbf{x}}_{\text{specific risk}}$$

where $\Sigma_f$ is the factor covariance matrix and $\mathbf{D} = \text{diag}(\sigma^2_{u_i})$.

This decomposition reveals what fraction of portfolio risk comes from
factor exposures vs. idiosyncratic stock-specific risk.

In [None]:
def portfolio_risk_decomposition(weights, exposures_t, factor_cov, specific_var,
                                  factor_names):
    """
    Decompose portfolio risk into factor and specific components.

    Parameters:
        weights (np.array): Portfolio weights, shape (N,).
        exposures_t (np.array): Factor exposures at time t, shape (N, K).
        factor_cov (np.array): Factor covariance matrix, shape (K, K).
        specific_var (np.array): Specific variances, shape (N,).
        factor_names (list): Factor names.

    Returns:
        decomp (dict): Risk decomposition results.
    """
    # Portfolio factor exposures
    port_exposures = exposures_t.T @ weights  # (K,)

    # Factor risk
    factor_var = port_exposures @ factor_cov @ port_exposures

    # Specific risk
    spec_var = weights @ (specific_var * weights)

    # Total risk
    total_var = factor_var + spec_var
    total_vol = np.sqrt(total_var)

    # Per-factor contribution
    factor_mcr = factor_cov @ port_exposures  # marginal contribution
    factor_contributions = port_exposures * factor_mcr

    decomp = {
        'Total Vol (ann %)': total_vol * 100,
        'Factor Vol (ann %)': np.sqrt(factor_var) * 100,
        'Specific Vol (ann %)': np.sqrt(spec_var) * 100,
        'Factor Risk Share (%)': factor_var / total_var * 100,
        'Specific Risk Share (%)': spec_var / total_var * 100,
        'Portfolio Factor Exposures': pd.Series(port_exposures, index=factor_names),
        'Factor Risk Contributions': pd.Series(
            factor_contributions / total_var * 100, index=factor_names)
    }

    return decomp

In [None]:
# Example portfolios
T_last = T - 1
spec_var = residuals.var(axis=0, ddof=1)  # annualize below

# Equal-weight portfolio
w_eq = np.ones(N) / N

# Cap-weight portfolio
w_cap = market_cap / market_cap.sum()

# Random active portfolio (long-short, sum to 0)
rng = np.random.RandomState(123)
w_active = rng.randn(N)
w_active = w_active - w_active.mean()
w_active = w_active / np.abs(w_active).sum() * 2  # gross exposure = 200%

portfolios = {
    'Equal-Weight': w_eq,
    'Cap-Weight': w_cap,
    'Long-Short Active': w_active
}

# Use monthly factor cov (not annualized) for consistent units
factor_cov_monthly = np.cov(est_fret, rowvar=False)

for name, w in portfolios.items():
    decomp = portfolio_risk_decomposition(
        w, exposures[T_last], factor_cov_monthly, spec_var, factor_names)
    print(f'\n--- {name} Portfolio ---')
    print(f'  Total Vol (monthly):   {decomp["Total Vol (ann %)"]:.2f}%')
    print(f'  Factor Vol (monthly):  {decomp["Factor Vol (ann %)"]:.2f}%')
    print(f'  Specific Vol (monthly):{decomp["Specific Vol (ann %)"]:.2f}%')
    print(f'  Factor Risk Share:     {decomp["Factor Risk Share (%)"]:.1f}%')
    print(f'  Specific Risk Share:   {decomp["Specific Risk Share (%)"]:.1f}%')
    print(f'  Portfolio Exposures:')
    print(f'    {decomp["Portfolio Factor Exposures"].round(3).to_dict()}')
    print(f'  Factor Risk Contributions (% of total variance):')
    print(f'    {decomp["Factor Risk Contributions"].round(2).to_dict()}')

## 11. Factor Exposure Turnover

Factor exposure **turnover** measures how rapidly exposures change over time.
High turnover implies a less stable factor definition. We measure turnover as
the cross-sectional rank correlation of exposures between consecutive periods:

$$\text{Autocorr}_k = \text{RankCorr}(X_{k,t}, X_{k,t-1})$$

Values near 1.0 indicate stable, slowly-evolving exposures.

In [None]:
def factor_exposure_turnover(exposures, factor_names):
    """
    Compute factor exposure turnover as rank autocorrelation.

    Parameters:
        exposures (np.array): Factor exposures, shape (T, N, K).
        factor_names (list): Factor names.

    Returns:
        turnover_df (pd.DataFrame): Rank autocorrelation per factor per period.
        turnover_summary (pd.DataFrame): Summary statistics.
    """
    T, N, K = exposures.shape
    autocorr = np.zeros((T - 1, K))

    for t in range(1, T):
        for k in range(K):
            corr, _ = spearmanr(exposures[t, :, k], exposures[t - 1, :, k])
            autocorr[t - 1, k] = corr

    turnover_df = pd.DataFrame(autocorr, columns=factor_names)

    turnover_summary = pd.DataFrame({
        'Mean Rank Autocorr': turnover_df.mean(),
        'Min Rank Autocorr': turnover_df.min(),
        'Max Rank Autocorr': turnover_df.max()
    })

    return turnover_df, turnover_summary

In [None]:
turnover_df, turnover_summary = factor_exposure_turnover(exposures, factor_names)

print('Factor Exposure Turnover (Rank Autocorrelation):')
print(turnover_summary.round(4))

In [None]:
fig, ax = plt.subplots(figsize=(12, 4))
for k in range(K):
    ax.plot(turnover_df[factor_names[k]], label=factor_names[k], linewidth=1, alpha=0.8)
ax.set_xlabel('Month')
ax.set_ylabel('Rank Autocorrelation')
ax.set_title('Factor Exposure Stability Over Time')
ax.legend()
ax.axhline(1.0, color='grey', linewidth=0.5)
plt.tight_layout()
plt.show()

## 12. Variance Inflation Factor (VIF)

The **VIF** measures multicollinearity among factor exposures. For factor $k$:

$$\text{VIF}_k = \frac{1}{1 - R^2_k}$$

where $R^2_k$ is the R-squared from regressing factor $k$'s exposures on all other factors.

- VIF $\leq$ 5: acceptable
- VIF $>$ 10: severe multicollinearity

In [None]:
def compute_vif(exposures, factor_names):
    """
    Compute Variance Inflation Factors for factor exposures.

    Parameters:
        exposures (np.array): Factor exposures, shape (T, N, K).
        factor_names (list): Factor names.

    Returns:
        vif_df (pd.DataFrame): VIF per factor, averaged and per-period.
    """
    T, N, K = exposures.shape
    vif_per_period = np.zeros((T, K))

    for t in range(T):
        X = exposures[t]  # (N, K)
        for k in range(K):
            y = X[:, k]
            others = np.delete(X, k, axis=1)
            # Add intercept
            others_with_const = np.column_stack([np.ones(N), others])
            # OLS: R-squared
            beta = np.linalg.lstsq(others_with_const, y, rcond=None)[0]
            y_hat = others_with_const @ beta
            ss_res = np.sum((y - y_hat)**2)
            ss_tot = np.sum((y - y.mean())**2)
            r2 = 1.0 - ss_res / ss_tot if ss_tot > 0 else 0.0
            vif_per_period[t, k] = 1.0 / (1.0 - r2) if r2 < 1.0 else np.inf

    vif_mean = vif_per_period.mean(axis=0)
    vif_max = vif_per_period.max(axis=0)

    vif_df = pd.DataFrame({
        'Mean VIF': vif_mean,
        'Max VIF': vif_max,
        'Status': ['OK' if v <= 5 else 'High' if v <= 10 else 'Severe'
                    for v in vif_mean]
    }, index=factor_names)

    return vif_df

In [None]:
vif_df = compute_vif(exposures, factor_names)

print('Variance Inflation Factors (VIF):')
print(vif_df.round(3))

## 13. Summary Dashboard

Consolidated view of all factor model diagnostics.

In [None]:
def build_summary_dashboard(fret_stats, ic_summary, quantile_summary,
                             bias_df, vif_df, turnover_summary, r_squared):
    """
    Build a consolidated summary of all factor model metrics.

    Parameters:
        fret_stats (pd.DataFrame): Factor return statistics.
        ic_summary (pd.DataFrame): IC summary.
        quantile_summary (pd.DataFrame): Quantile return summary.
        bias_df (pd.DataFrame): Bias statistics.
        vif_df (pd.DataFrame): VIF statistics.
        turnover_summary (pd.DataFrame): Turnover summary.
        r_squared (np.array): Cross-sectional R-squared.

    Returns:
        dashboard (pd.DataFrame): Consolidated metrics.
    """
    dashboard = pd.DataFrame(index=fret_stats.index)

    # Factor returns
    dashboard['Ann. Return (%)'] = fret_stats['Ann. Mean (%)']
    dashboard['Ann. Vol (%)'] = fret_stats['Ann. Vol (%)']
    dashboard['Ret t-stat'] = fret_stats['t-stat']

    # IC
    dashboard['Mean IC'] = ic_summary['Mean IC']
    dashboard['ICIR'] = ic_summary['ICIR']

    # Quantile spread
    dashboard['L/S Spread (%)'] = quantile_summary['L/S Spread'] * 100

    # Bias
    dashboard['Bias Stat'] = bias_df['Bias Statistic']

    # VIF
    dashboard['VIF'] = vif_df['Mean VIF']

    # Turnover
    dashboard['Exp. Autocorr'] = turnover_summary['Mean Rank Autocorr']

    return dashboard

In [None]:
dashboard = build_summary_dashboard(
    fret_stats, ic_summary, quantile_summary, bias_df, vif_df,
    turnover_summary, r_squared)

print('=' * 90)
print('GRINOLD FACTOR MODEL METRICS — SUMMARY DASHBOARD')
print('=' * 90)
print(dashboard.round(3).to_string())
print()
print(f'Cross-Sectional R-squared: mean={r_squared.mean():.4f}, '
      f'median={np.median(r_squared):.4f}')
print('=' * 90)

In [None]:
# Visual summary: heatmap of key metrics (normalized for display)
display_cols = ['Ret t-stat', 'ICIR', 'L/S Spread (%)', 'Bias Stat', 'VIF', 'Exp. Autocorr']
display_df = dashboard[display_cols].copy()

fig, ax = plt.subplots(figsize=(10, 4))
sns.heatmap(display_df, annot=True, fmt='.2f', cmap='RdYlGn', center=0, ax=ax)
ax.set_title('Factor Model Metrics Summary')
plt.tight_layout()
plt.show()

## References

- Grinold, R. C. & Kahn, R. N. (1994). "Multiple-Factor Models for Portfolio Risk." In *A Practitioner's Guide to Factor Models*, CFA Institute Research Foundation.
- Grinold, R. C. & Kahn, R. N. (1999). *Active Portfolio Management*. McGraw-Hill.
- Menchero, J., Orr, D. J., & Wang, J. (2011). "The Barra US Equity Model (USE4)." MSCI Methodology Notes.
- Fama, E. F. & MacBeth, J. D. (1973). "Risk, Return, and Equilibrium: Empirical Tests." *Journal of Political Economy*.