### Data Processing

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
port_totret = pd.read_excel('data/dfa_analysis_data.xlsx',sheet_name='portfolios (total returns)',index_col='Date')
factors = pd.read_excel('data/dfa_analysis_data.xlsx',sheet_name='factors',index_col='Date')
rf = pd.read_excel('data/dfa_analysis_data.xlsx',sheet_name='factors',index_col='Date')[['RF']]

# Retrieve the portfolio returns and compute excess returns
port_totret = port_totret.subtract(rf['RF'], axis=0)
port_totret = port_totret.loc['1981':] # Focus on data from 1981 onwards

# Take a look at the port_totret
port_totret.tail()

Unnamed: 0_level_0,SMALL LoBM,ME1 BM2,ME1 BM3,ME1 BM4,SMALL HiBM,ME2 BM1,ME2 BM2,ME2 BM3,ME2 BM4,ME2 BM5,...,ME4 BM1,ME4 BM2,ME4 BM3,ME4 BM4,ME4 BM5,BIG LoBM,ME5 BM2,ME5 BM3,ME5 BM4,BIG HiBM
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2025-04-30,0.004568,0.013403,0.005444,-0.009842,-0.028751,0.002469,-0.012547,-0.028043,-0.034868,-0.067272,...,-0.012266,-0.016199,-0.023646,-0.042776,-0.076168,0.010606,-0.033629,-0.077367,-0.016972,-0.031441
2025-05-31,0.104649,0.053824,0.049631,0.038679,0.053372,0.040805,0.047,0.065733,0.046892,0.054768,...,0.058777,0.046422,0.031553,0.077375,0.062026,0.074277,0.057496,0.014607,0.022356,0.061884
2025-06-30,0.135179,0.055405,0.083003,0.043425,0.051947,0.058885,0.062943,0.070162,0.05276,0.046904,...,0.016951,0.039792,0.020775,0.070415,0.054624,0.051879,0.059051,0.044005,0.033024,0.066709
2025-07-31,0.010927,0.053868,0.002366,0.0195,0.004535,0.014286,0.014293,0.022281,0.005769,0.014453,...,0.030806,0.018366,0.005613,-0.004469,-0.022703,0.029549,0.010668,0.008824,-0.001067,-0.017144
2025-08-31,0.085489,0.085346,0.092586,0.088495,0.095721,0.088509,0.050113,0.081779,0.080105,0.103466,...,0.032724,0.012364,0.019681,0.055048,0.067542,0.00781,0.008127,0.026767,0.050457,0.086999


### 1. Summary Statistics.

For each portfolio,

Use the Risk-Free rate column in the factors tab to convert these total returns to excess returns.

Calculate the (annualized) univariate statistics

In [None]:
# Define a function to compute mean, volatility, and Sharpe ratio
def summary_stats(data, portfolio = None, portfolio_name = 'Portfolio', annualize = 12):

    output = data.agg(['mean','std'])
    
    output.loc['sharpe'] = output.loc['mean'] / output.loc['std']
    output.loc['mean'] *= annualize
    output.loc['std'] *= np.sqrt(annualize)
    output.loc['sharpe'] *= np.sqrt(annualize)

    return output

In [19]:
# Calculate summary statistics for each portfolio
stats_port_totret = summary_stats(port_totret)
stats_port_totret.loc['VaR'] = port_totret.quantile(0.05).rename('VaR') # 5% VaR
stats_port_totret.transpose().style.format("{:.2%}")

Unnamed: 0,mean,std,sharpe,VaR
SMALL LoBM,1.17%,27.17%,4.31%,-12.49%
ME1 BM2,8.84%,23.54%,37.56%,-9.49%
ME1 BM3,9.02%,20.08%,44.93%,-8.48%
ME1 BM4,11.25%,19.40%,58.00%,-7.76%
SMALL HiBM,12.73%,20.84%,61.10%,-8.82%
ME2 BM1,6.09%,24.47%,24.90%,-10.32%
ME2 BM2,9.84%,20.54%,47.90%,-8.34%
ME2 BM3,10.52%,18.64%,56.40%,-8.03%
ME2 BM4,10.81%,18.19%,59.42%,-7.53%
ME2 BM5,11.32%,21.37%,52.98%,-9.33%


### 2. CAPM

The Capital Asset Pricing Model (CAPM) asserts that an asset (or portfolio's) expected excess return is completely a function of its beta to the equity market index ( SPY , or in this case, MKT .)

Specifically, it asserts that, for any excess return, $\tilde{r}^i$, its mean is proportional to the mean excess return of the market, $\tilde{r}^{\mathrm{mkt}}$, where the proporitonality is the regression beta of $\tilde{r}^i$ on $\tilde{r}^{\mathrm{mkt}}$.

$$
\mathbb{E}\left[\tilde{r}_t^i\right]=\beta^{i, \text { mkt }} \mathbb{E}\left[\tilde{r}_t^{\text {mkt }}\right]
$$


Let's examine whether that seems plausible.

For each of the $n=25$ test portfolios, run the CAPM time-series regression:

$$
\tilde{r}_t^i=\alpha^i+\beta^{i, \mathrm{mkt}} \tilde{r}_t^{\mathrm{mkt}}+\epsilon_t^i
$$


So you are running 25 separate regressions, each using the $T$-sized sample of time-series data.
- Report the betas and alphas for each test asset.
- Report the mean-absolute-error of the CAPM: $\mathrm{MAE}=\frac{1}{n} \sum_{i=1}^n\left|\alpha_i\right|$

If the CAPM were true, what would we expect of the MAE?
- Report the estimated $\beta^{i, \text { mkt }}$, Treynor Ratio, $\alpha^i$, and Information Ratio for each of the $n$ regressions.
- If the CAPM model were true, what would be true of the Treynor Ratios, alphas, and Information Ratios?

In [25]:
import statsmodels.api as sm

# Define a function to compute CAPM regression statistics
def capm_stats(assets, factors, annualize=12, name='asset', treynor=False, mkt_name='Mkt-RF'):
    # Ensure assets is a DataFrame
    if isinstance(assets, pd.Series):
        assets = assets.to_frame(name=name)
    
    X = sm.add_constant(factors)
    
    # Initialize storage DataFrames
    model_output = pd.DataFrame()
    stats_output = pd.DataFrame()
    
    # Loop through each asset column
    for col in assets.columns:
        y = assets[col]
        fit = sm.OLS(y, X).fit()
        
        # Store regression coefficients
        model_output[col] = fit.params
        
        # Compute annualized metrics
        mean_ret = y.mean() * annualize
        alpha_ann = fit.params['const'] * annualize
        resid_std = fit.resid.std() * np.sqrt(annualize)
        info_ratio = alpha_ann / resid_std
        
        # Save to stats_output
        metrics = {'Alpha': alpha_ann, 'Info. Ratio': info_ratio}
        
        if treynor:
            beta_mkt = fit.params[mkt_name]
            treynor_ratio = mean_ret / beta_mkt
            metrics['Treynor Ratio'] = treynor_ratio
    
        stats_output[col] = pd.Series(metrics)
    
    return model_output, stats_output

In [26]:
capm_stats = pd.concat(capm_stats(port_totret, factors['Mkt-RF'].loc['1981':], treynor=True)).T
print(f"MAE: {capm_stats['const'].abs().mean():.4f}")
print(f"MAE (Annualized): {capm_stats['Alpha'].abs().mean():.4f}")
capm_stats

MAE: 0.0017
MAE (Annualized): 0.0207


Unnamed: 0,const,Mkt-RF,Alpha,Info. Ratio,Treynor Ratio
SMALL LoBM,-0.008645,1.358486,-0.10374,-0.604694,0.008616
ME1 BM2,-0.000886,1.165759,-0.010629,-0.07051,0.075863
ME1 BM3,8.7e-05,1.049509,0.001039,0.008831,0.08597
ME1 BM4,0.002457,0.977337,0.029481,0.243461,0.115145
SMALL HiBM,0.003571,0.993918,0.042851,0.305768,0.128093
ME2 BM1,-0.004371,1.334065,-0.052447,-0.401824,0.045667
ME2 BM2,0.000131,1.138954,0.001576,0.01505,0.086364
ME2 BM3,0.001429,1.035676,0.017142,0.181248,0.101532
ME2 BM4,0.002091,0.976453,0.025089,0.24925,0.110675
ME2 BM5,0.001566,1.110819,0.018794,0.148788,0.1019


If the CAPM held perfectly, all alphas would be zero, implying that both the information ratio and the MAE would also be zero. Consequently, the Treynor Ratios across all assets would be identical and equal to the expected market premium.

### 3. Cross-sectional Estimation

Let's test the CAPM directly. We already have what we need:
- The dependent variable, (y): mean excess returns from each of the $n=25$ portfolios.
- The regressor, $(\mathrm{x})$ : the market beta from each of the $n=25$ time-series regressions.

Then we can estimate the following equation:

$$
\underbrace{\mathbb{E}\left[\tilde{r}^{i}\right]}_{n\times 1\text{ data}} 
= \textcolor{ForestGreen}{\underbrace{\eta}_{\text{regression intercept}}} 
+ \underbrace{{\beta}^{i,\text{mkt}}}_{n\times 1\text{ data}}~
\textcolor{ForestGreen}{\underbrace{\lambda_{\text{mkt}}}_{\text{regression estimate}}}
+ \textcolor{ForestGreen}{\underbrace{\upsilon}_{n\times 1\text{ residuals}}}
$$

Note that
- we use sample means as estimates of $\mathbb{E}\left[\tilde{r}^i\right]$.
- this is a weird regression! The regressors are the betas from the time-series regressions we already ran!
- this is a single regression, where we are combining evidence across all $n=25$ series. Thus, it is a cross-sectional regression!
- the notation is trying to emphasize that the intercept is different than the time-series $\alpha$ and that the regressor coefficient is different than the time-series betas.

Report
- the R-squared of this regression.
- the intercept, $\eta$.
- the regression coefficient, $\lambda_{\text {mkt }}$.

What would these three statistics be if the CAPM were completely accurate?

In [7]:
# Mean excess returns for each portfolio
port_mean_xret = port_totret.mean() 

# Market betas from CAPM regression
port_mkt_betas = capm_stats['Mkt-RF'] 

# Cross-sectional regression
y = port_mean_xret
x = sm.add_constant(port_mkt_betas)
model_cross_sectional = sm.OLS(y,x).fit()

print(model_cross_sectional.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.313
Model:                            OLS   Adj. R-squared:                  0.283
Method:                 Least Squares   F-statistic:                     10.49
Date:                Sun, 26 Oct 2025   Prob (F-statistic):            0.00363
Time:                        00:57:45   Log-Likelihood:                 125.95
No. Observations:                  25   AIC:                            -247.9
Df Residuals:                      23   BIC:                            -245.5
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.0172      0.003      5.891      0.0

If the CAPM holds, several outcomes should appear. To start, the intercept term should be zero, implying that investors who take no market risk should earn no excess mean return. Next, the coefficient $\lambda_m$ should equal the market risk premium. Finally, the $R^2$ value should be very high, since the portfolio’s average returns—our dependent variable—should be almost fully explained by the level of market risk, represented by the $\beta$ estimated in previous regressions.


### 4. Conclusion

Broadly speaking, do these results support DFA's belieef in size and value portfolios containing premia unrelated to the market premium?

Yes, broadly speaking, these findings suggest that the CAPM might have overlooked certain risk factors—our earlier results point to size and value as two possible examples of such factors.