# Smart Beta and Factor Investing

This assignment refers to the HBS case: **Smart Beta Exchange-Traded-Funds and Factor Investing**.

* The case is a good introduction to important pricing factors.
* It also gives useful introduction and context to ETFs, passive vs active investing, and so-called “smart beta” funds.

# 1. READING

1. Describe how each of the factors (other than MKT) is measured.1That is, each factor is a portfolio of stocks–which stocks are included in the factor portfolio?

1. Is the factor portfolio...
    * long-only
    * long-short
    * value-weighted
    * equally-weighted

1. What steps are taken in the factor construction to try to reduce the correlation between the factors?
5. What is the point of figures 1-6?
6. How is a “smart beta” ETF different from a traditional ETF?
7. Is it possible for all investors to have exposure to the “value” factor?
8. How does factor investing differ from traditional diversification?

#### Footnote:

If you need more info in how these factor portfolios are created, see Ken French’s website, and the follow- details: 

https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/Data_Library/f-f_5_factors_2x3.html

https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/Data_Library/det_mom_factor.html

***

# 2. The Factors

### Data
Use the data found in `data/factor_pricing_data.xlsx`.

Factors: Monthly excess return data for the overall equity market, $\tilde{r}^{\text{MKT}}$.

* The column header to the market factor is `MKT` rather than `MKT-RF`, but it is indeed already in excess return form.

* The sheet also contains data on five additional factors.

* All factor data is already provided as excess returns

In [1]:
import polars as pl
import pandas as pd
import numpy as np

# sheet names: 
# 'factors (excess returns)'
# 'portfolios (excess returns)'
# 'risk-free rate'

# factors
# Name
# MKT	Market
# SMB	Size
# HML	Value
# RMW	Profitability
# CMA	Investment
# UMD	Momentum
# RF	Risk-free rate

rets = pl.read_excel(
    '../data/factor_pricing_data_monthly.xlsx',
    sheet_name='factors (excess returns)',
)
rets_pd = rets.to_pandas()

rets.head()

Date,MKT,SMB,HML,RMW,CMA,UMD
date,f64,f64,f64,f64,f64,f64
1980-01-31,0.055,0.0188,0.0185,-0.0184,0.0189,0.0745
1980-02-29,-0.0123,-0.0162,0.0059,-0.0095,0.0292,0.0789
1980-03-31,-0.1289,-0.0697,-0.0096,0.0182,-0.0105,-0.0958
1980-04-30,0.0396,0.0105,0.0103,-0.0218,0.0034,-0.0048
1980-05-31,0.0526,0.02,0.0038,0.0043,-0.0063,-0.0118


### 1. 
Analyze the factors, similar to how you analyzed the three Fama-French factors in `Homework 4`.

You now have three additional factors, so let’s compare there univariate statistics. 

* mean
* volatility    
* Sharpe

In [2]:
def performance_metrics(returns, annualization=1, start_date=None, end_date=None): # , quantile=.05):
    # only calculate for columns containing numerical (factor) data, not "Date"
    # Filter the data by date if start_date or end_date are provided
    filtered_returns = returns
    import datetime

    if start_date is not None:
        if isinstance(start_date, str):
            start_date_dt = datetime.date.fromisoformat(start_date)
        else:
            start_date_dt = start_date
        filtered_returns = filtered_returns.filter(pl.col("Date") >= pl.lit(start_date_dt))

    if end_date is not None:
        if isinstance(end_date, str):
            end_date_dt = datetime.date.fromisoformat(end_date)
        else:
            end_date_dt = end_date
        filtered_returns = filtered_returns.filter(pl.col("Date") <= pl.lit(end_date_dt))

    factor_cols = [col for col in filtered_returns.columns if filtered_returns.schema[col] in [pl.Float64, pl.Float32, pl.Int64, pl.Int32]]
    means = filtered_returns.select([pl.col(factor_cols).mean() * annualization]).to_dict(as_series=False)
    vols = filtered_returns.select([pl.col(factor_cols).std() * np.sqrt(annualization)]).to_dict(as_series=False)
    sharpes = filtered_returns.select([
        (pl.col(col).mean() / pl.col(col).std() * np.sqrt(annualization)).alias(col) for col in factor_cols
    ]).to_dict(as_series=False)

    metrics = pl.DataFrame({
        "Factor": factor_cols,
        "Mean": [means[col][0] for col in factor_cols],
        "Vol": [vols[col][0] for col in factor_cols],
        "Sharpe": [sharpes[col][0] for col in factor_cols],
    })
    return metrics

metrics = performance_metrics(rets, annualization=12, start_date="2015-01-01")
metrics.head(10)

Factor,Mean,Vol,Sharpe
str,f64,f64,f64
"""MKT""",0.117872,0.157356,0.749078
"""SMB""",-0.023775,0.103166,-0.230455
"""HML""",-0.016303,0.129885,-0.12552
"""RMW""",0.0400125,0.072632,0.550896
"""CMA""",-0.009141,0.082072,-0.111373
"""UMD""",0.020119,0.137387,0.146438


### 2. 

Based on the factor statistics above, answer the following.
* Does each factor have a positive risk premium (positive expected excess return)? 
* How have the factors performed since the time of the case, (2015-present)?

For the 2015-present subsample, there are 3 factors with negative premia: size, value, and investment.

Profitability and momentum have fared better, especially the former with a Sharpe of 0.55.

### 3. 

Report the correlation matrix across the six factors.
* Does the construction method succeed in keeping correlations small?
* Fama and French say that HML is somewhat redundant in their 5-factor model. Does this seem to be the case?

In [3]:
# get the correlation matrix
corr_matrix = rets[["MKT", "SMB", "HML", "RMW", "CMA", "UMD"]].corr()
corr_matrix

MKT,SMB,HML,RMW,CMA,UMD
f64,f64,f64,f64,f64,f64
1.0,0.226997,-0.207918,-0.250639,-0.346542,-0.179352
0.226997,1.0,-0.021819,-0.411946,-0.051099,-0.06094
-0.207918,-0.021819,1.0,0.219401,0.676727,-0.215523
-0.250639,-0.411946,0.219401,1.0,0.138566,0.076694
-0.346542,-0.051099,0.676727,0.138566,1.0,9.4e-05
-0.179352,-0.06094,-0.215523,0.076694,9.4e-05,1.0


Correlations are generally small across the board. The largest one being HML and CMA.

HML does not seem to be that redundant, considering its highest correlation is with CMA - the investment factor. And only then at 0.68.

### 4. 

Report the tangency weights for a portfolio of these 6 factors.
* Which factors seem most important? And Least?
* Are the factors with low mean returns still useful?
* Re-do the tangency portfolio, but this time only include MKT, SMB, HML, and UMD. Which factors get high/low tangency weights now?

What do you conclude about the importance or unimportance of these styles?

In [4]:
def tangency_weights_from_df(returns: pd.DataFrame, periods: int = 12, rf: float = 0.0) -> np.ndarray:
    mu = returns.mean() * periods
    cov_matrix = returns.cov() * periods
    inv_cov_matrix = np.linalg.inv(cov_matrix)
    ones = np.ones(len(mu))

    z = inv_cov_matrix @ (mu - rf*ones)
    return z

weights = tangency_weights_from_df(rets_pd[["MKT", "SMB", "HML", "RMW", "CMA", "UMD"]])
weights

array([ 6.55018707,  2.00262168, -0.63544901,  9.04204062,  9.62925191,
        3.36880503])

Most important factors are investment and momentum by magnitude, while value (HML) has the lowest magnitude. Market beta is also very significant. These conclusions are not surprising considering the correlation matrix above.

Considering investment had a negative mean return, it shows low mean returns can still be quite useful.

In [5]:
w_new = tangency_weights_from_df(rets_pd[["MKT", "SMB", "HML", "UMD"]])
w_new

array([ 5.16973145, -0.70297433,  5.01604229,  4.24772173])

If we restrict the portfolio to MKT, SMB, HML, and UMD, we can see HML now has similar weight to market beta, and SMB has lower absolute weight.

***

# 3. Testing Modern LPMs

Consider the following factor models:
* CAPM: MKT
* Fama-French 3F: MKT, SMB, HML
* Fama-French 5F: MKT, SMB, HML, RMW, CMA
* AQR: MKT, HML, RMW, UMD

Our labeling of the last model as the **AQR** is just for concreteness. The firm is well-known for these factors and an unused case study discusses that further.

For instance, for the AQR model is...

$$
\mathbb{E}[\tilde{r}^i] 
= \beta^{i,\mathrm{MKT}} \, \mathbb{E}[\tilde{f}^{\mathrm{MKT}}] 
+ \beta^{i,\mathrm{HML}} \, \mathbb{E}[\tilde{f}^{\mathrm{HML}}] 
+ \beta^{i,\mathrm{RMW}} \, \mathbb{E}[\tilde{f}^{\mathrm{RMW}}] 
+ \beta^{i,\mathrm{UMD}} \, \mathbb{E}[\tilde{f}^{\mathrm{UMD}}]
$$

We will test these models with the time-series regressions. Namely, for each asset i, estimate the following regression to test the AQR model:

$$
\tilde{r}^i_t 
= \alpha^i 
+ \beta^{i,\mathrm{MKT}} \tilde{f}^{\mathrm{MKT}}_t 
+ \beta^{i,\mathrm{HML}} \tilde{f}^{\mathrm{HML}}_t 
+ \beta^{i,\mathrm{RMW}} \tilde{f}^{\mathrm{RMW}}_t 
+ \beta^{i,\mathrm{UMD}} \tilde{f}^{\mathrm{UMD}}_t 
+ \varepsilon_t
$$

### Data

* Monthly excess return data on `n=49` equity portfolios sorted by their industry. Denote these as $\tilde{r}^i$ , for $n = 1, . . . .$

* You do NOT need the risk-free rate data. It is provided only for completeness. The other two tabs are already in terms of excess returns.

### 1. 

Test the AQR 4-Factor Model using the time-series test. (We are not doing the cross-sectional regression tests.)

For each regression, report the estimated α and r-squared.


In [6]:
portfolios = pd.read_excel(
    '../data/factor_pricing_data_monthly.xlsx',
    sheet_name='portfolios (excess returns)',
    parse_dates=True,
)
portfolios.head()

Unnamed: 0,Date,Agric,Food,Soda,Beer,Smoke,Toys,Fun,Books,Hshld,...,Boxes,Trans,Whlsl,Rtail,Meals,Banks,Insur,RlEst,Fin,Other
0,1980-01-31,-0.0073,0.0285,0.0084,0.1009,-0.0143,0.0995,0.0348,0.0323,0.0048,...,0.0158,0.0851,0.0466,-0.0125,0.043,-0.0284,0.0254,0.077,0.0306,0.0666
1,1980-02-29,0.0125,-0.0609,-0.0967,-0.0323,-0.0575,-0.0316,-0.0492,-0.0803,-0.0556,...,-0.0083,-0.0543,-0.0345,-0.0641,-0.0653,-0.0824,-0.096,-0.0352,-0.0283,-0.0273
2,1980-03-31,-0.222,-0.1119,-0.0158,-0.1535,-0.0188,-0.1272,-0.0827,-0.1238,-0.0567,...,-0.0819,-0.1512,-0.1602,-0.0905,-0.145,-0.0559,-0.0877,-0.2449,-0.1261,-0.1737
3,1980-04-30,0.0449,0.0767,0.0232,0.0289,0.083,-0.0529,0.0785,0.0154,0.0305,...,0.0422,-0.0102,0.0268,0.0355,0.0539,0.0736,0.0528,0.0964,0.0458,0.0784
4,1980-05-31,0.0635,0.0797,0.0458,0.0866,0.0822,0.051,0.0325,0.0888,0.056,...,0.0564,0.1065,0.1142,0.0877,0.1104,0.057,0.056,0.0889,0.0846,0.0663


In [7]:
# Test the AQR 4-Factor Model using the time-series test
import statsmodels.api as sm

alphas = {}
rsquareds = {}

# Remove non-portfolio columns (like 'Date') if present
portfolio_columns = [col for col in portfolios.columns if col not in ['Date', 'date', 'DATE']]

# Align factors and portfolios on the date column
# Make sure both DataFrames use the same date as index or merge on 'Date'
# Assume 'rets_pd' has the same time frame/order as 'portfolios'
for col in portfolio_columns:
    y = portfolios[col]
    if 'Date' in portfolios.columns:
        # If 'Date' is a column, align on 'Date'
        merged = pd.merge(
            portfolios[['Date', col]],
            rets_pd[['Date', 'MKT', 'HML', 'RMW', 'UMD']],
            on='Date',
            how='inner'
        )
        y_aligned = merged[col]
        X_aligned = merged[['MKT', 'HML', 'RMW', 'UMD']]
    else:
        # Assume already aligned on index
        y_aligned = y
        X_aligned = rets_pd[['MKT', 'HML', 'RMW', 'UMD']]
        if len(y_aligned) != len(X_aligned):
            minlen = min(len(y_aligned), len(X_aligned))
            y_aligned = y_aligned.iloc[:minlen]
            X_aligned = X_aligned.iloc[:minlen]

    X_aligned = sm.add_constant(X_aligned)
    model = sm.OLS(y_aligned, X_aligned)
    results = model.fit()
    alphas[col] = results.params['const']
    rsquareds[col] = results.rsquared

# Print the results
for col in portfolio_columns[:5]:
    print(f"Portfolio {col} results:")
    print(f"Alpha: {alphas[col]:.6f}, R-squared: {rsquareds[col]:.4f}")

Portfolio Agric results:
Alpha: 0.000971, R-squared: 0.3421
Portfolio Food  results:
Alpha: 0.000125, R-squared: 0.4551
Portfolio Soda  results:
Alpha: 0.001282, R-squared: 0.3025
Portfolio Beer  results:
Alpha: 0.000821, R-squared: 0.4148
Portfolio Smoke results:
Alpha: 0.003426, R-squared: 0.2654


### 2. 

Calculate the mean-absolute-error of the estimated alphas.

$$\text{MAE} = \frac{1}{n}\sum_{i=1}^n|\tilde{\alpha}^i|$$

* If the pricing model worked, should these alpha estimates be large or small? Why?

* Based on your MAE stat, does this seem to support the pricing model or not?

In [8]:
mae = sum(abs(alpha) for alpha in alphas.values()) / len(alphas)
print(f"MAE: {mae:.6f}")

MAE: 0.002051


If the model worked, the alphas should be small - almost all of the excess returns should be explained by the factors in the model.

The MAE stat seems not to support the pricing model - mean absolute alpha comes in at 20 bps. That's a considerable portion of absolute mean returns across the portfolio.

### 3. 

Test the CAPM, FF 3-Factor Model and the the FF 5-Factor Model.
   * Report the MAE statistic for each of these models and compare it with the AQR Model MAE.
   * Which model fits best?

In [9]:
# Remove non-portfolio columns (like 'Date') if present
portfolio_columns = [col for col in portfolios.columns if col not in ['Date', 'date', 'DATE']]
model_factors = {
    'CAPM': ['MKT'],
    'FF3': ['MKT', 'SMB', 'HML'],
    'FF5': ['MKT', 'SMB', 'HML', 'RMW', 'CMA']
}
model_alphas = {
    'CAPM': {},
    'FF3': {},
    'FF5': {}
}
# Align factors and portfolios on the date column
# Make sure both DataFrames use the same date as index or merge on 'Date'
# Assume 'rets_pd' has the same time frame/order as 'portfolios'
for model_name in model_factors.keys():
    for col in portfolio_columns:
        y = portfolios[col]
        merged = pd.merge(
            portfolios[['Date', col]],
            rets_pd[['Date', *model_factors[model_name]]],
            on='Date',
            how='inner'
        )
        y_aligned = merged[col]
        X_aligned = merged[model_factors[model_name]]

        X_aligned = sm.add_constant(X_aligned)
        model = sm.OLS(y_aligned, X_aligned)
        results = model.fit()
        model_alphas[model_name][col] = results.params['const']

In [10]:
mae = sum(abs(alpha) for alpha in model_alphas['FF5'].values()) / len(model_alphas['FF5'])
print(f"FF5 MAE: {mae:.6f}")
mae = sum(abs(alpha) for alpha in model_alphas['FF3'].values()) / len(model_alphas['FF3'])
print(f"FF3 MAE: {mae:.6f}")
mae = sum(abs(alpha) for alpha in model_alphas['CAPM'].values()) / len(model_alphas['CAPM'])
print(f"CAPM MAE: {mae:.6f}")

FF5 MAE: 0.002614
FF3 MAE: 0.002030
CAPM MAE: 0.001748


We can see CAPM has the lowest MAE of all models. The Fama-French 3-factor model performs very slightly better than AQR, but not significantly so. The Fama-French 5-factor model underperforms.
An ode to simplicity.

### 4. 

Does any particular factor seem especially important or unimportant for pricing? Do you think Fama and French should use the Momentum Factor?

Size doesn't seem that important. The Fama-French models included it and didn't outperform AQR, which removed it. Its tangency weights were low. Its mean returns were negative in the sample.

As for momentum, it seems a better factor than size. It has a higher tangency weight in a portfolio of the factors, a positive mean return over the sample, and AQR outperformed the 5-factor model. It does not seem like including it is worth the cost incurred in parsimony - but if we're replacing factors, there's a better argument.

### 5. 

This does not matter for pricing, but report the average (across $n$ estimations) of the time-series regression r-squared statistics.
   * Do this for each of the three models you tested.
   * Do these models lead to high time-series r-squared stats? That is, would these factors be good in a Linear Factor Decomposition of the assets?


### 6. 

We tested three models using the time-series tests (focusing on the time-series alphas.) Re-test these models, but this time use the cross-sectional test.

* Report the time-series premia of the factors (just their sample averages,) and compare to the cross-sectionally estimated premia of the factors. Do they differ substantially?4
* Report the MAE of the cross-sectional regression residuals for each of the four models. How do they compare to the MAE of the time-series alphas?

#### Footnote:

Recall that we found in `Homework 4` that the market premium went from being strongly positive to strongly negative when estimated in the cross-section.

***