# Merge Portfolios with Fama-French Factors
Example of merging portfolio returns with Fama-French factors. We'll use WRDS for smooth downloading of the Fama-French factors, so you'll need a WRDS account. But in principle you could download the factors from Kenneth French's website.


In [1]:
# == Setup ==

# load packages
import pandas as pd
import openassetpricing as oap
import numpy as np
import wrds
from datetime import datetime
import statsmodels.formula.api as smf

# connect to WRDS
wrds_conn = wrds.Connection()

# initialize OpenAP
openap = oap.OpenAP()

WRDS recommends setting up a .pgpass file.
You can create this file yourself at any time with the create_pgpass_file() function.
Loading library list...
Done


# Download data

Let's examine Chen-Zimmermann equal-weighted quintiles. One can alternatively examine the original paper or other implementations (see run `openap.list_port()` for other options). Let's use the FF5 factors + momentum.

In [2]:
# download Chen-Zimmermann value-weighted quintile returns
port = openap.dl_port('quintiles_ew', 'pandas')

port.head()


Data is downloaded: 9s


Unnamed: 0,signalname,port,date,ret,signallag,Nlong,Nshort
0,AM,1,1951-07-31,7.661648,0.681619,67,0
1,AM,1,1951-08-31,4.273654,0.639816,67,0
2,AM,1,1951-09-28,1.31527,0.617079,67,0
3,AM,1,1951-10-31,-3.942987,0.612266,67,0
4,AM,1,1951-11-30,1.028675,0.63771,67,0


In [3]:
# download Fama-French factors
ff = wrds_conn.raw_sql("select * from ff.fivefactors_monthly", date_cols=['dateff'])

# convert to percent
fac_list = ['mktrf', 'smb', 'hml', 'rmw', 'cma','umd']
for fac in fac_list:
    ff[fac] = ff[fac]*100

ff.head()

Unnamed: 0,date,mktrf,smb,hml,rmw,cma,rf,year,month,umd,dateff
0,1963-07-01,-0.39,-0.41,-0.97,0.68,-1.18,0.0027,1963.0,7.0,0.9,1963-07-31
1,1963-08-01,5.07,-0.8,1.8,0.36,-0.35,0.0025,1963.0,8.0,1.01,1963-08-30
2,1963-09-01,-1.57,-0.52,0.13,-0.71,0.29,0.0027,1963.0,9.0,0.19,1963-09-30
3,1963-10-01,2.53,-1.39,-0.1,2.8,-2.01,0.0029,1963.0,10.0,3.12,1963-10-31
4,1963-11-01,-0.85,-0.88,1.75,-0.51,2.24,0.0027,1963.0,11.0,-0.74,1963-11-29


In [6]:
['dateff'] + fac_list

['dateff', 'mktrf', 'smb', 'hml', 'rmw', 'cma', 'umd']

# Merge
Merging is straightforward. Both tables have date columns that mark the last day of the month for the corresponding monthly returns.

In [7]:
portff = pd.merge(port, ff[['dateff']+fac_list], left_on='date', right_on='dateff')

portff.head()

Unnamed: 0,signalname,port,date,ret,signallag,Nlong,Nshort,dateff,mktrf,smb,hml,rmw,cma,umd
0,AM,1,1963-07-31,0.088593,0.444402,180,0,1963-07-31,-0.39,-0.41,-0.97,0.68,-1.18,0.9
1,AM,1,1963-08-30,6.239402,0.450915,180,0,1963-08-30,5.07,-0.8,1.8,0.36,-0.35,1.01
2,AM,1,1963-09-30,-2.359711,0.427911,180,0,1963-09-30,-1.57,-0.52,0.13,-0.71,0.29,0.19
3,AM,1,1963-10-31,2.818744,0.441678,179,0,1963-10-31,2.53,-1.39,-0.1,2.8,-2.01,3.12
4,AM,1,1963-11-29,-1.327826,0.435277,178,0,1963-11-29,-0.85,-0.88,1.75,-0.51,2.24,-0.74


# Check alphas
There's a myth that the predictor zoo boils down to a few factors. Let's see if this holds for equal-weighted quintiles, using the FF5 factors. 

As a benchmark, first let's check out the mean raw returns (no factor adjustments):


In [8]:
# define regression
def reg(group):
    fit = smf.ols('ret ~ 1', data=group).fit()
    return pd.Series({
        'meanret': fit.params['Intercept'], 'tstat': fit.tvalues['Intercept'], 'nmonth': len(group),
        'datemin': group['date'].min(), 'datemax': group['date'].max()
    })

# apply regression
portsum = portff.groupby(['signalname','port']).apply(reg, include_groups=False).reset_index()

# get meanret, tstat, and nmonth by portfolio
portsum_mean = portsum.groupby('port').agg({
    'meanret': 'mean',
    'tstat': 'mean', 
    'nmonth': 'mean',
    'datemin': 'mean',
    'datemax': 'mean',
    'signalname': 'count'
}).rename(
    columns={'signalname': 'nsignals'}
)
portsum_mean['datemin'] = portsum_mean['datemin'].dt.date
portsum_mean['datemax'] = portsum_mean['datemax'].dt.date

portsum_mean


Unnamed: 0_level_0,meanret,tstat,nmonth,datemin,datemax,nsignals
port,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
01,0.824993,3.478146,657.368715,1968-12-29,2023-09-12,179
02,1.056281,4.720015,649.101124,1969-04-25,2023-09-02,178
03,1.134126,5.270244,646.455056,1969-02-02,2023-09-05,178
04,1.190835,5.532314,653.648045,1969-03-02,2023-09-10,179
05,1.294408,5.482095,657.368715,1968-12-29,2023-09-12,179
LS,0.469416,3.823385,657.368715,1968-12-29,2023-09-12,179


So, without any factor adjustments, the mean returns show a clear increasing pattern across quintiles, with a long-short spread of 47 bps per month. This is somewhat smaller than the numbers reported in the literature (e.g. Chen and Zimmermann 2021), but this includes data up through 2023. Returns are much smaller in post 2005 (Chen and Velikov 2023). 

You might notice that the number of signals is (1) overall smaller than the full Chen-Zimmermann dataset and (2) portfolios two and three are have one fewer signal. This is due to discrete signals or degenerate signal distributions. Let's return to this in a bit.

But first, let's adjust for exposure to the CAPM + Fama-French factors + Momentum:

In [17]:
# define regression
def reg(group):
    fit = smf.ols('ret ~ mktrf + smb + hml + rmw + cma + umd', data=group).fit()
    return pd.Series({
        'meanret': fit.params['Intercept'], 'tstat': fit.tvalues['Intercept'], 'nmonth': len(group),
        'datemin': group['date'].min(), 'datemax': group['date'].max()
    })

# apply regression
portsum = portff.groupby(['signalname','port']).apply(reg, include_groups=False).reset_index()

# get meanret, tstat, and nmonth by portfolio
portsum_mean = portsum.groupby('port').agg({
    'meanret': 'mean',
    'tstat': 'mean', 
    'nmonth': 'mean',
    'datemin': 'mean',
    'datemax': 'mean',
    'signalname': 'count'
}).rename(
    columns={'signalname': 'nsignals'}
)
portsum_mean['datemin'] = portsum_mean['datemin'].dt.date
portsum_mean['datemax'] = portsum_mean['datemax'].dt.date

portsum_mean

Unnamed: 0_level_0,meanret,tstat,nmonth,datemin,datemax,nsignals
port,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
01,0.229323,3.387908,657.368715,1968-12-29,2023-09-12,179
02,0.356662,6.515554,649.101124,1969-04-25,2023-09-02,178
03,0.445047,7.948112,646.455056,1969-02-02,2023-09-05,178
04,0.498872,8.357514,653.648045,1969-03-02,2023-09-10,179
05,0.611647,7.585132,657.368715,1968-12-29,2023-09-12,179
LS,0.378901,3.703053,657.368715,1968-12-29,2023-09-12,179


After adjusting for exposure to factors, the increasing pattern is still clear, and the long-short spread is little changed. The 47 bps per month long-short spread is now a 38 bps per month after adjusting for these six prominent factors (including a few Nobel-prize winning ones).

# Discrete and degenerate signals

Let's see which signals are missing the interior portfolios.


In [18]:
# find signalnames in port 01
signallist_01 = portff[portff['port'] == '01']['signalname'].unique()

# find signalnames in port 02
signallist_02 = portff[portff['port'] == '02']['signalname'].unique()

# find signalnames in port 03
signallist_03 = portff[portff['port'] == '03']['signalname'].unique()

print('Signals missing from port 02:', set(signallist_01) - set(signallist_02))
print('Signals missing from port 03:', set(signallist_01) - set(signallist_03))

Signals missing from port 02: {'NumEarnIncrease'}
Signals missing from port 03: {'DelLTI'}


So the signals with missing portfolios are `NumEarnIncrease` and `DelLTI`. Let's see what these signals are, exactly, by checking the openap signal documentation.

In [19]:
# download signaldoc
signaldoc = openap.dl_signal_doc('pandas')

# show definitions of the signals with missing portfolios
print(signaldoc[signaldoc['Acronym'] == 'NumEarnIncrease']['Detailed Definition'].tolist())
print(signaldoc[signaldoc['Acronym'] == 'DelLTI']['Detailed Definition'].tolist())

['Number if consecutive 4-quarter increases in ibq, up to 8.']
['Difference in investment and advances (ivao) between years t-1 and t, scaled by average total assets (at) in years t-1 and t.']


So, `NumEarnIncrease` is a discrete signal that falls in the set {0,1,2,...,9}. So it's natural that it can lead to missing portfolios. Similarly, `DelLTI` shows changes in a somewhat obscure accounting number, which may bunch up around zero. 

The Chen-Zimmermann code handles these degenerate signals by giving extreme portfolios the "benefit of the doubt" in inequalities. See the [portfolio function](https://github.com/OpenSourceAP/CrossSection/blob/master/Portfolios/Code/01_PortfolioFunction.R) in the Chen-Zimmermann Github repo.