# Machine Learning Portfolio Example

This example illustrates how to use *all* Chen-Zimmermann predictors, together with CRSP data. We'll merge monthly CRSP with the full set of Chen-Zimmermann predictors, fit the CRSP returns to lagged signals, and form portfolios in a super simple out-of-sample test. Specifically, we'll use a "groovy" model (fit on the 1960s and 1970s) to try to predict returns during hair metal (1980s), gangsta rap (1990s), and other more recent samples. Does the groovy model work even in the TSwift era?

Downloading all of the signals takes some time and requires substantial RAM. It also requires a WRDS account, since some predictors require data from WRDS (size, short-term reversal, price). 

In [2]:
# == Setup ==

# load packages
import pandas as pd
import openassetpricing as oap
import numpy as np
import wrds
from datetime import datetime
import statsmodels.formula.api as smf
from scipy.stats import rankdata

# initialize OpenAP
openap = oap.OpenAP()

# Download data

You'll have to enter your WRDS credentials twice: once to download the CRSP returns, and once to download all Chen-Zimmermann predictors (including size, short-term reversal, and price). The downloads take a couple minutes in total.


In [3]:
# download CRSP data
wrds_conn = wrds.Connection()

crsp = wrds_conn.raw_sql("""select permno, date, ret*100 as ret
                        from crsp.msf""", 
                        date_cols=['date'])

crsp.head()

WRDS recommends setting up a .pgpass file.
You can create this file yourself at any time with the create_pgpass_file() function.
Loading library list...
Done


Unnamed: 0,permno,date,ret
0,10000,1985-12-31,
1,10000,1986-01-31,
2,10000,1986-02-28,-25.7143
3,10000,1986-03-31,36.5385
4,10000,1986-04-30,-9.8592


In [4]:
# download all Chen-Zimmermann predictors
bigdat = openap.dl_all_signals('pandas')

# get names of all signals
signal_list = [col for col in bigdat.columns if col not in ['permno', 'yyyymm']]

bigdat.head()

WRDS recommends setting up a .pgpass file.
You can create this file yourself at any time with the create_pgpass_file() function.
Loading library list...
Done

Data is downloaded: 2 mins


Unnamed: 0,permno,yyyymm,AM,AOP,AbnormalAccruals,Accruals,AccrualsBM,Activism1,Activism2,AdExp,...,sinAlgo,skew1,std_turn,tang,zerotrade12M,zerotrade1M,zerotrade6M,Price,Size,STreversal
0,10000,198601,,,,,,,,,...,,,,,,,,-1.475907,-2.778819,-0.0
1,10000,198602,,,,,,,,,...,,,,,,4.785175e-08,,-1.178655,-2.481568,0.257143
2,10000,198603,,,,,,,,,...,,,,,,1.023392e-07,,-1.490091,-2.793004,-0.365385
3,10000,198604,,,,,,,,,...,,,,,,7.467463e-08,,-1.386294,-2.719452,0.098592
4,10000,198605,,,,,,,,,...,,,,,,7.649551e-08,,-1.134423,-2.467581,0.222656


# Lag signals and merge

To lag signals, you can just add one month to the `yyyymm` column for the signals. For simplicity, let's fill in the day of the new variable `date` as the 28th (the signals are assumed to be available for trading at the end of the month). You can keep around `yyyymm` as `yyyymm_signals` for sanity checks. 

In [5]:
# rename yyyymm for clarity 
bigdat = bigdat.rename(columns={'yyyymm': 'yyyymm_signals'})

# create date that is one month ahead for merging with returns
bigdat['date'] = pd.to_datetime(bigdat['yyyymm_signals'].astype(str) + '28', format='%Y%m%d') + pd.DateOffset(months=1)

# reorder columns for clarity
bigdat = bigdat[['permno', 'date', 'yyyymm_signals'] + signal_list]

bigdat.head()

Unnamed: 0,permno,date,yyyymm_signals,AM,AOP,AbnormalAccruals,Accruals,AccrualsBM,Activism1,Activism2,...,sinAlgo,skew1,std_turn,tang,zerotrade12M,zerotrade1M,zerotrade6M,Price,Size,STreversal
0,10000,1986-02-28,198601,,,,,,,,...,,,,,,,,-1.475907,-2.778819,-0.0
1,10000,1986-03-28,198602,,,,,,,,...,,,,,,4.785175e-08,,-1.178655,-2.481568,0.257143
2,10000,1986-04-28,198603,,,,,,,,...,,,,,,1.023392e-07,,-1.490091,-2.793004,-0.365385
3,10000,1986-05-28,198604,,,,,,,,...,,,,,,7.467463e-08,,-1.386294,-2.719452,0.098592
4,10000,1986-06-28,198605,,,,,,,,...,,,,,,7.649551e-08,,-1.134423,-2.467581,0.222656


Now merge with CRSP. Convert CRSP dates to the 28th of the month for simplicity. The left join makes the missing values issues transparent.

In [6]:
# convert crsp dates to the 28th of the month
crsp['date'] = pd.to_datetime(crsp['date'].dt.strftime('%Y%m') + '28', format='%Y%m%d')

# left join returns onto signals, in-place (for ram)
bigdat = pd.merge(crsp, bigdat, on=['permno', 'date'], how='left')

bigdat.head()


Unnamed: 0,permno,date,ret,yyyymm_signals,AM,AOP,AbnormalAccruals,Accruals,AccrualsBM,Activism1,...,sinAlgo,skew1,std_turn,tang,zerotrade12M,zerotrade1M,zerotrade6M,Price,Size,STreversal
0,10000,1985-12-28,,,,,,,,,...,,,,,,,,,,
1,10000,1986-01-28,,,,,,,,,...,,,,,,,,,,
2,10000,1986-02-28,-25.7143,198601.0,,,,,,,...,,,,,,,,-1.475907,-2.778819,-0.0
3,10000,1986-03-28,36.5385,198602.0,,,,,,,...,,,,,,4.785175e-08,,-1.178655,-2.481568,0.257143
4,10000,1986-04-28,-9.8592,198603.0,,,,,,,...,,,,,,1.023392e-07,,-1.490091,-2.793004,-0.365385


Congrats, the data is merged! But unfortunately, we'll need to do a bit more work to make it usable.

# Process data
We'll need to deal with the missing signals. This is a notorious issue with big data. Here, we'll just standardize the signals and then fill in missings with zero. This follows [Chen and McCoy (2024)](https://arxiv.org/abs/2207.13071).

In [7]:
# copy over, keep only after 1963 and non-missing returns
cleandat = bigdat[
    (bigdat['date'].dt.year >= 1963) & 
    (bigdat['ret'].notna())
].copy()

# standardize
def standardize_signals(x):
    return (x - x.mean()) / x.std()

cleandat[signal_list] = cleandat[signal_list].apply(standardize_signals)

# replace NaNs with 0
cleandat = cleandat.fillna(0)

# Form ML-style portfolios
Following Lewellen (2014, CFR), let's predict returns using many signals and then sort stocks on the predicted returns. We'll do this in perhaps the simplest way possible: fit returns with OLS using the "groovy" 1963-1979 sample. Then use the fitted coefficients on lagged signals to sort stocks every month from 1980 onward. This can't work, can it?

In [8]:
# user-specified fit period
fit_start = 1963
fit_end = 1979

# user-specified number of portfolios
nport = 5

In [9]:
# fit returns
formula = 'ret ~ ' + ' + '.join(signal_list)

fit = smf.ols(formula, data=cleandat[cleandat['date'].dt.year.isin(range(fit_start, fit_end))]).fit()

# apply fit to all data
cleandat['pred'] = fit.predict(cleandat)

  cleandat['pred'] = fit.predict(cleandat)


In [10]:
# == find portfolio returns ==

# copy data
preddat = cleandat[['permno', 'date', 'pred', 'ret']].copy()

# define port sort function
# follows https://github.com/chenandrewy/flex-mining/blob/70ca658090a13fea8517945280b2de83b9886968/0_Environment.R#L465
def port_sort(x, nport):
    ranks = rankdata(x, method='min')
    return (np.floor(ranks * nport / (len(x) + 1)) + 1).astype(int)

# apply port sort function
preddat['port'] = preddat.groupby('date')['pred'].transform(lambda x: port_sort(x, nport))

# find portfolio returns 
portdat = preddat.groupby(['port', 'date']).agg(
    ret = ('ret', 'mean'),
    nstock = ('permno', 'nunique')
).reset_index()

# Far Out-of-Sample Performance
Let's examine the performance of our groovy model, into the hair metal (1980s), gangsta rap (1990s), emo (2000s), EDM (2010s), and TSwift (2020s) samples.

In [27]:
# find performance by 10-year periods
samplength = 10

portdat['subsamp'] = pd.cut(portdat['date'].dt.year, bins=range(1959, 2030, samplength), labels=range(1959, 2029, samplength))

portsum = portdat.groupby(['port', 'subsamp']).agg(
    meanret = ('ret', 'mean'),
    vol = ('ret', 'std'),
    nmonth = ('date', 'nunique'),
    nstock = ('nstock', 'mean'),
    datemin = ('date', 'min'),
    datemax = ('date', 'max')
).reset_index()
portsum['meanret'] = round(portsum['meanret'], 2)

# pivot meanret to wide format
sumwide = portsum.pivot(index=['subsamp'], columns='port', values='meanret').reset_index()
sumwide.columns = ['subsamp'] + [f'port_{col}' for col in sumwide.columns[1:]]

# add long-short
sumwide['5_minus_1'] = sumwide['port_5'] - sumwide['port_1']

# add date ranges
temp = portsum.groupby('subsamp').agg(
    datemin = ('datemin', 'min'),
    datemax = ('datemax', 'max')
).reset_index()

sumwide = pd.merge(temp, sumwide, on='subsamp', how='left')

# name the subsamples
sumwide['subsamp'] = sumwide['subsamp'].map({
    1959: 'groovy',
    1969: 'groovy (still)', 
    1979: 'hair metal',
    1989: 'gangsta rap',
    1999: 'emo',
    2009: 'EDM',
    2019: 'TSwift'
})

sumwide


  portsum = portdat.groupby(['port', 'subsamp']).agg(
  temp = portsum.groupby('subsamp').agg(


Unnamed: 0,subsamp,datemin,datemax,port_1,port_2,port_3,port_4,port_5,5_minus_1
0,groovy,1963-01-28,1969-12-28,0.51,0.99,1.43,1.89,2.89,2.38
1,groovy (still),1970-01-28,1979-12-28,-0.81,0.35,1.03,1.82,3.24,4.05
2,hair metal,1980-01-28,1989-12-28,0.01,0.92,1.32,1.57,2.27,2.26
3,gangsta rap,1990-01-28,1999-12-28,0.09,0.67,1.08,1.55,3.31,3.22
4,emo,2000-01-28,2009-12-28,-0.18,0.51,0.8,1.07,2.31,2.49
5,EDM,2010-01-28,2019-12-28,0.48,0.72,0.73,0.85,1.24,0.76
6,TSwift,2020-01-28,2023-12-28,0.14,0.6,0.6,0.68,1.11,0.97


The model, fit only using groovy era data, makes it through hair metal, gansta rap, and emo quite well. In the corresponding decades, the groovy model earns long-short returns of 2.0 to 3.0 percent per month. So a model from the [Simon and Garfunkel](https://en.wikipedia.org/wiki/Groovy#/media/File:Soundofsilence.jpg) days continued to predict quite well, even while [Metallica inexplicably started to paint their fingernails black](https://www.reddit.com/r/Metallica/comments/huk18i/never_forget_emotallica/). During EDM and the Tswift eras, the model predicts with the some notable magnitudes, though the returns are much weaker than they were while [Ms. Swift was still into pickup trucks](https://www.youtube.com/watch?v=GkD20ajVxnY).

There are huge caveats about trading costs (Chen and Velikov 2023). But then again, this tutorial doesn't even attempt to deal with trading costs. One can likely do much better by following DeMiguel, Martin-Utrera, Nogales, and Uppal (2020) or Jensen, Kelly, Malamud, and Pedersen (2024).