# Machine Learning Portfolio Example

This example illustrates how to use *all* Chen-Zimmermann predictors, together with CRSP data. We'll merge monthly CRSP with the full set of Chen-Zimmermann predictors, fit the CRSP returns to lagged signals, and form portfolios in a super simple out-of-sample test.

Downloading all of the signals takes some time and requires substantial RAM. It also requires a WRDS account, since some predictors require data from WRDS (size, short-term reversal, price). 

In [1]:
# == Setup ==

# load packages
import pandas as pd
import openassetpricing as oap
import numpy as np
import wrds
from datetime import datetime
import statsmodels.formula.api as smf
from scipy.stats import rankdata

# initialize OpenAP
openap = oap.OpenAP()

# Download data

You'll have to enter your WRDS credentials twice: once to download the CRSP returns, and once to download all Chen-Zimmermann predictors (including size, short-term reversal, and price). The downloads take a couple minutes in total.


In [3]:
# download CRSP data
wrds_conn = wrds.Connection()

crsp = wrds_conn.raw_sql("""select permno, date, ret*100 as ret
                        from crsp.msf""", 
                        date_cols=['date'])

crsp.head()

WRDS recommends setting up a .pgpass file.
You can create this file yourself at any time with the create_pgpass_file() function.
Loading library list...
Done


Unnamed: 0,permno,date,ret
0,10000,1985-12-31,
1,10000,1986-01-31,
2,10000,1986-02-28,-25.7143
3,10000,1986-03-31,36.5385
4,10000,1986-04-30,-9.8592


In [2]:
# download all Chen-Zimmermann predictors
bigdat = openap.dl_all_signals('pandas')

# get names of all signals
signal_list = [col for col in bigdat.columns if col not in ['permno', 'yyyymm']]

bigdat.head()

WRDS recommends setting up a .pgpass file.
You can create this file yourself at any time with the create_pgpass_file() function.
Loading library list...
Done

Data is downloaded: 2 mins


Unnamed: 0,permno,yyyymm,AM,AOP,AbnormalAccruals,Accruals,AccrualsBM,Activism1,Activism2,AdExp,...,sinAlgo,skew1,std_turn,tang,zerotrade12M,zerotrade1M,zerotrade6M,Price,Size,STreversal
0,10000,198601,,,,,,,,,...,,,,,,,,-1.475907,-2.778819,-0.0
1,10000,198602,,,,,,,,,...,,,,,,4.785175e-08,,-1.178655,-2.481568,0.257143
2,10000,198603,,,,,,,,,...,,,,,,1.023392e-07,,-1.490091,-2.793004,-0.365385
3,10000,198604,,,,,,,,,...,,,,,,7.467463e-08,,-1.386294,-2.719452,0.098592
4,10000,198605,,,,,,,,,...,,,,,,7.649551e-08,,-1.134423,-2.467581,0.222656


# Lag signals and merge

To lag signals, you can just add one month to the `yyyymm` column for the signals. For simplicity, let's fill in the day of the new variable `date` as the 28th (the signals are assumed to be available for trading at the end of the month). You can keep around `yyyymm` as `yyyymm_signals` for sanity checks. 

In [4]:
# rename yyyymm for clarity 
bigdat = bigdat.rename(columns={'yyyymm': 'yyyymm_signals'})

# create date that is one month ahead for merging with returns
bigdat['date'] = pd.to_datetime(bigdat['yyyymm_signals'].astype(str) + '28', format='%Y%m%d') + pd.DateOffset(months=1)

# reorder columns for clarity
bigdat = bigdat[['permno', 'date', 'yyyymm_signals'] + signal_list]

bigdat.head()

Unnamed: 0,permno,date,yyyymm_signals,AM,AOP,AbnormalAccruals,Accruals,AccrualsBM,Activism1,Activism2,...,sinAlgo,skew1,std_turn,tang,zerotrade12M,zerotrade1M,zerotrade6M,Price,Size,STreversal
0,10000,1986-02-28,198601,,,,,,,,...,,,,,,,,-1.475907,-2.778819,-0.0
1,10000,1986-03-28,198602,,,,,,,,...,,,,,,4.785175e-08,,-1.178655,-2.481568,0.257143
2,10000,1986-04-28,198603,,,,,,,,...,,,,,,1.023392e-07,,-1.490091,-2.793004,-0.365385
3,10000,1986-05-28,198604,,,,,,,,...,,,,,,7.467463e-08,,-1.386294,-2.719452,0.098592
4,10000,1986-06-28,198605,,,,,,,,...,,,,,,7.649551e-08,,-1.134423,-2.467581,0.222656


Now merge with CRSP. Convert CRSP dates to the 28th of the month for simplicity. The left join makes the missing values issues transparent.

In [5]:
# convert crsp dates to the 28th of the month
crsp['date'] = pd.to_datetime(crsp['date'].dt.strftime('%Y%m') + '28', format='%Y%m%d')

# left join returns onto signals, in-place (for ram)
bigdat = pd.merge(crsp, bigdat, on=['permno', 'date'], how='left')

bigdat.head()


Unnamed: 0,permno,date,ret,yyyymm_signals,AM,AOP,AbnormalAccruals,Accruals,AccrualsBM,Activism1,...,sinAlgo,skew1,std_turn,tang,zerotrade12M,zerotrade1M,zerotrade6M,Price,Size,STreversal
0,10000,1985-12-28,,,,,,,,,...,,,,,,,,,,
1,10000,1986-01-28,,,,,,,,,...,,,,,,,,,,
2,10000,1986-02-28,-25.7143,198601.0,,,,,,,...,,,,,,,,-1.475907,-2.778819,-0.0
3,10000,1986-03-28,36.5385,198602.0,,,,,,,...,,,,,,4.785175e-08,,-1.178655,-2.481568,0.257143
4,10000,1986-04-28,-9.8592,198603.0,,,,,,,...,,,,,,1.023392e-07,,-1.490091,-2.793004,-0.365385


Congrats, the data is merged! But unfortunately, we'll need to do a bit more work to make it usable.

# Process data
We'll need to deal with the missing signals. This is a notorious issue with big data. Here, we'll just standardize the signals and then fill in missings with zero. This follows [Chen and McCoy (2024)](https://arxiv.org/abs/2207.13071).

In [6]:
# copy over, keep only after 1963 and non-missing returns
cleandat = bigdat[
    (bigdat['date'].dt.year >= 1963) & 
    (bigdat['ret'].notna())
].copy()

# standardize
def standardize_signals(x):
    return (x - x.mean()) / x.std()

cleandat[signal_list] = cleandat[signal_list].apply(standardize_signals)

# replace NaNs with 0
cleandat = cleandat.fillna(0)

# Form ML-style portfolios
Following Lewellen (2014, CFR), let's predict returns using many signals and then sort stocks on the predicted returns. We'll do this in perhaps the simplest way possible: fit returns with OLS using the 1963-1979 sample. Then use the fitted coefficients on lagged signals to sort stocks every month from 1980 onward. This can't work, can it?

In [7]:
# user-specified fit period
fit_start = 1963
fit_end = 1979

# user-specified number of portfolios
nport = 5

In [8]:
# fit returns
formula = 'ret ~ ' + ' + '.join(signal_list)

fit = smf.ols(formula, data=cleandat[cleandat['date'].dt.year.isin(range(fit_start, fit_end))]).fit()

# apply fit to all data
cleandat['pred'] = fit.predict(cleandat)

  cleandat['pred'] = fit.predict(cleandat)


In [9]:
# == find portfolio returns ==

# copy data
preddat = cleandat[['permno', 'date', 'pred', 'ret']].copy()

# define port sort function
# follows https://github.com/chenandrewy/flex-mining/blob/70ca658090a13fea8517945280b2de83b9886968/0_Environment.R#L465
def port_sort(x, nport):
    ranks = rankdata(x, method='min')
    return (np.floor(ranks * nport / (len(x) + 1)) + 1).astype(int)

# apply port sort function
preddat['port'] = preddat.groupby('date')['pred'].transform(lambda x: port_sort(x, nport))

# find portfolio returns 
portdat = preddat.groupby(['port', 'date']).agg(
    ret = ('ret', 'mean'),
    nstock = ('permno', 'nunique')
).reset_index()

# Far Out-of-Sample Performance
Let's examine the performance of our groovy 1960s-1970s machine, decade by decade.

In [25]:
# find performance by 10-year periods
portdat['subsamp'] = pd.cut(portdat['date'].dt.year, bins=range(1960, 2035, 10), labels=range(1960, 2030, 10))

portsum = portdat.groupby(['port', 'subsamp']).agg(
    meanret = ('ret', 'mean'),
    vol = ('ret', 'std'),
    nmonth = ('date', 'nunique'),
    nstock = ('nstock', 'mean'),
    datemin = ('date', 'min'),
    datemax = ('date', 'max')
).reset_index()
portsum['meanret'] = round(portsum['meanret'], 2)

# pivot meanret to wide format
sumwide = portsum.pivot(index=['subsamp'], columns='port', values='meanret').reset_index()
sumwide.columns = ['subsamp'] + [f'port_{col}' for col in sumwide.columns[1:]]

# add long-short
sumwide['5_minus_1'] = sumwide['port_5'] - sumwide['port_1']

# add date ranges
temp = portsum.groupby('subsamp').agg(
    datemin = ('datemin', 'min'),
    datemax = ('datemax', 'max')
).reset_index()

sumwide = pd.merge(temp, sumwide, on='subsamp', how='left')

sumwide


  portsum = portdat.groupby(['port', 'subsamp']).agg(
  temp = portsum.groupby('subsamp').agg(


Unnamed: 0,subsamp,datemin,datemax,port_1,port_2,port_3,port_4,port_5,5_minus_1
0,1960,1963-01-28,1970-12-28,0.06,0.72,1.17,1.73,2.66,2.6
1,1970,1971-01-28,1980-12-28,-0.26,0.72,1.38,2.13,3.64,3.9
2,1980,1981-01-28,1990-12-28,-0.48,0.47,0.85,1.0,1.67,2.15
3,1990,1991-01-28,2000-12-28,0.09,0.81,1.28,1.71,3.39,3.3
4,2000,2001-01-28,2010-12-28,0.2,0.72,0.96,1.34,2.68,2.48
5,2010,2011-01-28,2020-12-28,0.58,0.72,0.7,0.81,1.33,0.75
6,2020,2021-01-28,2023-12-28,-0.56,0.28,0.31,0.28,0.01,0.57


This model, fitted by just mindlessly running OLS on signals (imputed with zeros), gets huge long-short returns, of 2.0-3.0 percent per month, into the 2000s. As a reminder, the model was fitted using data from 1963-1979. So the model predicts for 30 years after fitting! Even in the 2010s and 2020s, the model does pretty well. 

There are huge caveats about trading costs (Chen and Velikov 2023). But then again, this tutorial doesn't even attempt to deal with trading costs. One can likely do much better by following DeMiguel, Martin-Utrera, Nogales, and Uppal (2020) or Jensen, Kelly, Malamud, and Pedersen (2024).