**FIN 585R**  
**Diether**  
**Double Sorting**<br><br>

**Overview**

+ Goal: How to create double-sort portfolios<br><br>

+ Tool: Rely on a three-way groupby instead of a two-way grouupby.<br><br>

+ Also cover some odds and ends about working with the CRSP data.<br><br>


In [7]:
import numpy as np
import pandas as pd
from finance_byu.summarize import summary

Collecting pyarrow
  Downloading pyarrow-11.0.0-cp39-cp39-macosx_10_14_x86_64.whl (24.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.5/24.5 MB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: pyarrow
Successfully installed pyarrow-11.0.0
Note: you may need to restart the kernel to use updated packages.


**Raw CRSP Data**

+ This is raw CRSP data in the feather format.<br><br>

+ The feather format can be read by pandas very quickly. It's a great format for large datasets.<br><br>

+ Raw CRSP data contains negative prices. CRSP reports a negative price if the price is a quote from a market market rather than an actual transaction prices.<br><br>

+ Typically researchers don't care about this distinction so just take the absolute value of price.<br><br>


In [8]:
df = pd.read_feather('12-mstk.ftr')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4834887 entries, 0 to 4834886
Data columns (total 12 columns):
 #   Column     Dtype         
---  ------     -----         
 0   permno     int64         
 1   caldt      datetime64[ns]
 2   cusip      object        
 3   ticker     object        
 4   shrcd      int64         
 5   excd       int64         
 6   siccd      int64         
 7   prc        float64       
 8   ret        float64       
 9   vol        float64       
 10  shr        float64       
 11  cumfacshr  float64       
dtypes: datetime64[ns](1), float64(5), int64(4), object(2)
memory usage: 442.6+ MB


In [None]:
df[['prc','ret']].describe().round(3)

In [None]:
df[['prc','ret']].quantile([0.05,0.1,0.15,0.20])

In [None]:
df['prc'] = df['prc'].abs()
df['me']  = df.eval("prc*shr/1000.0").where(df.eval("prc*shr > 1e-6"))

df[['prc','ret','me']].quantile([0.05,0.1,0.15,0.20])

<br>

**Extension: Double Sorting Prep**

+ Sometimes you'll want to form portfolios based on two variables.<br><br>

+ We will form based on lagged market-cap and momentum.<br><br>

+ Need bins for both portfolio formation variables: momentum and market-cap.<br><br>

+ Let's use NYSE breakpoints for market-cap.<br><br>

+ We bin before splitting the sample so that the momentum breakpoints will be the same for the small and large-cap stratification.<br><br> 

+ Called independent double sorting. $\leftarrow$ Fama French (1992)<br><br>

+ Independent sorts make the comparisons across portfolio grouping more useful because the variation in momentum will be roughly the same across the portfolio groupings.<br><br>

In [None]:
df['prclag'] = df.groupby('permno')['prc'].shift(1)
df['melag'] = df.groupby('permno')['me'].shift(1)

df['logret'] = df.eval("log(1+ret)")
df['mom'] = df.groupby('permno')['logret'].rolling(11).sum().reset_index(drop=True)
df['mom'] = df.groupby('permno')['mom'].shift(2)

**NYSE Breakpoint Function**

In [None]:
def nyse_qcut(x,bp=[0.3,0.7]):
    bins = x.query("excd == 1")['melag'].quantile(bp).searchsorted(x['melag'])
    return pd.DataFrame(bins,index=x.index)

In [None]:
df = df.query("mom == mom and melag == melag").reset_index(drop=True)

df['bins'] = df.groupby('caldt')['mom'].transform(pd.qcut,5,labels=False)

df['mebins'] = df.groupby('caldt')[['excd','melag']].apply(nyse_qcut)
df.head()

In [None]:
df = df.query("prclag >= 5").reset_index(drop=True)

In [None]:
port = df.groupby(['caldt','mebins','bins'])['ret'].mean()*100
port

In [None]:
port = port.unstack(level='bins')
port

In [None]:
port.query("mebins == 0")

In [None]:
port = df.groupby(['caldt','mebins','bins'])['ret'].mean()*100
port = port.unstack(level=['mebins','bins'])
port

In [None]:
summary(port).loc[['mean','std','tstat']].round(3)