**Fin 585R**  
**Diether**  
**Course Notes**  
**Application: Merging**
**Aplication: Sub-Sample based merging**

**Overview**

**Monthly or Annual Rebalancing**

+ Previously formed portfolios where the breakpoints (bins) are formed every month. $\leftarrow$ called monthly rebalancing.<br><br>

+ Monthly rebalancing is quite common in empirical finance research but so is *annual rebalancing*. <br><br>

+ For example, *annual rebalancing* is typically preferred when using an accounting variable as a portfolio formation variable. Accounting variables typically come from annual balance sheets or income statements so annual rebalancing corresponds with the frequency at which these data are updated. <br><br>

+ On the other hand, we can implement *annual rebalancing* for any set of portfolios. <br><br>

+ To implement annual rebalancing on monthly data, we need to use **Pandas' merging capabilities.**<br><br>

+ **Main lesson from this application**: often the most difficult part of merging dataframes is preparing your merge keys.<br><br>


**Sub-Sample Breakpoints**

+ Additionally, breakpoints sometimes are not computed using all the stocks you may include in a portfolio. <br><br>

+ One common way this shows up is with NYSE breakpoints. For example, typically when people create portfolioa based on lagged market-cap the breakpoints are computed using only NYSE stocks.<br><br> 

+ Why? The distribution of size (market-cap) is shifted to the right for NYSE stocks versus an exchange like Nasdaq. So the idea is using the market-cap distribution of all stocks for breakpoints would lead to tiny (economically unimportant) stocks in the smallest two or three or four deciles. <br><br>

+ It's not hard to compute NYSE breakpoints and then apply those breakpoints to all stocks, but it can be frustrating to figure out on your own. So I want to highlight as part of this merging application.<br><br>

**Portfolio Formation Framework**

1. Data Prep<br><br>

2. Create portfolio formation variable $\leftarrow$ most of our work is in this step because we need to create an annual variable that repeats in the dataframe for 12 months.<br><br>

3. Bin the data $\leftarrow$ we extend our framework to do NYSE breakpoints and apply those breakpoints to all the stocks.<br><br>

4. Portfolio formation $\leftarrow$ same as always.<br><br>

In [1]:
import pandas as pd
import numpy as np

In [3]:
df = pd.read_csv('07-mstk_26-22.csv',parse_dates=['caldt'])
df['prclag'] = df.groupby('permno')['prc'].shift()
df['melag'] = df.groupby('permno')['me'].shift()
df

Unnamed: 0,permno,caldt,excd,prc,me,ret,prclag,melag
0,10000,1986-01-31,3,4.37500,16.1000,,,
1,10000,1986-02-28,3,3.25000,11.9600,-0.257143,4.3750,16.100
2,10000,1986-03-31,3,4.43750,16.3300,0.365385,3.2500,11.960
3,10000,1986-04-30,3,4.00000,15.1720,-0.098592,4.4375,16.330
4,10000,1986-05-30,3,3.10938,11.7939,-0.222656,4.0000,15.172
...,...,...,...,...,...,...,...,...
3679107,93436,2022-05-31,3,758.26000,785565.0000,-0.129197,870.7600,902116.000
3679108,93436,2022-06-30,3,673.42000,701030.0000,-0.111888,758.2600,785565.000
3679109,93436,2022-07-29,3,891.45000,931111.0000,0.323765,673.4200,701030.000
3679110,93436,2022-08-31,3,275.61000,863616.0000,-0.072489,891.4500,931111.000


**June dataframe**

+ First, let's create a dataframe of just the June observations.<br><br> 

+ Note, I will make an explicit copy of the `dataframe` (using the `.copy` method) because if you just slice a dataframe, by default, a new copy isn't made; Instead, you have created a reference or *view* back into the old dataframe. Usually this doesn't cause problems but occasionally it can lead to unexpected behavior (and it very often leads to annoying warnings). <br><br>

+ This is the first time we use features and methods from `pandas'` `datetime` object. An important feature is the **dt accessor**. I will use *dt accessor* below to get an integer representation of the month. <br><br>

+ The *dt accessor* has a number of convenience methods for `datetime` objects; for example, the *dt accessor* gives you access to methods that return the `year`, `month` or `day` as an integer. <br><br>
[Panda Docs on the .dt accessor](https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#dt-accessor)
<br><br>

**Create the June only dataframe**

In [6]:
jstk = df[df['caldt'].dt.month == 6].copy().reset_index(drop=True)
jstk.head(5)

Unnamed: 0,permno,caldt,excd,prc,me,ret,prclag,melag
0,10000,1986-06-30,3,3.09375,11.7346,-0.005025,3.10938,11.7939
1,10001,1986-06-30,3,6.125,6.03313,-0.013069,6.3125,6.21781
2,10001,1987-06-30,3,5.875,5.82212,0.051429,5.6875,5.63631
3,10001,1988-06-30,3,6.25,6.2,-0.012039,6.4375,6.386
4,10001,1989-06-30,3,7.0,7.007,0.017143,7.0,6.986


<br>

**Breakpoint creation for June DataFrame**

+ Normally we use `pd.qcut` to bin the data. `pd.qcut` has two parts: (1) it takes quantiles (.2,.4,.6,.8) and computes the empirical breakpoints of the data, and (2) then it applies those breakpoints to our data to bin the data.<br><br>

+ We can breakup those two parts to allow us to use different data for step 1 and step 2.<br><br>

**Recreating the steps in pd.qcut**

1. Compute the breakpoints based on quantiles and just NYSE data $\leftarrow$ use pandas **quantile** function.<br><br>

2. applies those breakpoints to all the data to bin the data $\leftarrow$ use pandas **searchsorted function**.<br><br>
[Panda searchsorted](https://pandas.pydata.org/docs/reference/api/pandas.Series.searchsorted.html)
<br><br>

**How? Small Custom Function and a Groupby/Apply**

+ I'll call the function, `nyse_qcut`.<br><br>

+ `nyse_qcut` is called by the groupby; so the input is all the observations as a DataFrame for a given month.<br><br>

+ It then computes the breakpoints for just NYSE stocks and (2) feed those breakpoints into `searchsorted` which computes what bin a given `me` (market-cap observation) belongs in.<br><br>


In [7]:
def nyse_qcut(x,q=np.arange(.2,1,.2)):
    bins = x.query("excd == 1")['me'].quantile(q).searchsorted(x['me'])
    return pd.DataFrame(bins,index=x.index)

# print(nyse_qcut(jstk))
# This is a n x 1 matrix 

jstk['port'] = (jstk.groupby('caldt',group_keys=False)[['excd','me']]
                .apply(nyse_qcut,np.arange(.1,1,.1)))

jstk

        0
0       0
1       0
2       0
3       0
4       0
...    ..
307201  4
307202  4
307203  4
307204  4
307205  4

[307206 rows x 1 columns]


Unnamed: 0,permno,caldt,excd,prc,me,ret,prclag,melag,port
0,10000,1986-06-30,3,3.09375,11.73460,-0.005025,3.10938,11.79390,0
1,10001,1986-06-30,3,6.12500,6.03313,-0.013069,6.31250,6.21781,0
2,10001,1987-06-30,3,5.87500,5.82212,0.051429,5.68750,5.63631,0
3,10001,1988-06-30,3,6.25000,6.20000,-0.012039,6.43750,6.38600,0
4,10001,1989-06-30,3,7.00000,7.00700,0.017143,7.00000,6.98600,0
...,...,...,...,...,...,...,...,...,...
307201,93436,2018-06-29,3,342.95000,58478.50000,0.204474,284.73000,48345.40000,9
307202,93436,2019-06-28,3,223.46000,40025.70000,0.206848,185.16000,32823.30000,9
307203,93436,2020-06-30,3,1079.81000,200845.00000,0.293186,835.00000,154785.00000,9
307204,93436,2021-06-30,3,679.70000,668827.00000,0.087137,625.22000,602293.00000,9


<br>

**General Pandas Programming Philosphy**

+ Often you'll need to do something you've done before or something pretty standard with a twist. Binning or cutting data is a standard data anaylsis task.<br><br>
    
+ But computing bins from a sub-sample and applying them to a larger or whole sample does't work with `pd.qcut`. The function doesn't have that flexbility built in.<br><br> 

+ So how do I proceed in this situation? I like to think about the conceptual parts of something like `pd.cut`, and then think about how I can do each part in Pandas. Often coding up the sub-parts opens up more flexibility. It does in this case.<br><br> 


**Merge Back into monthly dataframe**

Merge the June breakpoints into the monthly monthly stock dataframe. But, note, we want those breakpoints to be in effect from the next month (July) to the next June so the merge back into the monthly dataframe has some complications. Think about it conceptually as a three step process:

1. Rename the `caldt` variable  to `jdt` in the `jstk` dataframe. For a Pandas dataframe there is a method for renaming columns called `rename`. Also, I will slice the dataframe so it only includes the columns we need: `permno`,`caldt`, and `port`. I am also gowing to truncate the date to just the year-month level<br><br>

2. Create a date variable (`jdt`) in the `stk` dataframe that reflects the timing we want for the breakpoints. This is a little tricky. But we'll walk through it. This sets up the merge mapping annual data into month data.<br><br>

3. Merge June dataframe (`jstk`) into the full monthly dataframe (`stk`).<br><br>

**Step 1:**

In [8]:
jstk['jdt'] = jstk['caldt'].values.astype('datetime64[M]')
jstk

Unnamed: 0,permno,caldt,excd,prc,me,ret,prclag,melag,port,jdt
0,10000,1986-06-30,3,3.09375,11.73460,-0.005025,3.10938,11.79390,0,1986-06-01
1,10001,1986-06-30,3,6.12500,6.03313,-0.013069,6.31250,6.21781,0,1986-06-01
2,10001,1987-06-30,3,5.87500,5.82212,0.051429,5.68750,5.63631,0,1987-06-01
3,10001,1988-06-30,3,6.25000,6.20000,-0.012039,6.43750,6.38600,0,1988-06-01
4,10001,1989-06-30,3,7.00000,7.00700,0.017143,7.00000,6.98600,0,1989-06-01
...,...,...,...,...,...,...,...,...,...,...
307201,93436,2018-06-29,3,342.95000,58478.50000,0.204474,284.73000,48345.40000,9,2018-06-01
307202,93436,2019-06-28,3,223.46000,40025.70000,0.206848,185.16000,32823.30000,9,2019-06-01
307203,93436,2020-06-30,3,1079.81000,200845.00000,0.293186,835.00000,154785.00000,9,2020-06-01
307204,93436,2021-06-30,3,679.70000,668827.00000,0.087137,625.22000,602293.00000,9,2021-06-01


**Step 2**

In [9]:
def june_timing(date):
    mo = date.dt.month
    yeardt = date.values.astype('datetime64[Y]')
    yeardt[mo <= 6] = (yeardt - np.timedelta64(1,'Y'))[mo <= 6]

    return yeardt.astype('datetime64[M]') + np.timedelta64(5,'M')

In [10]:
# This is really clever!
df['jdt'] = june_timing(df['caldt'])
df

Unnamed: 0,permno,caldt,excd,prc,me,ret,prclag,melag,jdt
0,10000,1986-01-31,3,4.37500,16.1000,,,,1985-06-01
1,10000,1986-02-28,3,3.25000,11.9600,-0.257143,4.3750,16.100,1985-06-01
2,10000,1986-03-31,3,4.43750,16.3300,0.365385,3.2500,11.960,1985-06-01
3,10000,1986-04-30,3,4.00000,15.1720,-0.098592,4.4375,16.330,1985-06-01
4,10000,1986-05-30,3,3.10938,11.7939,-0.222656,4.0000,15.172,1985-06-01
...,...,...,...,...,...,...,...,...,...
3679107,93436,2022-05-31,3,758.26000,785565.0000,-0.129197,870.7600,902116.000,2021-06-01
3679108,93436,2022-06-30,3,673.42000,701030.0000,-0.111888,758.2600,785565.000,2021-06-01
3679109,93436,2022-07-29,3,891.45000,931111.0000,0.323765,673.4200,701030.000,2022-06-01
3679110,93436,2022-08-31,3,275.61000,863616.0000,-0.072489,891.4500,931111.000,2022-06-01


**Step 3**

In [11]:
df = df.merge(jstk[['permno','jdt','port']],on=['permno','jdt'],
              how='left')
df

Unnamed: 0,permno,caldt,excd,prc,me,ret,prclag,melag,jdt,port
0,10000,1986-01-31,3,4.37500,16.1000,,,,1985-06-01,
1,10000,1986-02-28,3,3.25000,11.9600,-0.257143,4.3750,16.100,1985-06-01,
2,10000,1986-03-31,3,4.43750,16.3300,0.365385,3.2500,11.960,1985-06-01,
3,10000,1986-04-30,3,4.00000,15.1720,-0.098592,4.4375,16.330,1985-06-01,
4,10000,1986-05-30,3,3.10938,11.7939,-0.222656,4.0000,15.172,1985-06-01,
...,...,...,...,...,...,...,...,...,...,...
3679107,93436,2022-05-31,3,758.26000,785565.0000,-0.129197,870.7600,902116.000,2021-06-01,9.0
3679108,93436,2022-06-30,3,673.42000,701030.0000,-0.111888,758.2600,785565.000,2021-06-01,9.0
3679109,93436,2022-07-29,3,891.45000,931111.0000,0.323765,673.4200,701030.000,2022-06-01,9.0
3679110,93436,2022-08-31,3,275.61000,863616.0000,-0.072489,891.4500,931111.000,2022-06-01,9.0


<br>

**Create the Portfolios**

Now that we've merged the annual `port` variable into the monthly data, we can create the portfolios exactly as we done before except now the portfolios reflect annual reblancing (in June) instead of monthly rebalancing.<br><br>

In [15]:
df = df.query("port == port and ret == ret").reset_index(drop=True)

ew = (df.groupby(['caldt','port'])['ret'].mean().unstack(level='port')
      .rename(lambda x: f'p{x:.0f}',axis='columns')*100) # Equal-weight returns  
ew.tail(10)

port,p0,p1,p2,p3,p4,p5,p6,p7,p8,p9
caldt,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2021-12-31,-4.760885,-2.232212,1.263813,1.454605,2.132335,2.041614,3.219649,4.04599,3.44629,4.936194
2022-01-31,-7.817011,-10.174559,-10.143785,-10.257562,-9.47982,-8.240967,-9.212675,-7.740304,-6.881608,-5.508781
2022-02-28,-0.866369,-2.013568,0.106025,0.384422,1.184038,0.289965,0.160548,1.012536,-1.011504,-2.141857
2022-03-31,4.277792,0.411128,-0.524007,2.345547,-0.128169,2.242386,-0.676826,1.186189,1.982554,2.07604
2022-04-29,-11.500525,-12.57087,-11.743625,-13.08809,-11.61081,-9.444027,-9.89884,-9.06802,-8.805458,-8.513888
2022-05-31,-4.551499,-1.683119,-2.31095,-1.044584,0.754859,-1.688285,-0.272908,0.559415,-1.067643,0.030056
2022-06-30,-6.528663,-6.034139,-4.35676,-7.560135,-8.118164,-7.814318,-8.609463,-9.825639,-9.398569,-8.390065
2022-07-29,4.703196,6.653972,11.077299,11.281194,11.277752,10.106064,10.245549,10.564318,10.321818,7.802988
2022-08-31,2.108549,-0.086811,-0.113494,-0.877827,-2.217359,-2.817142,-3.120407,-2.306207,-2.740208,-2.558875
2022-09-30,-13.042101,-7.448081,-10.604845,-9.365124,-10.230702,-10.101594,-9.111387,-9.766688,-9.429771,-8.633464


In [16]:
from finance_byu.summarize import summary
summary(ew).loc[['mean','std','tstat'],].round(3)

port,p0,p1,p2,p3,p4,p5,p6,p7,p8,p9
mean,1.838,1.307,1.259,1.192,1.145,1.15,1.069,1.052,1.002,0.884
std,10.615,9.089,8.267,7.784,7.265,6.992,6.607,6.305,5.925,5.399
tstat,5.883,4.885,5.177,5.202,5.355,5.588,5.498,5.668,5.749,5.564


<br>

**Stocks in Each Portfolio**

+ The smallest deciles will have a lot more stocks in them on average.<br><br>

+ The distribution of size (market-cap) is much different for NYSE versus Nasdaq or Amex.

In [17]:
(df.groupby(['caldt','port'])['ret'].count().unstack(level='port')
      .rename(lambda x: f'p{x:.0f}',axis='columns').mean())

port
p0    1293.287446
p1     370.727273
p2     259.585281
p3     210.797403
p4     181.090043
p5     159.348918
p6     147.071861
p7     140.013853
p8     132.516017
p9     129.027706
dtype: float64

Number of average stocks in each portfolio!

This is due to the distribution of NYSE stocks vs NASDAQ stocks. Economically, you would rather have something that represents the NYSE rather than the nasdaq. 