**Fin 585R**  
**Diether**  
**Problem Set**  
**Analyst Dispersion Portfolios**  

**Purpose/Goal**

The primary purpose of this problem set is to give you a portfolio formation task that makes you go through the first four steps of our portfolio formation framework.

1. Data Preparation.<br><br>

2. Create portfolio formation or criterion variable.<br><br>

3. Bin the data based on the formation variable.<br><br>

4. Portfolio creeation using the bins.<br><br>

5. Test the historical performance and/or test a model.<br><br>

A secondary goal is to introduce another interesting portfolio strategy. It's produces a large spread in average return, and we will use it later in testing models like the CAPM.

To accomplish the programming takes, you should be able to adapt a lot of code we've used before, and apply it this situation. 

**Overview**

In this problem set you reproduce another seminal empirical result in academic finance. Specifically, you reproduce the **dispersion effect** (or the analyst disgreement effect) of Diether, Malloy, and Scherbina (2002). This empirical result spawned a large literature in academic finance, and certainly some quant funds have tried to trade on this effect.

Dispersion (or analyst disagreement) portfolios are formed based on the standard deviation of analyst eps (earnings per share) forecasts over a given period. Here the standard deviation of analyst eps forecasts is the standard deviation across analysts for a given stock and month (most stocks have between 3 to 13 analysts covering them). Diether, Malloy, and Scherbina don't use raw standard deviation. Instead, they scale the standard deviation of analyst forecasts by the absolute value of the mean forecast. Therefore for a given month ($t$), dispersion for stock $i$ is defined as the following:

$$
disp_{it} = \frac{stdev_{it}}{|mean_{it}|}
$$

DMS form dispersion portfolios using $disp_{i,t-1}$; in other words, they lag dispersion one month. In this homework you will do the same. Additionally, you will form dispersion portfolio based on lagging dispersion 3 months.

There are two datasets for this problem set. The first is the CRSP data (security prices and returns) during the period from January of 1982 to December of 2000. The second is the analyst earnings per share data from IBES. It also covers the period of January of 1982 to December of 2000. The frequency for both datasets is monthly. The stock level identifier in the IBES data is called a CUSIP. Consequently, I also included CUSIPs in the CRSP data. The CUSIP and the calender month uniquely identify the analyst earnings per share observations.

You can download the CRSP data directly using the following link: [the CRSP data](http://diether.org/prephd/08-mstk_82-00.csv). There is also a link on *Learning Suite*. The data contain the following variables:

|Variable | Description                                              |
|---------|----------------------------------------------------------|
|permno   | stock identifier                                         |
|cusip    | stock identifier also in IBES data                       |
|caldt    | calendar date (the day is not truncated to 1)            |
|ret      | monthly return                                           |
|prc      | stock price (not lagged, contemporaneous with returns)   |   


You can download the IBES data directly using the following link: [the IBES data](http://diether.org/prephd/08-ibes_eps_analyst.csv). There is also a link on *Learning Suite*. The data contain the following variables:

|Variable | Description                                          |
|---------|------------------------------------------------------|
|cusip    | stock identifier also in IBES data                   |
|caldt    | calendar date (the day is not truncated to 1)        |
|meanest  | average analyst forecast for that month/stock        |
|stdev    | standard deviation of forecasts for that month/stock |


**Tasks**

1. Form quintile based equal-weight dispersion portfolios where dispersion is lagged one month. Report summary statistics (including a t-test of whether the average return is statistically different from zero for each portfolio). Note, you should exclude low price stocks from your portfolios (price below $5). <br><br>

2. Add a spread portfolio to your dataframe of dispersion portfolios. Report summary statistics (including a t-test of whether the average return is statistically different from zero for each portfolio).<br><br>

3. Compute the average number of stocks that are in each portfolio.<br><br>

4. Form quintile based equal-weight dispersion portfolios where dispersion is lagged three month instead of one.  Report summary statistics (including a t-test of whether the average return is statistically different from zero for each portfolio). Note, you should exclude low price stocks from your portfolios (price below $5).<br><br>

5. Compare the results from (1) and (4). What do either the differences or similarities in the average return pattern tell you about the nature of this dispersion effect?

In [1]:
import pandas as pd
import numpy as np

In [2]:
stk = pd.read_csv('08-mstk_82-00.csv',parse_dates=['caldt'])
stk.head(5)

Unnamed: 0,permno,caldt,cusip,ret,prc,me
0,10000,1986-01-31,68391610,,4.375,16.1
1,10000,1986-02-28,68391610,-0.257143,3.25,11.96
2,10000,1986-03-31,68391610,0.365385,4.4375,16.33
3,10000,1986-04-30,68391610,-0.098592,4.0,15.172
4,10000,1986-05-30,68391610,-0.222656,3.10938,11.7939


In [3]:
ibes = pd.read_csv("08-ibes_eps_analyst.csv",parse_dates=['caldt'])
ibes.head(5)

Unnamed: 0,cusip,caldt,meanest,stdev
0,117,1982-01-14,15.36,0.78
1,117,1982-02-18,15.18,0.78
2,117,1982-03-18,15.07,0.66
3,117,1982-04-15,15.06,0.7
4,117,1982-05-20,14.78,0.71


**Hint About Merging the two Datasets**

In the datasets I've include the full calender dates of the observations. Even though the frequency for both is monthly, the timing is not the same. The CRSP data is from the last trading day in the month, and the IBES data tends to be around the middle of the month. Therefore, to merge these dataframes you need to ctreate a new date variable that only preserve uniqueness at the year-month level. Here is a shortcut way to accomplish that:

In [4]:
stk['mdt'] = stk['caldt'].values.astype('datetime64[M]')
stk

Unnamed: 0,permno,caldt,cusip,ret,prc,me,mdt
0,10000,1986-01-31,68391610,,4.375000,16.10000,1986-01-01
1,10000,1986-02-28,68391610,-0.257143,3.250000,11.96000,1986-02-01
2,10000,1986-03-31,68391610,0.365385,4.437500,16.33000,1986-03-01
3,10000,1986-04-30,68391610,-0.098592,4.000000,15.17200,1986-04-01
4,10000,1986-05-30,68391610,-0.222656,3.109380,11.79390,1986-05-01
...,...,...,...,...,...,...,...
1443150,93324,1985-07-31,98984810,-0.125000,0.109375,1.60945,1985-07-01
1443151,93324,1985-08-30,98984810,-0.142857,0.093750,1.37953,1985-08-01
1443152,93324,1985-09-30,98984810,-0.166667,0.078125,1.14961,1985-09-01
1443153,93324,1985-10-31,98984810,0.400000,0.109375,1.60945,1985-10-01


In [5]:
ibes['mdt'] = ibes['caldt'].values.astype('datetime64[M]')
ibes

Unnamed: 0,cusip,caldt,meanest,stdev,mdt
0,00000117,1982-01-14,15.36,0.78,1982-01-01
1,00000117,1982-02-18,15.18,0.78,1982-02-01
2,00000117,1982-03-18,15.07,0.66,1982-03-01
3,00000117,1982-04-15,15.06,0.70,1982-04-01
4,00000117,1982-05-20,14.78,0.71,1982-05-01
...,...,...,...,...,...
714692,Y8564W10,2000-08-17,4.95,0.48,2000-08-01
714693,Y8564W10,2000-09-14,5.44,0.23,2000-09-01
714694,Y8564W10,2000-10-19,5.56,0.15,2000-10-01
714695,Y8564W10,2000-11-16,6.00,0.24,2000-11-01


What is the code above doing? Pandas stores all dates with precision to the nanosecond. But numpy (the library pandas uses for its date functionality) actually includes date types for varying levels of precision (including monthly). So the above code changes the original nanosecond datetype to a monthly datetype; this causes all the information about time beyond a month to be lost and when pandas automatically reconverts the date to a nanosecond datetype the day gets set equal to one for all observations.

Now you should be able to merge the two datasets.

In [6]:
mdf = pd.merge(stk, ibes, on = ["cusip","mdt"], how = "left")
mdf1 = mdf.copy()
#mdf = mdf.dropna(how = "any").reset_index(drop = True)
mdf

Unnamed: 0,permno,caldt_x,cusip,ret,prc,me,mdt,caldt_y,meanest,stdev
0,10000,1986-01-31,68391610,,4.375000,16.10000,1986-01-01,NaT,,
1,10000,1986-02-28,68391610,-0.257143,3.250000,11.96000,1986-02-01,NaT,,
2,10000,1986-03-31,68391610,0.365385,4.437500,16.33000,1986-03-01,NaT,,
3,10000,1986-04-30,68391610,-0.098592,4.000000,15.17200,1986-04-01,NaT,,
4,10000,1986-05-30,68391610,-0.222656,3.109380,11.79390,1986-05-01,NaT,,
...,...,...,...,...,...,...,...,...,...,...
1443150,93324,1985-07-31,98984810,-0.125000,0.109375,1.60945,1985-07-01,NaT,,
1443151,93324,1985-08-30,98984810,-0.142857,0.093750,1.37953,1985-08-01,NaT,,
1443152,93324,1985-09-30,98984810,-0.166667,0.078125,1.14961,1985-09-01,NaT,,
1443153,93324,1985-10-31,98984810,0.400000,0.109375,1.60945,1985-10-01,NaT,,


In [7]:
# Preparing the dataset
mdf["disp"] = mdf["stdev"] / np.abs(mdf["meanest"])
mdf["displag"] = mdf.groupby('cusip')["disp"].shift()
mdf["prclag"] = mdf.groupby('cusip')["prc"].shift()
mdf = mdf.query("displag == displag and prclag >= 5").reset_index(drop = True)
mdf = mdf[["cusip", "mdt", "ret", "meanest", "stdev", "disp", "displag", "prclag"]]
mdf

Unnamed: 0,cusip,mdt,ret,meanest,stdev,disp,displag,prclag
0,39040610,1990-05-01,-0.012658,1.05,0.07,0.066667,0.140000,9.8750
1,39040610,1990-06-01,0.014103,1.10,0.14,0.127273,0.066667,9.7500
2,39040610,1990-07-01,0.025641,1.10,0.14,0.127273,0.127273,9.7500
3,39040610,1990-08-01,-0.050000,1.05,0.08,0.076190,0.127273,10.0000
4,39040610,1990-09-01,0.040789,,,,0.076190,9.5000
...,...,...,...,...,...,...,...,...
585075,98373510,1991-08-01,0.225000,0.40,0.10,0.250000,0.170732,10.0000
585076,98373510,1991-09-01,-0.071429,0.38,0.11,0.289474,0.250000,12.2500
585077,98373510,1991-10-01,-0.065934,0.38,0.11,0.289474,0.289474,11.3750
585078,98927210,1987-09-01,0.053192,0.22,0.04,0.181818,0.045455,5.8750


1. Form quintile based equal-weight dispersion portfolios where dispersion is lagged one month. Report summary statistics (including a t-test of whether the average return is statistically different from zero for each portfolio). Note, you should exclude low price stocks from your portfolios (price below $5). <br><br>

In [8]:
mdf['bins'] = mdf.groupby("mdt")['displag'].transform(pd.qcut,5,labels=False)
mdf

Unnamed: 0,cusip,mdt,ret,meanest,stdev,disp,displag,prclag,bins
0,39040610,1990-05-01,-0.012658,1.05,0.07,0.066667,0.140000,9.8750,3
1,39040610,1990-06-01,0.014103,1.10,0.14,0.127273,0.066667,9.7500,2
2,39040610,1990-07-01,0.025641,1.10,0.14,0.127273,0.127273,9.7500,3
3,39040610,1990-08-01,-0.050000,1.05,0.08,0.076190,0.127273,10.0000,3
4,39040610,1990-09-01,0.040789,,,,0.076190,9.5000,3
...,...,...,...,...,...,...,...,...,...
585075,98373510,1991-08-01,0.225000,0.40,0.10,0.250000,0.170732,10.0000,3
585076,98373510,1991-09-01,-0.071429,0.38,0.11,0.289474,0.250000,12.2500,4
585077,98373510,1991-10-01,-0.065934,0.38,0.11,0.289474,0.289474,11.3750,4
585078,98927210,1987-09-01,0.053192,0.22,0.04,0.181818,0.045455,5.8750,1


In [9]:
ew = mdf.groupby(['mdt', 'bins'])['ret'].mean().unstack() * 100
ew.columns = ["p"+str(col) for col in ew.columns]
from finance_byu.summarize import summary
summary(ew).round(2)

Unnamed: 0,p0,p1,p2,p3,p4
count,227.0,227.0,227.0,227.0,227.0
mean,1.6,1.44,1.34,1.22,0.82
std,4.78,4.85,5.18,5.62,6.46
tstat,5.03,4.48,3.9,3.27,1.91
pval,0.0,0.0,0.0,0.0,0.06
min,-25.76,-25.09,-26.94,-29.43,-32.22
25%,-1.21,-1.63,-1.92,-1.9,-2.63
50%,1.82,1.63,1.95,1.7,1.36
75%,4.61,4.64,4.85,4.8,4.48
max,13.21,13.17,14.28,13.95,19.53


2. Add a spread portfolio to your dataframe of dispersion portfolios. Report summary statistics (including a t-test of whether the average return is statistically different from zero for each portfolio).<br><br>

In [10]:
ew["spread"] = ew["p4"] - ew["p0"]
summary(ew).round(2)

Unnamed: 0,p0,p1,p2,p3,p4,spread
count,227.0,227.0,227.0,227.0,227.0,227.0
mean,1.6,1.44,1.34,1.22,0.82,-0.78
std,4.78,4.85,5.18,5.62,6.46,3.54
tstat,5.03,4.48,3.9,3.27,1.91,-3.31
pval,0.0,0.0,0.0,0.0,0.06,0.0
min,-25.76,-25.09,-26.94,-29.43,-32.22,-14.92
25%,-1.21,-1.63,-1.92,-1.9,-2.63,-2.6
50%,1.82,1.63,1.95,1.7,1.36,-0.85
75%,4.61,4.64,4.85,4.8,4.48,0.7
max,13.21,13.17,14.28,13.95,19.53,19.66


3. Compute the average number of stocks that are in each portfolio.<br><br>

In [11]:
mdf.groupby(['mdt', 'bins'])['ret'].count().unstack(level = 'bins').rename('p{:.0f}'.format, axis = 'columns').mean()

bins
p0    516.933921
p1    515.281938
p2    515.092511
p3    514.814978
p4    515.132159
dtype: float64

4. Form quintile based equal-weight dispersion portfolios where dispersion is lagged three month instead of one.  Report summary statistics (including a t-test of whether the average return is statistically different from zero for each portfolio). Note, you should exclude low price stocks from your portfolios (price below $5).<br><br>

In [14]:
mdf1["disp"] = mdf1["stdev"] / np.abs(mdf1["meanest"])
mdf1["displag_3"] = mdf1.groupby("cusip")["disp"].shift(3)
mdf1["prclag"] = mdf1.groupby("cusip")["prc"].shift(1)
mdf1 = mdf1.query("displag_3 == displag_3 and prclag >= 5").reset_index(drop = True)
mdf1['bins'] = mdf1.groupby('mdt')['displag_3'].transform(pd.qcut,5,labels=False)
ew = mdf.groupby(['mdt', 'bins'])['ret'].mean().unstack() * 100
ew.columns = ["p"+str(col) for col in ew.columns]
ew["spread"] = ew["p4"] - ew["p0"]
from finance_byu.summarize import summary
summary(ew).round(2)

Unnamed: 0,p0,p1,p2,p3,p4,spread
count,224.0,224.0,224.0,224.0,224.0,224.0
mean,1.49,1.45,1.32,1.17,1.02,-0.47
std,4.7,4.81,5.11,5.5,6.42,3.55
tstat,4.76,4.51,3.86,3.18,2.38,-2.0
pval,0.0,0.0,0.0,0.0,0.02,0.05
min,-25.37,-24.75,-26.78,-29.2,-32.3,-12.67
25%,-1.03,-1.67,-1.95,-1.93,-2.28,-2.1
50%,1.68,1.93,1.82,1.62,1.51,-0.59
75%,4.5,4.49,4.41,4.58,4.7,1.22
max,12.29,12.54,14.62,14.25,20.61,20.76


5. Compare the results from (1) and (4). What do either the differences or similarities in the average return pattern tell you about the nature of this dispersion effect?

The higher the dispersion (p4), the lower the return. When analysts agree, the mean returns are higher, as shown in p0. We observe the spread portfolio, which is constantly negative. However, notice that the longer the lag on the dispersion effect, the lower the spread on the high-to-low dispersion portfolio. This implies that the dispersion effect holds over time, however, it seems to lessen slightly. This tells us that dispersion is a legitimate effect worth analyzing.