# CRSP

- Load CRSP data
- Tickle the data
- Create variables
    + ME
    + Prior

In [None]:
library(data.table)    # read csv much faster than standard function
library(dplyr)         # infinitely nicer grouping operations
library(ggplot2)       # sexy plots

## Load Data

In [None]:
crsp.path = 'C:/Data/CRSP/20171123_CRSP_196001_201612.csv'
# colClasses='character' because fread is upset that our columns
#   contain multiple variable types
# CRSP uses 'C' and 'Z' and some other characters to mean stuff
#   whatever they mean, fread is upset
# fread rather smugly reports how quickly it has read our file
#   if we do not set showProgress=FALSE
crsp = fread(crsp.path, colClasses='character', showProgress=FALSE)

Let's see what we have.

In [None]:
colnames(crsp)

There should be no duplicate `PERMNO`-`date` pairs.

In [None]:
pairs = duplicated(crsp[,c("PERMNO", "date")])
sum(pairs)

In [None]:
if (sum(pairs)>0){
    crsp = crsp[!duplicated(crsp[,c("PERMNO", "date")]),]
}

We know we only want firms with a `SHRCD` of 10 or 11 and an
`EXCHCD` of 1, 2, or 3.

In [None]:
crsp = crsp[crsp$SHRCD=="10" | crsp$SHRCD=="11", ]
crsp = crsp[crsp$EXCHCD=="1"| crsp$EXCHCD=="2" | crsp$EXCHCD=="3",]

For now, we are not bothered about the `TICKER`, `COMNAM`, `CUSIP`, `DLSTCD`, `SPREAD`, or `vwretd`.
We will borrow the value-weighted return from Dr. Ken French.

In [None]:
# The funny '%>%' operator belongs to dplyr. It has all sorts of
#   magical properties but basically, we can perform grouping and
#   operations on groups in a sane and readable way.
#   Unfortunately, dplyr is so magic it begins to get in the way of
#   automating some of the more interesting procedures.
crsp = crsp %>%
    select(-c(SHRCD, TICKER, COMNAM, CUSIP, DLSTCD, SPREAD, vwretd))

### How many firms do we have?

There are roughly 24K unique `PERMNO` codes.
But by 2016 there are less than 4K firms.
We should be mindful of firms dying.
We also see two important dates regarding the exchanges used in
the data.
In 1962 we see tha addition of AMEX stocks and in 1973 we see the
addition of NASDAQ stocks.
It is also important to note that NASDAQ stocks are the most common.

In [None]:
length(unique(crsp$PERMNO))

In [None]:
options(repr.plot.width=8, repr.plot.height=3)

dt = crsp %>% group_by(date) %>% summarise(N=n())
dt$date = as.Date(dt$date, format='%Y%m%d')
p = ggplot(dt, aes(x=date, y=N)) + geom_line()
p = p + scale_x_date(date_breaks="5 year", date_labels="%Y%m")
p  # + theme(axis.text.x=element_text(angle=90, vjust=0.5))

On average, there are 4.5K firms each month.
This has implications for when we make buckets.
Fine sorts would be great but we see already there are some
problems with extremely fine sorts.
For example, if we want to make 10x10 sorts,
we will be left with, on average, 45 firms in each bucket.

In [None]:
round(mean(dt$N),0)

In [None]:
dt %>% arrange(-N) %>% head

In [None]:
dt = crsp %>% group_by(date, EXCHCD) %>% summarise(N=n())
dt$date = as.Date(dt$date, format='%Y%m%d')
p = ggplot(dt, aes(x=date, y=N, color=EXCHCD)) + geom_line()
p = p + scale_x_date(date_breaks="5 year", date_labels="%Y%m")
p  # + theme(axis.text.x=element_text(angle=90, vjust=0.5))

All our columns are characters because we told `fread` to do this
to prevent it whining.
We can change them into numeric values and remove the annoying
letters at the same time (creating the ugly
`Warning` messages).
`SHROUT` is divided by 1,000.

In [None]:
str(crsp)

In [None]:
crsp$PRC = as.numeric(crsp$PRC)
crsp$RET = as.numeric(crsp$RET)
crsp$DLRET = as.numeric(crsp$DLRET)
crsp$SHROUT = as.numeric(crsp$SHROUT) / 1000

Some of our columns only contain numbers but they are descrete
values.

- `EXCHCD` is only 1, 2 or 3
- `SICCD` is only from 1000 to 9000 or something
- `PERMNO` is unique to each firm

So we will leave these values as characters.

We need to grab the year and month and from the date.

In [None]:
# Rather annoyingly, this method seems to take unnecessarily long
# crsp$date = as.Date(crsp$date, "%Y%m%d")
# crsp$Year = as.numeric(format(crsp$date, "%Y"))
# crsp$Month = as.numeric(format(crsp$date, "%m"))

# This is faster, I guess 'format' and 'as.Date' take a while
crsp$Year = as.numeric(substr(crsp$date, 1, 4))
crsp$Month = as.numeric(substr(crsp$date, 5, 6))

We want to adjust our returns for any **delisting returns**.

Where the return is missing but we have a delisting return we will
just drop the delisting return in place of the return.
If the delisting return is also missing,
we will have no return for this month.

Where we have a return and delisting return we will multiply
the return by the delisting return.

Where we have neither we will just leave it.

In [None]:
ix = is.na(crsp$RET)
crsp$RET[ix] = crsp$DLRET[ix]
ix = !is.na(crsp$RET) & !is.na(crsp$DLRET)
crsp$RET[ix] = (1+crsp$RET[ix]) * (1+crsp$DLRET[ix]) - 1

# we do not need DLRET anymore
crsp$DLRET = NULL

Often, we portfolios every 12 months.
We hold portfolios from July to June.

In [None]:
crsp$HP = crsp$Year
crsp$HP[crsp$Month<7] = crsp$HP[crsp$Month<7] - 1

## Tickle Data

Where did we end up?

Our `PERMNO`, `date`, `EXCHCD` and `SICCD` columns
remain characters.
These are descrete variables so we don't mind.

`RET` has been adjusted for delisting returns
and where it is missing a value we see `NA`.
Returns are given in decimals.
We must multiply them by 100 to see percentage
returns.

`PRC` is the share price and some of them are
negative.
This is because CRSP uses a negative sign to
denote a missing share price.
Where we see a dash, the price was inferred
from the bid-ask spread.
This is not important for our purposes.

`SHROUT` is the number of shares oustanding.
It is in millions after we divided it by 1000.

In [None]:
head(crsp, n=10)

We need to create some variables;

- Market equity
- Prior return

**Market equity** is price times shares outstanding.
$$abs(PRC) \cdot SHROUT$$
We need to use the absolute price because CRSP uses a negative
sign to indicate that the price was not directly available.

In [None]:
crsp$ME = abs(crsp$PRC) * crsp$SHROUT

Only about 1% are missing in total.
As many as 8.5% are missing in August, 1976.
The months with the highest percentages of missing values are all
in the mid-1970s.
This is when NASDAQ stocks were introduced.

In [None]:
round(summary(crsp$ME)["NA's"] / nrow(crsp) * 100, 2)

In [None]:
dt = crsp %>% group_by(date) %>% summarise(mssg=sum(is.na(ME))/n())
dt$date = as.Date(dt$date, format='%Y%m%d')
p = ggplot(dt, aes(x=date, y=mssg*100)) + geom_line()
p = p + scale_x_date(date_breaks="5 year", date_labels="%Y%m")
p  # + theme(axis.text.x=element_text(angle=90, vjust=0.5))

In [None]:
# NASDAQ is 100% missing for some months
# This dominates graph and is not useful
# Absolute values used
dt = crsp %>% group_by(date, EXCHCD) %>% summarise(mssg=sum(is.na(ME)))  # /n())
dt$date = as.Date(dt$date, format='%Y%m%d')
p = ggplot(dt, aes(x=date, y=mssg*100, color=EXCHCD)) + geom_line()
p = p + scale_x_date(date_breaks="5 year", date_labels="%Y%m")
p  # + theme(axis.text.x=element_text(angle=90, vjust=0.5))

In [None]:
dt %>% arrange(-mssg) %>% head

Given the plots above about the number of firms on each exchange
and the results below,
we see results will be driven by NYSE and NASDAQ stocks.
There are more NASDAQ stocks but they are small.
There are few NYSE stocks but they are large.

In [None]:
dt = crsp %>% group_by(date, EXCHCD) %>%
    summarise(ME=mean(ME, na.rm=TRUE))
dt$date = as.Date(dt$date, format='%Y%m%d')
p = ggplot(dt, aes(x=date, y=ME, color=EXCHCD)) + geom_line()
p = p + scale_x_date(date_breaks="5 year", date_labels="%Y%m")
p  # + theme(axis.text.x=element_text(angle=90, vjust=0.5))

**Prior return** is the sum of the past
11 to 1 lagged returns.
The return for month t is not included.

In [None]:
crsp$ri = crsp$RET
crsp$ri[is.null(crsp$ri) | is.na(crsp$ri)] = -.9999

crsp = crsp %>% group_by(PERMNO) %>%
    mutate(
        PR=lag(ri,11)+lag(ri,10)+lag(ri, 9)+lag(ri, 8)+
           lag(ri, 7)+lag(ri, 6)+lag(ri, 5)+lag(ri, 4)+
           lag(ri, 3)+lag(ri, 2)+lag(ri, 1),
        PR.OK=!is.na(lag(PRC,12))&!is.na(lag(RET))&!is.na(lag(ME))
    ) %>% as.data.frame

In [None]:
summary(crsp %>% select(ri, PR, PR.OK))

In [None]:
dt = crsp %>% group_by(date) %>% summarise(mssg=sum(PR.OK)/n())
dt$date = as.Date(dt$date, format='%Y%m%d')
p = ggplot(dt, aes(x=date, y=mssg*100)) + geom_line()
p = p + scale_x_date(date_breaks="5 year", date_labels="%Y%m")
p  # + theme(axis.text.x=element_text(angle=90, vjust=0.5))

In [None]:
dt = crsp %>% group_by(date, EXCHCD) %>% summarise(mssg=sum(PR.OK)/n())
dt$date = as.Date(dt$date, format='%Y%m%d')
p = ggplot(dt, aes(x=date, y=mssg*100, color=EXCHCD)) + geom_line()
p = p + scale_x_date(date_breaks="5 year", date_labels="%Y%m")
p  # + theme(axis.text.x=element_text(angle=90, vjust=0.5))

## Quantiles

We will frequently use the breakpoints from NYSE stocks.

Typically, this will be December and June but some
strategies, like momentum, require monthly quantiles.
This also allows us to compare what we find with
the data on Dr. Ken French's website.

In [None]:
q = c(.05, .1, .15, .2, .25, .3, .35, .4, .45, .5,
      .55, .6, .65, .7, .75, .8, .85, .9, .95, 1.)

dt = crsp %>% filter(EXCHCD=="1", !is.na(ME))

N = dt %>% group_by(date) %>% summarise(N=n()) %>% as.data.frame
rownames(N) = N$date
N$date = NULL

quantiles = do.call("rbind", tapply(dt$ME, dt$date, quantile, q))
quantiles = tibble::rownames_to_column(cbind(quantiles, N), "date")

write.csv(quantiles, "C:/Data/Thesis/20Q_ME.csv")

dt = crsp %>% filter(EXCHCD=="1", PR.OK)

N = dt %>% group_by(date) %>% summarise(N=n()) %>% as.data.frame
rownames(N) = N$date
N$date = NULL

quantiles = do.call("rbind", tapply(dt$PR, dt$date, quantile, q))
quantiles = tibble::rownames_to_column(cbind(quantiles, N), "date")

write.csv(quantiles, "C:/Data/Thesis/20Q_PR.csv")