# Assignment 2 - read everything carefully

## As you may have learned in Finance...

There's a thing called the CAPM model, aka the capital asset pricing model. 

The CAPM model hypothesizes the following linear relationship:

$$R_a-R_f=\alpha+\beta(R_m-R_f)$$

Where, $R_a-R_f$, the asset's risk premium (return on an asset above the risk free rate) is equal to some $\alpha$ plus a multiplier, $\beta$, times the market premium (market return above risk free), $R_m - R_f$. 

$\beta$ is the systematic market risk factor for a stock (it sort of represents growth potential / market risk or how much the stock moves on average for some movement in the market), and $\alpha$ is the excess return of a stock - what every investor is trying to find (excess returns = beating the market). 

We will be analyzing data from the merged CRSP and COMPUSTAT files to compute $\alpha$'s and $\beta$'s for each stock.

For this assignment, you are provided with several files:

1. The merged CRSP and COMPUSTAT monthly dataset (ccm.zip) from 2010-2016.
2. The interest rate dataset (interestrates.zip)
3. A variable description file (codebook.xlsx)
4. An industry classification table (naics codes.xlsx)

Here's some hints of how to work with the zipped files:

* If you don't want it to take up a bunch of space on your computer, **do not unzip the files** (at least don't unzip the CCM file, it's about 300Mbs)

* Note that your computer should have at least 500mb ram to be able to do this. All modern computers should satisfy this requirement. You may want to exit out of other RAM intensive programs (video streaming, for example).

* In order to read the zipped file, use pandas' `read_csv` function, and specify ` compression='zip' ` as a named input.

* For the monthly return of an asset, use the variable `trt1m` from the CCM file. This is the total holding period return adjusted for dividend payments for each asset. 

* For the risk free rate, use the variable `TCMNOM_M1` from the interest rates file. This is the 1 month treasury bill rate - we usually assume that the US government is "risk free."

* In the CCM file, use `datadate` as the date variable. In the interest rate file, use `date` as the date variable.

# Read in the CCM dataset

1. The unique company identifier to use is PERMNO, it's the permanent record number for publicly listed stocks. Even as ticker symbols or names change, the PERMNO remains unchanged. Find the variable that corresponds to PERMNO, how many unique companies are there?

2. What is the data type and format for dates? Convert this to a pandas timestamp. 

# Read in the interest rate data

1. What is the format for dates? Convert this to a pandas timestamp. 

2. Merge this to the CCM dataset.

# Compute market and asset risk premiums

1. Compute the equally weighted (each stock receives the same weight) market return by year-month (e.g. for each date). Create a new column for these values (hint: you may want to store the market returns by date as a separate dataframe and then merge back in with the CCM data).

2. Compute the market risk premium (market return - risk free rate), store as new column in the dataframe.

3. Compute the asset risk premium (asset return - risk free rate), store as new column.

# Run CAPM regressions

1. For each asset (think loop through PEMRNO's), run the CAPM regression **if there are at least 12 observations for that stock**.

2. Keep track of your results. For each asset, keep track of its PERMNO (duh!), number of observations, the $\alpha$, the $\beta$, and the $R^2$. (Think of a data type that can let you append new results at each iteration). Convert these saved results into a dataframe that has the 5 columns corresponding to the variables you kept track of, and as many rows as there are assets for which you ran a regression.

3. What is the equally weighted average $\alpha$ across all assets? What is the standard deviation? Is this statistically different from 0 (what theory predicts)?

4. What is the equally weighted $\beta$ across all assets? What is the standard deviation? Is this statistically different from 1 (what theory predicts)?

5. What is the average $R^2$. What does this tell you about the CAPM model?

# Explore $\alpha$'s

1. Use the CCM dataset to find Google's PERMNO by its ticker (look up Google's ticker if you don't know it and then write code to find the PERMNO). What is Google's $\alpha$? $\beta$?

2. What about Apple?, Amazon? Walmart?

3. Which company in the dataset has the highest $\alpha$? Second highest? How many observations does each have? What was the model fit ($R^2$)? Would you have invested in these companies? Why?

4. Of the assets which CAPM fits fairly well ($R^2>0.1$), and have the maximum number of observations (84, i.e. they existed throughout the timeframe of our CCM dataset), which 3 companies had the highest $\alpha$? What do these companies do?

# Bonus

Use only assets in our dataset with 84 observations

1. Explore the relationship between company size (as measured by some metric of net asset value) and $\alpha$. Document and explain what you are doing. What can you conclude?

2. (Trickier) Look at the NAICS industry code file. These industries correspond to the first 2 digits in of the NAICS code in the CCM file. 

    a. What is the CCM NAICS code data type? Note that if it's a number, the first 0, if it exists, gets dropped.
    
    b. Create a new column with just the first 2 digits of the NAICS code for the coarse classification.
    
    c. Can you find any relationship between industry and $\alpha$? Document your thought process, code, etc. Focus on why you're doing what you're doing.   