Compustat and CRSP data

**Caution**: Avoid Look-ahead bias when merging datasets




**Data Description**

Important Dataframes

1.  "Returns" dataframe : It contains monthly returns(RET), shares  outstanding (SHROUT) values, Price (PRC), Primary Exchange Code (PRIMEXCH) and  Unique Identifiers (PERMNO). The data are downloaded from  CRSP.

Key Input data:
date:    yyyymmdd format
RET:     return for the month ending yyyymmdd
EXCHCD:  Exchange where listed
SHROUT:  Shares outstanding as of month ending yyyymmdd


2.   "Cstat_book" dataframe : It contains monthly level earnings per share (eps) and book value of common equity(ceq) values for firms with unique identifiers (PERMNO)



3. merged_data : Dataframe obtained from Merging "Returns" & "Cstat_book" dataframe on "PERMNO" & "date" with 1 year tolerance for merging. Book to Market Ratio (b2m) is calculated using ceq and marketcap values.







In [1]:
# Importing Necessary Python Libraries
import pandas as pd
import numpy as np
import datetime as dt
from datetime import timedelta
from pandas import DateOffset

CRSP Data

* date    : Month-end date
* PERMNO  : permanent identification number assigned by CRSP
* RET      : Monthly returns
* PRC      : Month-end price  
*SHROUT    : number of shares outstanding as of month-end


In [4]:
CRSP = pd.read_csv('FCX_CRSP.csv')
CRSP["date"] = pd.to_datetime(CRSP["date"])             # Convert to  DateTime object for datetime manipulations
# Market Cap Calculation
CRSP['marketcap_monthend'] = CRSP.SHROUT * CRSP.PRC                    # Calculating Market Capitalization
CRSP['marketcap'] = CRSP.groupby('PERMNO')['marketcap_monthend'].shift()  # Lagged Market Capitalization
CRSP.head()

Unnamed: 0,PERMNO,date,TICKER,COMNAM,RET,PRC,SHROUT,marketcap_monthend,marketcap
0,81774,2022-12-30,FCX,FREEPORT MCMORAN INC,-0.05,38.0,1429327,54314426.0,
1,81774,2023-01-31,FCX,FREEPORT MCMORAN INC,0.18,44.62,1429327,63776570.74,54314426.0
2,81774,2023-02-28,FCX,FREEPORT MCMORAN INC,-0.08,40.97,1430694,58615533.18,63776570.74
3,81774,2023-03-31,FCX,FREEPORT MCMORAN INC,0.0,40.91,1430694,58529691.54,58615533.18
4,81774,2023-04-28,FCX,FREEPORT MCMORAN INC,-0.07,37.91,1430694,54237609.54,58529691.54


Compustat data items


* GVKEY : Compustat stock id
* LPERMNO: CRSP Permno for the stock
* datadate: fiscal year-end date
*   ceq: Book value of common equity






In [6]:
#Compustat Data

# Importing Compustat Data

Cstat = pd.read_csv('FCX_Cstat.csv')
Cstat.head()

Unnamed: 0,GVKEY,LPERMNO,datadate,fyear,indfmt,consol,popsrc,datafmt,curcd,ceq,costat
0,14590,81774,2020-12-31,2020,INDL,C,D,STD,USD,10174.0,A
1,14590,81774,2021-12-31,2021,INDL,C,D,STD,USD,13980.0,A
2,14590,81774,2022-12-31,2022,INDL,C,D,STD,USD,15555.0,A
3,14590,81774,2023-12-31,2023,INDL,C,D,STD,USD,16693.0,A
4,14590,81774,2024-12-31,2024,INDL,C,D,STD,USD,17581.0,A


In [7]:

Cstat.rename(columns = {'LPERMNO' : 'PERMNO'}, inplace = True) # Renaming "LPERMNO" for merging Cstat_book_eps with Returns data
Cstat = Cstat[["PERMNO","datadate","ceq"]].copy()        # Keeping only relevant columns for clarity


# Datetime Manipulations
Cstat["date"] = pd.to_datetime(Cstat["datadate"])             # Convert to  DateTime object for datetime manipulations
Cstat['date'] = Cstat['date'].apply(lambda x: x + DateOffset(months=+5)) # Adding five months (using DataOffset library) assuming it takes at most 4 months for the data to reach the market
Cstat.head()

Unnamed: 0,PERMNO,datadate,ceq,date
0,81774,2020-12-31,10174.0,2021-05-31
1,81774,2021-12-31,13980.0,2022-05-31
2,81774,2022-12-31,15555.0,2023-05-31
3,81774,2023-12-31,16693.0,2024-05-31
4,81774,2024-12-31,17581.0,2025-05-31


Merge CRSP and Compusta data by PERMNO.

Ensure no look-ahead bias. Are the Compustat data available when CRSP price is repoted?

In [8]:
# Merged Data

CRSP.sort_values(by = 'date', inplace = True)                       # Sort CRSP data by date to use merge_asof (Note: data should be sort on the variable that is used to "merge_asof")
Cstat.sort_values(by = 'date', inplace = True)                 # Sort Cstat data by date to use merge_asof


merged_data = pd.merge_asof(CRSP, Cstat, by = 'PERMNO', left_on = 'date', right_on= 'date', tolerance=dt.timedelta(days = 365)) # Merging "Returns" & "Cstat_book_eps" dataframe on "PERMNO" & "date" with 1 year tolerance for date
#merged_data.dropna(inplace = True)                                # Dropping Missing Values

# Calculating Book to Market Ratio
merged_data['b2m'] = 1000* merged_data.ceq / merged_data.marketcap      # Book to Market Ratio (Multiplies by 1000 because shares 'ceq' is in $m and market cap is in $'000)


merged_data[['PERMNO', 'date', 'RET','marketcap_monthend','marketcap', 'datadate','ceq', 'b2m' ]]



Unnamed: 0,PERMNO,date,RET,marketcap_monthend,marketcap,datadate,ceq,b2m
0,81774,2022-12-30,-0.05,54314426.0,,2021-12-31,13980.0,
1,81774,2023-01-31,0.18,63776570.74,54314426.0,2021-12-31,13980.0,0.25739
2,81774,2023-02-28,-0.08,58615533.18,63776570.74,2021-12-31,13980.0,0.219203
3,81774,2023-03-31,0.0,58529691.54,58615533.18,2021-12-31,13980.0,0.238503
4,81774,2023-04-28,-0.07,54237609.54,58529691.54,2021-12-31,13980.0,0.238853
5,81774,2023-05-31,-0.09,49219041.24,54237609.54,2022-12-31,15555.0,0.286794
6,81774,2023-06-30,0.16,57331440.0,49219041.24,2022-12-31,15555.0,0.316036
7,81774,2023-07-31,0.12,63996219.9,57331440.0,2022-12-31,15555.0,0.271317
8,81774,2023-08-31,-0.11,57216412.76,63996219.9,2022-12-31,15555.0,0.243061
9,81774,2023-09-29,-0.07,53460286.44,57216412.76,2022-12-31,15555.0,0.271863
