# Project 1 Step 4 Process Stock Return Data
In this notebook, we will mainly do the followings:
- Read in and clean stock return data and benchmark index return data
- Calculate 4-day buy-and-hold stock return
- Map the return to corresponding 10-K or 10-Q file

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [40]:
import pandas as pd
import numpy as np
import os
import warnings

warnings.filterwarnings("ignore")

from datetime import date
from tqdm import tqdm
from joblib import Parallel, delayed
import multiprocessing

from bs4 import BeautifulSoup
import re
from pathlib import Path
import json

import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from nltk.corpus import stopwords

In [3]:
data_path = "/content/drive/MyDrive/Mini 5/Natural Language Processing/Project 1/data/"
data_path_10q = "/content/drive/MyDrive/Mini 5/Natural Language Processing/Project 1/data/10Q/"
data_path_10k = "/content/drive/MyDrive/Mini 5/Natural Language Processing/Project 1/data/10K/"

cik_lookup_filename = "/content/drive/MyDrive/Mini 5/Natural Language Processing/Project 1/CIK_lookup_results_cleaned.csv"
sp500_constituents_path = "/content/drive/MyDrive/Mini 5/Natural Language Processing/Project 1/sp500_constituents.csv"
sp500_id_path = "/content/drive/MyDrive/Mini 5/Natural Language Processing/Project 1/sp500_w_addl_id.csv"

index_return_path = "/content/drive/MyDrive/Mini 5/Natural Language Processing/Project 1/Index_Returns.csv"
stock_return_path = "/content/drive/MyDrive/Mini 5/Natural Language Processing/Project 1/Stock_Prices.csv"

#### Read in stock return data

In [4]:
stock_return_df = pd.read_csv(stock_return_path)

In [5]:
stock_return_df

Unnamed: 0,PERMNO,date,TICKER,COMNAM,PRC,RETX
0,10104,20110103,ORCL,ORACLE CORP,31.62000,0.010224
1,10104,20110104,ORCL,ORACLE CORP,31.48000,-0.004428
2,10104,20110105,ORCL,ORACLE CORP,31.04000,-0.013977
3,10104,20110106,ORCL,ORACLE CORP,31.17000,0.004188
4,10104,20110107,ORCL,ORACLE CORP,31.03000,-0.004491
...,...,...,...,...,...,...
1863588,93436,20211227,TSLA,TESLA INC,1093.93994,0.025248
1863589,93436,20211228,TSLA,TESLA INC,1088.46997,-0.005000
1863590,93436,20211229,TSLA,TESLA INC,1086.18994,-0.002095
1863591,93436,20211230,TSLA,TESLA INC,1070.33997,-0.014592


#### Remove unnecessary columns

In [6]:
stock_return_df = stock_return_df[["date", "TICKER", "RETX"]]
stock_return_df.head()

Unnamed: 0,date,TICKER,RETX
0,20110103,ORCL,0.010224
1,20110104,ORCL,-0.004428
2,20110105,ORCL,-0.013977
3,20110106,ORCL,0.004188
4,20110107,ORCL,-0.004491


#### Stock Return Data Cleaning
When processing the file for stock return data, we found that the type of entries for column `RETX` is object/string, because it not only contains return data but also contains string "C" and "B". 

When we found out that the type of entries for column `RETX` is object/string, we want to cast it back to float, but we failed because there are entries containing string "C" and "B".

In order to deal with this issue, we want to see what kinds of entries will "C" and "B" reside.

In [8]:
test_c = stock_return_df.loc[stock_return_df['RETX'] =='C']
test_c

Unnamed: 0,date,TICKER,RETX
98514,20110104,MMI,C
101632,20110126,NLSN,C
104385,20110211,KMI,C
109896,20110310,HCA,C
112619,20110331,HII,C
...,...,...,...
407058,20211209,HCP,C
407074,20211209,NU,C
1309189,20150928,WMIH,C
1353858,20150601,MLSS,C


After inspections, we conclude that "C" happens when some companies change tickers. Thus, in order to maintain the date being consecutive for every ticker and not mess up the 4-day buy-and-hold calculations, we decide to fill those places of "C" with value 0.0.

In [9]:
stock_return_df['RETX'] = stock_return_df['RETX'].apply(lambda x: 0 if x == "C" else x)

In [10]:
stock_return_df.loc[stock_return_df['RETX'] =='C']

Unnamed: 0,date,TICKER,RETX


Now we have eliminiated the appearance of "C".

Then we want to inspect the occurrance of "B".

In [11]:
test_b = stock_return_df.loc[stock_return_df['RETX'] == 'B']
test_b

Unnamed: 0,date,TICKER,RETX
372894,20120105,,B
372895,20120106,,B
372896,20120109,,B
372897,20120110,,B
372898,20120111,,B
...,...,...,...
1353853,20150522,,B
1353854,20150526,,B
1353855,20150527,,B
1353856,20150528,,B


In [12]:
test_b[-1109:]

Unnamed: 0,date,TICKER,RETX
1309188,20150925,,B
1352750,20110103,,B
1352751,20110104,,B
1352752,20110105,,B
1352753,20110106,,B
...,...,...,...
1353853,20150522,,B
1353854,20150526,,B
1353855,20150527,,B
1353856,20150528,,B


From above, we can see that the last 1108 entries are consecutive with null ticker values.

In [13]:
test_b[:485]

Unnamed: 0,date,TICKER,RETX
372894,20120105,,B
372895,20120106,,B
372896,20120109,,B
372897,20120110,,B
372898,20120111,,B
...,...,...,...
373374,20131203,,B
373375,20131204,,B
373376,20131205,,B
373377,20131206,,B


From above, we can see that the first 484 entries are consecutive with null ticker values.

In [14]:
test_b[483:1677]

Unnamed: 0,date,TICKER,RETX
373377,20131206,,B
1307998,20110103,,B
1307999,20110104,,B
1308000,20110105,,B
1308001,20110106,,B
...,...,...,...
1309186,20150923,,B
1309187,20150924,,B
1309188,20150925,,B
1352750,20110103,,B


From above, we can see that the 484th term until the 1675th term are consecutive with null ticker values.

Thus, after inspections, we found that "B" in `RETX` is associated with NaN ticker value. Also, those null values are composite of 3 consecutive sections of entries. Then it is safe to drop those without affecting calculating stock returns. 

In [15]:
stock_return_df = stock_return_df.loc[stock_return_df['RETX'] != "B"]
stock_return_df

Unnamed: 0,date,TICKER,RETX
0,20110103,ORCL,0.010224
1,20110104,ORCL,-0.004428
2,20110105,ORCL,-0.013977
3,20110106,ORCL,0.004188
4,20110107,ORCL,-0.004491
...,...,...,...
1863588,20211227,TSLA,0.025248
1863589,20211228,TSLA,-0.005000
1863590,20211229,TSLA,-0.002095
1863591,20211230,TSLA,-0.014592


After previous cleaning, we can now try to cast all return values to floats. Now we succeed in doing so.

In [16]:
stock_return_df['RETX'] = stock_return_df['RETX'].astype(float)
print(type(stock_return_df['RETX'].iloc[100]))

<class 'numpy.float64'>


We finally check whether there are other data issues, such as missing data. After checking, we found that there are 5 tickers missing and 214 return data missing. 

In [17]:
stock_return_df.isna().sum()

date        0
TICKER      5
RETX      214
dtype: int64

In [18]:
stock_return_df = stock_return_df.reset_index(drop=True, inplace=False)

In [19]:
stock_return_df.loc[stock_return_df['TICKER'].isna()]

Unnamed: 0,date,TICKER,RETX
138226,20120207,,
234894,20140724,,
303306,20171212,,
394888,20210901,,
406573,20211207,,


In [20]:
stock_return_df.iloc[138224:138230]

Unnamed: 0,date,TICKER,RETX
138224,20211230,EPAM,-0.006934
138225,20211231,EPAM,-0.006967
138226,20120207,,
138227,20120208,CZR,0.0
138228,20120209,CZR,-0.048083
138229,20120210,CZR,-0.027986


In [21]:
stock_return_df.iloc[234891:234900]

Unnamed: 0,date,TICKER,RETX
234891,20151215,PGN,0.018919
234892,20151216,PGN,-0.071618
234893,20151217,PGN,-0.166857
234894,20140724,,
234895,20140729,SPKE,0.0
234896,20140730,SPKE,-0.031338
234897,20140731,SPKE,-0.010399
234898,20140801,SPKE,0.003503
234899,20140804,SPKE,0.018034


In [22]:
stock_return_df.iloc[303303:303309]

Unnamed: 0,date,TICKER,RETX
303303,20200930,DLPH,0.022644
303304,20201001,DLPH,0.018552
303305,20201002,DLPH,
303306,20171212,,
303307,20171213,ACT,0.0
303308,20171214,ACT,-0.006357


In [23]:
stock_return_df.iloc[394885:394892]

Unnamed: 0,date,TICKER,RETX
394885,20211229,CNP,0.003591
394886,20211230,CNP,-0.004293
394887,20211231,CNP,0.002875
394888,20210901,,
394889,20210916,ACT,0.0
394890,20210917,ACT,-0.023914
394891,20210920,ACT,-0.0075


In [24]:
stock_return_df.iloc[406570:406576]

Unnamed: 0,date,TICKER,RETX
406570,20211229,GLW,0.004284
406571,20211230,GLW,-0.007998
406572,20211231,GLW,0.000538
406573,20211207,,
406574,20211209,HCP,0.0
406575,20211210,HCP,0.005987


From all the inspections above, we can see that those 5 missing tickers are due to being the first day of this ticker's appearance, so there are no tickers and no return data for that entry. In this case we can backward fill the next ticker and next return for this.

In [25]:
stock_return_df.iloc[138224:138230] = stock_return_df.iloc[138224:138230].bfill()
stock_return_df.iloc[234891:234900] = stock_return_df.iloc[234891:234900].bfill()
stock_return_df.iloc[303303:303309] = stock_return_df.iloc[303303:303309].bfill()
stock_return_df.iloc[394885:394892] = stock_return_df.iloc[394885:394892].bfill()
stock_return_df.iloc[406570:406576] = stock_return_df.iloc[406570:406576].bfill()

In [26]:
stock_return_df.isna().sum()

date        0
TICKER      0
RETX      208
dtype: int64

We resolved the missing data issue for `TICKER`. However, since there are 200+ entries of return data missing, and there is no obvious pattern of it, we decide not to inspect every case, but to fill them with values. If the i-th entry has its return data missing, we check whether i-th and (i-1)-th entries have the same ticker, if so we do forward fill; otherwise, we do backward fill.

In [35]:
stock_return_df.loc[stock_return_df['RETX'].isna()]

Unnamed: 0,date,TICKER,RETX
5576,20110228,AYE,
12544,20160907,EMC,
13381,20140501,BEAM,
14939,20170313,LLTC,
15008,20110411,GENZ,
...,...,...,...
1829529,20180710,DPS,
1834104,20180307,SNI,
1835728,20170615,MJN,
1836785,20150317,CFN,


In [28]:
ind_to_fill = stock_return_df.loc[stock_return_df['RETX'].isna()].index.tolist()

In [36]:
list_for_filling = []
for i in ind_to_fill:
  # check the entry before
  if stock_return_df.loc[i-1, 'TICKER'] == stock_return_df.loc[i, 'TICKER']:
    stock_return_df.loc[i, 'RETX'] = stock_return_df.loc[i-1, 'RETX']
  # check the entry after
  elif stock_return_df.loc[i+1, 'TICKER'] == stock_return_df.loc[i, 'TICKER']:
    stock_return_df.loc[i, 'RETX'] = stock_return_df.loc[i+1, 'RETX']

In [37]:
stock_return_df.isna().sum()

date      0
TICKER    0
RETX      0
dtype: int64

We successfully filled all missing return values. The next step is to calculate 4-day buy-and-hold returns for each stock on each date.

#### Stock 4-day buy-and-hold return calculation
The stock absolute 4-day buy-and-hold return can be calculated by the following formula:

4-day buy-and-hold return = (r0 + 1) * (r1 + 1) * (r2 + 1) * (r3 + 1) - 1

where r0 = filing date of a certain document, r1= 1 day after the filing date, r2 = 2 days after the filing date, r3 = 3 days after the filing date

In [49]:
def buy_and_hold_ret(df_ticker):
  # assume df_ticker is the return dataframe is for a single ticker
  shift1 = df_ticker.shift(-1).ffill() + 1
  shift2 = df_ticker.shift(-2).ffill() + 1
  shift3 = df_ticker.shift(-3).ffill() + 1
  return (df_ticker + 1) * shift1 * shift2 * shift3 - 1

In [57]:
stock_return_df['BuyHoldRet'] = stock_return_df.groupby("TICKER")['RETX'].apply(lambda x: buy_and_hold_ret(x))

In [58]:
stock_return_df

Unnamed: 0,date,TICKER,RETX,BuyHoldRet
0,20110103,ORCL,0.010224,-0.004153
1,20110104,ORCL,-0.004428,-0.018659
2,20110105,ORCL,-0.013977,-0.013977
3,20110106,ORCL,0.004188,-0.001611
4,20110107,ORCL,-0.004491,-0.007218
...,...,...,...,...
1860805,20211227,TSLA,0.025248,0.003130
1860806,20211228,TSLA,-0.005000,-0.033969
1860807,20211229,TSLA,-0.002095,-0.041415
1860808,20211230,TSLA,-0.014592,-0.051572


In [59]:
(0.010224 + 1) * (-0.004428 + 1) * (-0.013977 + 1) * (1 + 0.004188) - 1

-0.004153438048403402

In [42]:
# multiprocessing.cpu_count()

2

In [41]:
# def applyParallel_groupby(dfGrouped, func):
#     retLst = Parallel(n_jobs=multiprocessing.cpu_count())(delayed(func)(group) for name, group in dfGrouped)
#     return pd.concat(retLst)

In [52]:
# stock_return_buy_hold = applyParallel_groupby(stock_return_df.groupby("TICKER"), buy_and_hold_ret)

#### Read in and clean the benchmark index return data

In [60]:
index_return_df = pd.read_csv(index_return_path)
index_return_df

Unnamed: 0,DlyCalDt,vwretx
0,20110103,0.011186
1,20110104,-0.003940
2,20110105,0.005302
3,20110106,-0.002759
4,20110107,-0.001953
...,...,...
2764,20211227,0.011964
2765,20211228,-0.002451
2766,20211229,0.000539
2767,20211230,-0.001132


In [61]:
index_return_df.isna().sum()

DlyCalDt    0
vwretx      0
dtype: int64

It is good to see that we do not have much data issues in the index return data. Then we can proceed to calculate the 4-day buy-and-hold return of the index for each trading day.

#### Calculate the 4-day buy-and-hold return of index benchmark

In [62]:
index_return_df['BuyHoldRet'] = buy_and_hold_ret(index_return_df['vwretx'])

In [63]:
index_return_df

Unnamed: 0,DlyCalDt,vwretx,BuyHoldRet
0,20110103,0.011186,0.009749
1,20110104,-0.003940,-0.003372
2,20110105,0.005302,-0.000142
3,20110106,-0.002759,-0.000399
4,20110107,-0.001953,0.011295
...,...,...,...
2764,20211227,0.011964,0.008884
2765,20211228,-0.002451,-0.005462
2766,20211229,0.000539,-0.005437
2767,20211230,-0.001132,-0.008384


In [64]:
(0.011186	 + 1) * (-0.003940 + 1) * (0.005302 + 1) * (1 + -0.002759) - 1

0.00974850809140726

After testing, we successfully calculated the 4-day buy-and-hold return of the index. Now we need to merge this data back to the stock absolute 4-day buy-and-hold return dataframe according to the date. After that, we subtract the index return from the stock return to obtain the excess stock return for 4-day buy-and-hold.

#### Merge the two return dataframes and calculate the stock 4-day buy-and-hold excess returns

In [68]:
merged_return_df = stock_return_df.merge(index_return_df, left_on="date", right_on="DlyCalDt", how='left', suffixes=('_stock', '_index'))
merged_return_df

Unnamed: 0,date,TICKER,RETX,BuyHoldRet_stock,DlyCalDt,vwretx,BuyHoldRet_index
0,20110103,ORCL,0.010224,-0.004153,20110103.0,0.011186,0.009749
1,20110104,ORCL,-0.004428,-0.018659,20110104.0,-0.003940,-0.003372
2,20110105,ORCL,-0.013977,-0.013977,20110105.0,0.005302,-0.000142
3,20110106,ORCL,0.004188,-0.001611,20110106.0,-0.002759,-0.000399
4,20110107,ORCL,-0.004491,-0.007218,20110107.0,-0.001953,0.011295
...,...,...,...,...,...,...,...
1860805,20211227,TSLA,0.025248,0.003130,20211227.0,0.011964,0.008884
1860806,20211228,TSLA,-0.005000,-0.033969,20211228.0,-0.002451,-0.005462
1860807,20211229,TSLA,-0.002095,-0.041415,20211229.0,0.000539,-0.005437
1860808,20211230,TSLA,-0.014592,-0.051572,20211230.0,-0.001132,-0.008384


#### Remove columns that are unnecessary in the following steps

In [72]:
merged_return = merged_return_df[["date", "TICKER", "BuyHoldRet_stock", "BuyHoldRet_index"]]
merged_return

Unnamed: 0,date,TICKER,BuyHoldRet_stock,BuyHoldRet_index
0,20110103,ORCL,-0.004153,0.009749
1,20110104,ORCL,-0.018659,-0.003372
2,20110105,ORCL,-0.013977,-0.000142
3,20110106,ORCL,-0.001611,-0.000399
4,20110107,ORCL,-0.007218,0.011295
...,...,...,...,...
1860805,20211227,TSLA,0.003130,0.008884
1860806,20211228,TSLA,-0.033969,-0.005462
1860807,20211229,TSLA,-0.041415,-0.005437
1860808,20211230,TSLA,-0.051572,-0.008384


#### Check whether there are any data missing due to left join merging

In [73]:
merged_return.isna().sum()

date                0
TICKER              0
BuyHoldRet_stock    3
BuyHoldRet_index    7
dtype: int64

#### Check out those data and have a look

In [74]:
merged_return.loc[merged_return['BuyHoldRet_index'].isna()]

Unnamed: 0,date,TICKER,BuyHoldRet_stock,BuyHoldRet_index
104817,20121029,KMI,-0.027096,
266698,20121029,TXN,0.014219,
624182,20121029,NI,-0.012528,
808533,20121029,CAG,-0.008828,
833455,20121029,LNT,-0.00989,
947792,20121029,AOS,0.012154,
988929,20121029,MS,0.039767,


We can see from above that, for those missing index data, they are all coming from the same date 20121029. The last day that the index has daily return and 4-day buy-and-hold return is 20121026. We can use this to fill the missing index 4-day buy-and-hold return for 20121029.

In [87]:
index_return_df.loc[index_return_df['DlyCalDt'] == 20121026]

Unnamed: 0,DlyCalDt,vwretx,BuyHoldRet
459,20121026,-0.00139,0.001992


In [88]:
merged_return['BuyHoldRet_index'] = merged_return['BuyHoldRet_index'].ffill()

In [89]:
# Check whether there are any data missing due to left join merging
merged_return.isna().sum()

date                0
TICKER              0
BuyHoldRet_stock    3
BuyHoldRet_index    0
dtype: int64

Now we want to inspect the reason for stock return missing.

In [75]:
merged_return.loc[merged_return['BuyHoldRet_stock'].isna()]

Unnamed: 0,date,TICKER,BuyHoldRet_stock,BuyHoldRet_index
417682,20110103,MOT,,0.009749
1708099,20190319,TFCFA,,-0.014344
1708100,20190320,TFCFA,,-0.014038


In [90]:
merged_return.loc[417679:417685]

Unnamed: 0,date,TICKER,BuyHoldRet_stock,BuyHoldRet_index
417679,20211229,MRK,-0.008681,-0.005437
417680,20211230,MRK,-0.016899,-0.008384
417681,20211231,MRK,-0.025677,-0.009669
417682,20110103,MOT,,0.009749
417683,20110104,MSI,0.054509,-0.003372
417684,20110105,MSI,-0.031179,-0.000142
417685,20110106,MSI,-0.036405,-0.000399


In [92]:
merged_return.loc[merged_return['TICKER'] == 'MOT']

Unnamed: 0,date,TICKER,BuyHoldRet_stock,BuyHoldRet_index
417682,20110103,MOT,,0.009749


In [91]:
merged_return.loc[1708095:1708103]

Unnamed: 0,date,TICKER,BuyHoldRet_stock,BuyHoldRet_index
1708095,20190313,FOXA,0.009433,0.013959
1708096,20190314,FOXA,-0.009412,0.006825
1708097,20190315,FOXA,-0.031253,0.00487
1708098,20190318,FOXA,-0.049063,0.010634
1708099,20190319,TFCFA,,-0.014344
1708100,20190320,TFCFA,,-0.014038
1708101,20110103,MKTX,-0.021144,0.009749
1708102,20110104,MKTX,-0.011719,-0.003372
1708103,20110105,MKTX,0.046006,-0.000142


In [93]:
merged_return.loc[merged_return['TICKER'] == 'TFCFA']

Unnamed: 0,date,TICKER,BuyHoldRet_stock,BuyHoldRet_index
1708099,20190319,TFCFA,,-0.014344
1708100,20190320,TFCFA,,-0.014038


From the checking above, we can see that tickers "MOT" and "TFCFA" have buy-and-hold stock returns missing, and "MOT" only appears in 1 day, and "TFCFA" only appears in 2 days. So we want to know where they are in the original dataframe of stock daily return.

In [94]:
stock_return_df[stock_return_df['TICKER']=="MOT"]

Unnamed: 0,date,TICKER,RETX,BuyHoldRet
417682,20110103,MOT,0.00441,


In [95]:
stock_return_df[stock_return_df['TICKER']=="TFCFA"]

Unnamed: 0,date,TICKER,RETX,BuyHoldRet
1708099,20190319,TFCFA,-0.032516,
1708100,20190320,TFCFA,-0.032516,


"MOT" also only appears 1 time in the daily return dataframe; "TFCFA" also only appears 2 times in the daily return dataframe. This explains why they do not have 4-day buy-and-hold return data. In this case, we can drop them from consideration.

In [96]:
merged_return = merged_return.loc[merged_return['BuyHoldRet_stock'].isna() == False]
merged_return

Unnamed: 0,date,TICKER,BuyHoldRet_stock,BuyHoldRet_index
0,20110103,ORCL,-0.004153,0.009749
1,20110104,ORCL,-0.018659,-0.003372
2,20110105,ORCL,-0.013977,-0.000142
3,20110106,ORCL,-0.001611,-0.000399
4,20110107,ORCL,-0.007218,0.011295
...,...,...,...,...
1860805,20211227,TSLA,0.003130,0.008884
1860806,20211228,TSLA,-0.033969,-0.005462
1860807,20211229,TSLA,-0.041415,-0.005437
1860808,20211230,TSLA,-0.051572,-0.008384


Now we can calculate the 4-day buy-and-hold stock excess return.

In [97]:
merged_return['BuyHoldRet_excess'] = merged_return['BuyHoldRet_stock'] - merged_return['BuyHoldRet_index']
merged_return

Unnamed: 0,date,TICKER,BuyHoldRet_stock,BuyHoldRet_index,BuyHoldRet_excess
0,20110103,ORCL,-0.004153,0.009749,-0.013902
1,20110104,ORCL,-0.018659,-0.003372,-0.015287
2,20110105,ORCL,-0.013977,-0.000142,-0.013835
3,20110106,ORCL,-0.001611,-0.000399,-0.001212
4,20110107,ORCL,-0.007218,0.011295,-0.018513
...,...,...,...,...,...
1860805,20211227,TSLA,0.003130,0.008884,-0.005754
1860806,20211228,TSLA,-0.033969,-0.005462,-0.028507
1860807,20211229,TSLA,-0.041415,-0.005437,-0.035978
1860808,20211230,TSLA,-0.051572,-0.008384,-0.043188


In [98]:
# Finally we save the end product of this notebook
merged_return = merged_return[['date', 'TICKER', 'BuyHoldRet_excess']]
merged_return

Unnamed: 0,date,TICKER,BuyHoldRet_excess
0,20110103,ORCL,-0.013902
1,20110104,ORCL,-0.015287
2,20110105,ORCL,-0.013835
3,20110106,ORCL,-0.001212
4,20110107,ORCL,-0.018513
...,...,...,...
1860805,20211227,TSLA,-0.005754
1860806,20211228,TSLA,-0.028507
1860807,20211229,TSLA,-0.035978
1860808,20211230,TSLA,-0.043188


In [100]:
merged_return = merged_return.reset_index(drop=True, inplace=False)

In [101]:
merged_return

Unnamed: 0,date,TICKER,BuyHoldRet_excess
0,20110103,ORCL,-0.013902
1,20110104,ORCL,-0.015287
2,20110105,ORCL,-0.013835
3,20110106,ORCL,-0.001212
4,20110107,ORCL,-0.018513
...,...,...,...
1860802,20211227,TSLA,-0.005754
1860803,20211228,TSLA,-0.028507
1860804,20211229,TSLA,-0.035978
1860805,20211230,TSLA,-0.043188


In [103]:
merged_return.to_csv(os.path.join(data_path, "Stock_Excess_BuyHoldReturn.csv"), index=False)

In [104]:
# pd.read_csv(os.path.join(data_path, "Stock_Excess_BuyHoldReturn.csv"))

Unnamed: 0,date,TICKER,BuyHoldRet_excess
0,20110103,ORCL,-0.013902
1,20110104,ORCL,-0.015287
2,20110105,ORCL,-0.013835
3,20110106,ORCL,-0.001212
4,20110107,ORCL,-0.018513
...,...,...,...
1860802,20211227,TSLA,-0.005754
1860803,20211228,TSLA,-0.028507
1860804,20211229,TSLA,-0.035978
1860805,20211230,TSLA,-0.043188
