<a href="https://colab.research.google.com/github/thowley1207/capstone_project/blob/02/02_map_sec_cik_to_wrds_permno.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install --upgrade wrds
!wget https://raw.githubusercontent.com/thowley0824/capstone/main/colab_initialization/initializer.py

import json
import pandas as pd
import pathlib
import numpy as np
import requests
import zipfile

import initializer
initializer.initialize_colab()
db = initializer.initialize_wrds_connection()

In [None]:
'''
SET DATA SUBDIRECTORIES AND FORM TYPE PREFIX
WHEN APPLICABLE, THIS FORM TYPE PREFIX WILL BE USED MOVING FORWARD
'''
data_subdir = 'data/edgar_wrds_linking/'
file_prefix = '8k_'

'''
FILE NAMES CARRIED DOWN FROM PRIOR WORK
'''
master_index_all_periods_file_name = 'master_index_all_periods.pkl'

'''
NEW FILE NAMES FOR USE BELOW
'''
permno_cik_map_file_name = 'permno_cik_map.pkl'

filtered_index_data_w_permno_file_name = (
    'filtered_index_data_w_permno.pkl')

In [None]:
'''
READ IN COMBINED ALL PERIOD MASTER INDEX DATA
'''

master_index_all_periods = pd.read_pickle((
    data_subdir +
    file_prefix +
    master_index_all_periods_file_name
    ))

### **QUERY HELPER FUNCTION**

In [None]:
def execute_wrds_query(query_content,
                       wrds_session = db,
                       output_directory = 'data/',
                       output_file = None):

    query_result = db.raw_sql(query_content)

    if output_file is not None:
        output_path = f"""{output_directory}{output_file}"""
        print(f"""Writing query result to: {output_path}""")

        if output_file.endswith('.csv'):
            query_result.to_csv(output_path)
        elif output_file.endswith('.pkl'):
            query_result.to_pickle(output_path)
        else:
            raise Exception("Invalid File Format Provided For Output")
        print(f"""Query result successfully written.""")
    else:
        print(f"""Warning: output is not saved to Google Drive.""")

    return query_result

**Generate A Lookup Table By Linking the Following Data**\
(For The Specified Event Periods)**:**

---
*   gvkey, cusip, cik from **comp.funda**
      - cusip = most recent cusip for each company

TO

*   permno, ncusip from **crsp.dse_names**
      - ncusip = all cusips related to a given permno over the period

ON

* **substring(comp.funda.cusip, 1, 8) = crsp.dse_names.ncusip**

---

AND

---

*   permno from **crsp.dsf**

TO

*   permno, ncusip from **crsp.dse_names**

ON

*  **crsp.dsf.permno = crsp.dse_names.permno**

---

The resultant lookup table is a map containing pairs of the form:

*  **crsp.dsf.permno, comp.funda.cik**

After minor cleaning for irregularities and formatting, this will allow us to map the CIKs from the scraped EDGAR data to their respective daily CRSP time series data.

In [None]:
q_permno_cik_map = f"""
with
comp_funda as
(
    select distinct
        gvkey,
        substring(cusip,1,8) as cusip,
        cik
  from comp.funda
    where datadate between '2005-01-01' and '2020-01-01'
    and cik is not null
    and cusip is not null
),

crsp_dse_names as
(
    select distinct
        permno,
        ncusip as cusip
    from crsp.dsenames
    where shrcd in (10, 11)
    and ncusip is not null
)
select distinct
    crsp_dsf.permno,
    comp_funda.cik
from crsp.dsf crsp_dsf
    inner join crsp_dse_names
        on crsp_dsf.permno = crsp_dse_names.permno
    inner join comp_funda
        on crsp_dse_names.cusip = comp_funda.cusip
"""

permno_cik_map = execute_wrds_query(q_permno_cik_map
                                    ).drop_duplicates(subset=['permno'],
                                                      keep=False)

permno_cik_map['cik'] = permno_cik_map['cik'].str[3:].astype(int)

In [None]:
permno_cik_map.to_pickle((
    data_subdir +
    file_prefix +
    permno_cik_map_file_name
    ))

**MAP 8k data to PERMNOs by CIK**
- Only keep data that has am existing PERMNO
- This map will be used both to:
  - Facilatate the event study (bc it needs the event dates from the SEC data)
  - Facilatate the LLM fine-tuning / training / output analysis
    - LLM needs to be fine tuned via being trained on both 8k data as well as the event study output
    - The actual output analysis is predicated on comparing predictions to real performance
    - None of this can be done without having a way to link SEC data and CRSP data for the entities

In [None]:
filtered_index_data_w_permno = master_index_all_periods.merge(permno_cik_map,
                                                              on='cik')

In [None]:
filtered_index_data_w_permno.to_pickle((
    data_subdir +

    filtered_index_data_w_permno_file_name
    ))