# Scraping the Data
_Author_: https://github.com/raffysantayana

## Goal
Use the US Securities and Exchange Commision's (SEC) electronic filing system to programmatically parse and organize the data to later be explored, analyzed, and modeled.

## Overview
SEC archives quarterly reports from various filing entities such as Netflix Inc. (NFLX) and American Express Co. (AXP).

WRITE SOMETHING HERE TALKING ABOUT THE API

## Using the SEC API
Note: This requires a subscription of $55/month to make 100+ requests. I wrote this code before realizing this, so this stopped after gathering a the free limit's worth of documents.
```python
import time
import pandas as pd
from sec_api import QueryApi

# main dataframe we will append each query results to
df = pd.DataFrame()

# paste your api key below
sec_api_key:str = 'api_key' # 'a71896086f47a9ae5928bae84adfaff594ec0a1dcbb0bcc3db52ee3aa0f8e15c'
query_api = QueryApi(api_key = sec_api_key)

base_query = {
  "query": "PLACEHOLDER", # this will be set during runtime 
  "from": "0",
  "size": "200", # dont change this
  # sort by filedAt
  "sort": [{ "filedAt": { "order": "desc" } }]
}

# open the file we use to store the filing URLs
log_file = open("filing_urls.txt", "a")

# start with filings filed in 2021, then 2020, 2019, ... up to 2010 
# uncomment line below to fetch all filings filed in 2022-2010
# for year in range(2021, 2009, -1):
for year in range(2024, 1996, -1):
    print("starting {year}".format(year=year))
    # a single search universe is represented as a month of the given year
    
    for month in range(1, 13, 1):
        # get 10-Q and 10-Q/A filings filed in year and month
        # resulting query example: "formType:\"10-Q\" AND filedAt:[2021-01-01 TO 2021-01-31]"
        universe_query = \
            "formType:\"10-Q\" AND " + \
            "filedAt:[{year}-{month:02d}-01 TO {year}-{month:02d}-31]" \
            .format(year=year, month=month)
    
    print(universe_query)
    # set new query universe for year-month combination
    base_query["query"] = universe_query;

    # paginate through results by increasing "from" parameter 
    # until we don't find any matches anymore
    # uncomment line below to fetch 10,000 filings
    for from_batch in range(0, 999_800, 200): 
    # for from_batch in range(0, 400, 200):
        # set new "from" starting position of search 
        base_query["from"] = from_batch;

        # submit request
        response = query_api.get_filings(base_query)
        # building a temp dataframe of the recent query
        temp_df = pd.DataFrame.from_records(response['filings'])
        # concatenating the temp dataframe to the main dataframe
        df = pd.concat([df, temp_df])
        print(f'df.shape = {df.shape}')
        
        # no more filings in search universe
        if len(response["filings"]) == 0:
            break;
            
        # for each filing, only save the URL pointing to the filing itself 
        # and ignore all other data. 
        # the URL is set in the dict key "linkToFilingDetails"
        urls_list = list(map(lambda x: x["linkToFilingDetails"], response["filings"]))
        
        # transform list of URLs into one string by joining all list elements
        # and add a new-line character between each element.
        urls_string = "\n".join(urls_list) + "\n"
      
        log_file.write(urls_string)

log_file.close()
```

## Web Scraping
SEC has an electrtonic filing system Electronic Data Gathering, Analysis, and Retrieval (EDGAR) that started around 1995 to archive reports such as quarterly 10Q. This system has a RESTful API at [this URL](https://www.sec.gov/edgar/sec-api-documentation) to retrieve report information. Each entity’s current filing history is available at the following URL where CIK_number is an entity's 10 digit CIK number: `
https://data.sec.gov/submissions/CIK{CIK_number}.json`

The returning json contains information such as `accessionNumber`, and `primaryDocument` where the index of the `accessionNumber`is associated with the index of `primaryDocument`. Using these two pieces of info together with the CIK number allows us to construct a url to access all filings for that CIK. Our goal is to specifically analyze quarterly reports, so we will filter results based off of `form` value of "10-Q". The URL we will construct will be:
`https://www.sec.gov/Archives/edgar/data/{CIK_number}/{accessionNumber}/{primaryDocument}`

For example, https://www.sec.gov/Archives/edgar/data/0001445815/000149315224015525/form10-qa.htm

A list of all CIK numbers to iterate through can be found [here](https://www.sec.gov/Archives/edgar/cik-lookup-data.txt)

In [1]:
import time
import requests
import pandas as pd

In [2]:
tickers_url:str = r'https://www.sec.gov/files/company_tickers.json'
# Navigate to the target URL,
# Inspect the page,
# Network tab
# Validate header values
headers = {'User-Agent': r'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36',
          'Accept': r'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
          'Accept-Encoding': r'gzip, deflate, br, zstd',
          'Accept-Language': r'en-US,en;q=0.9',
          'Cache-Control': r'max-age=0',
          'Cookie': r'bm_mi=BC7142F8086BAEE3F72651C3E83E0110~YAAQlvNuaPNr2zCSAQAAY1N8OhnmIHKjoxVwaI/ZtMIJX/PfxLczBtXiz0iklqZ+xIEpEcxpJU4cJMzVCwHoJuAFdGEhWaXyZk5X3zW+O+yIrIS8RiMBdXshd4jqETnc1qloYcASAXlWGbM5XtsAImK+Qai1DQrOUZcg25mQg1MnxRcGaFkmRkpFcYP0N7rikp5gu/ZDCZuzxAxxgc/2r95B2WELRm3i2NEuOmoatL799Y6yGKdJsdqkIwsxQyfPz8s+NYowETjxaAo9gTNVDD80EEJJqVzsIymXcUyE8g9TelOXLOEHp5RGVd0lCHQ0fXzFRs18a+SDa0nyUL/mfVP09Zzf36uhvaC4ZomMjhT8Hbi+C89fl2IZNCTck88Ssk7iPjAg~1; _ga=GA1.1.927723074.1727558080; nmstat=82418d68-15a7-185f-c020-d68d891b928b; ak_bmsc=90A1BDCD774101B03CE94836CAA9FBB6~000000000000000000000000000000~YAAQlvNuaAds2zCSAQAAb1d8OhkRpse9KikT1XMruvgV9301++feYbq5RTSLfeZbfESeyqDH+gyULs6M2CcqP2vLlcrkm50x+Mn2uuUJk6kkTcRrf0M9j4FKUBU85z/lpq3rM/OLwacdSqJ8MTbJS8+IJMxxRUKWDVu7zNArBxYwI/Y0UADubeL0p+/GNPNmVrrVBERf3xufgEwW+JdFU74FP/fCBvB0x7ntuiNcFCj3SbsYDhm4fd/+r/hJRmmQSMgn1yxV0hEQBYwxvtIrSPyIPkm/lB1SnYWBtJc4jeuI9t9Lwptbp5oxq5IbXhQkaAUqsJmbqsl7tvSNu7hfWkQfueVK1eWRVtEUMotCrxVKjbAeWRZCbTgKFQ+qr2bFQyW1ly8ChlbWNKuhet7EHI+cviLH2rQ+OrMLo7Q+06Dt035EyeQwTglPDqLvGC2JV8sQxuvGCFrX34o4LVx9FRSyBAcfvpiFYSz5mpGo21YPiim2FbNrQo2WSFfwbF3nalZUWSyFNcha3Djg44gCslDtf/8kUdZKWzc1ffQLgldX/oDVZDxAoQ==; _4c_=%7B%22_4c_s_%22%3A%22lZHNbsMgEIRfJeIcHMD495pzW7Wq2mPkwCa24gQLU9M08rt3sS2l6qnlAnzsjHaHG%2FE1XEjJM5ElSc5ywaVckxNce1LeiG102AZSErEXTFQqpRlPBJWsyOlei4oyiCsWZ6pK8wNZk8%2FgFSdxLFnOuCzGNVHd4nEjymhAL15EPI8YPfSocF9IqJAMz501%2BkO5nbt2oc7DftXrEz5oGBoFO99oV08Ggt1pDc2xdgELNuHOhguefHPRxv%2BWLfQuY7JAvLfG9xCk29qaM6y4CNhgEuR9koR2LRzA2qmsdq7ry83Gex8djTm2EClz3mBR37jQfw8KH4YFYKgzozN7awLVq9ftC%2FLHH%2BR5%2B%2FSwoG5YRmmNqtpgit81LjmLLJVSYs5Jhjm6lpR5KllY46ycYud%2FqZ6Hp6E7uPxHOo7f%22%7D; _ga_CSLL4ZEK4L=GS1.1.1727558080.1.0.1727558147.0.0.0; _ga_300V1CHKH1=GS1.1.1727558080.1.0.1727558162.0.0.0; bm_sv=0AA71C9F629AEC79558BCA20AC0B76BA~YAAQlvNuaP532zCSAQAAdbB9OhmiWaHQ21/biblHt4t4ehVBLbtki81OzkcMbg6vWn+G7Lm3XJqobk5BPFhd1InHg8eg68CJKtUXTOWFgHEdZlmYq2joAUvmemNV0qcAwNLWHd514rEAb2oV2awL97FYG2FuqnrpTRAVqLoE0TEy7fe9jDdbGKL4z2+NWF+ymu5Vw1A08NoOD19WVGezZdpy/qpZFiXyy6Y/ikKL0BW8YzmEHoYRBX8VBFPW~1'}
response = requests.get(tickers_url, headers=headers)
if response.status_code != 200:
    raise Exception(f"[ERROR] Status code {response.status_code} received")
else:
    print(f"Successful response from {tickers_url}.")
raw_tickers = response.json()
print(f"{len(raw_tickers)} retrieved.")

Successful response from https://www.sec.gov/files/company_tickers.json.
10197 retrieved.


In [3]:
tickers = pd.DataFrame(columns=['cik_str', 'ticker', 'title'])

for i in range(len(raw_tickers)):
    tickers.loc[f"{i}"] = raw_tickers[f"{i}"]

In [4]:
tickers.shape

(10197, 3)

In [5]:
tickers.dtypes

cik_str     int64
ticker     object
title      object
dtype: object

In [6]:
tickers.head()

Unnamed: 0,cik_str,ticker,title
0,320193,AAPL,Apple Inc.
1,789019,MSFT,MICROSOFT CORP
2,1045810,NVDA,NVIDIA CORP
3,1652044,GOOGL,Alphabet Inc.
4,1018724,AMZN,AMAZON COM INC


In [7]:
tickers.tail()

Unnamed: 0,cik_str,ticker,title
10192,1849294,FRLAW,Fortune Rise Acquisition Corp
10193,886163,LGNDZ,LIGAND PHARMACEUTICALS INC
10194,886163,LGNXZ,LIGAND PHARMACEUTICALS INC
10195,886163,LGNYZ,LIGAND PHARMACEUTICALS INC
10196,886163,LGNZZ,LIGAND PHARMACEUTICALS INC


In [8]:
tickers.to_csv("../data/tickers.csv")

In [9]:
with open('../data/all_submissions_w_duplicates.txt', 'w') as the_file:
    for cik in tickers['cik_str']:
        # Write to all_submissions.txt the url for the given cik number with
        # leading zeroes until 10 digits are reached
        the_file.write(f'https://data.sec.gov/submissions/CIK{cik:010d}.json\n')

In [10]:
with open('../data/all_submissions_w_duplicates.txt', 'r') as dup_file:
    lines = dup_file.readlines()
    with open('../data//all_submissions.txt', 'w') as final_file:
        for unique_line in set(lines):
            final_file.write(unique_line)

## Validating URLs
The cell below will iterate through each URL in `all_submissions.txt` and validate that they each provide a valid response. This only needs to be validated once assuming that SEC does not remove any of these submissions. With this assumption, the below code block will become markdown and will show the results.

```python
urls = open('../data/all_submissions.txt', 'r')
lines = urls.readlines()
counter = 1
for line in lines:
    response = requests.get(line.strip(), headers=headers)
    if response.status_code != 200:
        print(f'{line.strip()} might not be a valid URL. Status code {response.status_code} received. This URL is on line {counter}.')
        counter -= 1
    # print the progress
    print(f'{counter:05d}/{len(lines)}', end='\r')
    time.sleep(5)
    counter += 1
if counter - 1 == len(lines):
    validation_file = open('../data/url_validation.txt', 'w')
    validation_file.write('All URLs in all_submissions.txt have passed validation')
```
Output:
10352/10352

## Filtering URLs of CIK Submissions for 10Q Filings
Now that we have URLs for all CIK numbers that detail all submissions that these entities have provided, we can move on to filter for the specific filing we want to train our model on. For this project, we will focus on 10Q quarterly reports. To show how we will filter for only 10Q

In [11]:
urls = open('../data/all_submissions.txt', 'r')
lines = urls.readlines()

# Get the 0th ticker
response = requests.get(lines[0].strip(), headers=headers)
print(f'Status code {response.status_code} received for URL {lines[0].strip()}')

Status code 200 received for URL https://data.sec.gov/submissions/CIK0001978867.json


In [12]:
raw_json = response.json()

In [13]:
raw_json['tickers']

['CDLR']

In [14]:
raw_json['exchanges']

['NYSE']

## Validating the number of values in each column
A row in our dataframe will consist of the below values as well as the accession number above. These values will each be a column in the dataframe and there should be a value - non-null or null/empty string - for each submission for each column.

### Accession Number

In [15]:
recent_filings = raw_json['filings']['recent']
recent_filings['accessionNumber']
print(len(recent_filings['accessionNumber']))
print(recent_filings['accessionNumber'][3])

91
0001104659-24-090342


### Filing Date

In [16]:
recent_filings['filingDate']
print(len(recent_filings['filingDate']))
print(recent_filings['filingDate'][3])

91
2024-08-16


### Report Date

In [17]:
recent_filings['reportDate']
print(len(recent_filings['reportDate']))
print(recent_filings['reportDate'][3])

91
2024-08-16


### Acceptance Date Time

In [18]:
recent_filings['acceptanceDateTime']
print(len(recent_filings['acceptanceDateTime']))
print(recent_filings['acceptanceDateTime'][3])

91
2024-08-16T08:03:51.000Z


### ACT

In [19]:
recent_filings['act']
print(len(recent_filings['act']))
print(recent_filings['act'][3])

91
34


### Form

In [20]:
recent_filings['form']
print(len(recent_filings['form']))
print(recent_filings['form'][3])

91
6-K


### File Number

In [21]:
recent_filings['fileNumber']
print(len(recent_filings['fileNumber']))
print(recent_filings['fileNumber'][3])

91
001-41889


### Film Number

In [22]:
recent_filings['filmNumber']
print(len(recent_filings['filmNumber']))
print(recent_filings['filmNumber'][3])

91
241214172


### Items

In [23]:
recent_filings['items']
print(len(recent_filings['items']))
print(recent_filings['items'][3])

91



### Size

In [24]:
recent_filings['size']
print(len(recent_filings['size']))
print(recent_filings['size'][3])

91
19362


### Primary Document

In [25]:
recent_filings['primaryDocument']
print(len(recent_filings['primaryDocument']))
print(recent_filings['primaryDocument'][3])

91
tm2421902d1_6k.htm


### Is XBRL

In [26]:
recent_filings['isXBRL']
print(len(recent_filings['isXBRL']))
print(recent_filings['isXBRL'][3])

91
0


### Is Inline XBRL

In [27]:
recent_filings['isInlineXBRL']
print(len(recent_filings['isInlineXBRL']))
print(recent_filings['isInlineXBRL'][3])

91
0


### Primary Doc Description

In [28]:
recent_filings['primaryDocDescription']
print(len(recent_filings['primaryDocDescription']))
print(recent_filings['primaryDocDescription'][3])

91
FORM 6-K


`https://www.sec.gov/Archives/edgar/data/{CIK_number}/{accessionNumber}/{primaryDocument}`

### Function to Scrape Submission Metadata
Submissions are grouped by CIK. URLs to the metadata for each group's submissions are detailed in URLs similar to the one below as JSONs where the digits in the URL are the CIK number with enough leading zeroes for a total of 10 digits.<br/>
Example: https://data.sec.gov/submissions/CIK0001076682.json<br/><br/>

```python
# Make a function that loops through the length of response.json()['filings']['recent'] and populates a list
# whose items are in the order of the dataframe columns
# start should start at which cik_index to start at (line in all_submissions.txt)
def extract_10qs(cik_url_index=0):
    line_counter = cik_url_index
    tickers = []
    exchanges = []
    accession_numbers = []
    filing_dates = []
    report_dates = []
    acceptance_datetimes = []
    acts = []
    forms = []
    file_numbers = []
    film_numbers = []
    items = []
    sizes = []
    primary_documents = []
    is_XBRLs = []
    is_inline_XBRLs = []
    primary_doc_descriptions = []
    sources = []
    has_multi_tickers = []
    has_multi_exchanges = []
    all_submissions_line_numbers = []
    report_urls = []
    
    with open('../data/all_submissions.txt', 'r') as file_reader:
        lines = file_reader.readlines()
        line = lines[cik_url_index]
        #for cik_index in range(len(lines)):
        # for line in lines[start:]:
            # Skip cik_index that is less than the specified starting index
            # print(f'cik_index {cik_index} >= start {start} = {cik_index >= start}')
            # if cik_index >= start:
        print(f'Extracting reports from URL {line}', end='\r')

        # save cik number to build report url later
        cik_number = line.split('/')[-1].split('.')[0][3:].strip('0')
        
        response = requests.get(line.strip(), headers=headers)

        # WAIT 1 SECOND TO NOT DDOS THE GOVERNMENT
        time.sleep(1)
        
        if response.status_code != 200:
            print(f'Status code {response.status_code} received for URL {line.strip()}. URL on line {line_counter + 1}', end='\r')
        json = response.json()
        curr_ticker = json['tickers']
        curr_exchange = json['exchanges']
        ticker_filings = json['filings']['recent']
        for i_curr_ticker_filings in range(len(ticker_filings['accessionNumber'])):
            if ticker_filings['form'][i_curr_ticker_filings] == '10-Q':
                if len(curr_ticker) == 0:
                    tickers.append(None)
                else:
                    tickers.append(curr_ticker[0])
                if len(curr_exchange) == 0:
                    exchanges.append(None)
                else:
                    exchanges.append(curr_exchange[0])
                accession_numbers.append(ticker_filings['accessionNumber'][i_curr_ticker_filings])
                filing_dates.append(ticker_filings['filingDate'][i_curr_ticker_filings])
                report_dates.append(ticker_filings['reportDate'][i_curr_ticker_filings])
                acceptance_datetimes.append(ticker_filings['acceptanceDateTime'][i_curr_ticker_filings])
                acts.append(ticker_filings['act'][i_curr_ticker_filings])
                forms.append(ticker_filings['form'][i_curr_ticker_filings])
                file_numbers.append(ticker_filings['fileNumber'][i_curr_ticker_filings])
                film_numbers.append(ticker_filings['filmNumber'][i_curr_ticker_filings])
                items.append(ticker_filings['items'][i_curr_ticker_filings])
                sizes.append(ticker_filings['size'][i_curr_ticker_filings])
                primary_documents.append(ticker_filings['primaryDocument'][i_curr_ticker_filings])
                is_XBRLs.append(ticker_filings['isXBRL'][i_curr_ticker_filings])
                is_inline_XBRLs.append(ticker_filings['isInlineXBRL'][i_curr_ticker_filings])
                primary_doc_descriptions.append(ticker_filings['primaryDocDescription'][i_curr_ticker_filings])
                sources.append(line)
                if len(curr_ticker) > 1:
                    has_multi_tickers.append(1)
                else:
                    has_multi_tickers.append(0)
                if len(curr_exchange) > 1:
                    has_multi_exchanges.append(1)
                else:
                    has_multi_exchanges.append(0)
                all_submissions_line_numbers.append(cik_url_index + 1)
                report_urls.append(f'https://www.sec.gov/Archives/edgar/data/{cik_number}/{ticker_filings["accessionNumber"][i_curr_ticker_filings].replace("-", "")}/{ticker_filings["primaryDocument"][i_curr_ticker_filings]}')
    return pd.DataFrame({
        'ticker': tickers,
        'exchange': exchanges,
        'accession_number': accession_numbers,
        'filing_date': filing_dates,
        'report_date': report_dates,
        'acceptance_datetime': acceptance_datetimes,
        'act': acts,
        'form': forms,
        'file_number': file_numbers,
        'film_number': film_numbers,
        'items': items,
        'size': sizes,
        'primary_document': primary_documents,
        'is_XBRL': is_XBRLs,
        'is_inline_XBRL': is_inline_XBRLs,
        'primary_doc_description': primary_doc_descriptions,
        'source': [source.strip() for source in sources],
        'has_multi_ticker': has_multi_tickers,
        'has_multi_exchange': has_multi_exchanges,
        'all_submissions_line_number': all_submissions_line_numbers,
        'report_url': report_urls
    })

functional_test_df = extract_10qs(cik_url_index=0)
functional_test_df.shape
```
Output:<br/>Extracting reports from URL https://data.sec.gov/submissions/CIK0001686850.json</br>(19, 21)

### Inspecting the Data Types

```python
functional_test_df.dtypes
```

Output:<br/>
```
ticker                         object
exchange                       object
accession_number               object
filing_date                    object
report_date                    object
acceptance_datetime            object
act                            object
form                           object
file_number                    object
film_number                    object
items                          object
size                            int64
primary_document               object
is_XBRL                         int64
is_inline_XBRL                  int64
primary_doc_description        object
source                         object
has_multi_ticker                int64
has_multi_exchange              int64
all_submissions_line_number     int64
report_url                     object
dtype: object
```

The `extract_10qs()` function can now be used to extract a maximum of `extraction_count` number of reports. It will return a dataframe of all 10q reports between the range of `start` and (`start` + `extraction_count`). The resulting dataframe can then be concatenated to the original dataframe. Gathering this metadata will take some time, so it is a huge benefit to periodically pause extraction, concatenate the incremental progress to the original dataframe, and then save the dataframe to a csv to continue progress at a later time.

```python
for i in functional_test_df.head().index:
    print(functional_test_df.iloc[i]["report_url"])
```
Output:
```
https://www.sec.gov/Archives/edgar/data/168685/000149315224019257/form10-q.htm
https://www.sec.gov/Archives/edgar/data/168685/000149315223040502/form10-q.htm
https://www.sec.gov/Archives/edgar/data/168685/000149315223028452/form10-q.htm
https://www.sec.gov/Archives/edgar/data/168685/000149315223016166/form10-q.htm
https://www.sec.gov/Archives/edgar/data/168685/000149315222032096/form10-q.htm
```

### Gathering All 10-Q Reports

The below block of code was executed to scrape the returned json's from `all_submissions.txt` to return a dataframe of only 10-Q reports. We will periodically save the csv that is being written in case of any interuptions in execution. Then we may check the `all_submissions_line_number` value of the latest written row and start the loop from that index. It is possible for human error to provide the incorrect index to resume work, so we will need to validate and clean the final result of the csv as needed.

```python
df = pd.DataFrame(columns=['ticker', 'exchange', 'accession_number', 'filing_date', 'report_date', 'acceptance_datetime', 'act', 'form', 'file_number', 'film_number', 'items', 'size', 'primary_document', 'is_XBRL', 'is_inline_XBRL', 'primary_doc_description', 'source', 'has_multi_ticker', 'has_multi_exchange', 'all_submissions_line_number', 'report_url'])
                              
with open('../data/all_submissions.txt', 'r') as submissions:
    lines = submissions.readlines()
    # Modify start of range as needed
    for i_line in range(0, len(lines)):
        df = pd.concat([df, extract_10qs(i_line)], axis=0)
        df.to_csv('../data/debug/all_10qs.csv')
```

I had two interruptions due to my machine going to sleep while the above block ran. For example, the process ended at line number 7393, but I restarted the process at 7390. This is mostly because I didn't want to spend any time in that moment determining which 10-Q report URLs from the json on line 7393 have already been built, so I backtracked a few line indicies to be sure I have all the data I need at the cost of a few seconds of execution and cleaning up duplicates. I have manually moved the final `all_10qs.csv` out of the debug folder to the data folder. Let's do some of that cleanup in the next notebook.