# Scraping the Data
_Author_: https://github.com/raffysantayana

## Goal
Use the US Securities and Exchange Commision's (SEC) electronic filing system to programmatically parse and organize the data to later be explored, analyzed, and modeled.

## Overview
SEC archives quarterly reports from various filing entities such as Netflix Inc. (NFLX) and American Express Co. (AXP).

WRITE SOMETHING HERE TALKING ABOUT THE API

## Using the SEC API
Note: This requires a subscription of $55/month to make 100+ requests.
```python
import time
import pandas as pd
from sec_api import QueryApi

# main dataframe we will append each query results to
df = pd.DataFrame()

# paste your api key below
sec_api_key:str = 'api_key' # 'a71896086f47a9ae5928bae84adfaff594ec0a1dcbb0bcc3db52ee3aa0f8e15c'
query_api = QueryApi(api_key = sec_api_key)

base_query = {
  "query": "PLACEHOLDER", # this will be set during runtime 
  "from": "0",
  "size": "200", # dont change this
  # sort by filedAt
  "sort": [{ "filedAt": { "order": "desc" } }]
}

# open the file we use to store the filing URLs
log_file = open("filing_urls.txt", "a")

# start with filings filed in 2021, then 2020, 2019, ... up to 2010 
# uncomment line below to fetch all filings filed in 2022-2010
# for year in range(2021, 2009, -1):
for year in range(2024, 1996, -1):
    print("starting {year}".format(year=year))
    # a single search universe is represented as a month of the given year
    
    for month in range(1, 13, 1):
        # get 10-Q and 10-Q/A filings filed in year and month
        # resulting query example: "formType:\"10-Q\" AND filedAt:[2021-01-01 TO 2021-01-31]"
        universe_query = \
            "formType:\"10-Q\" AND " + \
            "filedAt:[{year}-{month:02d}-01 TO {year}-{month:02d}-31]" \
            .format(year=year, month=month)
    
    print(universe_query)
    # set new query universe for year-month combination
    base_query["query"] = universe_query;

    # paginate through results by increasing "from" parameter 
    # until we don't find any matches anymore
    # uncomment line below to fetch 10,000 filings
    for from_batch in range(0, 999_800, 200): 
    # for from_batch in range(0, 400, 200):
        # set new "from" starting position of search 
        base_query["from"] = from_batch;

        # submit request
        response = query_api.get_filings(base_query)
        # building a temp dataframe of the recent query
        temp_df = pd.DataFrame.from_records(response['filings'])
        # concatenating the temp dataframe to the main dataframe
        df = pd.concat([df, temp_df])
        print(f'df.shape = {df.shape}')
        
        # no more filings in search universe
        if len(response["filings"]) == 0:
            break;
            
        # for each filing, only save the URL pointing to the filing itself 
        # and ignore all other data. 
        # the URL is set in the dict key "linkToFilingDetails"
        urls_list = list(map(lambda x: x["linkToFilingDetails"], response["filings"]))
        
        # transform list of URLs into one string by joining all list elements
        # and add a new-line character between each element.
        urls_string = "\n".join(urls_list) + "\n"
      
        log_file.write(urls_string)

log_file.close()
```

## Web Scraping
SEC has an electrtonic filing system Electronic Data Gathering, Analysis, and Retrieval (EDGAR) that started around 1995 to archive reports such as quarterly 10Q. This system has a RESTful API at [this URL](https://www.sec.gov/edgar/sec-api-documentation) to retrieve report information. Each entity’s current filing history is available at the following URL where CIK_number is an entity's 10 digit CIK number: `
https://data.sec.gov/submissions/CIK{CIK_number}.json`

The returning json contains information such as `accessionNumber`, and `primaryDocument` where the index of the `accessionNumber`is associated with the index of `primaryDocument`. Using these two pieces of info together with the CIK number allows us to construct a url to access all filings for that CIK. Our goal is to specifically analyze quarterly reports, so we will filter results based off of `form` value of "10-Q". The URL we will construct will be:
`https://www.sec.gov/Archives/edgar/data/{CIK_number}/{accessionNumber}/{primaryDocument}`

For example, https://www.sec.gov/Archives/edgar/data/0001445815/000149315224015525/form10-qa.htm

A list of all CIK numbers to iterate through can be found [here](https://www.sec.gov/Archives/edgar/cik-lookup-data.txt)

In [2]:
import time
import requests
import pandas as pd

tickers_url:str = r'https://www.sec.gov/files/company_tickers.json'
headers = {'User-Agent': 'Mozilla/5.0'}
response = requests.get(tickers_url, headers=headers)
if response.status_code != 200:
    raise Exception("Failed to get a 200 status code")
else:
    print(f"Successful response from {tickers_url}.")
raw_tickers = response.json()
print(f"{len(raw_tickers)} retrieved.")

Successful response from https://www.sec.gov/files/company_tickers.json.
10352 retrieved.


In [2]:
tickers = pd.DataFrame(columns=['cik_str', 'ticker', 'title'])

for i in range(len(raw_tickers)):
    tickers.loc[f"{i}"] = raw_tickers[f"{i}"]

In [3]:
tickers.shape

(10352, 3)

In [4]:
tickers.dtypes

cik_str     int64
ticker     object
title      object
dtype: object

In [5]:
tickers.head()

Unnamed: 0,cik_str,ticker,title
0,789019,MSFT,MICROSOFT CORP
1,320193,AAPL,Apple Inc.
2,1045810,NVDA,NVIDIA CORP
3,1652044,GOOGL,Alphabet Inc.
4,1018724,AMZN,AMAZON COM INC


In [6]:
tickers.tail()

Unnamed: 0,cik_str,ticker,title
10347,1748680,OWSCX,1WS Credit Income Fund
10348,1957489,ABLVW,Able View Global Inc.
10349,1933644,MDLVY,"Medlive Technology Co., Ltd./ADR"
10350,1062750,SAAYY,SAIPEM S P A /FI
10351,1572957,BGLAF,BioGaia AB/ADR


In [7]:
tickers.to_csv("../data/tickers.csv")

In [24]:
with open('../data/all_submissions.txt', 'w') as the_file:
    for cik in tickers['cik_str']:
        the_file.write(f'https://data.sec.gov/submissions/CIK{cik:010d}.json\n')
        # print(f'https://data.sec.gov/submissions/CIK{cik:010d}.json\n')

In [29]:
urls = open('../data/all_submissions.txt', 'r')
lines = urls.readlines()
counter = 1
for line in lines:
    response = requests.get(line.strip(), headers=headers)
    if response.status_code != 200:
        print(f'{line.strip()} might not be a valid URL. Status code {response.status_code} received. This URL is on line {counter}.')
        counter -= 1
    print(f'{counter:05d}/{len(lines)}', end='\r')
    time.sleep(5)
    counter += 1
if counter - 1 == len(lines):
    validation_file = open('../data/url_validation.txt', 'w')
    validation_file.write('All URLs in all_submissions.txt have passed validation')

10352/10352

## Filtering URLs of CIK Submissions for 10Q Filings
Now that we have URLs for all CIK numbers that detail all submissions that these entities have provided, we can move on to filter for the specific filing we want to train our model on. For this project, we will focus on 10Q quarterly reports.

In [43]:
urls = open('../data/all_submissions.txt', 'r')
lines = urls.readlines()
counter = 1
foobar = lines[:10]
for line in foobar:
    response = requests.get(line.strip(), headers=headers)
    if response.status_code != 200:
        print(f'{line.strip()} might not be a valid URL. Status code {response.status_code} received. This URL is on line {counter}.')
    print(f'{counter:02d}/{len(foobar)}', end='\r')
    time.sleep(1)
    counter += 1
if counter - 1 == len(lines):
    validation_file = open('../data/url_validation.txt', 'w')
    validation_file.write('All URLs in all_submissions.txt have been validated')

https://data.sec.gov/submissions/CIK.json might not be a valid URL. Status code 404 received. This URL is on line 3.
09/10

In [4]:
urls = open('../data/all_submissions.txt', 'r')
lines = urls.readlines()
counter = 1
df = pd.DataFrame(columns=['tickers', 'exchanges', 'accession_number', 'filing_date', 'report_date', 'acceptance_datetime', 'act',
                          'form', 'file_number', 'film_number', 'items', 'size', 'primary_document', 'is_XBRL', 'is_inline_XBRL', 
                           'primary_doc_description'])

response = requests.get(lines[0].strip(), headers=headers)
if (response.status_code != 200):
    print(f'Status code {response.status_code} received for URL {lines[0].strip()}')

In [6]:
raw_json = response.json()

In [12]:
raw_json['tickers']

['MSFT']

In [13]:
raw_json['exchanges']

['Nasdaq']

In [30]:
recent_filings = raw_json['filings']['recent']
recent_filings['accessionNumber']
print(len(recent_filings['accessionNumber']))

1001


In [31]:
recent_filings['filingDate']
print(len(recent_filings['filingDate']))

1001


In [32]:
recent_filings['reportDate']
print(len(recent_filings['reportDate']))

1001


In [34]:
recent_filings['acceptanceDateTime']
print(len(recent_filings['acceptanceDateTime']))

1001


In [36]:
recent_filings['act']
print(len(recent_filings['act']))

1001


In [37]:
recent_filings['form']
print(len(recent_filings['form']))

1001


In [38]:
recent_filings['fileNumber']
print(len(recent_filings['fileNumber']))

1001


In [39]:
recent_filings['filmNumber']
print(len(recent_filings['filmNumber']))

1001


In [40]:
recent_filings['items']
print(len(recent_filings['items']))

1001


In [41]:
recent_filings['size']
print(len(recent_filings['size']))

1001


In [42]:
recent_filings['primaryDocument']
print(len(recent_filings['primaryDocument']))

1001


In [43]:
recent_filings['isXBRL']
print(len(recent_filings['isXBRL']))

1001


In [44]:
recent_filings['isInlineXBRL']
print(len(recent_filings['isInlineXBRL']))

1001


In [45]:
recent_filings['primaryDocDescription']
print(len(recent_filings['primaryDocDescription']))

1001


In [None]:
# Make a function that loops through the length of response.json()['filings']['recent'] and populates a list
# whose items are in the order of the dataframe columns