<a href="https://colab.research.google.com/gist/janlukasschroeder/856848c3666f9688a011b3a77516aefb/download-10-k-filings-from-sec-edgar.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# How to download and scrape 10-K filings from SEC EDGAR

This tutorial shows how to download and scrape 10-K filings from SEC EDGAR to your local disk. We use Python 3 and the SEC-API.io Python package to help us find the links to all 10-K filings on EDGAR and then download them.

Our SEC filings download application will be structured into two components:
1. The first component of our Python application finds all URLs of  10-K filings on EDGAR filed between 1995 and 2022. We also consider older 10-K variants, that is 10-KT, 10KSB, 10KT405, 10KSB40, 10-K405. Our application also includes all amended/changed filings, for example 10-K/A. Once we generated a complete list of all URLs of 10-K filings, we're going to save the list to a file to our hard disk.
2. The second component reads the URLs from the file, and downloads all annual reports. We download up to 30 filings in parallel using the Render API of the SEC-API package and use Pythons multiprocessing package to speed up the download process.

# Getting Started

Let's starts by installing the SEC-API Python package.



In [None]:
#pip install sec-api

Head over to https://sec-api.io to get your free API key so that we can start searching the SEC EDGAR database for 10-K filings.

In [2]:
from sec_api import QueryApi

queryApi = QueryApi(api_key="YOUR_API_KEY")

The Query API is a search interface allowing us to search and find SEC filings across the entire EDGAR database by any filing meta data parameter. For example, we can find all filings filed by Microsoft using a ticker search (ticker:MSFT) or build more complex search expressions using boolean and brackets operators. The Query API returns the meta data of SEC filings matching the search query, such as filer details (e.g. ticker and company name), URLs to the filing and all exhibits, filing date, form type and more.

We're looking for all filings with form type 10-K and its variants: 10-KT, 10KSB, 10KT405, 10KSB40, 10-K405. So, the Query API form type filter comes in handy.

The search query string looks like this:

```txt
formType:("10-K", "10-KT", "10KSB", "10KT405", "10KSB40", "10-K405")
```

The brackets tell the Query API to include a filing in the response if the form type is either 10-K, or 10-KT, or 10KSB, and so on.

Let's start off by finding the most recently filed 10-K filing.

In [9]:
query = {
  "query": { "query_string": { 
      "query": "formType:\"10-K\" AND ticker:TSLA", # only 10-Ks
  }},
  "from": "0", # start returning matches from position null, i.e. the first matching filing 
  "size": "1"  # return just one filing
}

response = queryApi.get_filings(query)

The response of the Query API package in Python represents a dictionary (short: dict) with two keys: `total` and `filings`.

The value of `total` is a dict itself and tells us, among other things, how many filings in total match our search query. The value of `filings` is a list of dicts, where each dict represents all meta data of a matching filing.

We use the `json` Python package to pretty-print the first filing to the console to explore the structure of a filing dict.

In [10]:
import json 
print(json.dumps(response["filings"][0], indent=2))

{
  "ticker": "TSLA",
  "formType": "10-K",
  "accessionNo": "0001193125-14-069681",
  "cik": "1318605",
  "companyNameLong": "TESLA MOTORS INC (Filer)",
  "companyName": "TESLA MOTORS INC",
  "linkToFilingDetails": "https://www.sec.gov/Archives/edgar/data/1318605/000119312514069681/d668062d10k.htm",
  "description": "Form 10-K - Annual report [Section 13 and 15(d), not S-K Item 405]",
  "linkToTxt": "https://www.sec.gov/Archives/edgar/data/1318605/000119312514069681/0001193125-14-069681.txt",
  "filedAt": "2014-02-26T16:02:51-05:00",
  "documentFormatFiles": [
    {
      "sequence": "1",
      "size": "1589148",
      "documentUrl": "https://www.sec.gov/Archives/edgar/data/1318605/000119312514069681/d668062d10k.htm",
      "description": "10-K",
      "type": "10-K"
    },
    {
      "sequence": "2",
      "size": "71602",
      "documentUrl": "https://www.sec.gov/Archives/edgar/data/1318605/000119312514069681/d668062dex1035a.htm",
      "description": "EX-10.35A",
      "type": "EX

The URL of the 10-K filing is the value of the `linkToFilingDetails` key in each filing dict, for example:
https://www.sec.gov/Archives/edgar/data/1318605/000119312514069681/d668062d10k.htm

We see that information such as the filer ticker and CIK, company name, and all links and types of filing attachements (e.g. XBRL) is included as well. If you were to download, let's say, XBRL attachements of 10-K filings, you would be able to use the same approach we implement here.

In order to for us to generate a complete list of 10-K URLs, we simply iterate over all filing dicts, read the `linkToFilingDetails` value and write the URL to a local file.

One more thing: the Query API returns a maximum of 200 filings per search request and a maximum of 10,000 filings per search universe. That's why we paginate over the search results, i.e. we request the first "page" of matches with 200 filings, then the second "page", and so on, until we iterated through all filings filed between 1995 and 2022. 

# 1. Generate list of URLs of all 10-K filings

This chapter implements the first of our two components and explains how to generate the list of 10-K URLs and save the list to a file.

The following `base_query` is reused and updated for each request allowing us to page through all results in the next part of the code. 

In [11]:
base_query = {
  "query": { 
      "query_string": { 
          "query": "PLACEHOLDER", # this will be set during runtime 
          "time_zone": "America/New_York"
      } 
  },
  "from": "0",
  "size": "200", # dont change this
  # sort returned filings by the filedAt key/value
  "sort": [{ "filedAt": { "order": "desc" } }]
}

On each search request, the `PLACEHOLDER` in the `base_query` is replaced with our form type filter and with a date range filter. The complete Python code for downloading all URLs of filings filed between 1995 and 2022 is shown and explained below.

> Be aware that it takes some time to download and save all URLs. Plan at least 30 minutes for running your application without interruption. 

The URL downloader appends a new URL to the log file `filing_urls.txt` on each processing iteration. In case you accidentally shut down your application, you can start off from the most recently processed year without having to download already processed URLs again.

> Uncomment the two lines in your code if you want to generate all URLs at once. I deliberately uncommented them to provide a quick running example of the entire code without having to wait 30+ minutes to see results. 
- `for year in range(2021, 1994, -1):` and 
- `for from_batch in range(0, 9800, 200):` 


In [15]:
# open the file we use to store the filing URLs
log_file = open("filing_urls.txt", "a")

# start with filings filed in 2022, then 2020, 2019, ... up to 1995
# uncomment next line to fetch all filings filed from 2022-1995
# for year in range(2021, 1994, -1):
for year in range(2022, 2020, -1):
  print("Starting download for year {year}".format(year=year))
  
  # a single search universe is represented as a month of the given year
  for month in range(1, 13, 1):
    # get 10-Q and 10-Q/A filings filed in year and month
    # resulting query example: "formType:\"10-Q\" AND filedAt:[2021-01-01 TO 2021-01-31]"
    universe_query = \
        "formType:(\"10-K\", \"10-KT\", \"10KSB\", \"10KT405\", \"10KSB40\", \"10-K405\") AND " + \
        "filedAt:[{year}-{month:02d}-01 TO {year}-{month:02d}-31]" \
        .format(year=year, month=month)
  
    # set new query universe for year-month combination
    base_query["query"]["query_string"]["query"] = universe_query;

    # paginate through results by increasing "from" parameter 
    # until we don't find any matches anymore
    # uncomment next line to fetch all 10,000 filings
    # for from_batch in range(0, 9800, 200): 
    for from_batch in range(0, 400, 200):
      # set new "from" starting position of search 
      base_query["from"] = from_batch;

      response = queryApi.get_filings(base_query)

      # no more filings in search universe
      if len(response["filings"]) == 0:
        break;

      # for each filing, only save the URL pointing to the filing itself 
      # and ignore all other data. 
      # the URL is set in the dict key "linkToFilingDetails"
      urls_list = list(map(lambda x: x["linkToFilingDetails"], response["filings"]))

      # transform list of URLs into one string by joining all list elements
      # and add a new-line character between each element.
      urls_string = "\n".join(urls_list) + "\n"
      
      log_file.write(urls_string)

    print("Filing URLs downloaded for {year}-{month:02d}".format(year=year, month=month))

log_file.close()

print("All URLs downloaded")

Starting download for year 2022
Filing URLs downloaded for 2022-01
Filing URLs downloaded for 2022-02
Filing URLs downloaded for 2022-03
Filing URLs downloaded for 2022-04
Filing URLs downloaded for 2022-05
Filing URLs downloaded for 2022-06
Filing URLs downloaded for 2022-07
Filing URLs downloaded for 2022-08
Filing URLs downloaded for 2022-09
Filing URLs downloaded for 2022-10
Filing URLs downloaded for 2022-11
Filing URLs downloaded for 2022-12
Starting download for year 2021
Filing URLs downloaded for 2021-01
Filing URLs downloaded for 2021-02
Filing URLs downloaded for 2021-03
Filing URLs downloaded for 2021-04
Filing URLs downloaded for 2021-05
Filing URLs downloaded for 2021-06
Filing URLs downloaded for 2021-07
Filing URLs downloaded for 2021-08
Filing URLs downloaded for 2021-09
Filing URLs downloaded for 2021-10
Filing URLs downloaded for 2021-11
Filing URLs downloaded for 2021-12


# 2. Download all 10-Ks from SEC EDGAR

The second component of our filing download application loads all 10-K URLs from our log file `filing_urls.txt` into memory, and downloads 20 filings in parallel into the folder `filings`.

We use the Render API interface of the SEC-API Python package to download the filing by providing its URL. The Render API allows us to download up to 40 SEC filings per second in parallel. However, we don't utilize the full bandwidth of the API because otherwise it's very likely we end up with memory overflow exceptions (considering some filings are 400+ MB large).


In [16]:
from sec_api import RenderApi

renderApi = RenderApi(api_key="YOUR_API_KEY")

The `download_filing` function downloads the filing from the URL, generates a file name using the last two parts of the URL and saves the downloaded file to the `filings` folder.

In [43]:
# download filing and save to "filings" folder
def download_filing(url):
  try:
    filing = renderApi.get_filing(url)
    # file_name example: 000156459019027952-msft-10k_20190630.htm
    file_name = url.split("/")[-2] + "-" + url.split("/")[-1] 
    download_to = "./filings/" + file_name
    with open(download_to, "w") as f:
      f.write(filing)
  except Exception as e:
    print("Problem with {url}".format(url=url))
    print(e)

The `load_urls` function reads the text content from the previously generated `filing_urls.txt` file, and creates a list of URLs by splitting the text content at each new line character ("\n").

In [44]:
# load URLs from log file
def load_urls():
  log_file = open("filing_urls.txt", "r")
  urls = log_file.read().split("\n") # convert long string of URLs into a list 
  log_file.close()
  return urls

The `download_all_filings` is the heart and soul of our application. Here, Python's inbuilt `multiprocessing.Pool` method allows us to apply a function to a list of values multiple times in parallel. This way we can apply the `download_filing` function to values of the `urls` list in parallel. 

For example, setting `number_of_processes` to 4 results in 4 `download_filing` functions running in parallel where each function processes one URL. Once a download is completed, `multiprocessing.Pool` gets the next URL from the URLs list and calls `download_filing` with the new URL.

> We used 40 URLs (`urls = load_urls()[1:40]`) to quickly test the code without having to wait hours for the download to complete. Uncomment the line below to process all URLs. 
- `urls = load_urls()`



In [49]:
import os
import multiprocessing

def download_all_filings():
  print("Start downloading all filings")

  download_folder = "./filings" 
  if not os.path.isdir(download_folder):
    os.makedirs(download_folder)
    
  # uncomment next line to process all URLs
  # urls = load_urls()
  urls = load_urls()[0:40]
  print("{length} filing URLs loaded".format(length=len(urls)))

  number_of_processes = 20

  with multiprocessing.Pool(number_of_processes) as pool:
    pool.map(download_filing, urls)
  
  print("All filings downloaded")

In [50]:
download_all_filings()

Start downloading all filings
40 filing URLs loaded
All filings downloaded
