<a href="https://colab.research.google.com/github/winterForestStump/thesis/blob/main/data/sec_master_scrap_v2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook creates a csv file with all 10-K filings from January 1999 until September 2023 with following columns:
* cik_number - unique company id
* company_name
* form_id - is always 10-K (annual filings)
* date - a date of filing of the report
* file_url - url of the 10-K filing

In [1]:
import requests
import time
import pandas as pd

In [2]:
!cd content/drive/MyDrive

/bin/bash: line 1: cd: content/drive/MyDrive: No such file or directory


In [3]:
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36'}

In [4]:
# Function to generate dates from 1999 to 2023
def generate_dates():
    for year in range(1999, 2024):
        for month in range(1, 10):
            for day in range(1, 32):
                # Ensure valid dates are used
                if month in [4, 6, 9, 11] and day == 31:
                    continue
                elif month == 2 and day > 29:
                    continue
                elif month == 2 and day == 29 and not (year % 4 == 0 and (year % 100 != 0 or year % 400 == 0)):
                    continue
                yield f"{year:04d}{month:02d}{day:02d}"

In [5]:
#function for making an url
def make_url(base_url , comp):
    url = base_url
    # add each component to the base url
    for r in comp:
        url = '{}/{}'.format(url, r)
    return url

10-K (Annual report) provides audited annual financial statements, a discussion of material risk factors for the company and its business, and a management’s discussion and analysis of the company’s results of operations for the prior fiscal year.

In [None]:
# Base URL for the master files
base_url = "https://www.sec.gov/Archives/edgar/daily-index/"

# Counter to keep track of downloaded and failed files
total_downloaded_files = 0
total_failed_files = 0

master_datas = []
# Loop through all dates and download the content for each day
for date in generate_dates():
    file_url = f"{base_url}{date[:4]}/QTR{int(date[4:6]) // 4 + 1}/master.{date}.idx"
    response = requests.get(file_url, headers=headers)

    if response.status_code == 200:
        try:
          data = response.content.decode("utf-8").split('  ')
        except UnicodeDecodeError as e:
          print(f"UnicodeDecodeError: {e}")
        # We need to remove the headers, so look for the end of the header and grab it's index
        for index, item in enumerate(data):
          if "ftp://ftp.sec.gov/edgar/" in item:
            start_ind = index

        # define a new dataset with out the header info.
        data_format = data[start_ind:]

        master_data = []

        # now we need to break the data into sections, this way we can move to the final step of getting each row value.
        for index, item in enumerate(data_format):

          # if it's the first index, it won't be even so treat it differently
          if index == 0:
            clean_item_data = item.replace('\n','|').split('|')
            clean_item_data = clean_item_data[8:]
          else:
            clean_item_data = item.replace('\n','|').split('|')

          for index, row in enumerate(clean_item_data):

            # when you find the text file.
            if '.txt' in row:

              # grab the values that belong to that row. It's 4 values before and one after.
              mini_list = clean_item_data[(index - 4): index + 1]

              if len(mini_list) != 0:
                mini_list[4] = "https://www.sec.gov/Archives/" + mini_list[4]
                master_data.append(mini_list)

        # loop through each document in the master list.
        for index, document in enumerate(master_data):

          # create a dictionary for each document in the master list
          document_dict = {}
          document_dict['cik_number'] = document[0]
          document_dict['company_name'] = document[1]
          document_dict['form_id'] = document[2]
          document_dict['date'] = document[3]
          document_dict['file_url'] = document[4]

          master_data[index] = document_dict

        master_data = [item for item in master_data if item.get('form_id') == '10-K']

        master_datas.extend(master_data)
        print(f"File for {date} downloaded and saved successfully.")
        total_downloaded_files += 1

    else:
      print(f"Failed to download the file for {date}.")
      total_failed_files += 1

    # Introduce a delay to comply with the rate limit (10 requests per second)
    time.sleep(0.1)

print(f"All files downloaded. Total downloaded files: {total_downloaded_files}. Total failed to download files: {total_failed_files}")


In [8]:
df = pd.DataFrame(master_datas)
df

Unnamed: 0,cik_number,company_name,form_id,date,file_url
0,1044324,TROPICAL SPORTSWEAR INTERNATIONAL CORP,10-K,19990104,https://www.sec.gov/Archives/edgar/data/104432...
1,320303,METAL ARTS CO INC,10-K,19990104,https://www.sec.gov/Archives/edgar/data/320303...
2,792130,DATAWATCH CORP,10-K,19981229,https://www.sec.gov/Archives/edgar/data/792130...
3,851478,BEI MEDICAL SYSTEMS CO INC /DE/,10-K,19990104,https://www.sec.gov/Archives/edgar/data/851478...
4,884124,GALEY & LORD INC,10-K,19990104,https://www.sec.gov/Archives/edgar/data/884124...
...,...,...,...,...,...
178185,1808898,Benitec Biopharma Inc.,10-K,20230921,https://www.sec.gov/Archives/edgar/data/180889...
178186,1948565,Investcorp US Institutional Private Credit Fund,10-K,20230921,https://www.sec.gov/Archives/edgar/data/194856...
178187,33533,ESPEY MFG & ELECTRONICS CORP,10-K,20230921,https://www.sec.gov/Archives/edgar/data/33533/...
178188,718332,"RAVE RESTAURANT GROUP, INC.",10-K,20230921,https://www.sec.gov/Archives/edgar/data/718332...


In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 178190 entries, 0 to 178189
Data columns (total 5 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   cik_number    178190 non-null  object
 1   company_name  178190 non-null  object
 2   form_id       178190 non-null  object
 3   date          178190 non-null  object
 4   file_url      178190 non-null  object
dtypes: object(5)
memory usage: 6.8+ MB


In [10]:
# Save the final DataFrame to CSV
desired_path = '/content/drive/MyDrive/10-K_sec_filings_ver2.csv'
df.to_csv(desired_path, index=False)

print("DataFrame saved to CSV:", desired_path)

DataFrame saved to CSV: /content/drive/MyDrive/10-K_sec_filings_ver2.csv


Now we have a csv file with more than 178K of links to the annual reports (10-K)