# P1 - Download Financial Data for S&P500 Companies from SEC

## Overview:

This project aims to download financial data for S&P500 companies from SEC filings using the SEC API. The data gathered will be applied in the subsequent project to compute key financial metrics for factor-based investment analysis.

Techniques used in the project:

1. Utilizing web scraping and API requests to collect company tickers and financial data. This includes interacting with the SEC database via submission and company fact APIs.
2. Processing JSON files and leveraging the requests library for API retrieval.
3. Manipulating time series data by resampling and aligning the datasets over uniform time periods.
4. Cleaning and adjusting the data, particularly dealing with cumulative figures and converting them into quarterly amounts for consistency.

In [2]:
# import libraries
import yfinance as yf
import pandas as pd
import requests
import json
import numpy as np
import pickle
import copy
import datetime as dt
import os

headers = {"User-Agent": "ian.ye.fu@gmail.com"} 

data_folder_download = './datasets/download/'
data_folder_generate = './datasets/generate/'

%store -r removed_tickers_list

## Step 1: Download S&P500 tickers from wikipedia 

In [10]:
# Wikipedia URL for S&P 500
url = 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'

# Read the HTML tables from the webpage
tables = pd.read_html(url)

# Table 1: List of S&P 500 companies
sp500_table = tables[0]

# Table 2: Recent changes to the S&P 500
recent_changes_table = tables[1]

# Save these tables to CSV files:
sp500_table.to_csv(data_folder_download + 'sp500_companies.csv', index=False)
recent_changes_table.to_csv(data_folder_download + 'sp500_recent_changes.csv', index=False)

In [30]:
# create a list of S&P500 tickers
sp500_df = pd.read_csv(data_folder_download + 'sp500_companies.csv', index_col = 0)
sp500_tickers = sp500_df.index.to_list()

In [59]:
%store sp500_tickers

Stored 'sp500_tickers' (list)


In [31]:
sp500_df.head()

Unnamed: 0_level_0,Security,GICS Sector,GICS Sub-Industry,Headquarters Location,Date added,CIK,Founded
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
MMM,3M,Industrials,Industrial Conglomerates,"Saint Paul, Minnesota",1957-03-04,66740,1902
AOS,A. O. Smith,Industrials,Building Products,"Milwaukee, Wisconsin",2017-07-26,91142,1916
ABT,Abbott Laboratories,Health Care,Health Care Equipment,"North Chicago, Illinois",1957-03-04,1800,1888
ABBV,AbbVie,Health Care,Biotechnology,"North Chicago, Illinois",2012-12-31,1551152,2013 (1888)
ACN,Accenture,Information Technology,IT Consulting & Other Services,"Dublin, Ireland",2011-07-06,1467373,1989


## Step 2: Convert the CIK code format to match SEC Fillings

In [35]:
# Create the CIK column and get sp500_cik_list
sp500_cik_list = [str(cik).zfill(10) for cik in sp500_df['CIK'].tolist()]  
sp500_cik = pd.Series(sp500_cik_list, index = sp500_df.index)
sp500_df['CIK'] = sp500_cik

In [33]:
sp500_df.drop(columns = 'Headquarters Location', inplace = True)

In [36]:
sp500_df.head()

Unnamed: 0_level_0,Security,GICS Sector,GICS Sub-Industry,Date added,CIK,Founded
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
MMM,3M,Industrials,Industrial Conglomerates,1957-03-04,66740,1902
AOS,A. O. Smith,Industrials,Building Products,2017-07-26,91142,1916
ABT,Abbott Laboratories,Health Care,Health Care Equipment,1957-03-04,1800,1888
ABBV,AbbVie,Health Care,Biotechnology,2012-12-31,1551152,2013 (1888)
ACN,Accenture,Information Technology,IT Consulting & Other Services,2011-07-06,1467373,1989


In [41]:
sp500_df.loc['AAPL']

Security                                             Apple Inc.
GICS Sector                              Information Technology
GICS Sub-Industry    Technology Hardware, Storage & Peripherals
Date added                                           1982-11-30
CIK                                                  0000320193
Founded                                                    1977
Name: AAPL, dtype: object

In [96]:
with open(data_folder_generate + 'sp500_df_v1.0.pkl', 'wb') as f: 
    pickle.dump(sp500_df, f)

## Step 4: Get all the company_fact data from downloaded archive json files. 

In [42]:
def get_facts(ticker):
    """
    Load company_facts data from local json_file
    """
    cik = f'CIK{sp500_df.loc[ticker, 'CIK']}'
    directory = data_folder_download + 'companyfacts/'
    file_path = os.path.join(directory, f"{cik}.json")  # Example: specific file

    # Load the JSON data
    with open(file_path, 'r') as json_file:
        company_facts = json.load(json_file)

    # Print the keys in the root of the JSON to understand its structure
    return company_facts

In [43]:
# test get_facts function: success.
company_facts_appl = get_facts('AAPL')

## Step 5: Create financials dataframes from the company_facts dict. 

In [47]:
def facts_DF(ticker, headers=headers):
    """
    Convert the company_facts dict to dataframes
    """
    facts = get_facts(ticker)
    us_gaap_data = facts["facts"]["us-gaap"]
    df_data = []
    for fact, details in us_gaap_data.items():
        for unit in details["units"].keys():
            for item in details["units"][unit]:
                row = item.copy() # keep the original data intact
                row["fact"] = fact
                row['label'] = details['label']
                df_data.append(row)

    df = pd.DataFrame(df_data)
    df["end"] = pd.to_datetime(df["end"])
    df["start"] = pd.to_datetime(df["start"])
    df = df.drop_duplicates(subset=["fact", "end", "val"])
    df.set_index("end", inplace=True)
    labels_dict = {fact: details["label"] for fact, details in us_gaap_data.items()}
   
    return df

In [52]:
# test get_facts function: success.
ice_facts_df = facts_DF('ICE', headers)

## Step 6: Define the financial data to be downloaded

In [57]:
pd.set_option('display.max_rows', None)

In [1]:
# get the data labels in the SEC database for all the essential financials for the factor calculation. 
data_category = ['Assets', 
                 'Liabilities', 
                 'LiabilitiesCurrent', 
                 'LiabilitiesNoncurrent', 
                 'LiabilitiesAndStockholdersEquity',  
                 'StockholdersEquity', 
                 'StockholdersEquityIncludingPortionAttributableToNoncontrollingInterest',
                 'EarningsPerShareDiluted', 
                 'CommonStockDividendsPerShareDeclared', 
                 'WeightedAverageNumberOfDilutedSharesOutstanding', 
                 'WeightedAverageNumberOfSharesOutstandingBasic',
                 'CommonStockSharesOutstanding',
                 'NetIncomeLoss' 
                 ]
%store data_category

Stored 'data_category' (list)


## Step 7: Download the financial data from SEC 

In [82]:
def download_financial_data_from_SEC(sp500_tickers, data_category):
    """
    download all the financial metrics in the data_category from SEC for the sp500 companies 
    """
    sp500_financial_data = {} # outer dict
    
    for ticker in sp500_tickers:
        
        df = facts_DF(ticker)  # Correct data extraction for the ticker
        financial_data= {} # inner dict
        
        # Loop over all categories in the data_category list
        for category in data_category:
            
            x = df.query('fact == @category')
            # remove the duplicated rows based on 'end' index and keep the last record
            x = x[(x['val']!= 0) & (x['val'].notna())]
            cleaned_data = x[~x.index.duplicated(keep='last')].sort_index(ascending = True)
            # slice data from only 2013 onwards
            financial_data[category] = cleaned_data.loc['2013':]  
            
        # Assign the financial data for each ticker
        sp500_financial_data[ticker] = financial_data
        
    return  sp500_financial_data

In [154]:
sp500_financial_data = download_financial_data_from_SEC(sp500_tickers, data_category)

In [157]:
with open(data_folder_generate + 'sp500_financial_data_v1.0.pkl', 'wb') as f: 
    pickle.dump(sp500_financial_data, f)

## Step 8: Clean the financial data from SEC 

In [159]:
sp500_financial_data_updated = convert_annual_to_quarter(sp500_financial_data, updated_data_category)

In [162]:
# notice only I/S based metrics need to be updated.
updated_data_category = [
                 'EarningsPerShareDiluted',
                 'CommonStockDividendsPerShareDeclared', 
                 'NetIncomeLoss'
                 ]

def convert_annual_to_quarter(sp500_financial_data, updated_data_category):
    """
    This function updates financial data by subtracting the annual rows by 
    the sum of the previous three quarters for a given list of tickers and data categories
    """
    
    for ticker in sp500_financial_data.keys(): 

        for category in sp500_financial_data[ticker].keys(): 
            
            new_df = sp500_financial_data[ticker][category].reset_index().copy() 

            if category == "EarningsPerShareDiluted" or category == 'CommonStockDividendsPerShareDeclared' or category == 'NetIncomeLoss':

                # identify the annual rows
                index_list = new_df[(new_df['end'] - new_df['start']).dt.days > 130].index.tolist()
    
                # subtract the annual rows by the sum of the previous three quarters.
                for i in index_list: 
                    new_df.loc[i,'val'] = new_df.loc[i,'val'] - new_df.loc[i-3: i-1, 'val'].sum()
    
            sp500_financial_data[ticker][category] = new_df.set_index('end')
        
    return sp500_financial_data

In [160]:
sp500_financial_data_updated['ABNB']['StockholdersEquity']

Unnamed: 0_level_0,val,accn,fy,fp,form,filed,fact,label,frame,start
end,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2018-12-31,-517308000.0,0001559720-22-000006,2021,FY,10-K,2022-02-25,StockholdersEquity,Stockholders' Equity Attributable to Parent,CY2018Q4I,NaT
2019-12-31,-808000000.0,0001559720-23-000003,2022,FY,10-K,2023-02-17,StockholdersEquity,Stockholders' Equity Attributable to Parent,CY2019Q4I,NaT
2020-03-31,-1117431000.0,0001628280-21-010389,2021,Q1,10-Q,2021-05-14,StockholdersEquity,Stockholders' Equity Attributable to Parent,,NaT
2020-06-30,-1646645000.0,0001628280-21-016979,2021,Q2,10-Q,2021-08-13,StockholdersEquity,Stockholders' Equity Attributable to Parent,,NaT
2020-09-30,-1376284000.0,0001559720-21-000017,2021,Q3,10-Q,2021-11-05,StockholdersEquity,Stockholders' Equity Attributable to Parent,CY2020Q3I,NaT
2020-12-31,2901000000.0,0001559720-23-000003,2022,FY,10-K,2023-02-17,StockholdersEquity,Stockholders' Equity Attributable to Parent,,NaT
2021-03-31,3159423000.0,0001628280-21-010389,2021,Q1,10-Q,2021-05-14,StockholdersEquity,Stockholders' Equity Attributable to Parent,,NaT
2021-06-30,3393201000.0,0001628280-21-016979,2021,Q2,10-Q,2021-08-13,StockholdersEquity,Stockholders' Equity Attributable to Parent,,NaT
2021-09-30,4448934000.0,0001559720-21-000017,2021,Q3,10-Q,2021-11-05,StockholdersEquity,Stockholders' Equity Attributable to Parent,,NaT
2021-12-31,4775000000.0,0001559720-23-000003,2022,FY,10-K,2023-02-17,StockholdersEquity,Stockholders' Equity Attributable to Parent,,NaT


In [161]:
with open(data_folder_generate +'sp500_financial_data_v1.1.pkl', 'wb') as f:
    pickle.dump(sp500_financial_data_updated, f)

## Step 9: Download S&P500 stock daily price data from Yahoo Finance   

In [7]:
syear = 2013
smonth = 1
sday = 1
eyear = 2024
emonth = 9
eday = 1

stocks_not_downloaded = []

folder = '../datasets/download/companyprice/'

interval = '1mo'
def save_to_csv_from_yahoo(folder, ticker, syear, smonth, sday, eyear, emonth, eday):
    """
    Download stock.['Adj Close'] data from Yahoo Finance and save them to folder.
    """
    start = dt.datetime(syear, smonth, sday)
    end = dt.datetime(eyear, emonth, eday)
    print('Get Data for:', ticker)
    df = yf.download(ticker, start, end)['Adj Close']
    
    if df.empty:
        print("Couldn't Get Data for: ", ticker)
        stocks_not_downloaded.append(ticker)
    df.to_csv(folder + ticker + '.csv')

In [313]:
# convert the '.' ticker format to '-' ticker format.
tickers_not_downloaded = [ticker.replace('.', '-') for ticker in stocks_not_downloaded]

In [None]:
for ticker in sp500_df.index.to_list(): 
    save_to_csv_from_yahoo(folder, ticker, syear, smonth, sday, eyear, emonth, eday)

In [315]:
# redownloaded the failed tickers 
for ticker in tickers_not_downloaded: 
    save_to_csv_from_yahoo(folder, ticker, syear, smonth, sday, eyear, emonth, eday, interval)

Get Data for: BRK-B
[*********************100%%**********************]  1 of 1 completed
Get Data for: BF-B
[*********************100%%**********************]  1 of 1 completed


In [8]:
for ticker in removed_tickers_list: 
    save_to_csv_from_yahoo(folder, ticker, syear, smonth, sday, eyear, emonth, eday)

Get Data for: AA
[*********************100%%**********************]  1 of 1 completed
Get Data for: AAP
[*********************100%%**********************]  1 of 1 completed
Get Data for: ADT
[*********************100%%**********************]  1 of 1 completed
Get Data for: AIV
[*********************100%%**********************]  1 of 1 completed
Get Data for: ALK
[*********************100%%**********************]  1 of 1 completed
Get Data for: ALTR
[*********************100%%**********************]  1 of 1 completed
Get Data for: AMG
[*********************100%%**********************]  1 of 1 completed
Get Data for: AN
[*********************100%%**********************]  1 of 1 completed
Get Data for: ANF
[*********************100%%**********************]  1 of 1 completed
Get Data for: ATI
[*********************100%%**********************]  1 of 1 completed
Get Data for: AYI
[*********************100%%**********************]  1 of 1 completed
Get Data for: BEAM
[*********************100

In [72]:
# download cap-weighted sp500 index historical performance. 
save_to_csv_from_yahoo('./', '^GSPC', syear, smonth, sday, eyear, emonth, eday)

Get Data for: ^GSPC
[*********************100%%**********************]  1 of 1 completed


In [73]:
# download equal-weighted sp500 index historical performance. 
save_to_csv_from_yahoo('./', '^SPXEW', syear, smonth, sday, eyear, emonth, eday)

Get Data for: ^SPXEW
[*********************100%%**********************]  1 of 1 completed
