In [1]:
import pandas as pd
import os
from sec_edgar_downloader import Downloader
from tqdm import tqdm # status bar 
import glob

In [2]:
# places to put files - best practice chapter 2!

os.makedirs("inputs", exist_ok=True)
os.makedirs("10k_files", exist_ok=True)

## Step 1: Get the S&P 500 firms

Using a sample of S&P500 firms is sensible. Two major points come up, the first of which we discussed in class a lot.

The obvious limitation is "what if the relationships between our risk measurements and returns during a pandemic are different for smaller firms outside the S&P500"? This is a good concern, and worthy of discussion in your results. Do you have an **economic argument** for why your particular risks would be more or less relevant in a pandemic for small firms (than for the larger firms in the S&P500)? Depending on your answer, that means a relationship you find might be too high or too low. Maybe the sign of the relationship flips. 

The second major issue how we get the list of S&P 500 firms below. This code gets the list of S&P 500 firms **as of today.** So our sample (A) excludes firms that were S&P in Mar 2020 but no longer are and (B) includes firms that weren't before but are now.  This could bias our results. 
- Perhaps the firms we are erroneously missing (which we know had poor returns) had HIGH risk factors (which is why they did poorly). So excluding them makes it harder to find a risk-return relationship. 
- Perhaps the firms that we are erroneously including (with high returns) had low risk factors (which is why they fared better in the pandemic). So including them makes it easier to find a risk-return relationship. 

Putting those arguments together, the very way I constructed this sample might bias the results, but it depends on the specifics ofers and joiners"the "leavj


Alternatively, I could just fix the list to be the 2022 S&P500 firms. The fix does not involve copy and paste, manual edits, or merges -- a _tiny_ change to the code below will do it! **Bonus to the person(s) that email me a fix.**joiners". 

In [3]:
# path and place to put it
sp500_file = 'inputs/sp500_2022.csv'

# get it if we haven't 
if not os.path.exists(sp500_file):
    url = 'https://en.wikipedia.org/w/index.php?title=List_of_S%26P_500_companies&oldid=1130173030'
    pd.read_html(url)[0].to_csv(sp500_file,index=False)

# load and look at it
sp500 = pd.read_csv('inputs/sp500_2022.csv')    

In [4]:
sp500

Unnamed: 0,Symbol,Security,SEC filings,GICS Sector,GICS Sub-Industry,Headquarters Location,Date first added,CIK,Founded
0,MMM,3M,reports,Industrials,Industrial Conglomerates,"Saint Paul, Minnesota",1976-08-09,66740,1902
1,AOS,A. O. Smith,reports,Industrials,Building Products,"Milwaukee, Wisconsin",2017-07-26,91142,1916
2,ABT,Abbott,reports,Health Care,Health Care Equipment,"North Chicago, Illinois",1964-03-31,1800,1888
3,ABBV,AbbVie,reports,Health Care,Pharmaceuticals,"North Chicago, Illinois",2012-12-31,1551152,2013 (1888)
4,ACN,Accenture,reports,Information Technology,IT Consulting & Other Services,"Dublin, Ireland",2011-07-06,1467373,1989
...,...,...,...,...,...,...,...,...,...
498,YUM,Yum! Brands,reports,Consumer Discretionary,Restaurants,"Louisville, Kentucky",1997-10-06,1041061,1997
499,ZBRA,Zebra Technologies,reports,Information Technology,Electronic Equipment & Instruments,"Lincolnshire, Illinois",2019-12-23,877212,1969
500,ZBH,Zimmer Biomet,reports,Health Care,Health Care Equipment,"Warsaw, Indiana",2001-08-07,1136869,1927
501,ZION,Zions Bancorporation,reports,Financials,Regional Banks,"Salt Lake City, Utah",2001-06-22,109380,1873


## Step 2: Download their last 10-K 

This takes ____ seconds per download. 

In total ____ minutes, and downloaded a 10-K for ___ of the ____ firms.

This code here does not attempt to fix or explore why ___ 10-Ks are missing. Do you know why?



In [5]:
dl = Downloader("Lehigh", 
                "deb219@lehigh.edu",
                "10k_files")

In [6]:
# start with a small subset while we figure things out
ciks = sp500['CIK'].to_list()[:50]

if not os.path.exists('10k_files/10k_files.zip'):
    
    for cik in tqdm(ciks): # tqdm() status bar 
         
        firm_folder = f'10k_files/sec-edgar-filings/{str(cik).zfill(10)}/'  # str(cik).zfill(10)   means that CIK 1234 becomes 0000001234

        # if I haven't downloaded any HTML for this firm (len=0 files on this pattern), do so
        if len(glob.glob(firm_folder + '/10-K/*/*.html')) == 0:
            
            dl.get("10-K", cik, 
                   limit=1,                  # get the latest filing within window
                   after="2022-01-01",       # does this download filings ON 1/1 or nah? (check)
                   before="2022-12-31",      # does this download filings ON 12/31 or nah? (check)
                   download_details =True    # download the html 
            ) 
    
        # delete the txt files as we go!!!
        # files are of the form: folder/10-K/*/*.txt
        for txt_f in glob.glob(firm_folder + '/10-K/*/*.txt'):
            os.remove(txt_f)    
    
        # pause if there is a problem and the SEC is mad at my spider
        # unneeded! sec-edgar-dl does it for us 


100%|███████████████████████████████████████████| 50/50 [00:58<00:00,  1.16s/it]


In [7]:
files = glob.glob('10k_files/sec-edgar-filings/*/10-K/*/*.html')
f'We have {len(files)} HTML files for {len(ciks)} firms'

'We have 49 HTML files for 50 firms'

Stop! Check. ABCD. Do we have enough files? Why are any missing files missing? Is it ok to move on, or are fixes needed?

## Step 3: Reduce hard drive usage
_

Don't run this until you are done with downloads. What is below is a "one shot" code. Use it once onThere is an explicit variable controlling it. cit.

In [8]:
# set to True to run the code below. make sure you are done with downloads first!
# see if your folder has ~500ish html files, and take the screenshot from instructions
done_with_downloads = False 

if os.path.exists('10k_files/sec-edgar-filings') and \
    not os.path.exists('10k_files/10k_files.zip') and \
    done_with_downloads:
    
    # zip the folder (15GB --> 3GB)
    shutil.make_archive('10k_files', 'zip', '10k_files')
    
    # delete the folder 
    shutil.rmtree('10k_files/sec-edgar-filings')
    
    # put the zip file in the `10k_files` folder
    shutil.move('10k_files.zip', '10k_files/')