# SEC Disclosures Scaper
Setup webscraping to scrape SEC financial reports (ex. [APPL](https://www.sec.gov/ix?doc=/Archives/edgar/data/0000320193/000032019321000105/aapl-20210925.htm)) for diversity and inclusion statements.

##### Data Sources
* SEC Master index of filings (ex. Q1 2021: https://www.sec.gov/Archives/edgar/full-index/2021/QTR1/master.idx)
* SEC text archives (ex. AAPL 2021 10-K: https://www.sec.gov/Archives/edgar/data/0000320193/0000320193-21-000105.txt)

##### Notes
* If a user or application submits more than 10 requests per second, further requests from the IP address(es) may be limited for a brief period. Once the rate of requests has dropped below the threshold for 10 minutes, the user may resume accessing content on SEC.gov. This SEC practice is designed to limit excessive automated searches on SEC.gov and is not intended or expected to impact individuals browsing the SEC.gov website.

In [1]:
from bs4 import BeautifulSoup
from urllib.request import urlopen
import requests
import json
import pprint
import pandas as pd
import time

# fix ssl certificate (needed for MacOS sometimes)
import ssl
ssl._create_default_https_context = ssl._create_unverified_context

## Step 1: Get List All 10-Q Links for Companies in 2021
* Using edgar master index endpoint, retrieve list of all filed 10-Q's for the 2021 year
* Organize data in a pandas dataframe

In [2]:
# prepare headers to give us permission to access html content
headers = {"User-agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"}

# set up variables in scrape 10-Q links for all four quarters in 2021
tenQs = []
year = 2021
quarters = [1,2,3,4]
for quarter in quarters:
    r = requests.get(f'https://www.sec.gov/Archives/edgar/full-index/{year}/QTR{quarter}/master.idx', headers=headers).content
    r = r.decode("utf-8").split('\n')
    tenQs.extend([row for row in r if '10-K' in row])

In [3]:
df = pd.DataFrame([row.split('|') for row in tenQs])
df.columns = ['cik', 'companyName', 'filingType', 'date', 'link']
df['cik'] = df['cik'].astype(int)
df.head(10)

Unnamed: 0,cik,companyName,filingType,date,link
0,1000209,MEDALLION FINANCIAL CORP,10-K,2021-03-16,edgar/data/1000209/0001564590-21-013216.txt
1,1000228,HENRY SCHEIN INC,10-K,2021-02-17,edgar/data/1000228/0001000228-21-000019.txt
2,1000229,CORE LABORATORIES N V,10-K,2021-02-08,edgar/data/1000229/0001564590-21-004561.txt
3,1000232,KENTUCKY BANCSHARES INC /KY/,10-K,2021-03-03,edgar/data/1000232/0001558370-21-002326.txt
4,1000298,IMPAC MORTGAGE HOLDINGS INC,10-K,2021-03-12,edgar/data/1000298/0001558370-21-002945.txt
5,1000623,SCHWEITZER MAUDUIT INTERNATIONAL INC,10-K,2021-03-01,edgar/data/1000623/0001000623-21-000047.txt
6,1000683,BLONDER TONGUE LABORATORIES INC,10-K,2021-03-25,edgar/data/1000683/0001213900-21-017745.txt
7,1000694,NOVAVAX INC,10-K,2021-03-01,edgar/data/1000694/0001000694-21-000004.txt
8,1000697,WATERS CORP /DE/,10-K,2021-02-24,edgar/data/1000697/0001193125-21-054385.txt
9,1000753,"INSPERITY, INC.",10-K,2021-02-12,edgar/data/1000753/0001000753-21-000009.txt


## Step 2: Merge with our CIK List
Grab the links for the CIKs that we care about.

In [4]:
ciks = [1326801, 320193, 1321655, 1065280]     # sample list of ciks (FB, AAPL, PLTR, and NFLX)
df_ciks = pd.DataFrame(ciks) # convert to df
df_ciks.columns = ['cik']
df_ciks

Unnamed: 0,cik
0,1326801
1,320193
2,1321655
3,1065280


In [5]:
# left merge with original df
df = df_ciks.merge(df, on='cik')
df

Unnamed: 0,cik,companyName,filingType,date,link
0,1326801,Facebook Inc,10-K,2021-01-28,edgar/data/1326801/0001326801-21-000014.txt
1,320193,Apple Inc.,10-K,2021-10-29,edgar/data/320193/0000320193-21-000105.txt
2,1321655,Palantir Technologies Inc.,10-K,2021-02-26,edgar/data/1321655/0001193125-21-060650.txt
3,1065280,NETFLIX INC,10-K,2021-01-28,edgar/data/1065280/0001065280-21-000040.txt


## Step 3: Scrape 10-K and Store
* for each cik, scrape the text of its 10-K, pick out instances of the word 'diversity'
* store scrapes in json file

In [6]:
# convert to dict
files = df.to_dict(orient='records')

# loop through each file, download file from internet
# download 10 files, wait 1 second...SEC limits the number of api calls we can make per second
count = 0
for file in files:
    print(file['companyName'])
    
    # scrape text
    url = f'https://www.sec.gov/Archives/{file["link"]}'
    r = requests.get(url, headers=headers)
    soup = BeautifulSoup(r.text, 'html.parser')

    # get all text
    text_all = []
    for div in soup.find_all('div'):
        text_block = ''
        for span in div.find_all('span'):
            text_block += span.text
        text_all.append(text_block)

    # get text mentioning diversity
    text_diversity = [text for text in text_all if 'diversity' in text.lower()]
    
    # store in dict
    file['textAll'] = text_all
    file['textDiversity'] = text_diversity
    
    print(str(len(text_diversity)) + ' mentions of diversity found')
    print()
    
    # update counter
    count += 1
    if count == 10:
        count = 0
        time.sleep(1)

Facebook Inc
4 mentions of diversity found

Apple Inc.
3 mentions of diversity found

Palantir Technologies Inc.
0 mentions of diversity found

NETFLIX INC
2 mentions of diversity found



In [7]:
with open('../data/filings2021.json', 'w') as f:
    json.dump(files, f)

## View Results of Scraping

In [8]:
for file in files:
    print(file['companyName'])
    pprint.pprint(file['textDiversity'])
    print()

Facebook Inc
['Diversity and Inclusion',
 'Diversity and inclusion are core to our work at Facebook. We seek to build a '
 'diverse and inclusive workplace where we can leverage our collective '
 'cognitive diversity to build the best products and make the best decisions '
 'for the global community we serve. While we have made progress, we still '
 'have more work to do.',
 'We publish our global gender diversity and U.S. ethnic diversity workforce '
 'data annually. In 2020, we announced that as of June\xa030, 2020, our global '
 'employee base was comprised of 37% females and 63% males, and our U.S. '
 'employee base was comprised of the following ethnicities: 44.4% Asian, '
 '41%\xa0White, 6.3% Hispanic, 4% two or more ethnicities, 3.9%\xa0Black, and '
 '0.4% additional groups (including American Indian or Alaska Native and '
 'Native Hawaiian or Other Pacific Islander). We also announced our goals to '
 'have 50% of our workforce made up of underrepresented populations by 2024, '
