# Parsing Company 10Ks From the SEC

In this module, now that we can grab any filing we want from the daily-index filings we are going to move on to the next topic parsing financial documents. The easiest one we can start with is the 10K because the underlying structure provided to us will make grabbing the data accessible and quick. We will only focus on the data tables as this is separated from the document itself. However, in time, we will explore how to parse the different components of the 10K.

***

## Import the libraries
This module will require only three libraries, the first is the `requests` library for making the URL requests, `bs4` to parse the files and content, and finally `pandas` which will be used for taking our cleaned data and giving it structure.

In [34]:
# import scraping packages
import requests
import pandas as pd
from bs4 import BeautifulSoup
import numpy as np
import os
import unicodedata

In [35]:
# import packages for getting SEC files as well as CIK numbers
# FilingType gives the 100 type of files to choose from
# Filing is the class of fetched company items
from secedgar.filings import Filing, FilingType

# CIKLookup is for looking at cik of companies, with either name or ticker, case-insensitive
from secedgar.filings.cik_lookup import CIKLookup
# get dict for all ticker/name -> CIK mapping
from secedgar.utils import get_cik_map

# datetime is needed for start_date or end_date in Filing()
from datetime import datetime


In [36]:
# get filepath 
relative_fp = os.path.realpath(os.getcwd())
fp = relative_fp + 'randomly_selected_firms'

# load data
df = pd.read_csv(fp)
df.drop(df.columns[[0]], axis=1, inplace=True)
df

Unnamed: 0,Information Technology,Consumer Products and Services (B2C),Healthcare,Energy,Financial Services,Materials and Resources,Business Products and Services (B2B)
0,Qumu (NAS: QUMU),XpresSpa Group (NAS: XSPA),Molina Healthcare (NYS: MOH),Natural Resource Partners (NYS: NRP),Envestnet (NYS: ENV),Celanese (NYS: CE),Wrap Technologies (NAS: WRTC)
1,Micronet Enertec Technologies (NAS: MICT),Winnebago Industries (NYS: WGO),Capital Senior Living (NYS: CSU),FTS International Services (ASE: FTSI),Legacy Housing Corp (NAS: LEGH),Compass Minerals (NYS: CMP),Red Violet (NAS: RDVT)
2,Activision Blizzard (NAS: ATVI),Heska (NAS: HSKA),Akebia Therapeutics (NAS: AKBA),ChampionX (NYS: CHX),Permian Basin Royalty Trust (NYS: PBT),Hawkins (NAS: HWKN),Meritage Homes (NYS: MTH)
3,Elasticsearch (NYS: ESTC),Turning Point Brands (NYS: TPB),Select Medical Holdings (NYS: SEM),Hess (NYS: HES),Marine Petroleum Trust (NAS: MARPS),Hycroft Mining (NAS: HYMC),Cavco Industries (NAS: CVCO)
4,Aviat Networks (NAS: AVNW),Century Casinos (NAS: CNTY),Blueprint Medicines (NAS: BPMC),Schlumberger (NYS: SLB),North European Oil Royalty Trust (NYS: NRT),Pacific Ethanol (NAS: PEIX),Lakeland Industries (NAS: LAKE)


In [37]:
#df['Information Technology'].to_numpy()

In [38]:
#get name, ticker and sector
name_ticker_sector = []

for sector in df.columns:
    for company_name in df[sector].to_list():
        
        index_of_colon = company_name.index(':')
        index_of_right_paran = company_name.index(')')

        # get the ticker of the current firm
        ticker = company_name[index_of_colon+1:index_of_right_paran].strip()
        
        name_ticker_sector.append([company_name, ticker, sector]) 

name_ticker_sector = np.array(name_ticker_sector, dtype = str)

In [39]:
# get CIK number for each firm
tickers = name_ticker_sector[:, 1].tolist()
CIK_dict = CIKLookup(tickers).lookup_dict

In [40]:
# save this for viewing purpose
'''
save_data = pd.DataFrame(name_ticker_sector)
save_data.columns = ['Company Name', 'Ticker', 'Sector']
save_data['Company Name'] = [name[:name.index('(')].strip() for name in save_data['Company Name']]
save_data
save_data.to_csv('company basic info')
'''

"\nsave_data = pd.DataFrame(name_ticker_sector)\nsave_data.columns = ['Company Name', 'Ticker', 'Sector']\nsave_data['Company Name'] = [name[:name.index('(')].strip() for name in save_data['Company Name']]\nsave_data\nsave_data.to_csv('company basic info')\n"

In [41]:
#CIK_dict['MICT']

In [42]:
# store the info for each firm in df, before we start web scraping 10Ks
firms_info = []

for company_list in name_ticker_sector:
    company_name = company_list[0]
    company_ticker = company_list[1]
    company_sector = company_list[2]
    company_cik = CIK_dict[company_ticker]
    
    # base url for every search 10-K page on SEC.org
    search_10K_page_base_url = r'https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=|&type=10-k'

    # get firm-specific url
    search_10K_page_url = search_10K_page_base_url.replace('|', company_cik)

    # connect to page and search
    search_page = requests.get(search_10K_page_url).content
    soup = BeautifulSoup(search_page, 'html')

    # store report numbers for each firm
    ###can do later

    # find the section denoting report number
    for td_instance in soup.find_all('td',{'class':'small'}):
        text = td_instance.text
        
        # skip 10-K/A files which are amendments
        if '[Amend]' in text:
            #print('10K/A file encountered')
            continue
        
        # parse to get each report number
        index_of_start = text.find('Acc-no:')
        index_of_end = text.find('(34 Act)') # stop before '('

        # some old ones has no (34 Act)
        if index_of_end == -1:
            index_of_end = text.find('Size:')

        if index_of_start == -1 or index_of_end == -1:
            # guess no annual report in here
            print('no annual report in this <td> class')
            print(td_instance.text)
            continue

        index_of_start += len('Acc-no:') # start after this string

        # get report number
        report_number = text[index_of_start: index_of_end]
        # remove - and whitespace
        report_number = report_number.strip().replace('-', '')
        # find date
        report_date = td_instance.findNext('td').text
        # update df
        #print(pd.DataFrame([name, report_number, report_date], columns=list(firms_info.columns)))
        firms_info.append([company_name, company_ticker, company_cik, company_sector, report_number, report_date])

#convert to dataframe
firms_info = pd.DataFrame(firms_info, columns=['Company Name', 'Ticker', 'CIK', 'Sector', '10-K Report Number', '10-K Report Date'])
firms_info['Company Name'] = [name[:name.index('(')].strip() for name in firms_info['Company Name']]

In [43]:
firms_info.head()

Unnamed: 0,Company Name,Ticker,CIK,Sector,10-K Report Number,10-K Report Date
0,Qumu,QUMU,892482,Information Technology,89248220000007,2020-03-06
1,Qumu,QUMU,892482,Information Technology,89248219000013,2019-03-15
2,Qumu,QUMU,892482,Information Technology,89248218000015,2018-03-23
3,Qumu,QUMU,892482,Information Technology,89248217000024,2017-03-31
4,Qumu,QUMU,892482,Information Technology,89248216000029,2016-03-15


In [44]:
#firms_info.iloc[0,:].to_list()+['s']

In [45]:
#report_number = firms_info.iloc[0,:]['10-K Report Number']
#CIK = firms_info.iloc[0,:]['CIK']
#xlm_base_url = r'https://www.sec.gov/Archives/edgar/data/CIK/report_number/FilingSummary.xml'
#xlm_base_url.replace('CIK', CIK).replace('report_number', report_number)

In [46]:
#get_10_K_page_url
def get_10_K_page_url(xml_summary):
    page_10K_url = xml_summary.replace('/FilingSummary.xml', '')
    page_10K_url = page_10K_url[:-8] + '-' + page_10K_url[-8:]
    page_10K_url = page_10K_url[:-6] + '-' + page_10K_url[-6:]
    page_10K_url += '-index.htm'
    return page_10K_url

#get 10 K html text file
def get_10_K_txt(xml_summary):
    page_10K_url = get_10_K_page_url(xml_summary)
    return page_10K_url.replace('-index.htm', '.txt')
get_10_K_page_url('https://www.sec.gov/Archives/edgar/data/0000892482/000089248220000007/FilingSummary.xml')
get_10_K_txt('https://www.sec.gov/Archives/edgar/data/0000892482/000089248220000007/FilingSummary.xml')

'https://www.sec.gov/Archives/edgar/data/0000892482/0000892482-20-000007.txt'

In [47]:
# decode weird strings...not really working
def restore_windows_1252_characters(restore_string):
    """
        Replace C1 control characters in the Unicode string s by the
        characters at the corresponding code points in Windows-1252,
        where possible.
    """

    def to_windows_1252(match):
        try:
            return bytes([ord(match.group(0))]).decode('windows-1252')
        except UnicodeDecodeError:
            # No character at the corresponding code point: remove it.
            return ''
        
    return re.sub(r'[\u0080-\u0099]', to_windows_1252, restore_string)

In [48]:
#import webbrowser
#webbrowser.open_new_tab('sec_data\\nmsl.html')

In [100]:
############################################
# helper functions for reports without .xlm summary
# older ones before 2012
############################################
def parse_acquisition_table(table_instance): # a bs4 table instance
    acquisition_related = False
    
    # define a dictionary that will store the different parts of the statement.
    table_data = {}
    table_data['headers'] = []
    table_data['sections'] = []
    table_data['data'] = []

    # find all the rows, figure out what type of row it is, parse the elements, and store in the statement file list.
    for index, row in enumerate(table_instance.find_all('tr')):
        
        # first let's get all the elements.
        cols = row.find_all('td')
        
        # get data by row
        reg_row = [ele.text.strip() for ele in cols] # remove unicode char
        table_data['data'].append(reg_row)

        if True in ['acquisition' in element.lower() for element in reg_row]:
            acquisition_related = True
        '''
        # if it's a regular row and not a section or a table header
        if (len(row.find_all('th')) == 0 and len(row.find_all('strong')) == 0): 
            reg_row = [ele.text.strip() for ele in cols]
            table_data['data'].append(reg_row)
                        
            if True in ['acquisition' in element.lower() for element in reg_row]:
                acquisition_related = True
            
        # if it's a regular row and a section but not a table header
        elif (len(row.find_all('th')) == 0 and len(row.find_all('strong')) != 0):
            sec_row = cols[0].text.strip()
            table_data['sections'].append(sec_row)
            
            if True in ['acquisition' in element.lower() for element in sec_row]:
                acquisition_related = True
            
        # finally if it's not any of those it must be a header
        elif (len(row.find_all('th')) != 0):            
            hed_row = [unicodedata.normalize('NFKD',ele.text.strip()) for ele in row.find_all('th')]
            table_data['headers'].append(hed_row)
             
            if True in ['acquisition' in element.lower() for element in hed_row]:
                acquisition_related = True
            
        else:            
            print('We encountered an error.')
        ''' 
    '''
    # turn to dataframe
    table = pd.DataFrame(table_data['data'])
    if acquisition_related: 
        # only return when it is acquisition related
        return table
    return # return None if not
    '''
    return acquisition_related

# get acquisition related reports for older 10-Ks using .txt html url, return a list of urls where we save them
def get_info_from_txt(txt_url, name):
    
    content = requests.get(txt_url).content
    soup = BeautifulSoup(content, 'html')
    # get notes sections for acquisition
    
    stuff = []
    bolds = soup.find_all('b')
    if bolds is not None:
        for bold_instance in bolds:
            bold_font = bold_instance.findChild('font' , recursive=False)
            if bold_font is not None:
                note_html = str(bold_font) # new way

                bold_text = bold_font.text
                if bold_text.lower() == 'acquisitions':
                    #acquisition_notes = bold_text + '\n'

                    # this is the next note section's title
                    next_bold_instance = bold_instance.find_next('b')
                    div_to_stop = next_bold_instance.parent

                    # the paragraphs we want
                    div_instance = bold_instance.find_next('div')

                    while div_instance != div_to_stop:
                        #paragraph = unicodedata.normalize('NFKD', div_instance.text)
                        #acquisition_notes += paragraph

                        # new way
                        note_html += str(div_instance)

                        div_instance = div_instance.find_next('div')

                    #stuff.append(acquisition_notes.encode('ascii', 'ignore')) # remove unicode char
                    # new way
                    stuff.append(note_html) # remove unicode char

    # get table sections for acquisition
    tables = soup.find_all('table')
    if tables is not None:
        for table_instance in tables:
                acquisition_related = parse_acquisition_table(table_instance)
                if acquisition_related: # if table is not None:
                    stuff.append(table_instance)

    report_urls = []
    # for all related stuff, write to html and get the link for them
    if stuff:
        for index, html_instance in enumerate(stuff):
            report_url = add_link_and_write(html_instance, name + '_report_' + str(index), dir_name='sec_data')
            report_urls.append(['report ' + str(index), report_url])

    return report_urls

# add a parent tag for the html instanse parsed
def wrap(to_wrap, wrap_in):
    contents = to_wrap.replace_with(wrap_in)
    wrap_in.append(contents)
    
# add link to the html instance we have s.t. we can write it to a html file and call later
def add_link_and_write(html_instance, link_name, dir_name='sec_data'):
    # turn tag into a bs4 instance
    info = BeautifulSoup(str(html_instance))
    wrap(info.table, info.new_tag('a'))
    
    url = r"file:///C:\Users\happy\OneDrive - California Institute of Technology\Desktop\{}\{}.html".format(dir_name, link_name)
    info.a['href'] = url
    #<a href="where/you/want/the/link/to/go">text of the link</a>
    
    # write to html
    f = open('{}\\{}.html'.format(dir_name, link_name),'wb')
    f.write(str(info).encode('ascii', 'ignore'))
    f.close()
    
    return url

In [128]:
# main function
def find_relevent_reports(firms_info):
    # to be returned
    #relevant_table_links = []
    # to be returned and saved for Professor
    relevant_table_links = pd.DataFrame(' ', index=list(firms_info['Ticker'].unique()), columns=np.arange(1950,2021)[::-1].astype(str))
    
    # failed to parse one
    failures = 0
    failed_table_links = []
    
    # base link for every xlm doc for each 10-K
    xlm_base_url = r'https://www.sec.gov/Archives/edgar/data/CIK/report_number/FilingSummary.xml'
    
    # for each 10-K we have
    for i in range(firms_info.shape[0]):
        
        #if i > 10:# or i > 200:
        #    continue
            
        print(i)
        # current row
        report_info_df = firms_info.iloc[i,:]
        #if report_info_df['Ticker'] != 'ATVI':
        #    continue
        
        print(report_info_df['Ticker'])
        print(report_info_df['10-K Report Date'])
        print('-'*100)
        
        # xlm link for current 10-K
        xml_summary = xlm_base_url.replace('CIK', report_info_df['CIK']).replace('report_number', report_info_df['10-K Report Number'])
    
        # define a new base url that represents the filing folder. This will come in handy when we need to download the reports.
        base_url = xml_summary.replace('FilingSummary.xml', '')
        
        print(xml_summary)
        
        try:
            # request and parse the content
            content = requests.get(xml_summary).content
            soup = BeautifulSoup(content, 'lxml')
            
        except:
            #print(report_info_df['Company Name'])
            print('Link does not exist')
            print(xml_summary)
            print('-'*100)

        # find the 'myreports' tag because this contains all the individual reports submitted.
        reports = soup.find('myreports')

        '''
        # some AVTI .xlm has no htmlfilename but only xml instance, which is not readable... 
        if report_info_df['Ticker'] in ['ATVI', 'NRP', 'HES', 'SLB', 'CE'] and int(report_info_df['10-K Report Date'].split('-')[0]) <= 2011:
            continue
            # one more try with Report
            #reports = soup.find('report')
            # print(reports)
        '''
        if reports == None:
            # this is the case when the 10-Ks are early than (including 2011)
            # no .xlm summary exist so directly parse .txt html file
            print('Directly parse .txt html file')
            try:
                page_10_K_url = get_10_K_page_url(xml_summary)
                txt_url = get_10_K_txt(xml_summary)
                reports_url = get_info_from_txt(txt_url, name=report_info_df['Ticker']+'_'+report_info_df['10-K Report Date'])

                # of form [report name, report url]
                for report_name_and_url in reports_url:

                    #relevant_table_links.append(report_info_df.tolist() + [page_10_K_url] + report_url)
                    relevant_table_links.loc[report_info_df['Ticker'], report_info_df['10-K Report Date'].split('-')[0]] += report_name_and_url[1] + ' | '

                    print(page_10_K_url)
                    print(report_name_and_url[0])
                    print(report_name_and_url[1])
                    print('-'*100)

            # real issue didnt catch
            except:
                print('got fucked 1!')
                print(get_10_K_txt(xml_summary))
                print('-'*100)
                failures += 1
                failed_table_links.append([report_info_df['Ticker'], report_info_df['10-K Report Date'], get_10_K_txt(xml_summary)])
                continue
            
        else: 
            # loop through each report in the 'myreports' tag but avoid the last one as this will cause an error.
            try:
                for report in reports.find_all('report')[:-1]:

                    # acquisition related
                    # what we want
                    if 'acquisition' in report.shortname.text.lower() or 'aquisition' in report.longname.text.lower():

                        # update our list
                        page_10_K_url = get_10_K_page_url(xml_summary)
                        table_name = report.shortname.text # report.longname.text # both long or short is fine
                        table_url = base_url + report.htmlfilename.text

                        #relevant_table_links.append(report_info_df.tolist() + [page_10_K_url, table_name, table_url])
                        relevant_table_links.loc[report_info_df['Ticker'], report_info_df['10-K Report Date'].split('-')[0]] += table_url + ' | '
                        
                        print(page_10_K_url)
                        print(table_name)
                        print(table_url)
                        print('-'*100)
                        
            except:
                # some .xlm summary have acquisition related report but not .htm instance
                try:
                    page_10_K_url = get_10_K_page_url(xml_summary)
                    txt_url = get_10_K_txt(xml_summary)
                    reports_url = get_info_from_txt(txt_url, name=report_info_df['Ticker']+'_'+report_info_df['10-K Report Date'])

                    # of form [report name, report url]
                    for report_name_and_url in reports_url:
                        
                        #relevant_table_links.append(report_info_df.tolist() + [page_10_K_url] + report_url)
                        relevant_table_links.loc[report_info_df['Ticker'], report_info_df['10-K Report Date'].split('-')[0]] += report_name_and_url[1] + ' | '
                        

                        print(page_10_K_url)
                        print(report_name_and_url[0])
                        print(report_name_and_url[1])
                        print('-'*100)

                # real issue didnt catch
                except:
                    print('got fucked 2!')
                    failures += 1
                    failed_table_links.append([report_info_df['Ticker'], report_info_df['10-K Report Date'], get_10_K_txt(xml_summary)])
                    continue

                
    #relevant_table_links = pd.DataFrame(relevant_table_links, columns=['Company Name', 'Ticker', 'CIK', 'Sector', 
    #                                                                   '10-K Report Number', '10-K Report Date', 
    #                                                                   '10-K SEC Page', 'Table Name', 'Table Link'])
    #relevant_table_links = relevant_table_links.drop(columns=['CIK', 'Sector', '10-K Report Number'])
    
    failed_table_links = pd.DataFrame(failed_table_links, columns=['Ticker', '10-K Report Date', '10-K Report URL'])

    return relevant_table_links, failed_table_links, failures

In [129]:
relevant_table_links, failed_tables, failed_count = find_relevent_reports(firms_info)

0
QUMU
2020-03-06
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000892482/000089248220000007/FilingSummary.xml
1
QUMU
2019-03-15
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000892482/000089248219000013/FilingSummary.xml
2
QUMU
2018-03-23
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000892482/000089248218000015/FilingSummary.xml
3
QUMU
2017-03-31
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000892482/000089248217000024/FilingSummary.xml
4
QUMU
2016-03-15
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000892482/00

Directly parse .txt html file
https://www.sec.gov/Archives/edgar/data/0000892482/0000897101-10-000537-index.htm
report 0
file:///C:\Users\happy\OneDrive - California Institute of Technology\Desktop\sec_data\QUMU_2010-03-12_report_0.html
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000892482/0000897101-10-000537-index.htm
report 1
file:///C:\Users\happy\OneDrive - California Institute of Technology\Desktop\sec_data\QUMU_2010-03-12_report_1.html
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000892482/0000897101-10-000537-index.htm
report 2
file:///C:\Users\happy\OneDrive - California Institute of Technology\Desktop\sec_data\QUMU_2010-03-12_report_2.html
----------------------------------------------------------------------------------------------------
11
QUMU
2009-03-16
---------------------------

30
MICT
2014-03-19
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000854800/000117891314000982/FilingSummary.xml
31
MICT
2013-03-29
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000854800/000117891313000945/FilingSummary.xml
https://www.sec.gov/Archives/edgar/data/0000854800/0001178913-13-000945-index.htm
ACQUISITION OF NON-CONTROLLING INTEREST
https://www.sec.gov/Archives/edgar/data/0000854800/000117891313000945/R10.htm
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000854800/0001178913-13-000945-index.htm
ACQUISITION OF NON-CONTROLLING INTEREST (Details)
https://www.sec.gov/Archives/edgar/data/0000854800/000117891313000945/R46.htm
-------------------------------------------------------------------------

42
ATVI
2014-03-03
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000718877/000104746914001688/FilingSummary.xml
43
ATVI
2013-02-22
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000718877/000104746913001506/FilingSummary.xml
44
ATVI
2012-02-28
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000718877/000104746912001775/FilingSummary.xml
45
ATVI
2011-02-25
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000718877/000104746911001413/FilingSummary.xml
https://www.sec.gov/Archives/edgar/data/0000718877/0001047469-11-001413-index.htm
report 0
file:///C:\Users\happy\OneDrive - California Institute of Technology\Desktop\

Directly parse .txt html file
57
ATVI
1999-06-29
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000718877/000104746999025826/FilingSummary.xml
Directly parse .txt html file
58
ATVI
1998-06-15
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000718877/000104746998024188/FilingSummary.xml
Directly parse .txt html file
59
ATVI
1997-06-16
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000718877/000091205797020468/FilingSummary.xml
Directly parse .txt html file
60
ATVI
1996-07-08
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000718877/000071887796000011/FilingSummary.xml
Directly parse .txt html file
61
ATVI
1995-06-30

Directly parse .txt html file
https://www.sec.gov/Archives/edgar/data/0001377789/0000950123-10-084926-index.htm
report 0
file:///C:\Users\happy\OneDrive - California Institute of Technology\Desktop\sec_data\AVNW_2010-09-09_report_0.html
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0001377789/0000950123-10-084926-index.htm
report 1
file:///C:\Users\happy\OneDrive - California Institute of Technology\Desktop\sec_data\AVNW_2010-09-09_report_1.html
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0001377789/0000950123-10-084926-index.htm
report 2
file:///C:\Users\happy\OneDrive - California Institute of Technology\Desktop\sec_data\AVNW_2010-09-09_report_2.html
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/000137

Directly parse .txt html file
https://www.sec.gov/Archives/edgar/data/0001377789/0000950134-08-017170-index.htm
report 0
file:///C:\Users\happy\OneDrive - California Institute of Technology\Desktop\sec_data\AVNW_2008-09-25_report_0.html
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0001377789/0000950134-08-017170-index.htm
report 1
file:///C:\Users\happy\OneDrive - California Institute of Technology\Desktop\sec_data\AVNW_2008-09-25_report_1.html
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0001377789/0000950134-08-017170-index.htm
report 2
file:///C:\Users\happy\OneDrive - California Institute of Technology\Desktop\sec_data\AVNW_2008-09-25_report_2.html
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/000137

78
XSPA
2019-04-01
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0001410428/000114420419017496/FilingSummary.xml
79
XSPA
2018-03-29
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0001410428/000114420418017944/FilingSummary.xml
80
XSPA
2017-03-30
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0001410428/000114420417017688/FilingSummary.xml
81
XSPA
2016-03-10
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0001410428/000114420416087235/FilingSummary.xml
82
XSPA
2015-03-16
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/00014104

Directly parse .txt html file
got fucked 1!
https://www.sec.gov/Archives/edgar/data/0000107687/0000107687-10-000020.txt
----------------------------------------------------------------------------------------------------
97
WGO
2009-10-27
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000107687/000089710109002143/FilingSummary.xml
Directly parse .txt html file
https://www.sec.gov/Archives/edgar/data/0000107687/0000897101-09-002143-index.htm
report 0
file:///C:\Users\happy\OneDrive - California Institute of Technology\Desktop\sec_data\WGO_2009-10-27_report_0.html
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000107687/0000897101-09-002143-index.htm
report 1
file:///C:\Users\happy\OneDrive - California Institute of Technology\Desktop\sec_data\WGO_2009-10-27_report_1.html
-----------------------------

https://www.sec.gov/Archives/edgar/data/0001038133/0001038133-19-000015-index.htm
ACQUISITION AND RELATED PARTY ITEMS
https://www.sec.gov/Archives/edgar/data/0001038133/000103813319000015/R10.htm
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0001038133/0001038133-19-000015-index.htm
ACQUISITION AND RELATED PARTY ITEMS (Tables)
https://www.sec.gov/Archives/edgar/data/0001038133/000103813319000015/R28.htm
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0001038133/0001038133-19-000015-index.htm
ACQUISITION AND RELATED PARTY ITEMS - ACQUISITION INFORMATION (Details)
https://www.sec.gov/Archives/edgar/data/0001038133/000103813319000015/R45.htm
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0001038133/0001038133-19

Directly parse .txt html file
123
HSKA
2010-02-22
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0001038133/000095012310015125/FilingSummary.xml
Directly parse .txt html file
124
HSKA
2009-03-16
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0001038133/000110465909017873/FilingSummary.xml
Directly parse .txt html file
125
HSKA
2008-03-03
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0001038133/000110465908014804/FilingSummary.xml
Directly parse .txt html file
126
HSKA
2007-03-30
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0001038133/000110465907024411/FilingSummary.xml
Directly parse .txt html file
127
HSKA
2006-

https://www.sec.gov/Archives/edgar/data/0001290677/0001567619-17-000434-index.htm
Acquisitions
https://www.sec.gov/Archives/edgar/data/0001290677/000156761917000434/R14.htm
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0001290677/0001567619-17-000434-index.htm
Acquisitions (Tables)
https://www.sec.gov/Archives/edgar/data/0001290677/000156761917000434/R38.htm
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0001290677/0001567619-17-000434-index.htm
Acquisitions, Wind River Tobacco Company (Details)
https://www.sec.gov/Archives/edgar/data/0001290677/000156761917000434/R60.htm
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0001290677/0001567619-17-000434-index.htm
Acquisitions, VaporBeast (Details)
https://www.se

https://www.sec.gov/Archives/edgar/data/0000911147/0000911147-17-000006-index.htm
Apex Acquisition
https://www.sec.gov/Archives/edgar/data/0000911147/000091114717000006/R10.htm
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000911147/0000911147-17-000006-index.htm
Apex Acquisition (Tables)
https://www.sec.gov/Archives/edgar/data/0000911147/000091114717000006/R28.htm
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000911147/0000911147-17-000006-index.htm
Apex Acquisition (Narrative) (Details)
https://www.sec.gov/Archives/edgar/data/0000911147/000091114717000006/R47.htm
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000911147/0000911147-17-000006-index.htm
Apex Acquisition (Schedule of Estimated Fair Values o

Directly parse .txt html file
https://www.sec.gov/Archives/edgar/data/0000911147/0000911147-11-000003-index.htm
report 0
file:///C:\Users\happy\OneDrive - California Institute of Technology\Desktop\sec_data\CNTY_2011-03-31_report_0.html
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000911147/0000911147-11-000003-index.htm
report 1
file:///C:\Users\happy\OneDrive - California Institute of Technology\Desktop\sec_data\CNTY_2011-03-31_report_1.html
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000911147/0000911147-11-000003-index.htm
report 2
file:///C:\Users\happy\OneDrive - California Institute of Technology\Desktop\sec_data\CNTY_2011-03-31_report_2.html
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/000091

Directly parse .txt html file
https://www.sec.gov/Archives/edgar/data/0000911147/0000911147-07-000007-index.htm
report 0
file:///C:\Users\happy\OneDrive - California Institute of Technology\Desktop\sec_data\CNTY_2007-03-16_report_0.html
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000911147/0000911147-07-000007-index.htm
report 1
file:///C:\Users\happy\OneDrive - California Institute of Technology\Desktop\sec_data\CNTY_2007-03-16_report_1.html
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000911147/0000911147-07-000007-index.htm
report 2
file:///C:\Users\happy\OneDrive - California Institute of Technology\Desktop\sec_data\CNTY_2007-03-16_report_2.html
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/000091

Directly parse .txt html file
160
CNTY
2003-03-13
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000911147/000091114703000003/FilingSummary.xml
Directly parse .txt html file
161
MOH
2020-02-14
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0001179929/000117992920000026/FilingSummary.xml
162
MOH
2019-02-19
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0001179929/000117992919000032/FilingSummary.xml
163
MOH
2018-03-01
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0001179929/000117992918000040/FilingSummary.xml
164
MOH
2017-03-01
----------------------------------------------------------------------------------------

Directly parse .txt html file
https://www.sec.gov/Archives/edgar/data/0001179929/0001193125-07-054663-index.htm
report 0
file:///C:\Users\happy\OneDrive - California Institute of Technology\Desktop\sec_data\MOH_2007-03-14_report_0.html
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0001179929/0001193125-07-054663-index.htm
report 1
file:///C:\Users\happy\OneDrive - California Institute of Technology\Desktop\sec_data\MOH_2007-03-14_report_1.html
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0001179929/0001193125-07-054663-index.htm
report 2
file:///C:\Users\happy\OneDrive - California Institute of Technology\Desktop\sec_data\MOH_2007-03-14_report_2.html
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/000117992

https://www.sec.gov/Archives/edgar/data/0001043000/0001193125-17-065814-index.htm
Acquisitions
https://www.sec.gov/Archives/edgar/data/0001043000/000119312517065814/R10.htm
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0001043000/0001193125-17-065814-index.htm
Acquisitions (Tables)
https://www.sec.gov/Archives/edgar/data/0001043000/000119312517065814/R28.htm
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0001043000/0001193125-17-065814-index.htm
Acquisitions - Additional Information (Detail)
https://www.sec.gov/Archives/edgar/data/0001043000/000119312517065814/R43.htm
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0001043000/0001193125-17-065814-index.htm
Acquisitions - Schedule of Pro Forma Combined Results

https://www.sec.gov/Archives/edgar/data/0001043000/0001193125-13-098934-index.htm
Acquisitions
https://www.sec.gov/Archives/edgar/data/0001043000/000119312513098934/R11.htm
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0001043000/0001193125-13-098934-index.htm
Acquisitions (Tables)
https://www.sec.gov/Archives/edgar/data/0001043000/000119312513098934/R29.htm
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0001043000/0001193125-13-098934-index.htm
Acquisitions (Details)
https://www.sec.gov/Archives/edgar/data/0001043000/000119312513098934/R46.htm
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0001043000/0001193125-13-098934-index.htm
Acquisitions (Details Textual)
https://www.sec.gov/Archives/edgar/data/000104

Directly parse .txt html file
https://www.sec.gov/Archives/edgar/data/0001043000/0000950134-08-004629-index.htm
report 0
file:///C:\Users\happy\OneDrive - California Institute of Technology\Desktop\sec_data\CSU_2008-03-12_report_0.html
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0001043000/0000950134-08-004629-index.htm
report 1
file:///C:\Users\happy\OneDrive - California Institute of Technology\Desktop\sec_data\CSU_2008-03-12_report_1.html
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0001043000/0000950134-08-004629-index.htm
report 2
file:///C:\Users\happy\OneDrive - California Institute of Technology\Desktop\sec_data\CSU_2008-03-12_report_2.html
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/000104300

Directly parse .txt html file
got fucked 1!
https://www.sec.gov/Archives/edgar/data/0001043000/0000950134-04-004249.txt
----------------------------------------------------------------------------------------------------
195
CSU
2003-03-28
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0001043000/000095013403004867/FilingSummary.xml
Directly parse .txt html file
196
CSU
2002-03-28
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0001043000/000095013402002869/FilingSummary.xml
Directly parse .txt html file
197
CSU
2001-03-21
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0001043000/000095013401002395/FilingSummary.xml
Directly parse .txt html file
198
CSU
2000-03-30
----------------------------------------------

https://www.sec.gov/Archives/edgar/data/0001320414/0001047469-17-000890-index.htm
Acquisitions
https://www.sec.gov/Archives/edgar/data/0001320414/000104746917000890/R9.htm
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0001320414/0001047469-17-000890-index.htm
Acquisitions (Tables)
https://www.sec.gov/Archives/edgar/data/0001320414/000104746917000890/R30.htm
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0001320414/0001047469-17-000890-index.htm
Acquisitions - Physiotherapy Acquisition (Details)
https://www.sec.gov/Archives/edgar/data/0001320414/000104746917000890/R49.htm
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0001320414/0001047469-17-000890-index.htm
Acquisitions - Concentra Acquisition (Details)
htt

220
BPMC
2019-02-26
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0001597264/000155837019001094/FilingSummary.xml
221
BPMC
2018-02-21
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0001597264/000155837018000862/FilingSummary.xml
222
BPMC
2017-03-09
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0001597264/000155837017001547/FilingSummary.xml
223
BPMC
2016-03-11
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0001597264/000155837016004066/FilingSummary.xml
224
NRP
2020-02-27
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0001

https://www.sec.gov/Archives/edgar/data/0001171486/0001193125-14-077852-index.htm
Significant Acquisitions
https://www.sec.gov/Archives/edgar/data/0001171486/000119312514077852/R9.htm
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0001171486/0001193125-14-077852-index.htm
Significant Acquisitions - Additional Information (Detail)
https://www.sec.gov/Archives/edgar/data/0001171486/000119312514077852/R39.htm
----------------------------------------------------------------------------------------------------
231
NRP
2013-02-28
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0001171486/000119312513083594/FilingSummary.xml
https://www.sec.gov/Archives/edgar/data/0001171486/0001193125-13-083594-index.htm
Significant acquisitions
https://www.sec.gov/Archives/edgar/data/0001171486/000119312513083594/R9.htm
---

Directly parse .txt html file
got fucked 1!
https://www.sec.gov/Archives/edgar/data/0001171486/0000950129-04-000984.txt
----------------------------------------------------------------------------------------------------
241
NRP
2003-03-31
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0001171486/000095012903001690/FilingSummary.xml
Directly parse .txt html file
242
FTSI
2020-02-27
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0001529463/000155837020001639/FilingSummary.xml
243
FTSI
2019-02-28
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0001529463/000155837019001357/FilingSummary.xml
244
FTSI
2018-03-09
----------------------------------------------------------------------------------------------------
ht

https://www.sec.gov/Archives/edgar/data/0000004447/0000950123-11-018415-index.htm
report 0
file:///C:\Users\happy\OneDrive - California Institute of Technology\Desktop\sec_data\HES_2011-02-25_report_0.html
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000004447/0000950123-11-018415-index.htm
report 1
file:///C:\Users\happy\OneDrive - California Institute of Technology\Desktop\sec_data\HES_2011-02-25_report_1.html
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000004447/0000950123-11-018415-index.htm
report 2
file:///C:\Users\happy\OneDrive - California Institute of Technology\Desktop\sec_data\HES_2011-02-25_report_2.html
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000004447/0000950123-11-018415-index.h

Directly parse .txt html file
https://www.sec.gov/Archives/edgar/data/0000004447/0000950123-06-002849-index.htm
report 0
file:///C:\Users\happy\OneDrive - California Institute of Technology\Desktop\sec_data\HES_2006-03-09_report_0.html
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000004447/0000950123-06-002849-index.htm
report 1
file:///C:\Users\happy\OneDrive - California Institute of Technology\Desktop\sec_data\HES_2006-03-09_report_1.html
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000004447/0000950123-06-002849-index.htm
report 2
file:///C:\Users\happy\OneDrive - California Institute of Technology\Desktop\sec_data\HES_2006-03-09_report_2.html
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/000000444

Directly parse .txt html file
267
HES
2000-03-27
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000004447/000095012300002763/FilingSummary.xml
Directly parse .txt html file
268
HES
1999-03-29
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000004447/000095012399002682/FilingSummary.xml
Directly parse .txt html file
269
HES
1998-03-30
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000004447/000095012398003112/FilingSummary.xml
Directly parse .txt html file
270
HES
1997-03-26
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000004447/000095012397002524/FilingSummary.xml
Directly parse .txt html file
271
HES
1996-03-27

https://www.sec.gov/Archives/edgar/data/0000087347/0001564590-16-012009-index.htm
Acquisitions
https://www.sec.gov/Archives/edgar/data/0000087347/000156459016012009/R12.htm
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000087347/0001564590-16-012009-index.htm
Acquisitions - Additional Information (Detail)
https://www.sec.gov/Archives/edgar/data/0000087347/000156459016012009/R52.htm
----------------------------------------------------------------------------------------------------
279
SLB
2015-01-29
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000087347/000156459015000337/FilingSummary.xml
https://www.sec.gov/Archives/edgar/data/0000087347/0001564590-15-000337-index.htm
Acquisitions
https://www.sec.gov/Archives/edgar/data/0000087347/000156459015000337/R12.htm
-------------------------------------

Directly parse .txt html file
https://www.sec.gov/Archives/edgar/data/0000087347/0001193125-09-024868-index.htm
report 0
file:///C:\Users\happy\OneDrive - California Institute of Technology\Desktop\sec_data\SLB_2009-02-11_report_0.html
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000087347/0001193125-09-024868-index.htm
report 1
file:///C:\Users\happy\OneDrive - California Institute of Technology\Desktop\sec_data\SLB_2009-02-11_report_1.html
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000087347/0001193125-09-024868-index.htm
report 2
file:///C:\Users\happy\OneDrive - California Institute of Technology\Desktop\sec_data\SLB_2009-02-11_report_2.html
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/000008734

Directly parse .txt html file
https://www.sec.gov/Archives/edgar/data/0000087347/0001193125-06-037849-index.htm
report 0
file:///C:\Users\happy\OneDrive - California Institute of Technology\Desktop\sec_data\SLB_2006-02-24_report_0.html
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000087347/0001193125-06-037849-index.htm
report 1
file:///C:\Users\happy\OneDrive - California Institute of Technology\Desktop\sec_data\SLB_2006-02-24_report_1.html
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000087347/0001193125-06-037849-index.htm
report 2
file:///C:\Users\happy\OneDrive - California Institute of Technology\Desktop\sec_data\SLB_2006-02-24_report_2.html
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/000008734

Directly parse .txt html file
https://www.sec.gov/Archives/edgar/data/0000087347/0000950130-03-001530-index.htm
report 0
file:///C:\Users\happy\OneDrive - California Institute of Technology\Desktop\sec_data\SLB_2003-02-27_report_0.html
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000087347/0000950130-03-001530-index.htm
report 1
file:///C:\Users\happy\OneDrive - California Institute of Technology\Desktop\sec_data\SLB_2003-02-27_report_1.html
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000087347/0000950130-03-001530-index.htm
report 2
file:///C:\Users\happy\OneDrive - California Institute of Technology\Desktop\sec_data\SLB_2003-02-27_report_2.html
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/000008734

Directly parse .txt html file
296
SLB
1998-03-30
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000087347/000095013098001558/FilingSummary.xml
Directly parse .txt html file
297
SLB
1997-03-31
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000087347/000095013097001348/FilingSummary.xml
Directly parse .txt html file
298
ENV
2020-02-28
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0001337619/000162828020002576/FilingSummary.xml
https://www.sec.gov/Archives/edgar/data/0001337619/0001628280-20-002576-index.htm
Business Acquisitions
https://www.sec.gov/Archives/edgar/data/0001337619/000162828020002576/R10.htm
----------------------------------------------------------------------------------------------------
htt

https://www.sec.gov/Archives/edgar/data/0001337619/0001047469-15-001517-index.htm
Business Acquisitions
https://www.sec.gov/Archives/edgar/data/0001337619/000104746915001517/R9.htm
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0001337619/0001047469-15-001517-index.htm
Business Acquisitions (Tables)
https://www.sec.gov/Archives/edgar/data/0001337619/000104746915001517/R28.htm
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0001337619/0001047469-15-001517-index.htm
Business Acquisitions (Details)
https://www.sec.gov/Archives/edgar/data/0001337619/000104746915001517/R46.htm
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0001337619/0001047469-15-001517-index.htm
Business Acquisitions (Details 2)
https://www.sec.g

Directly parse .txt html file
https://www.sec.gov/Archives/edgar/data/0001436208/0001558370-19-002900-index.htm
report 0
file:///C:\Users\happy\OneDrive - California Institute of Technology\Desktop\sec_data\LEGH_2019-04-09_report_0.html
----------------------------------------------------------------------------------------------------
310
PBT
2020-03-16
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000319654/000119312520074441/FilingSummary.xml
Directly parse .txt html file
311
PBT
2019-03-18
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000319654/000119312519078294/FilingSummary.xml
Directly parse .txt html file
312
PBT
2018-03-15
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000319654/000119312518083

343
MARPS
2011-09-16
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000062362/000095012311085006/FilingSummary.xml
Directly parse .txt html file
344
MARPS
2010-09-27
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000062362/000095012310089062/FilingSummary.xml
Directly parse .txt html file
345
MARPS
2009-09-21
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000062362/000095012309044552/FilingSummary.xml
Directly parse .txt html file
346
MARPS
2008-09-25
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000062362/000095013408017168/FilingSummary.xml
Directly parse .txt html file
347
MARPS
2007-09-28
-------------------

Directly parse .txt html file
378
NRT
2002-01-14
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000072633/000007263302000002/FilingSummary.xml
Directly parse .txt html file
379
NRT
2001-01-10
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000072633/000007263301000001/FilingSummary.xml
Directly parse .txt html file
380
NRT
2000-01-12
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000072633/000007263300000002/FilingSummary.xml
Directly parse .txt html file
381
NRT
1999-01-13
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000072633/000007263399000002/FilingSummary.xml
Directly parse .txt html file
382
NRT
1998-01-13

https://www.sec.gov/Archives/edgar/data/0001306830/0001306830-16-000212-index.htm
Acquisitions, Dispositions and Plant Closures
https://www.sec.gov/Archives/edgar/data/0001306830/000130683016000212/R11.htm
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0001306830/0001306830-16-000212-index.htm
Acquisitions, Dispositions and Plant Closures (Tables)
https://www.sec.gov/Archives/edgar/data/0001306830/000130683016000212/R38.htm
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0001306830/0001306830-16-000212-index.htm
Acquisitions, Dispositions and Plant Closures (Schedule of Restructuring and Related Costs) (Details)
https://www.sec.gov/Archives/edgar/data/0001306830/000130683016000212/R73.htm
----------------------------------------------------------------------------------------------------
https://www.se

https://www.sec.gov/Archives/edgar/data/0001306830/0000950123-11-012862-index.htm
report 0
file:///C:\Users\happy\OneDrive - California Institute of Technology\Desktop\sec_data\CE_2011-02-11_report_0.html
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0001306830/0000950123-11-012862-index.htm
report 1
file:///C:\Users\happy\OneDrive - California Institute of Technology\Desktop\sec_data\CE_2011-02-11_report_1.html
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0001306830/0000950123-11-012862-index.htm
report 2
file:///C:\Users\happy\OneDrive - California Institute of Technology\Desktop\sec_data\CE_2011-02-11_report_2.html
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0001306830/0000950123-11-012862-index.htm


Directly parse .txt html file
https://www.sec.gov/Archives/edgar/data/0001306830/0000950123-08-002318-index.htm
report 0
file:///C:\Users\happy\OneDrive - California Institute of Technology\Desktop\sec_data\CE_2008-02-29_report_0.html
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0001306830/0000950123-08-002318-index.htm
report 1
file:///C:\Users\happy\OneDrive - California Institute of Technology\Desktop\sec_data\CE_2008-02-29_report_1.html
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0001306830/0000950123-08-002318-index.htm
report 2
file:///C:\Users\happy\OneDrive - California Institute of Technology\Desktop\sec_data\CE_2008-02-29_report_2.html
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0001306830/0

Directly parse .txt html file
https://www.sec.gov/Archives/edgar/data/0001306830/0000950136-06-002553-index.htm
report 0
file:///C:\Users\happy\OneDrive - California Institute of Technology\Desktop\sec_data\CE_2006-03-31_report_0.html
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0001306830/0000950136-06-002553-index.htm
report 1
file:///C:\Users\happy\OneDrive - California Institute of Technology\Desktop\sec_data\CE_2006-03-31_report_1.html
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0001306830/0000950136-06-002553-index.htm
report 2
file:///C:\Users\happy\OneDrive - California Institute of Technology\Desktop\sec_data\CE_2006-03-31_report_2.html
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0001306830/0

402
CMP
2019-03-01
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0001227654/000122765419000037/FilingSummary.xml
https://www.sec.gov/Archives/edgar/data/0001227654/0001227654-19-000037-index.htm
ACQUISITION
https://www.sec.gov/Archives/edgar/data/0001227654/000122765419000037/R12.htm
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0001227654/0001227654-19-000037-index.htm
ACQUISITION (Tables)
https://www.sec.gov/Archives/edgar/data/0001227654/000122765419000037/R30.htm
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0001227654/0001227654-19-000037-index.htm
ACQUISITION (Additional Information) (Details)
https://www.sec.gov/Archives/edgar/data/0001227654/000122765419000037/R47.htm
------------------------------

408
CMP
2013-02-21
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0001227654/000114036113008641/FilingSummary.xml
409
CMP
2012-02-22
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0001227654/000114036112010153/FilingSummary.xml
https://www.sec.gov/Archives/edgar/data/0001227654/0001140361-12-010153-index.htm
ACQUISITION
https://www.sec.gov/Archives/edgar/data/0001227654/000114036112010153/R13.htm
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0001227654/0001140361-12-010153-index.htm
ACQUISITION (Tables)
https://www.sec.gov/Archives/edgar/data/0001227654/000114036112010153/R32.htm
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/

Directly parse .txt html file
https://www.sec.gov/Archives/edgar/data/0001227654/0000950137-05-003091-index.htm
report 0
file:///C:\Users\happy\OneDrive - California Institute of Technology\Desktop\sec_data\CMP_2005-03-16_report_0.html
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0001227654/0000950137-05-003091-index.htm
report 1
file:///C:\Users\happy\OneDrive - California Institute of Technology\Desktop\sec_data\CMP_2005-03-16_report_1.html
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0001227654/0000950137-05-003091-index.htm
report 2
file:///C:\Users\happy\OneDrive - California Institute of Technology\Desktop\sec_data\CMP_2005-03-16_report_2.html
----------------------------------------------------------------------------------------------------
417
CMP
2004-03-19
------------------------------

Directly parse .txt html file
https://www.sec.gov/Archives/edgar/data/0000046250/0000897101-08-001344-index.htm
report 0
file:///C:\Users\happy\OneDrive - California Institute of Technology\Desktop\sec_data\HWKN_2008-06-13_report_0.html
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000046250/0000897101-08-001344-index.htm
report 1
file:///C:\Users\happy\OneDrive - California Institute of Technology\Desktop\sec_data\HWKN_2008-06-13_report_1.html
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000046250/0000897101-08-001344-index.htm
report 2
file:///C:\Users\happy\OneDrive - California Institute of Technology\Desktop\sec_data\HWKN_2008-06-13_report_2.html
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/000004

Directly parse .txt html file
https://www.sec.gov/Archives/edgar/data/0000778164/0001019687-16-005470-index.htm
report 0
file:///C:\Users\happy\OneDrive - California Institute of Technology\Desktop\sec_data\PEIX_2016-03-15_report_0.html
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000778164/0001019687-16-005470-index.htm
report 1
file:///C:\Users\happy\OneDrive - California Institute of Technology\Desktop\sec_data\PEIX_2016-03-15_report_1.html
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000778164/0001019687-16-005470-index.htm
report 2
file:///C:\Users\happy\OneDrive - California Institute of Technology\Desktop\sec_data\PEIX_2016-03-15_report_2.html
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/000077

Directly parse .txt html file
https://www.sec.gov/Archives/edgar/data/0000778164/0001019687-08-001360-index.htm
report 0
file:///C:\Users\happy\OneDrive - California Institute of Technology\Desktop\sec_data\PEIX_2008-03-27_report_0.html
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000778164/0001019687-08-001360-index.htm
report 1
file:///C:\Users\happy\OneDrive - California Institute of Technology\Desktop\sec_data\PEIX_2008-03-27_report_1.html
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000778164/0001019687-08-001360-index.htm
report 2
file:///C:\Users\happy\OneDrive - California Institute of Technology\Desktop\sec_data\PEIX_2008-03-27_report_2.html
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/000077

https://www.sec.gov/Archives/edgar/data/0000833079/0000833079-18-000007-index.htm
ACQUISITIONS AND GOODWILL
https://www.sec.gov/Archives/edgar/data/0000833079/000083307918000007/R15.htm
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000833079/0000833079-18-000007-index.htm
ACQUISITIONS AND GOODWILL (Tables)
https://www.sec.gov/Archives/edgar/data/0000833079/000083307918000007/R33.htm
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000833079/0000833079-18-000007-index.htm
ACQUISITIONS AND GOODWILL - Narrative (Details)
https://www.sec.gov/Archives/edgar/data/0000833079/000083307918000007/R62.htm
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000833079/0000833079-18-000007-index.htm
ACQUISITIONS AND GOODWILL 

Directly parse .txt html file
https://www.sec.gov/Archives/edgar/data/0000833079/0001104659-08-012616-index.htm
report 0
file:///C:\Users\happy\OneDrive - California Institute of Technology\Desktop\sec_data\MTH_2008-02-25_report_0.html
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000833079/0001104659-08-012616-index.htm
report 1
file:///C:\Users\happy\OneDrive - California Institute of Technology\Desktop\sec_data\MTH_2008-02-25_report_1.html
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000833079/0001104659-08-012616-index.htm
report 2
file:///C:\Users\happy\OneDrive - California Institute of Technology\Desktop\sec_data\MTH_2008-02-25_report_2.html
----------------------------------------------------------------------------------------------------
478
MTH
2007-02-26
------------------------------

Directly parse .txt html file
491
MTH
1994-03-31
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000833079/000095014794000036/FilingSummary.xml
Directly parse .txt html file
492
CVCO
2020-05-27
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000278166/000027816620000028/FilingSummary.xml
https://www.sec.gov/Archives/edgar/data/0000278166/0000278166-20-000028-index.htm
Acquisition of Destiny Homes
https://www.sec.gov/Archives/edgar/data/0000278166/000027816620000028/R28.htm
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000278166/0000278166-20-000028-index.htm
Acquisition of Destiny Homes (Details)
https://www.sec.gov/Archives/edgar/data/0000278166/000027816620000028/R121.htm
---------------------------------

https://www.sec.gov/Archives/edgar/data/0000798081/0001144204-12-022494-index.htm
BUSINESS COMBINATIONS-Acquisition of Qualytextil, S.A.
https://www.sec.gov/Archives/edgar/data/0000798081/000114420412022494/R12.htm
----------------------------------------------------------------------------------------------------
520
LAKE
2011-04-07
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000798081/000114420411020736/FilingSummary.xml
Directly parse .txt html file
https://www.sec.gov/Archives/edgar/data/0000798081/0001144204-11-020736-index.htm
report 0
file:///C:\Users\happy\OneDrive - California Institute of Technology\Desktop\sec_data\LAKE_2011-04-07_report_0.html
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000798081/0001144204-11-020736-index.htm
report 1
file:///C:\Users\happy\OneDrive - California I

Directly parse .txt html file
https://www.sec.gov/Archives/edgar/data/0000798081/0000914317-07-001003-index.htm
report 0
file:///C:\Users\happy\OneDrive - California Institute of Technology\Desktop\sec_data\LAKE_2007-04-12_report_0.html
----------------------------------------------------------------------------------------------------
525
LAKE
2006-04-17
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000798081/000091431706001087/FilingSummary.xml
Directly parse .txt html file
526
LAKE
2005-04-15
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000798081/000091431705001282/FilingSummary.xml
Directly parse .txt html file
527
LAKE
2004-04-30
----------------------------------------------------------------------------------------------------
https://www.sec.gov/Archives/edgar/data/0000798081/000091431704

In [130]:
print('Unable to process', failed_count, '10-Ks in total.')

Unable to process 16 10-Ks in total.


In [131]:
failed_tables

Unnamed: 0,Ticker,10-K Report Date,10-K Report URL
0,WGO,2010-10-26,https://www.sec.gov/Archives/edgar/data/000010...
1,MOH,2011-03-08,https://www.sec.gov/Archives/edgar/data/000117...
2,MOH,2010-03-16,https://www.sec.gov/Archives/edgar/data/000117...
3,MOH,2008-03-17,https://www.sec.gov/Archives/edgar/data/000117...
4,CSU,2007-03-16,https://www.sec.gov/Archives/edgar/data/000104...
5,CSU,2004-03-29,https://www.sec.gov/Archives/edgar/data/000104...
6,SEM,2011-03-09,https://www.sec.gov/Archives/edgar/data/000132...
7,SEM,2010-03-17,https://www.sec.gov/Archives/edgar/data/000132...
8,NRP,2011-02-28,https://www.sec.gov/Archives/edgar/data/000117...
9,NRP,2010-02-26,https://www.sec.gov/Archives/edgar/data/000117...


In [133]:
relevant_table_links

Unnamed: 0,2020,2019,2018,2017,2016,2015,2014,2013,2012,2011,...,1959,1958,1957,1956,1955,1954,1953,1952,1951,1950
QUMU,,,,,https://www.sec.gov/Archives/edgar/data/00008...,https://www.sec.gov/Archives/edgar/data/00008...,https://www.sec.gov/Archives/edgar/data/00008...,https://www.sec.gov/Archives/edgar/data/00008...,https://www.sec.gov/Archives/edgar/data/00008...,file:///C:\Users\happy\OneDrive - California ...,...,,,,,,,,,,
MICT,,,,,,,,https://www.sec.gov/Archives/edgar/data/00008...,https://www.sec.gov/Archives/edgar/data/00008...,,...,,,,,,,,,,
ATVI,,https://www.sec.gov/Archives/edgar/data/00007...,https://www.sec.gov/Archives/edgar/data/00007...,https://www.sec.gov/Archives/edgar/data/00007...,https://www.sec.gov/Archives/edgar/data/00007...,,,,,file:///C:\Users\happy\OneDrive - California ...,...,,,,,,,,,,
ESTC,https://www.sec.gov/Archives/edgar/data/00017...,https://www.sec.gov/Archives/edgar/data/00017...,,,,,,,,,...,,,,,,,,,,
AVNW,,,,,,,,,https://www.sec.gov/Archives/edgar/data/00013...,file:///C:\Users\happy\OneDrive - California ...,...,,,,,,,,,,
XSPA,,,,,,,,https://www.sec.gov/Archives/edgar/data/00014...,,file:///C:\Users\happy\OneDrive - California ...,...,,,,,,,,,,
WGO,,,,,,,,https://www.sec.gov/Archives/edgar/data/00001...,https://www.sec.gov/Archives/edgar/data/00001...,file:///C:\Users\happy\OneDrive - California ...,...,,,,,,,,,,
HSKA,https://www.sec.gov/Archives/edgar/data/00010...,https://www.sec.gov/Archives/edgar/data/00010...,https://www.sec.gov/Archives/edgar/data/00010...,https://www.sec.gov/Archives/edgar/data/00010...,https://www.sec.gov/Archives/edgar/data/00010...,https://www.sec.gov/Archives/edgar/data/00010...,https://www.sec.gov/Archives/edgar/data/00010...,,,,...,,,,,,,,,,
TPB,https://www.sec.gov/Archives/edgar/data/00012...,https://www.sec.gov/Archives/edgar/data/00012...,https://www.sec.gov/Archives/edgar/data/00012...,https://www.sec.gov/Archives/edgar/data/00012...,,,,,,,...,,,,,,,,,,
CNTY,https://www.sec.gov/Archives/edgar/data/00009...,https://www.sec.gov/Archives/edgar/data/00009...,https://www.sec.gov/Archives/edgar/data/00009...,https://www.sec.gov/Archives/edgar/data/00009...,,https://www.sec.gov/Archives/edgar/data/00009...,https://www.sec.gov/Archives/edgar/data/00009...,,file:///C:\Users\happy\OneDrive - California ...,file:///C:\Users\happy\OneDrive - California ...,...,,,,,,,,,,


In [135]:
relevant_table_links.loc['QUMU','2016']

' https://www.sec.gov/Archives/edgar/data/0000892482/000089248216000029/R9.htm | https://www.sec.gov/Archives/edgar/data/0000892482/000089248216000029/R24.htm | https://www.sec.gov/Archives/edgar/data/0000892482/000089248216000029/R37.htm | https://www.sec.gov/Archives/edgar/data/0000892482/000089248216000029/R38.htm | '

In [134]:
# save file for RA
relevant_table_links.to_csv('reports_links_for_selected_firms.csv')
failed_tables.to_csv('failed_to_parse_for_selected_firms.csv')

In [None]:
################################

In [9]:
# some checking
relative_fp = os.path.realpath(os.getcwd())
fp = relative_fp + 'reports_links_for_selected_firms.csv'
report_names = pd.read_csv('reports_links_for_selected_firms.csv')['Ticker']

In [11]:
report_names


0       QUMU
1       QUMU
2       QUMU
3       QUMU
4       QUMU
        ... 
1167    LAKE
1168    LAKE
1169    LAKE
1170    LAKE
1171    LAKE
Name: Ticker, Length: 1172, dtype: object

In [14]:
names = pd.read_csv('company basic info')['Ticker']
for name in names: 
    #print(name)
    if name not in list(report_names):
        print(name)
        
# AKBA new firm since 2018, no acq
# BPMC new firm since 2016, no acq
# PBT firm since 1996, checked .txt 10K no acq
# MARPS firm since 1995, checked .txt 10K no acq
# NRT firm since 1996, checked .txt 10K no acq
# HYMC new firm since 2019, no acq
# WRTC new firm since 2018, no acq
# RDVT new firm since 2019, no acq

AKBA
BPMC
PBT
MARPS
NRT
HYMC
WRTC
RDVT


***
## Grab the Filing XML Summary
Something that makes 10-K and for that matter 10-Q filings so unique is we have access to a particular document that gives us a quick way to grab the data we need from a 10-K. This file is the **filing summary** and comes in an either an `XML` or `xlsx` format. While you would think these two files would be identical, they are not, the `XML` version of the file provides us with a quick way to see the structure of the 10-K, defines whether a section is a note, table or details, and the name and each corresponding file for each section.

The `xlsx` file, on the other hand, contains each section of the 10K in an excel style format. This file can come in handy if we want to parse just a single location, but be warned that formatting issues will not make it a simple load.

Let's assume we want to parse the `XML` file as we want to leverage the underlying structure of the 10-K report. In the section below, I outline how you would go about this process and use a sample document URL for our demonstration.

In [13]:
# define the base url needed to create the file url.
base_url = r"https://www.sec.gov"

# convert a normal url to a document url
normal_url = r"https://www.sec.gov/Archives/edgar/data/50863/0000050863-20-000011.txt"

# define a url that leads to a 10k document landing page
documents_url = normal_url.replace('-','').replace('.txt','/index.json')

# request the url and decode it.
content = requests.get(documents_url).json()

for file in content['directory']['item']:
    
    # Grab the filing summary and create a new url leading to the file so we can download it.
    if file['name'] == 'FilingSummary.xml':

        xml_summary = base_url + content['directory']['name'] + "/" + file['name']
        
        print('-' * 100)
        print('File Name: ' + file['name'])
        print('File Path: ' + xml_summary)       

----------------------------------------------------------------------------------------------------
File Name: FilingSummary.xml
File Path: https://www.sec.gov/Archives/edgar/data/50863/000005086320000011/FilingSummary.xml


***
## Parsing the Filing Summary
Okay, we now have access to a filing summary file. The first thing we need to do is request the file using the `requests` library we will then take the contents of that request and pass through our `BeautifulSoup` object. I encourage individuals who are new to this process to look at the file itself, so you better understanding of the structure, this will reinforce my approach below.

The main section of the file we want to grab belongs under the `myreports` tag. This contains a list of each of the reports in the document. Each report falls under a `report` tag and has the following structure:

`<Report instance="mtii-20181231.xml">
    <IsDefault>false</IsDefault>
    <HasEmbeddedReports>false</HasEmbeddedReports>
    <HtmlFileName>R1.htm</HtmlFileName>
    <LongName>0001000 - Document - Document and Entity Information</LongName>
    <ReportType>Sheet</ReportType>
    <Role>http://www.monitronics.com/role/DocumentAndEntityInformation</Role>
    <ShortName>Document and Entity Information</ShortName>
    <MenuCategory>Cover</MenuCategory>
    <Position>1</Position>
</Report>`

The main tags we will be concerned with are the following:

1. **`HtmlFileName`** - This is the name of the file and will be needed to build the file URL.
2. **`LongName`** - This is the long name of the report, with its ID. Keep in mind the ID can be leveraged in other 10Ks of other companies, but unfortunately, it is not always guaranteed.
3. **`ShortName`** - The short name of the report, this is surprisingly more consistent across companies compared to the long name which includes the ID.
4. **`MenuCategory`** - This can be thought of as a category the report falls under, a table, notes, details, cover, or statements. This will be leveraged as another filtering mechanism.
5. **`Position`** - This is the position of the report in the main document and also corresponds to the `HtmlFileName.`

In [127]:
company_xml_summaries = {'INTC': r'https://www.sec.gov/Archives/edgar/data/50863/000005086320000011/FilingSummary.xml', 
                         'GOOG': r'https://www.sec.gov/Archives/edgar/data/1652044/000165204420000008/FilingSummary.xml',
                         'AMZN': r'https://www.sec.gov/Archives/edgar/data/1018724/000101872420000004/FilingSummary.xml'}

In [None]:
def find_relevent_reports(xlm_summaries):
    #dict for relevant reports for each company
    company_reports = {}
    
    for company in company_xml_summaries.keys():
        xml_summary = company_xml_summaries[company]
    
        # define a new base url that represents the filing folder. This will come in handy when we need to download the reports.
        base_url = xml_summary.replace('FilingSummary.xml', '')

        # request and parse the content
        content = requests.get(xml_summary).content

        soup = BeautifulSoup(content, 'lxml')

        # find the 'myreports' tag because this contains all the individual reports submitted.
        reports = soup.find('myreports')

        # I want a list to store all the individual components of the report, so create the master list.
        master_reports = []

        # for aquisition related reports#############
        acquisition_reports = []

        # loop through each report in the 'myreports' tag but avoid the last one as this will cause an error.
        for report in reports.find_all('report')[:-1]:

            # let's create a dictionary to store all the different parts we need.
            report_dict = {}
            report_dict['name_short'] = report.shortname.text
            report_dict['name_long'] = report.longname.text
            report_dict['position'] = report.position.text
            report_dict['category'] = report.menucategory.text
            report_dict['url'] = base_url + report.htmlfilename.text

            # append the dictionary to the master list.
            master_reports.append(report_dict)

            # acquisition related
            #what we want
            if 'acquisition' in report.shortname.text.lower() or 'aquisition' in report.longname.text.lower():
                acquisition_reports.append(report_dict)

                # print the info to the user.
                #print('-'*100)
                print(company)
                print(base_url + report.htmlfilename.text)
                print(report.longname.text)
                print(report.shortname.text)
                print(report.menucategory.text)
                print(report.position.text)
            
        company_reports[company] = acquisition_reports

    return company_reports


In [129]:
company_relevent_reports = find_relevent_reports(company_xml_summaries)

INTC
https://www.sec.gov/Archives/edgar/data/50863/000005086320000011/R19.htm
2112100 - Disclosure - Acquisitions & Divestitures
Acquisitions & Divestitures
Notes
19
INTC
https://www.sec.gov/Archives/edgar/data/50863/000005086320000011/R67.htm
2412401 - Disclosure - Acquisitions (Details)
Acquisitions (Details)
Details
67
GOOG
https://www.sec.gov/Archives/edgar/data/1652044/000165204420000008/R16.htm
2111100 - Disclosure - Acquisitions
Acquisitions
Notes
16
GOOG
https://www.sec.gov/Archives/edgar/data/1652044/000165204420000008/R65.htm
2411401 - Disclosure - Acquisitions (Narrative) (Details)
Acquisitions (Narrative) (Details)
Details
65
GOOG
https://www.sec.gov/Archives/edgar/data/1652044/000165204420000008/R67.htm
2412403 - Disclosure - Goodwill and Other Intangible Assets (Acquisition-Related Intangible Assets that are being Amortized) (Details)
Goodwill and Other Intangible Assets (Acquisition-Related Intangible Assets that are being Amortized) (Details)
Details
67
GOOG
https://www

In [141]:
# function to help find words related to acquisition and number and cost, and n words before and after them
def extract_info(lst, n_words=16):
    # a set for the indexes to pull out the important words
    info_indexes = set()
    
    acquisition_related_words = ['acquire', 'acquisition']
    numbers = ['one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'ten']
    digits = [str(i) for i in range(1, 11)]
    dollar = ['$']
    numerical_keywords = numbers + digits + dollar
    
    for i in range(len(lst)):
        word = lst[i]
        
        # find if any word is a keyword of any variation
        if True in [keyword.lower() in word for keyword in acquisition_related_words]:

            # check if numbers or digits or dollars are n words adjacent
            front = lst[i-n_words : i+1]
            #for front
            for front_word in front: 
                if True in [keyword.lower() in front_word for keyword in numerical_keywords]: 
                    info_indexes.update([index for index in range(max(i-n_words, 0), i+1)])
            
            end = lst[i : i+n_words+1]
            #for end
            for end_word in end: 
                if True in [keyword.lower() in end_word for keyword in numerical_keywords]:
                    info_indexes.update([index for index in range(i, min(i+n_words+1, len(lst)))])   
            
        '''
        # find if any word is a keyword of any variation
        if True in [word in keyword.lower() for keyword in keywords]:
            # update set
            info_indexes.update([index for index in range(i-n_words, i+n_words+1)])
        '''
    
    return np.array(lst)[list(info_indexes)] # important words
            

In [152]:
def company_acquisition_info(company_relevent_reports):
    for company in company_relevent_reports.keys():
        # get report dictionaries for each company
        report_dicts = company_relevent_reports[company]
        has_info_on_acquisition = False

        for report_dict in report_dicts:
            # access each relevant report using url
            url = report_dict['url']
            content = requests.get(url).content
            soup = BeautifulSoup(content, 'html')

            for paragraph in soup.find_all('td',{'class':'text'}):
                words = paragraph.text.split()
                if words != []:
                    # only works for plain texts report, where each comapny should have a note section for acquisition
                    info = extract_info(words)
                    info 
                    info_string = ' '.join(info)
                    if info_string.strip() != '':
                        no_info_on_acquisition = True
                        print(company)
                        print(info_string)#.replace('.', '.\n'))
                        print()
                    

        #inform user if this 10-K does not disclose acquisition info for this firm
        if not no_info_on_acquisition:
            print('No acquisition information disclosed for ' + company)
    
    return

In [153]:
company_acquisition_info(company_relevent_reports)

INTC
acquisitions in both 2019 and 2018, all of which qualified as business combinations. Except for the acquisition of Habana Labs, these acquisitions acquisitions in 2019 and 2018 primarily consisted of cash and was allocated to goodwill and identified intangible the classification of intangible assets, see "Note 13: Identified Intangible Assets."Habana LabsOn December 12, 2019, we acquired opportunity. Total consideration to acquire Habana Labs was $1.7 billion. The fair values of the assets acquired relate to goodwill of $1.5 billion and acquisition-related intangible assets of $250 million, which was primarily in-process research and development. The goodwill and operating of Habana Labs are included in our DCG operating segment.Goodwill of $1.5 billion arising from the acquisition

GOOG
acquisition of Looker, a unified platform for business intelligence, data applications and embedded analytics for $2.4 billion, measurement period. The $2.4 billion purchase price includes our pre

In [144]:
'sdfsdfs. sfsfsf..fs'.replace('.', '.\n')

'sdfsdfs.\n sfsfsf.\n.\nfs'

In [151]:
a = ['2',  'sddfs.']
[i[-1] == '.' for i in a]

TypeError: list indices must be integers or slices, not list

In [None]:
# below unrelated

***
## Grabbing the Financial Statements
We now have a nice organized list of all the different components of the 10-K filing, while it won't have all the info it makes the process of getting the data tables a lot easier. We can always revisit the actual text but at this point let's move forward assuming that we want to get the company financial statements. This will include the following:

1. Balance Sheet
2. Statement of Cash Flows
3. Income Statement
4. Statement of Stock Holder Equity

The first thing we need to do is a loop through each report dictionary, see if the financial statement we are looking for exists in that dictionary and if it does append it to a new list called `url_list`. The `url_list` will contain a URL to each of the statements, and each statement will exist in an HTML format that we can scrape relatively quickly.

***
*As a side note, I will be working on a list of report naming conventions across companies. This way, for example, if we want to find the balance sheet we have a list of potential names and IDs we can try.*

In [4]:
# create the list to hold the statement urls
statements_url = []

for report_dict in master_reports:
    
    # define the statements we want to look for.
    item1 = r"Consolidated Balance Sheets"
    item2 = r"Consolidated Statements of Operations and Comprehensive Income (Loss)"
    item3 = r"Consolidated Statements of Cash Flows"
    item4 = r"Consolidated Statements of Stockholder's (Deficit) Equity"
    
    # store them in a list.
    report_list = [item1, item2, item3, item4]
    
    # if the short name can be found in the report list.
    if report_dict['name_short'] in report_list:
        
        # print some info and store it in the statements url.
        print('-'*100)
        print(report_dict['name_short'])
        print(report_dict['url'])
        
        statements_url.append(report_dict['url'])

----------------------------------------------------------------------------------------------------
Consolidated Balance Sheets
https://www.sec.gov/Archives/edgar/data/1265107/000126510719000004/R2.htm
----------------------------------------------------------------------------------------------------
Consolidated Statements of Operations and Comprehensive Income (Loss)
https://www.sec.gov/Archives/edgar/data/1265107/000126510719000004/R4.htm
----------------------------------------------------------------------------------------------------
Consolidated Statements of Cash Flows
https://www.sec.gov/Archives/edgar/data/1265107/000126510719000004/R5.htm
----------------------------------------------------------------------------------------------------
Consolidated Statements of Stockholder's (Deficit) Equity
https://www.sec.gov/Archives/edgar/data/1265107/000126510719000004/R6.htm


***
## Scraping the Financial Statements
We now have each financial statement's URL that we can now request for the content of that specific statement. The first thing we will need to do is a loop through all the URLs, request each one, and then parse the content. Like the **filing xml summary** up above, I encourage individuals new to scraping the documents to visit each HTML file up above to see what the data looks like.

You'll first notice, it's a simple HTML table. Depending on the statement you're looking at, the structure may be slightly different, but a general hierarchy does exist. My approach to parsing the table falls into three significant steps:

1. Parsing the table headers.
2. Parsing the table rows.
3. Parsing the table sections.

I find this approach to be the most reliable as it allows us to loop through each row in the table, but we ask a specific question for each row. The question is, what type of row are you?

Depending on the answer to this question, we can determine which section of the `statement dictionary` the row should be inserted to. Table headers will contain important information regarding the time horizon of the financial statement. Table sections, help us to distinguish different parts of the statement easily.

Finally, table rows contain the data we want to parse. We distinguish these rows, by seeing if certain elements exist in each. For example, only header row would include a `th` tag inside of it; otherwise, it's not considered a header. Section headers contain a `strong` element but no `th` tags.

Hopefully, you're catching on to my approach when it comes to grabbing each row. I ask a simple question to distinguish each row. Once we have the row, we loop through each of the `td` tags, strip text, store it in a list using a list comprehension, and store it in the appropriate section of our statement dictionary.

In [5]:
# let's assume we want all the statements in a single data set.
statements_data = []

# loop through each statement url
for statement in statements_url:

    # define a dictionary that will store the different parts of the statement.
    statement_data = {}
    statement_data['headers'] = []
    statement_data['sections'] = []
    statement_data['data'] = []
    
    # request the statement file content
    content = requests.get(statement).content
    report_soup = BeautifulSoup(content, 'html')

    # find all the rows, figure out what type of row it is, parse the elements, and store in the statement file list.
    for index, row in enumerate(report_soup.table.find_all('tr')):
        
        # first let's get all the elements.
        cols = row.find_all('td')
        
        # if it's a regular row and not a section or a table header
        if (len(row.find_all('th')) == 0 and len(row.find_all('strong')) == 0): 
            reg_row = [ele.text.strip() for ele in cols]
            statement_data['data'].append(reg_row)
            
        # if it's a regular row and a section but not a table header
        elif (len(row.find_all('th')) == 0 and len(row.find_all('strong')) != 0):
            sec_row = cols[0].text.strip()
            statement_data['sections'].append(sec_row)
            
        # finally if it's not any of those it must be a header
        elif (len(row.find_all('th')) != 0):            
            hed_row = [ele.text.strip() for ele in row.find_all('th')]
            statement_data['headers'].append(hed_row)
            
        else:            
            print('We encountered an error.')

    # append it to the master list.
    statements_data.append(statement_data)  

***
## Converting the Data into a Data Frame
Great, we now have all the data for all the financial statements, and it's in a much better structure that will allow us to work with it. We still have some work to do regarding transforming it into the right data type, but we will handle that later. Let's first get it into a data frame, and from there we need to do some massaging to the data.

The first thing we will notice is the index won't work with what we have so we take the first column which contains the indexes and set it as the index. Let's also make sure to rename the index column to something more meaningful.

From here, we need to remove certain characters before we do our type conversion. We can use the `replace` method specifying the `regex` parameter to true. I have to do three separate replace because one handles positive data, one handles negative data, and one handles blank values. After the regex, we can do a type conversion to the whole data frame and then assign our column headers.

In [6]:
# Grab the proper components
income_header =  statements_data[1]['headers'][1]
income_data = statements_data[1]['data']

# Put the data in a DataFrame
income_df = pd.DataFrame(income_data)

# Display
print('-'*100)
print('Before Reindexing')
print('-'*100)
display(income_df.head())

# Define the Index column, rename it, and we need to make sure to drop the old column once we reindex.
income_df.index = income_df[0]
income_df.index.name = 'Category'
income_df = income_df.drop(0, axis = 1)

# Display
print('-'*100)
print('Before Regex')
print('-'*100)
display(income_df.head())

# Get rid of the '$', '(', ')', and convert the '' to NaNs.
income_df = income_df.replace('[\$,)]','', regex=True )\
                     .replace( '[(]','-', regex=True)\
                     .replace( '', 'NaN', regex=True)

# Display
print('-'*100)
print('Before type conversion')
print('-'*100)
display(income_df.head())

# everything is a string, so let's convert all the data to a float.
income_df = income_df.astype(float)

# Change the column headers
income_df.columns = income_header

# Display
print('-'*100)
print('Final Product')
print('-'*100)

# show the df
income_df

# drop the data in a CSV file if need be.
# income_df.to_csv('income_state.csv')

----------------------------------------------------------------------------------------------------
Before Reindexing
----------------------------------------------------------------------------------------------------


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,Net revenue,"$ 134,436","$ 137,156","$ 135,013","$ 133,753","$ 133,546","$ 138,211","$ 140,498","$ 141,200","$ 540,358","$ 553,455","$ 570,372"
1,Cost of services,,,,,,,,,128939,119193,115236
2,"Selling, general and administrative, including...",,,,,,,,,118940,155902,114152
3,Radio conversion costs,,,,,,,,,0,450,18422
4,"Amortization of subscriber accounts, deferred ...",,,,,,,,,211639,236788,246753


----------------------------------------------------------------------------------------------------
Before Regex
----------------------------------------------------------------------------------------------------


Unnamed: 0_level_0,1,2,3,4,5,6,7,8,9,10,11
Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Net revenue,"$ 134,436","$ 137,156","$ 135,013","$ 133,753","$ 133,546","$ 138,211","$ 140,498","$ 141,200","$ 540,358","$ 553,455","$ 570,372"
Cost of services,,,,,,,,,128939,119193,115236
"Selling, general and administrative, including stock-based and long-term incentive compensation",,,,,,,,,118940,155902,114152
Radio conversion costs,,,,,,,,,0,450,18422
"Amortization of subscriber accounts, deferred contract acquisition costs and other intangible assets",,,,,,,,,211639,236788,246753


----------------------------------------------------------------------------------------------------
Before type conversion
----------------------------------------------------------------------------------------------------


Unnamed: 0_level_0,1,2,3,4,5,6,7,8,9,10,11
Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Net revenue,134436.0,137156.0,135013.0,133753.0,133546.0,138211.0,140498.0,141200.0,540358,553455,570372
Cost of services,,,,,,,,,128939,119193,115236
"Selling, general and administrative, including stock-based and long-term incentive compensation",,,,,,,,,118940,155902,114152
Radio conversion costs,,,,,,,,,0,450,18422
"Amortization of subscriber accounts, deferred contract acquisition costs and other intangible assets",,,,,,,,,211639,236788,246753


----------------------------------------------------------------------------------------------------
Final Product
----------------------------------------------------------------------------------------------------


Unnamed: 0_level_0,"Dec. 31, 2018","Sep. 30, 2018","Jun. 30, 2018","Mar. 31, 2018","Dec. 31, 2017","Sep. 30, 2017","Jun. 30, 2017","Mar. 31, 2017","Dec. 31, 2018","Dec. 31, 2017","Dec. 31, 2016"
Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Net revenue,134436.0,137156.0,135013.0,133753.0,133546.0,138211.0,140498.0,141200.0,540358.0,553455.0,570372.0
Cost of services,,,,,,,,,128939.0,119193.0,115236.0
"Selling, general and administrative, including stock-based and long-term incentive compensation",,,,,,,,,118940.0,155902.0,114152.0
Radio conversion costs,,,,,,,,,0.0,450.0,18422.0
"Amortization of subscriber accounts, deferred contract acquisition costs and other intangible assets",,,,,,,,,211639.0,236788.0,246753.0
Depreciation,,,,,,,,,11434.0,8818.0,8160.0
Loss on goodwill impairment,,,,,,,,,563549.0,0.0,0.0
Total operating expenses,,,,,,,,,1034501.0,521151.0,502723.0
Operating income (loss),-316590.0,12280.0,-201845.0,12012.0,14647.0,12896.0,-11848.0,16609.0,-494143.0,32304.0,67649.0
Interest expense,,,,,,,,,180770.0,145492.0,127308.0
