The SEC website has index files that provide information on all filings avaliable through EDGAR.

We will combine these index files to make a database of filings. This database contains information about these filings and the URLs. Building this database allows us to create samples of filings that we wish to download. Alternatively, you can use the WRDS module. 

After we have imported our modules, we need to generate the URLs needed to scrape the index files. We know the first part of the URL will stay static. It will be https://www.sec.gov/Archives/edgar/full-index/; however, the second part of the URL will change. Thankfully the second part follows an API [  /X/QTR(Y)/master.idx  ] with X being the year, and Y being the QTR number. Since our URLs follow a predefined pattern, we can create a for loop to generate URLs for as many years as we would like. For this walkthrough, however, I have limited the database to 2019. We can expand the range by changing the variables.

Once we have generated our dictionary of URLs with the for loop, we create a list to store the files already downloaded in
the index folder and a dictionary to store the information about the data we need to download. This list will prevent you from downloading these files every time you run the program.

In [8]:
import pandas as pd
from os.path import join
import spacy
import unidecode, requests, unidecode, tqdm
import lxml.html
from glob import glob
import os, re, sys, time

#Url_dict will store filename, URL
url_dict = {}
url_base = 'https://www.sec.gov/Archives/edgar/full-index/'

#Set our data path to the filings folder
data_path = join(os.getcwd(), 'Index Files')

#Variables to configure the for loop, loop is configured to only download 2019 files currently.
start_year = 2019
end_year = 2020
date_range = end_year - start_year

#loops through each year, then for each year loops through 4 times.
for i in range(date_range):
    for i in range(1, 5):
        dict_key = str(start_year) + '_Q' + str(i)+ '_Master.idx'
        url = url_base + str(start_year) + '/QTR' + str(i) +'/master.idx'
        url_dict[dict_key] = url
        
#os.listdir will create a list of files present in the folder of the path it is passed.
files_down = os.listdir(data_path)        
#Store downloaded files
download_dict = {}

#Loops through Url_dict and checks if files are in the Index File Foler, if they are not it downloads them and adds them.
#.items() is required to loop through dictionaries. We are also required to specificy two variable names before naming
#the data structure we wish to alter. For simplicity I usually utilize key, and value. 
for key, value in url_dict.items():
    if key not in files_down:
        download_dict[key] = value

#Get out data_path
data_path = join(os.getcwd(), 'Index Files')

#data_dict will store the raw text data
data_dict = {}

#Loop through out download dict and download the items contained in it, if we can not get one it will return an error.
for key, value in download_dict.items():
    #requests.get generates an object. Objects store a variety of information that can be acessed through commands.
    res = requests.get(value)
    
    #Two examples of this are .status code and .txt
    if res.status_code == 200:
        print('Found...... Downloading  ' + str(key))
        res = res.text
        data_dict[key] = res
        
    else:
        print('Error.....' + str(key))

#Loop through the data_dict and remove everything before CIK, and the ------- in the document. Save to Index File folder
#This loop is just for formatting, dont worry about it too much. 
for key, text in data_dict.items():
    text_2 = text.split('CIK')
    text = text_2[1]
    text = 'CIK' + text
    text = re.sub("-","", text, count=80)
    filename = key
    html_file = open(join(data_path, filename), 'a')
    html_file.write(text)
    html_file.close()

We now need to take the index files we downloaded and altered, and put them into a dataframe.

We will begin by creating another list containing the files downloaded in the Index Files folder. This list is files_0, and we will use this list to gather the names of the files. We will also create a second list files_1; we will use it to store the actual paths of the files in files_0. We will build these paths by utilizing a for loop, where we can combine the folder's path with a backslash and the file name to create the path of the file. Interestingly we cannot add a single backslash to the path and file because a single backslash is a special character in Python. Instead, we have to pass two backslashes, which Python will recognize as one. 

Once we have the files_1 list, we can create a dataframe that stores all of the information from these files. We will do some minor housekeeping with a command to add back leading zeros to the CIK number, reformatting the date from a string to a DateTime format in case we want to create a sample based on date. We also need to create the URLs for the filings. The index files only come with half of the information needed to generate the URL. Luckily we can add the first half of the URL, which is not unique to the second half of the URL contained within the index filing, which is unique. We also do some housekeeping with Company names, removing special characters so we can use the company name to create file names later. Pandas will automatically truncate values to save memory, so we will need to stop it from truncating these values.

We will finally save the dataframe to the folder in which the program is running. This will save us from having to repeat this process in the future. We save this a .pkl file, which is short for the Pickle file extension. This file extension serializes data in Python and stores it in a way that it can be loaded into another Python script without having to reformat the information.

In [9]:
#Set our data path to the filings folder
path = join(os.getcwd(), 'Index Files')

#Create List of Files in Index Files
files_0 = os.listdir(path)

#Second List to add path to Index Files
files_1 = []

#Add Path to Index 
for file in files_0:
    file = path + '\\' + file
    files_1.append(file)
files_1

#Create Data Frame
Edgar_df = pd.concat((pd.read_table(file, encoding="latin1", sep='|') for file in files_1))

#Add back leading zeros
Edgar_df['CIK'] = Edgar_df['CIK'].apply(lambda x: '{0:0>10}'.format(x))

#Need to Format Date Filed as a datetime, so we can search it later. 
Edgar_df['Date Filed'] =  pd.to_datetime(Edgar_df['Date Filed'])

#Trying adding URL to df, may be able to feed directly into a downloader
Edgar_df['URL'] = 'https://www.sec.gov/Archives/' + Edgar_df['Filename']

#Remove special characters from the company names, some of these can cause problems
Edgar_df['Company Name'].replace('[^A-Za-z0-9- ]+', '', regex=True, inplace=True)

#Stops Pandas from truncating values.
pd.set_option('display.max_colwidth', -1) # Was using -1, but being depreciated

#Save the dataframe so we can open it again later without having to recreate it. 
Edgar_df.to_pickle('Edgar_df.pkl')

Edgar_df

Unnamed: 0,CIK,Company Name,Form Type,Date Filed,Filename,URL
0,0001000045,NICHOLAS FINANCIAL INC,10-Q,2019-02-14,edgar/data/1000045/0001193125-19-039489.txt,https://www.sec.gov/Archives/edgar/data/1000045/0001193125-19-039489.txt
1,0001000045,NICHOLAS FINANCIAL INC,4,2019-01-15,edgar/data/1000045/0001357521-19-000001.txt,https://www.sec.gov/Archives/edgar/data/1000045/0001357521-19-000001.txt
2,0001000045,NICHOLAS FINANCIAL INC,4,2019-02-19,edgar/data/1000045/0001357521-19-000002.txt,https://www.sec.gov/Archives/edgar/data/1000045/0001357521-19-000002.txt
3,0001000045,NICHOLAS FINANCIAL INC,4,2019-03-15,edgar/data/1000045/0001357521-19-000003.txt,https://www.sec.gov/Archives/edgar/data/1000045/0001357521-19-000003.txt
4,0001000045,NICHOLAS FINANCIAL INC,8-K,2019-02-01,edgar/data/1000045/0001193125-19-024617.txt,https://www.sec.gov/Archives/edgar/data/1000045/0001193125-19-024617.txt
...,...,...,...,...,...,...
206005,0000009984,BARNES GROUP INC,4,2019-12-02,edgar/data/9984/0000009984-19-000109.txt,https://www.sec.gov/Archives/edgar/data/9984/0000009984-19-000109.txt
206006,0000009984,BARNES GROUP INC,4,2019-12-04,edgar/data/9984/0000009984-19-000113.txt,https://www.sec.gov/Archives/edgar/data/9984/0000009984-19-000113.txt
206007,0000009984,BARNES GROUP INC,4,2019-12-12,edgar/data/9984/0000009984-19-000114.txt,https://www.sec.gov/Archives/edgar/data/9984/0000009984-19-000114.txt
206008,0000009984,BARNES GROUP INC,8-K,2019-10-25,edgar/data/9984/0000009984-19-000098.txt,https://www.sec.gov/Archives/edgar/data/9984/0000009984-19-000098.txt


We can use this dataframe to create samples, but for this tutorial, we will use a preselected sample.

In [10]:
#Load predefined sample
Sample_df = pd.read_pickle('Sample_df.pkl')

# Update Index for Identifer, with an Identifer the df will play nicely with flow control
Sample_df = Sample_df.reset_index(drop=True)
Sample_df.index.name = 'File ID'

#Convert Date Filed back to a string so we can combine it with Company Name, and Form Type
#to get our save name.
Sample_df['Date Filed'] = Sample_df['Date Filed'].astype(str)
Sample_df['Save_Name'] = Sample_df['Company Name'] + ' ' + Sample_df['Form Type'] + ' ' + Sample_df['Date Filed'] + '.txt'
Sample_df

Unnamed: 0_level_0,CIK,Company Name,Form Type,Date Filed,Filename,URL,Save_Name
File ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0001522420,BSB Bancorp Inc,10-K,2019-03-15,edgar/data/1522420/0001193125-19-076573.txt,https://www.sec.gov/Archives/edgar/data/1522420/0001193125-19-076573.txt,BSB Bancorp Inc 10-K 2019-03-15.txt
1,0000896264,USANA HEALTH SCIENCES INC,10-K,2019-02-26,edgar/data/896264/0001047469-19-000707.txt,https://www.sec.gov/Archives/edgar/data/896264/0001047469-19-000707.txt,USANA HEALTH SCIENCES INC 10-K 2019-02-26.txt
2,0000719220,ST BANCORP INC,10-K,2019-02-21,edgar/data/719220/0000719220-19-000017.txt,https://www.sec.gov/Archives/edgar/data/719220/0000719220-19-000017.txt,ST BANCORP INC 10-K 2019-02-21.txt
3,0001280784,Hercules Capital Inc,10-K,2019-02-21,edgar/data/1280784/0001564590-19-003680.txt,https://www.sec.gov/Archives/edgar/data/1280784/0001564590-19-003680.txt,Hercules Capital Inc 10-K 2019-02-21.txt
4,0000311817,HMG COURTLAND PROPERTIES INC,10-K,2019-03-28,edgar/data/311817/0001575872-19-000071.txt,https://www.sec.gov/Archives/edgar/data/311817/0001575872-19-000071.txt,HMG COURTLAND PROPERTIES INC 10-K 2019-03-28.txt
...,...,...,...,...,...,...,...
95,0001178670,ALNYLAM PHARMACEUTICALS INC,10-K,2019-02-14,edgar/data/1178670/0001564590-19-003022.txt,https://www.sec.gov/Archives/edgar/data/1178670/0001564590-19-003022.txt,ALNYLAM PHARMACEUTICALS INC 10-K 2019-02-14.txt
96,0001590364,Fortress Transportation Infrastructure Investors LLC,10-K,2019-02-28,edgar/data/1590364/0001590364-19-000002.txt,https://www.sec.gov/Archives/edgar/data/1590364/0001590364-19-000002.txt,Fortress Transportation Infrastructure Investors LLC 10-K 2019-02-28.txt
97,0001428439,ROKU INC,10-K,2019-03-01,edgar/data/1428439/0001564590-19-005829.txt,https://www.sec.gov/Archives/edgar/data/1428439/0001564590-19-005829.txt,ROKU INC 10-K 2019-03-01.txt
98,0001293282,TechTarget Inc,10-K,2019-03-13,edgar/data/1293282/0001564590-19-007403.txt,https://www.sec.gov/Archives/edgar/data/1293282/0001564590-19-007403.txt,TechTarget Inc 10-K 2019-03-13.txt


We will now need to download the 10-K filings from the SEC website.

We start by defining a function, download_file. This function will take the URL, date, form, and company name, and return the text and status code. This function is originally from Dr. Kok's GitHub and has been modified for our purposes.

After we have defined our download function, we will get ready to download the files. We will begin by setting our path equal to the path of the folder where we will store our filings. This will allow us to check for any filings present in the folder, and tell our program where to store them once we have downloaded them.

We also need to create two dictionaries to store the information for use in our program. I utilize dictionaries a lot in my programs because information can be stored by an identifier called a key. For example, you may save a cleaned 10-k in a dictionary using the company name as the key, which gives you a unique identifier to call the information as opposed to a list where you can only call stored information by the index where it is stored. Dictionaries and lists can be combined however, to store multiple pieces of information under one key. I have found it is often better to use multiple dictionaries than to combine data structures as combining data structures creates some unique problems and messy code. Sometimes it is necessary and more convenient, but it is rare.

Once we have initialized our data structures and have our path and files list, we can download our files. We will create a for loop for this task. Once again, we will need to pass two variables before we declare the data structure we wish to alter. We will also have to do something new with this for loop, we will need to call the function .iterrows() for our data structure, this is because we are dealing with a dataframe object and not a dictionary, the .itterows() function, functions similarly to the .items() function. The for loop contains two if statements, we want to check if the file is saved in our index folder, and if it is we will open the file, and if it is not, we will download the file.

In [11]:
#Define function to downlaod files
def download_file(url, date, form, company, max_tries=4, sleep_time = 1):
    failed_attempts = 0
    while True: 
        res = requests.get(url)
        
        #download the raw html of the file we just scraped
        if '.txt' in url:
            # Define filename
            filename = company + ' ' + form + ' ' + date + '.txt'
            
            #Create file
            html_file = open(join(data_path, filename), 'a')
            
            #Decode res object
            resx = unidecode.unidecode(res.text)
            
            #Save file
            html_file.write(str(resx))
            
            #Close file
            html_file.close()
            
        #Status_code = 200 means sucessful scraping
        if res.status_code == 200:  
            return True, res.text
        
        #Loop will attempt to download 3 more times, if failure has been encountered 
        else:
            if failed_attempts < max_tries:
                failed_attempts += 1
                time.sleep(sleep_time)
            else:
                return False, 'Could not download'


#Set our data path to the filings folder
data_path = join(os.getcwd(), 'Filings')

#List of filenames in the filing folder
files_down = os.listdir(data_path)

result_10k_dict = {}
status_dict = {}

for index, row in Sample_df.iterrows():
        
    #Load file list incase program is used out of order    
    files_down = os.listdir(data_path)
    
    #if program exists, load it instead of downloading it again
    if row['Save_Name'] in files_down:
        with open(join(data_path, row['Save_Name']), 'r') as file:
            file_content = file.read()
            result_10k_dict[row['Save_Name']] = file_content
            
    #if program does not exist then call the download function    
    if row['Save_Name'] not in files_down:
        download_res = download_file(row['URL'], row['Date Filed'], row['Form Type'], row['Company Name'])
    
        status_dict[index] = download_res[0]
    
        if download_res[0]:
            result_10k_dict[row['Save_Name']] = download_res[1]
                   
#The if not in x part of the loop works correctly and the files download, need to test after they have been saved.


****This section of code is largely taken from Dr. Kok's github. It is the pipeline to process the files we have downloaded into clean text. I haven't spent a large amount of time on this section of code, and therefore my understanding of it is currently subpar.

We start be defining a pattern_dict, this dictionary contains regular expressions statements. 

Regular expressions are a very valuable, and very frustrating tool. They are confusing and unintuitve, and they will only make sense after practice, like organic chemistry. I would recommend reading the Python documentation for regular expressions here: https://docs.python.org/3/library/re.html.

We will use this dictionary to extract metadata from 10-k documents. We do this through the extract_metadata function


In [12]:
%pylab inline

#Define regular expressions dictionary
pattern_dict = {
    'documents' : re.compile(r"<document>(.*?)</document>", re.IGNORECASE | re.DOTALL),
    'metadata' : {
        'type' : re.compile(r"<type>(.*?)\n", re.IGNORECASE | re.DOTALL),
        'sequence' : re.compile(r"<sequence>(.*?)\n", re.IGNORECASE | re.DOTALL),
        'Filename' : re.compile(r"<filename>(.*?)\n", re.IGNORECASE | re.DOTALL),
        'description' : re.compile(r"<description>(.*?)\n", re.IGNORECASE | re.DOTALL)
    },
    'text' : re.compile(r"<text>(.*?)</text>", re.IGNORECASE | re.DOTALL)
}

#Define extract_metadata function
def extract_metadata(doc, pattern_dict=pattern_dict):
    data_dict = {}
    
    data_dict['metadata'] = {}
    for key, pattern in pattern_dict['metadata'].items():
        matches = pattern.findall(doc)
        if matches:
            data_dict['metadata'][key] = matches[0]
        else:
            data_dict['metadata'][key] = np.nan
            
    text_match = pattern_dict['text'].findall(doc)
    if text_match:
        data_dict['text'] = text_match[0]
    else:
        data_dict['text'] = np.nan
        
    return data_dict

data_10k_dict = {}
for Filename, data in result_10k_dict.items():
    docs_split = pattern_dict['documents'].findall(data)
        
    for doc in docs_split:
        doc_data = extract_metadata(doc)
        
        ## Only keep 10-K document
        if doc_data['metadata']['type'] == '10-K':
            data_10k_dict[Filename] = doc_data['text']
            break
            
html_10k_dict = {}
text_10k_dict = {}
for Filename, raw_text in data_10k_dict.items():
    html = lxml.html.fromstring(raw_text)
    html_10k_dict[Filename] = html
    text_10k_dict[Filename] = html.text_content()
    
path = join(os.getcwd(), 'Clean Files')
cleantext_10k_dict = {}
for Filename, text in text_10k_dict.items():
    ## Fix encoding
    clean_text = unidecode.unidecode(text)
    
    ## Replace newline characters with space
    clean_text = re.sub('\s', ' ', clean_text)
    
    ## Remove duplicate whitespaces
    clean_text = ' '.join([word for word in clean_text.split(' ') if word])
    
    ## Replace "Page number + Table of Contents footer"
    clean_text = re.sub(' \d+ Table of Contents ', ' ', clean_text)
    
    cleantext_10k_dict[Filename] = clean_text
    
    html_file = open(join(path, Filename), 'w')
    resx = unidecode.unidecode(text)

    html_file.write(str(resx))
    html_file.close()

Populating the interactive namespace from numpy and matplotlib


`%matplotlib` prevents importing * from pylab and numpy
  "\n`%matplotlib` prevents importing * from pylab and numpy"


We are now going to shorten the clean text files we just produced and store them in a new dictionary, appropriately named cleantext_10k_dict_short. By doing this, we can significantly decrease the amount of time our program takes to run. Since the AMV is listed early in the 10-K filing, we can safely remove 4/5s of the 10-K filing without risking data loss.

In [13]:
cleantext_10k_dict_short = {}

#Loop through cleantext_10k_dict and take the first 1/5 of the document and store it in cleantext_10k_dict_short
for key, text in cleantext_10k_dict.items():
    doc = str(cleantext_10k_dict[key])
    length = len(doc)
    new_len = length * .2
    new_len = int(float(new_len))
    doc = doc[0:new_len]
    cleantext_10k_dict_short[key] = doc

We will now take the clean text and search for the text's aggregate market value using a for loop and a regular expression. We start the loop by creating a variable to hold our text, doc, and set doc equal to the clean text passed through a str() function. This is because the clean text is stored as a request object. Once we have the doc as a string, we can also pass the .lower() function, which changes any capital letters to lower case letters. This way, we do not need to worry about case sensitivity. We then define our regular expressions in the result variable; we give the regular expressions two cases to search for separated by the '|' character. The first expression tells the program to search for aggregate market value, a number, a decimal, two more numbers, then any 7 characters. This is meant to look for numbers such as 7.0 billion or 8.00 million. The second regular expression looks for a sentence containing 'aggregate market value' and returns that sentence. We then create an if statement to eliminate any text documents that did not return a result. If a result was not found, the function will return '[]'; therefore, we can tell the if statement if the result does not equal '[]' then we want to store that result in our amv dict.

In [14]:
amv_dict = {}
#Loops through the short dict and searchs for the aggregate market value in sentences
for key, item in cleantext_10k_dict_short.items():
    #Get String of the doc
    doc = str(cleantext_10k_dict_short[key])
    
    #Make the doc lowercase so we don't have to worry about case sensitivity
    doc = doc.lower()
    
    #Search for aggergate market value followed by X.X XXXXXXXX then just aggregate market value in a sentence
    #First case will return the million/billion values if present
    result = re.findall(r"([^.]*?aggregate market value[^.]*\.[0-9]........|[^.]*?aggregate market value[^.]*\.)", doc)
    
    #Only save if we have a result
    if str(result) != '[]':
        amv_dict[key] = result

#Dictionary examples
print(amv_dict['BSB Bancorp Inc 10-K 2019-03-15.txt'])
print(amv_dict['Hercules Capital Inc 10-K 2019-02-21.txt'])
print(amv_dict['ST BANCORP INC 10-K 2019-02-21.txt'])    


[' yes no the aggregate market value of the voting and non-voting common equity held by nonaffiliates as of june 30, 2018 was approximately $276,496,828.']
[" yes no the aggregate market value of the voting and non-voting common stock held by non-affiliates of the registrant as of the last business day of the registrant's most recently completed second fiscal quarter was approximately $1.2 billion"]
[" yes o no xstate the aggregate market value of the voting and non-voting common equity held by non-affiliates computed by reference to the price at which the common equity was last sold, or the average bid and asked price of such common equity, as of the last business day of the registrant's most recently completed second fiscal quarter."]


So we have now stored the sentence that likely contains the aggregate market value information in the amv_dictionary. We now need to start extracting the numbers from that sentence. This is quite easy for the sentences that say $7.0 billion or $8 million, but quite difficult for the ones that say: Our AMV was approximately $1,302,302 based on a total share number of 304,303, trading at X price. We will cover the millions and billions right now and pick up this conversation in the next code segment.

We are going to create 4 dictionaries to store information. The functions and data collection in the following section can and will be cleaned up significantly. We will use match_dict to store all of the numbers we extract from these sentences, and we will use match_dict_mb to store strings such as '$4 million' and $8 billion'. million_dict will be used to store the strings in denominations of millions, and billions_dict will be used to store the strings in the billions.

We will use a for loop and regular expression to extract the strings in the form of '$4 million' and '$8 billion' from the amv_dictionary. We will then store these in the match_dict_mb and use another for loop and a couple of if statements to sort these into the million_dict and the billion_dict. We will sort these simply by checking if 'mill' or 'bill' is in the text itself. We use a shorter string than 'million' or 'billion' because these filings often contain typos.

We finish by using a for loop to extract any number present in a sentence from the amv_dict and store these numbers in the match_dict. We will use this match_dict to attempt to extract the AMV from the rest of the companies. 

In [15]:
match_dict = {}
match_dict_mb = {}
million_dict = {}
billion_dict = {}

#Search the amv_dict for the millions and billions value and store them in match_dict_mb
for key, text in amv_dict.items():
    doc = str(amv_dict[key])
    match = re.findall(r"(\$[.\d,]+........)", doc)
    if match:
        match_dict_mb[key] = str(match)

#Loop through match_dict_mb if mill in the sentence store in million dict, if bill store in billions dict
for key, text in match_dict_mb.items():
    doc = str(match_dict_mb[key])
    if 'mill' in text:
        million_dict[key] = text
    if 'bill' in text:
        billion_dict[key] = text
        
#Extract just the numerical values from all the sentences in amv_dict
for key, text in amv_dict.items():
    doc = str(amv_dict[key])
    match = re.findall(r"([.\d,]+)", doc)
    if match:
        match_dict[key] = match

#Examples
print(match_dict['BSB Bancorp Inc 10-K 2019-03-15.txt'])
print(million_dict['SECOND SIGHT MEDICAL PRODUCTS INC 10-K 2019-03-19.txt'])
print(billion_dict['Hercules Capital Inc 10-K 2019-02-21.txt'])

['30,', '2018', '276,496,828.']
['$59.0 million']
['$1.2 billion']


Now that we have a dictionary containing the numbers extracted from the sentences in the amv dict. We need to convert the list into a string and remove special characters so we can convert it into an integer. The results that are stored in match_dict are lists. We want to loop through this list and do the necessary formatting while keeping the data structure. To accomplish this we will use a nested for loop, the first for loop will loop through the dictionary, and the second will loop through the list stored under the dictionary key. Once we have accomplished this, we will utilize another nested for loop to convert the strings into integers. 

At this point, our lists can have up to 4 numbers in them: Year, AMV, Share Price, and Number of Shares. My current approach is to delete any numbers less than 10,000. This will get rid of the Year, and Share Price. It should also eliminate most entries in the million or billion dictionaries. In the case of one of these filings listing share number above 10,000 the results can be cross-referenced to reduce error. 

If your paying attention, you have probably noticed an issue in the algorithm. The number of shares can be higher than the AMV. We will get two errors out of 68 observations because of this. I am working on a couple of solutions around this flaw. I am sure a few more imperfections will be discovered as the algorithm is expanded to a larger sample. 

In [16]:
#Loop through match_dict and remove special characters
#Can clean and shorten this
for key, text in match_dict.items():
    item_lis = []
    for item in text:
        doc = item
        doc = str(doc)
        doc = doc.replace('[', '')
        doc = doc.replace('[', '')
        doc = doc.replace(']', '')
        doc = doc.replace('"', '')
        doc = doc.replace("'", '')
        doc = doc.replace(" ", '|')
        doc = doc.replace(",", '')
        doc = doc.replace(r".", '')
        doc = str(doc)
        doc = str(doc)
        item_lis.append(doc)
        match_dict[key] = item_lis

#Loop through match_dict delete blank values, and convert integers to spaces
for key,text in match_dict.items():
    doc = match_dict[key]
    num_lis = []
    for item in doc:
        if item == ' ' or item == '':
            item = int(0)
            num_lis.append(item)
        else:
            item = int(item)
            num_lis.append(item)
    num_lis = list(map(int, num_lis))
    num_lis.sort(reverse=True)
    num = num_lis[0]
    match_dict[key] = num

likely_dict = {}
    
for key, text in match_dict.items():
    item = match_dict[key]
    if item > 10000:
        likely_dict[key] = item

#Examples
print(match_dict['BSB Bancorp Inc 10-K 2019-03-15.txt'])
print(likely_dict['BSB Bancorp Inc 10-K 2019-03-15.txt'])

276496828
276496828


We will now process the filings that contain $X.X million in the AMV sentence. Processing these filings is significantly easier than the likely, as we are almost certain the number we will extract will be the actual AMV.

We begin by extracting decimal numbers from the million dict and store the numbers back in the dict. While I say they are numbers, they are, in fact, lists in Python, as we have used a regular expression to return the results. 

We should have only returned one decimal number from this match. To make sure we create another dictionary million_dict_step2 and only store the results that have returned a single number in the dictionary.

We then need to convert the list to a string and format it as we did above. In this case, the only difference is that we are not concerned about keeping the list data structure intact because we are only dealing with one item. 

Finally, we convert the string into an integer and multiply it by a million to return the final AMV value.

In [17]:
#Find all numbers in the million_dict sentence
for key, text in million_dict.items():
    doc = million_dict[key]
    match = re.findall("([.\d,]+)", doc)
    million_dict[key] = match

#Initalize new dict
million_dict_step2 = {}

#Loop through and make sure there was only 1 match returned
for key, text in million_dict.items():
    lis = million_dict[key]
    lis_len = len(lis)
    if lis_len == 1:
        million_dict_step2[key] = lis

#Loop through and format, remove all special characters so only thing left in string is numbers
for key, text in million_dict_step2.items():
    doc = million_dict_step2[key]
    doc = str(doc)
    doc = doc.replace('[', '')
    doc = doc.replace('[', '')
    doc = doc.replace(']', '')
    doc = doc.replace('"', '')
    doc = doc.replace("'", '')
    million_dict_step2[key] = doc


#Loop through and change numbers into integers, then multiply by a million
for key, text in million_dict_step2.items():
    item = million_dict_step2[key]
    item = float(item)
    item = item * 1000000
    item = int(item)
    
    million_dict_step2[key] = item

#Print
print(million_dict_step2['SECOND SIGHT MEDICAL PRODUCTS INC 10-K 2019-03-19.txt'])


59000000


We are doing the same below, as we have above, but for the billions_dict. 

In [18]:
#Find all numbers in the sentence in stored under billion_dict
for key, text in billion_dict.items():
    doc = billion_dict[key]
    match = re.findall("([.\d,]+)", doc)
    billion_dict[key] = match

#Create new dictionary
billion_dict_step2 = {}

#Loop through numbers, if any sentence returned more than one number do not add to billion_dict_step_2, error checking loop
for key, text in billion_dict.items():
    lis = billion_dict[key]
    lis_len = len(lis)
    if lis_len == 1:
        billion_dict_step2[key] = lis

#Loop through numbers and format, remove all special characters so the only thing left in the string is numbers
#This loop can be shortened.
for key, text in billion_dict_step2.items():
    doc = billion_dict_step2[key]
    doc = str(doc)
    doc = doc.replace('[', '')
    doc = doc.replace('[', '')
    doc = doc.replace(']', '')
    doc = doc.replace('"', '')
    doc = doc.replace("'", '')
    billion_dict_step2[key] = doc

#Loop through the billion_dict_step2 dictionary and convert the strings of numbers into ints, multiply by 1 bill
for key, text in billion_dict_step2.items():
    item = billion_dict_step2[key]
    item = float(item)
    item = item * 1000000000
    item = int(item)
    billion_dict_step2[key] = item

    #Print Dict
print(billion_dict_step2['Hercules Capital Inc 10-K 2019-02-21.txt'])

1200000000


We now need to combine our results from the million_dict_step2, billion_dict_step2, and likely_dict. We will create a new dict to store these results, all_dict and loop through each dictionary, storing the results in the all_dict.

In [19]:
#Create dictionary to store all other dictionaries containing results
all_dict = {}

#Loop through million_dict_step2 and add all items to all_dict
for key, text in million_dict_step2.items():
    item = million_dict_step2[key]
    all_dict[key] = item
    
#Loop through billion_dict_step2 and add all items to all_dict
for key, text in billion_dict_step2.items():
    item = billion_dict_step2[key]
    all_dict[key] = item

#Loop through likely_dict and add all items to all_dict
for key, text in likely_dict.items():
    item = likely_dict[key]
    all_dict[key] = item

#Print all_dict
print(all_dict['Hercules Capital Inc 10-K 2019-02-21.txt'])
print(all_dict['SECOND SIGHT MEDICAL PRODUCTS INC 10-K 2019-03-19.txt'])
print(all_dict['BSB Bancorp Inc 10-K 2019-03-15.txt'])

1200000000
59000000
276496828


Below we will create a Pandas dataframe for the results. Dataframes can simplify data manipulation and make it easy to save and later load datasets into other Python programs.

Once we have created it, we will save it, then print the output.

In [20]:
#Create dataframe to store results, make 'Dollar Value of AMV' Column Heading
Results_df = pd.DataFrame.from_dict(all_dict, orient='index', columns = ['Dollar Value of AMV'])

#Save file
path = path = join(os.getcwd(), 'Output')
Results_df.to_excel(path + 'results.xlsx', index = True)

#print dataframe
Results_df

Unnamed: 0,Dollar Value of AMV
RTI SURGICAL INC 10-K 2019-03-05.txt,286000000
CBA Florida Inc 10-K 2019-04-01.txt,6870000
PEAPACK GLADSTONE FINANCIAL CORP 10-K 2019-03-14.txt,632000000
DITECH HOLDING Corp 10-K 2019-04-16.txt,24100000
SecureWorks Corp 10-K 2019-03-28.txt,147300000
...,...
WEYLAND TECH INC 10-K 2019-04-15.txt,19481155
ABRAXAS PETROLEUM CORP 10-K 2019-03-15.txt,470774656
MARINE PRODUCTS CORP 10-K 2019-02-28.txt,131839491
AMGEN INC 10-K 2019-02-13.txt,119629312769


The following code imports the documents into the Spacy pipeline. The pipeline takes the documents and returns a Spacy object. An object can be called with different commands to return the sentences of the document, the tokens of the document (fancy name for words), and the part of speech of those tokens. 

In [None]:
path = join(os.getcwd(), 'Spacy Files')

import spacy
#spacy.require_gpu()
nlp = spacy.load("en_core_web_sm")

#Increase memory limit per filing.
nlp.max_length = 15000000

#Create dict to store spacy files
spacy_dict = {}

#Process clean text files into spacy documents for better data extraction
for filename, document in cleantext_10k_dict.items():
    spacy_dict[filename] = nlp(document)
    
    #Need to create list of documents that already exist, and prevent them from being downloaded again if they do already exist.
    html_file = open(join(path, filename), 'w')
    resx = spacy_dict[filename]

    html_file.write(str(resx))
    html_file.close()

The code below is a solution to extract AMV information from 10-Ks using Spacy. This code can not be run in Jupyter notebooks and must be run in a Python environment that does not utilize Interactive Python. A safe choice is Spyder. The code below takes documents out of the small_spacy_dict, so you will need to include the code to generate the small_spacy_dict.

In [None]:
#This code will not run in Jupyter, please move to Spyder. 
# This code is an alteration of:Extracting entity relations @ https://spacy.io/usage/examples
import plac
import spacy
#spacy.require_gpu() #uncomment for GPU



#Initalize list to store output
output = []

#Not sure what this does
@plac.annotations(
    model=("Model to load (needs parser and NER)", "positional", None, str)
)

#Main function, main function calls the functions below.
def main(model="en_core_web_lg"):
    nlp = spacy.load(model)
    print("Loaded model '%s'" % model)
    print("Processing %d texts" % len(TEXTS))

    for key, text in small_spacy_dict.items():
        doc = nlp(str(text))
        relations = extract_currency_relations(doc, key)
        output.append(relations)
  


#This function primarily deals with formatting
def filter_spans(spans):
    # Filter a sequence of spans so they don't contain overlaps
    # For spaCy 2.1.4+: this function is available as spacy.util.filter_spans()
    get_sort_key = lambda span: (span.end - span.start, -span.start)
    sorted_spans = sorted(spans, key=get_sort_key, reverse=True)
    result = []
    seen_tokens = set()
    for span in sorted_spans:
        # Check for end - 1 here because boundaries are inclusive
        if span.start not in seen_tokens and span.end - 1 not in seen_tokens:
            result.append(span)
        seen_tokens.update(range(span.start, span.end))
    result = sorted(result, key=lambda span: span.start)
    return result

#This function primarily deals with extracting the relations, it extracts the entities and nouns if the entity type is Money
def extract_currency_relations(doc, key):
    # Merge entities and noun chunks into one token
    spans = list(doc.ents) + list(doc.noun_chunks)
    spans = filter_spans(spans)
    with doc.retokenize() as retokenizer:
        for span in spans:
            retokenizer.merge(span)

    relations = []
    for money in filter(lambda w: w.ent_type_ == "MONEY", doc):
        if money.dep_ in ("attr", "dobj"):
            subject = [w for w in money.head.lefts if w.dep_ == "nsubj"]
            if subject:
                subject = subject[0]
                relations.append((subject, money, key))
        elif money.dep_ == "pobj" and money.head.dep_ == "prep":
            relations.append((money.head.head, money, key))
    return relations

#Calls the main function
if __name__ == "__main__":
    plac.call(main)

#Output produces a list of tuples for every document that is ran through the function.
#Because of this we have a list of list of tuples, we just want a list of tuples, so we are going to make a new list.
#We will add all of the tuples to the new list.

flat_list = []

for sublist in output:
    for item in sublist:
        flat_list.append(item)
print(flat_list)

#Initalize a list to store our results    
result = [] 

#Now that we have our list of tuples, we will add only the tuples that have the noun value AMV.
#we have included two cases of how the noun value of AMV may be present, they are case sensesitive. 
for i in flat_list:
    if str(i[0]) == 'The aggregate market value':
        result.append(i)
    if str(i[0]) == 'the aggregate market value':
        result.append(i)
        
print(result)   
len(result)

The following code does not run in Jupyter notebooks and must be run in an editor that does not user Interactive Python, such as Spyder. The following code is an alteration of Training NER at https://spacy.io/usage/examples. In this example, the label and training data has been altered for our purposes. There are two challenges when building Spacy models, compute power and datasets. Training a model can take minutes to days depending on the size of the dataset used, the amount of times the training loop is run, and the computer power at your disposal. 

In [None]:
#This code will not run in Jupyter, please move to Spyder. 
#This code is an alteration of Training NER @ https://spacy.io/usage/examples
import plac
import random
import warnings
from pathlib import Path
import spacy
from spacy.util import minibatch, compounding
# spacy.require_gpu() #uncomment for GPU


# new entity label
LABEL = "AMV"

# training data
# Note: If you're using an existing model, make sure to mix in examples of
# other entity types that spaCy correctly recognized before. Otherwise, your
# model might learn the new type, but "forget" what it previously knew.
# https://explosion.ai/blog/pseudo-rehearsal-catastrophic-forgetting

TRAIN_DATA = [
    ("As of June 30, 2018, the last day of the registrant's most recently completed second fiscal quarter, the aggregate market value of the common stock held by non-affiliates of the registrant was $470,774,656 based on the closing sale price as reported on The NASDAQ Stock Market.", {'entities': [(193, 205, "AMV")]}),
    ("The aggregate market value of the registrant's common stock held by non-affiliates of the registrant was $22,262,043,858 as of June 29, 2018 based on the closing sale price of the registrant's common stock on the NASDAQ Global Market on such date.", {'entities': [(105, 120, "AMV")]}),
    ("The aggregate market value of ordinary shares held by non-affiliates on June 30, 2018 was approximately $7.3 billion based on the closing price of such stock on the New York Stock Exchange.", {'entities': [(104, 116, "AMV")]}),
    ("The aggregate market value of the registrant's common stock, $0.01 par value per share Common Stock, held by non-affiliates of the registrant, based on the last sale price of the Common Stock at the close of business on June 29, 2018, was $9,819,826,967.", {'entities': [(239, 253, "AMV")]}),
    ("The aggregate market value of the voting stock held by non-affiliates of the registrant on June 30, 2018, based upon the closing price of $4.07 of the registrant's Class A Common Stock as reported on the NASDAQ Global Select Market, was approximately $3.1 billion, which excludes 87.1 million shares of the registrant's common stock held on June 30, 2018 by then current executive officers, directors, and stockholders that the registrant has concluded are affiliates of the registrant.", {'entities': [(251, 263, "AMV")]}),
    ("The aggregate market value of the shares of Class A Common Stock held by non-affiliates of the registrant, computed by reference to the closing price of such stock as of the last business day of the registrant's most recently completed second quarter, was $7.6 billion.", {'entities': [(256, 268, "AMV")]})
]

plac.annotations(
    model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
    new_model_name=("New model name for model meta.", "option", "nm", str),
    output_dir=("Optional output directory", "option", "o", Path),
    n_iter=("Number of training iterations", "option", "n", int),
)
def main(model=None, new_model_name="animal", output_dir=None, n_iter=30):
    """Set up the pipeline and entity recognizer, and train the new entity."""
    random.seed(0)
    if model is not None:
        nlp = spacy.load(model)  # load existing spaCy model
        print("Loaded model '%s'" % model)
    else:
        nlp = spacy.blank("en")  # create blank Language class
        print("Created blank 'en' model")
    # Add entity recognizer to model if it's not in the pipeline
    # nlp.create_pipe works for built-ins that are registered with spaCy
    if "ner" not in nlp.pipe_names:
        ner = nlp.create_pipe("ner")
        nlp.add_pipe(ner)
    # otherwise, get it, so we can add labels to it
    else:
        ner = nlp.get_pipe("ner")

    ner.add_label(LABEL)  # add new entity label to entity recognizer
    # Adding extraneous labels shouldn't mess anything up
    ner.add_label("VEGETABLE")
    if model is None:
        optimizer = nlp.begin_training()
    else:
        optimizer = nlp.resume_training()
    move_names = list(ner.move_names)
    # get names of other pipes to disable them during training
    pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
    # only train NER
    with nlp.disable_pipes(*other_pipes) and warnings.catch_warnings():
        # show warnings for misaligned entity spans once
        warnings.filterwarnings("once", category=UserWarning, module='spacy')

        sizes = compounding(1.0, 4.0, 1.001)
        # batch up the examples using spaCy's minibatch
        for itn in range(100):
            random.shuffle(TRAIN_DATA)
            batches = minibatch(TRAIN_DATA, size=sizes)
            losses = {}
            for batch in batches:
                texts, annotations = zip(*batch)
                nlp.update(texts, annotations, sgd=optimizer, drop=0.35, losses=losses)
            print("Losses", losses)

    # test the trained model
    test_text = cleantext_10k_dict['BELDEN INC 10-K 2019-02-20.txt']
    doc = nlp(test_text)
    print("Entities in '%s'" % test_text)
    for ent in doc.ents:
        print(ent.label_, ent.text)

    # save model to output directory
    if output_dir is not None:
        output_dir = Path(output_dir)
        if not output_dir.exists():
            output_dir.mkdir()
        nlp.meta["name"] = new_model_name  # rename model
        nlp.to_disk(output_dir)
        print("Saved model to", output_dir)

        # test the saved model
        print("Loading from", output_dir)
        nlp2 = spacy.load(output_dir)
        # Check the classes have loaded back consistently
        assert nlp2.get_pipe("ner").move_names == move_names
        doc2 = nlp2(test_text)
        for ent in doc2.ents:
            print(ent.label_, ent.text)


if __name__ == "__main__":
    plac.call(main)