This code attempts to extract the risk factor section of many raw text html 10Ks/10Qs and then explores the tractability of a simple LDA analysis on these documents.

In [1]:
import pandas as pd
import numpy as np
import unicodedata
import re
from timeit import default_timer as timer
import pickle

from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
pd.set_option('display.max_colwidth', -1)

#makes a pandas cell clickable
def cell_clickable(val):
    return '<a href="{}">{}</a>'.format(val,val)
#makes pandas dataframe's col 'htmURL' clickable from the ipython notebook
def make_clickable(df):
    if len(df)>100: df=df.sample(n=100)  #safeguard against displaying large dataframes
    return df.style.format({'htmURL': cell_clickable})

#description: attempts to remove html code from a raw text html file, leaving mostly readable text
#input: raw text html string
#output: lowercase readable text string
def clean_html(text):
    #tag removal first pass
    text = re.sub('<.*?>', ' ', text) 
    #attempts to standardize characters, may or may not work perfectly
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore') 
    #removes html special chars, non-printable ascii characters,
    #string literals, and a second pass of tag removal
    text = re.sub('&.*?;|[^\x00-\x7F]+|\r|\n|\a|\b|\f|\t|\v|<.*?>', ' ',text)
    text = re.sub(' +', ' ', text) #removes extra whitespace
    return text.lower().strip() #returns lower case

#description: takes text string and removes some 'unimportant' portions
#input: text string
#output: reduced text string
def clean(text):
    text = re.sub('\d', ' ', text) #removes numbers
    text = re.sub(r'[?|!|\'|\"|#|.|,|)|(|\|/|$|:|;|&|^|%|@|_|-]', '', text)#removes punctuation
    text = re.sub('(?:^| )\w(?:$| )', ' ',text)#removes single characters
    text = re.sub(' +', ' ', text)
    return text.strip()

In [2]:
df_complete = pd.read_csv('df_coords.csv')
df_reduced = df_complete.filter(items=['CIK','COMN','DATE','FILE','FORM','fdate','formt','ftime','htmLOC','htmURL','locURL','period'])
df_reduced.shape

(95506, 12)

This is a mixed sample of randomly selected 10Ks and 10Qs pulled from the SEC website. Because this data is from another project the sample data's dates of filing will be skewed towards 2010-2015, but this shouldn't matter much for a short demo. For now lets focus on a smaller size.

In [3]:
df=df_reduced.sample(n=1000).reset_index(drop=True)
make_clickable(df.head())

Unnamed: 0,CIK,COMN,DATE,FILE,FORM,fdate,formt,ftime,htmLOC,htmURL,locURL,period
0,318154,AMGEN INC,2012-08-08,edgar/data/318154/0001193125-12-344279.txt,10-Q,20120808,False,17:03:20,E:\Data\SEC\HTML3\31\81\54\0001193125-12-344279.htm,https://www.sec.gov/Archives/edgar/data/318154/000119312512344279/d351286d10q.htm,E:\Data\SEC\HTML\b\75\318154_61,20120630
1,870826,VALUEVISION MEDIA INC,2012-09-04,edgar/data/870826/0000870826-12-000010.txt,10-Q,20120831,False,17:40:38,E:\Data\SEC\HTML3\87\8\26\0000870826-12-000010.htm,https://www.sec.gov/Archives/edgar/data/870826/000087082612000010/vvtv10q07282012.htm,E:\Data\SEC\HTML\d\63\870826_31,20120728
2,1487101,KEYW HOLDING CORP,2011-08-04,edgar/data/1487101/0001144204-11-043965.txt,10-Q,20110804,False,16:05:56,E:\Data\SEC\HTML3\148\71\1\0001144204-11-043965.htm,https://www.sec.gov/Archives/edgar/data/1487101/000114420411043965/v230285_10q.htm,E:\Data\SEC\HTML\i\99\1487101_88,20110630
3,1091325,"CHINA YIDA HOLDING, CO.",2012-05-11,edgar/data/1091325/0001213900-12-002399.txt,10-Q,20120511,False,08:53:52,E:\Data\SEC\HTML3\109\13\25\0001213900-12-002399.htm,https://www.sec.gov/Archives/edgar/data/1091325/000121390012002399/f10q0312_chinayida.htm,E:\Data\SEC\HTML\l\11\1091325_11,20120331
4,1378706,AMERICAN DG ENERGY INC,2014-04-09,edgar/data/1378706/0001378706-14-000002.txt,10-K,20140409,True,13:54:06,E:\Data\SEC\HTML3\137\87\6\0001378706-14-000002.htm,https://www.sec.gov/Archives/edgar/data/1378706/000137870614000002/adge-20131231x10k.htm,E:\Data\SEC\HTML\i\34\1378706_74,20131231


In [4]:
with open(df.loc[0,'htmLOC'], "r") as f: 
    docPage = f.read()

docPage[:1000]

'<DOCUMENT>\n<TYPE>10-Q\n<SEQUENCE>1\n<FILENAME>d351286d10q.htm\n<DESCRIPTION>FORM 10-Q\n<TEXT>\n<HTML><HEAD>\n<TITLE>Form 10-Q</TITLE>\n</HEAD>\n <BODY BGCOLOR="WHITE">\n<h5 align="left"><a href="#toc">Table of Contents</a></h5>\n\n <P STYLE="line-height:3px;margin-top:0px;margin-bottom:0px;border-bottom:3pt solid #000000">&nbsp;</P>\n<P STYLE="line-height:3px;margin-top:0px;margin-bottom:2px;border-bottom:0.5pt solid #000000">&nbsp;</P> <P STYLE="margin-top:12px;margin-bottom:0px" ALIGN="center"><FONT STYLE="font-family:Times New Roman" SIZE="5"><B>UNITED STATES </B></FONT></P>\n<P STYLE="margin-top:0px;margin-bottom:0px" ALIGN="center"><FONT STYLE="font-family:Times New Roman" SIZE="5"><B>SECURITIES AND EXCHANGE COMMISSION </B></FONT></P> <P STYLE="margin-top:0px;margin-bottom:0px" ALIGN="center"><FONT\nSTYLE="font-family:Times New Roman" SIZE="3"><B>Washington, D.C. 20549 </B></FONT></P> <P STYLE="margin-top:12px;margin-bottom:0px" ALIGN="center"><FONT STYLE="font-family:Times New 

Looks like a typical html file, though this will need to be stripped down before any sort of analysis. Lots of punctuation and non-visible characters.

The main goal now is to extract each risk factor section and store it in a list. To identify the risk factor section this code assumes that each file has a table of contents, and that the table of contents follows the naming convention discussed here https://www.sec.gov/fast-answers/answersreada10khtm.html. 


<br><br>
TABLE OF CONTENTS

Item 1.
	  	Business 	   	  	1  

Item 1A.
	  	Risk Factors         5

Item 1B.
        Unresolved Staff Comments     9
<br><br><br>


From the SEC website it can be seen that the Risk Factors section should always be listed, and indexed by "Item 1A.". In addition the code takes advantage of the fact that most table of contents are uniquely linked to the appropriate page element. It first looks for the Risk Factors index and then its corresponding url tag, if successful it then continues looking for the next index's url tag; with the hope that the next index's url tag can be used to demarcate the end of the Risk Factor's section later. A typical table of contents may look like this:

In [5]:
start = timer()
load_time = 0

risk_factors_corpus = []
cleaner = re.compile('&.*?;|\xa0|\r|\n|\a|\b|\f|\t|\v')
for docLoc in df['htmLOC'].tolist():
    start_load = timer()
    with open(docLoc, "r") as f: 
        docPage = f.read()
    end_load = timer()
    load_time += (end_load - start_load)
    

    docPage = docPage.lower()
    docPage = cleaner.sub(' ',docPage)

    
    item_1a_index = docPage.find('item 1a') 
    risk_toc_index = docPage[item_1a_index:].find('risk factors')+item_1a_index
    href_index = docPage[:risk_toc_index].rfind('href="#') #looks for first link before 'risk factors'
    if(href_index <0): continue #skips 10k/10qs that don't have hotlinked table of contents
    left_tag_index = href_index+7
    right_tag_index = docPage[left_tag_index:].find('"')+left_tag_index 
    risk_element_tag = docPage[left_tag_index:right_tag_index] #isolated risk factor tag
    if len(risk_element_tag)==0: continue #this likely never happens

    next_element_tag = ""
    post_risk_toc_index = -1
    risk_start_index = -1
    risk_end_index = -1
    next_hrefs=[i.end() for i in re.finditer('href="#', docPage[risk_toc_index:risk_toc_index+2500])] #looks for links after 'risk factors'
    for i in next_hrefs:
        left_tag_index = risk_toc_index+i
        right_tag_index = docPage[left_tag_index:].find('"')+left_tag_index #looks for end of element tag
        if(right_tag_index < left_tag_index): continue  #this also likely never happens
        next_element_tag = docPage[left_tag_index:right_tag_index]
        #this checks to see if the next url links beyond the risk section
        #this looks ineffecient but is needed because sometimes the links are out of order
        if (next_element_tag != risk_element_tag) and (next_element_tag != ""): 
            post_risk_toc_index = right_tag_index
            risk_start_index = docPage[post_risk_toc_index:].find('<a name="' + risk_element_tag  +'">')+post_risk_toc_index
            risk_end_index = docPage[risk_start_index:].find('<a name="' + next_element_tag  +'">')+risk_start_index
            if(risk_start_index < risk_end_index): break #the next unique url after the risk factors toc entry that links beyond where the risk factors url links to has been found
    if(risk_start_index >= risk_end_index):continue            

    risk_factors = docPage[risk_start_index:risk_end_index]
    risk_factors = risk_factors[risk_factors.find(">")+1:risk_factors.rfind("<")]
    risk_factors = clean_html(risk_factors)
    
    risk_factors_corpus.append(risk_factors)
    
    
end = timer()
print('Total loop time: ' + str(end - start))
print('IO load time: ' + str(load_time))
print('Avg. iteration time: ' + str(((end - start)/len(df))))
print(str(len(risk_factors_corpus))+" entries were retrieved with a success rate of: " + str(len(risk_factors_corpus)/len(df)))

Total loop time: 33.321857795
IO load time: 12.38193154700002
Avg. iteration time: 0.033321857795
674 entries were retrieved with a success rate of: 0.674


This obviously doesn't work for every 10K/10Q but the well labelled and linked table of contents is a common enough convetion that the code has pulled a section for ~60% of the documents. Also it's worth noting this isn't particularly fast, and it's likely some of the regex cleaning portions could be improved

In [22]:
for i in range (10):
    if len(risk_factors_corpus[i])<1000: continue
    display(risk_factors_corpus[i][:500]+" ... "+risk_factors_corpus[i][-500:])

'item 1a. risk factors this report and other documents we file with the sec contain forward-looking statements that are based on current expectations, estimates, forecasts and projections about us, our future performance, our business, our beliefs and our management s assumptions. these statements are not guarantees of future performance and involve certain risks, uncertainties and assumptions that are difficult to predict. you should carefully consider the risks and uncertainties facing our busi ... erations. further, additional risks not currently known to us or that we currently believe are immaterial may in the future materially and adversely affect our business, operations, liquidity and stock price. there are no material updates from the risk factors previously disclosed in part i, item 1a, of our annual report on form 10-k for the fiscal year ended december 31, 2011, and in part ii, item ia, of our quarterly report on form 10-q for the period ended march 31, 2012. 36 table of co

'item 1a. risk factors in addition to the other information contained in this annual report on form 10-k, you should consider the following risk factors in evaluating our results of operations, financial condition, business and operations or an investment in the shares of our company. the risk factors described in this section have been separated into four groups: risks that relate to the competition we face and the technology used in our businesses; risks that relate to our operating in overseas ...  it would be to bring similar claims against a u.s. company in a u.s. court. in particular, english law significantly limits the circumstances under which shareholders of english companies may bring derivative actions. under english law generally, only the company can be the proper plaintiff in proceedings in respect of wrongful acts committed against us. our articles of association provide for the exclusive jurisdiction of the english courts for shareholder lawsuits against us or our dire

'item 1a. risk factors we operate in a business environment that involves numerous known and unknown risks and uncertainties that could have a materially adverse impact on our operations. the risks described below highlight some of the factors that have affected, and in the future could affect, our operations. you should carefully consider these risks. these risks are not the only ones we may face. additional risks and uncertainties of which we are unaware or that we currently deem immaterial als ... of our available cash, if any, for working capital and other general corporate purposes. any payment of future dividends will be at the discretion of our board of directors and will depend upon, among other things, our earnings, financial condition, capital requirements, debt levels, statutory and contractual restrictions applying to the payment of dividends and other considerations that our board of directors deems relevant. investors seeking cash dividends should not purchase our common 

There maybe some oddities somewhere but this looks good enough to continue. So we'll save the corpus, which so far has only had its tags/code/nonprintable characters removed and will need further preprocessing before analysis.

In [7]:
lengths = np.array([len(x) for x in risk_factors_corpus])
print("Avg words per bag: " + str(np.average(lengths)))
print("Avg std of words per bag: " + str(np.std(lengths)))
print("Most words in one bag: " + str(max(lengths)))

Avg words per bag: 28495.32640949555
Avg std of words per bag: 45241.50133404879
Most words in one bag: 331174


In [8]:
with open('risk_factors_corpus.pkl', 'wb') as f:
    pickle.dump(risk_factors_corpus, f)

This is all pretty but standard but for preprocessing, I first remove all numbers, punctuation, and one/two letter words, I then use the prebuilt lemmatizer and stemmer from WordNet and Snowball respectively and store it as word-tokenized data. While each 10K/10Q still has its own bag of words, most of the contextual data been, for better or worse, has been stripped out.

In [9]:
import gensim
import nltk

from gensim import models
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.tokenize import word_tokenize
numbers = ['twenty','thirty','forty','fifty','sixty','seventy','eighty','ninety',
           'zero','one','two','three','four','five','six','seven','eight','nine',
           'ten','eleven','twelve','thirteen','fourteen','fifteen','sixteen',
           'seventeen','eighteen','nineteen','hundred','thousand','million','billion']
nltk.download('wordnet')
nltk.download('punkt')
stemmer = SnowballStemmer("english") #better than porter
stopwords = set([x for x in gensim.parsing.preprocessing.STOPWORDS] + numbers) #set has better for performance than list for checking membership

#input: single body text string
#output: tokenized/stemmed/lemmatized/filtered word string list
def stem_lem_clean(text):
    text = re.sub('\d', ' ', text) #removes numbers
    text = re.sub(r'[?|!|\'|\"|#|.|,|)|(|\|/|$|:|;|&|^|%|@|_|-]', '', text)#removes punctuation
    text = re.sub(r'\W*\b\w{1,2}\b','',text)#removes words that are 1 and chars long
    text = word_tokenize(text) #converts from paragraph to list
    #stems and lemmatizes words, throws out stopwords
    text = [stemmer.stem(WordNetLemmatizer().lemmatize(x, pos='v')) for  x in text if x not in stopwords]
    return text

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Zach\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Zach\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [10]:
start = timer()
tokenized_risk = [stem_lem_clean(x) for x in risk_factors_corpus]
end = timer()
print('Total loop time: ' + str(end - start))
print('Avg. iteration time: ' + str(((end - start)/len(risk_factors_corpus))))

Total loop time: 34.22308647900001
Avg. iteration time: 0.05077609269881307


In [11]:
lengths = np.array([len(x) for x in tokenized_risk])
print(tokenized_risk[0][:20])
print("Avg words per bag: " + str(np.average(lengths)))
print("Avg std of words per bag: " + str(np.std(lengths)))
print("Most words in (one bag: " + str(max(lengths)))

['item', 'risk', 'factor', 'report', 'document', 'file', 'sec', 'contain', 'forwardlook', 'statement', 'base', 'current', 'expect', 'estim', 'forecast', 'project', 'futur', 'perform', 'busi', 'belief']
Avg words per bag: 2250.6958456973293
Avg std of words per bag: 3550.065799249855
Most words in (one bag: 26948


We can clearly see the effects of the stemming and lemmatization on the above sample. In addition word counts have been drastically reduced with the removal of stopwords/punctuation/numbers. This is a good thing because most of those words would just become noise later, without context they add almost no information.

In [12]:
word2vec = gensim.corpora.Dictionary(tokenized_risk,prune_at=1000000) #this by itself creates numerical pairing for each unique word
#internally keeps track of which bags of words have contributed to the dictionary
#so a threshold can be set for word membership
word2vec.filter_extremes(no_below=3, no_above=0.5)
lengths = np.array([len(x) for x in tokenized_risk])
print(word2vec)


Dictionary(5088 unique tokens: ['assumpt', 'base', 'belief', 'believ', 'care']...)


Even after setting membership thresholds, there's still a total of 5000 unique words over all of the corpus. One thing to improve on here is that word membership threshold should be done on a cross-section of firms. E.g, one company may really like a particular word, say "motorhome", because its central to their business. So that firm uses it a lot in their own 10K/10Qs over time but other firms may never mention it. It's unlikely the word "motorhome" would add much information but it would require a slightly more advanced dictionary implementation than gensim provides to deal with it. Now we add the word counts to the gensim's dictionary object.

In [13]:
counter = [word2vec.doc2bow(x) for x in tokenized_risk] #creates a dictionary style counter for each bag of words

It's important to note that this counter dictionary is using numerical indexing and membership restrictions previously defined by the word2vec dictionary.

Tf-idf scores are commonly used as a means of normalizing the data before applying many nlp algos but it's not really required for LDA work. It can be useful to help reduce dimensionality for extremely large datasets. (See page 11 of http://www.cs.columbia.edu/~blei/papers/BleiLafferty2009.pdf) This dataset is already manageable though, so I will forgo it for the final LDA model. Nevertheless this is how you'd implement it with gensim if you wanted to use it.

In [23]:
tf_idf_modeller = models.TfidfModel(counter)
tf_idf_matrix = tf_idf_modeller[counter]
tf_idf_matrix[0][:10]

[(0, 0.2717032165200514),
 (1, 0.08996560184300338),
 (2, 0.3374346469439526),
 (3, 0.08223323604442176),
 (4, 0.0933074473205476),
 (5, 0.0722524688999295),
 (6, 0.07164114101988502),
 (7, 0.08463003575826447),
 (8, 0.09180853907078416),
 (9, 0.08497711828159607)]

There's one list of tuples like the one above for each of the previously pulled 10K/10Qs. The 1st number in each tuple is a word's dictionary key, and the 2nd number is that word's bag specific tf-idf score.

In [15]:
start = timer()
lda_modeller = gensim.models.LdaMulticore(counter, num_topics=12, id2word=word2vec, passes=10, per_word_topics=True, workers=4)
end = timer()
print('Total loop time: ' + str(end - start))

Total loop time: 23.521939949


In [16]:
for idx, topic in lda_modeller.print_topics(-1):
    print('Topic: {} Word: {}'.format(idx, topic))

Topic: 0 Word: 0.027*"product" + 0.015*"custom" + 0.010*"market" + 0.010*"cost" + 0.009*"sale" + 0.009*"signific" + 0.009*"price" + 0.008*"increas" + 0.008*"manufactur" + 0.008*"new"
Topic: 1 Word: 0.015*"gas" + 0.015*"oil" + 0.012*"natur" + 0.011*"price" + 0.011*"product" + 0.010*"requir" + 0.009*"cost" + 0.008*"abil" + 0.008*"properti" + 0.008*"general"
Topic: 2 Word: 0.011*"incom" + 0.011*"net" + 0.011*"asset" + 0.010*"cash" + 0.009*"month" + 0.009*"tax" + 0.008*"total" + 0.008*"march" + 0.008*"valu" + 0.008*"cost"
Topic: 3 Word: 0.028*"stock" + 0.024*"product" + 0.014*"market" + 0.011*"develop" + 0.011*"prefer" + 0.010*"common" + 0.009*"seri" + 0.009*"price" + 0.009*"requir" + 0.008*"agreement"
Topic: 4 Word: 0.016*"cost" + 0.012*"gas" + 0.012*"regul" + 0.010*"requir" + 0.009*"impact" + 0.009*"increas" + 0.009*"qep" + 0.008*"price" + 0.008*"natur" + 0.007*"product"
Topic: 5 Word: 0.020*"servic" + 0.009*"market" + 0.007*"abil" + 0.007*"certain" + 0.007*"new" + 0.007*"requir" + 0.007

In [17]:
lda_modeller.log_perplexity(counter)

-6.7088100156030634

From here you could further optimize this likihood metric (more negative is better here) by changing the corpus filtering, the assumed number of topics, adding priori beliefs, tweaking how the calculations are iterated, and perhaps using n-grams. I think 2-grams could be really interesting in particular because in finance simple qualifiers can have a ton of informational context (e.g., "increased revenue", "project decreases", "cash flow"). In addition, training/test sets should be used if quantitative metrics are going to be used for evaluation. In particular, the above metric should not be evaluated on the training set (counter is the training set) for a serious optimization attempt. But with all that said, here's a model-based inference on the topic classification for one of the in-set documents:

In [33]:
lda_modeller[counter[8]][0]

[(0, 0.13253443), (3, 0.015137215), (8, 0.8439119)]

In [35]:
risk_factors_corpus[8][:1000]+"..."

'item 1a. risk factors we operate in a business environment that involves numerous known and unknown risks and uncertainties that could have a materially adverse impact on our operations. the risks described below highlight some of the factors that have affected, and in the future could affect, our operations. you should carefully consider these risks. these risks are not the only ones we may face. additional risks and uncertainties of which we are unaware or that we currently deem immaterial also may become important factors that affect us. if any of the events or circumstances described in the followings risks occurs, our business, financial condition, results of operations, cash flows, or any combination of the foregoing, could be materially and adversely affected. our risks are described in detail below; however, the more significant risks we face include the following: if we are unable to attract new customers, or if our existing customers do not purchase additional products or se

Here the modelling is saying that Sciquest's fiscal ended 2012 10K ( https://www.sec.gov/Archives/edgar/data/1082526/000119312513098097/d444510d10k.htm#tx444510_3 ) should be classified as topic 8. Recall:
<br><br>
Topic: 8 Word: 0.018*"custom" + 0.016*"product" + 0.015*"servic" + 0.011*"market" + 0.009*"requir" + 0.008*"revenu" + 0.008*"signific" + 0.007*"stock" + 0.007*"technolog" + 0.007*"increas" <br><br>

And this makes sense seeing as SciQuest describes themselves as "The SciQuest Supplier Network is a SaaS communications hub that enables efficient and automated transaction interactions between our customers and their suppliers through our procurement, accounts payable and supplier management solutions.". So their revenue is derived from a subscription based service model. E.g., "custom","product","servic", etc.

In conclusion, there's certainly tractability here. One could tune the model to identify particular phrases of known interest or could use the model to look for new topics of interest. A potential strategy here is automatically classify firms by topic likihoods and then weight the topics by the historical risk-adjusted returns. From there you could construct an expected score of each topic. Then if a company is later classified (once trained, this be done the moment a new SEC document is posted) into topics with high scores, you'd then have good reason to further examine it.