## Unigram Visualizations Using Latent Dirichlet Allocation

In [None]:
This tutorial will help illustrate how using Latent Dirichlet Allocation we can start to uncover some interesting trends in the quarterly and annual public filings required by the SEC. We analyzed 10 years of data. Unigram and bigram visualizations for each year can be found below. If you have not downloaded and prepared your data set, please see our [Downloading and Extracting MD&A Text](Downloading_and_Extracting_MDA_Text.ipynb) tutorial before proceeding.

We use Gensim to assign with the topic modeling and pyLDAvis for the visualizations.

> **Note:**
>
> We recommend filtering out warnings as certain versions of pyLDAvis are using depricated parsing tools.

In [1]:
import warnings
warnings.filterwarnings("ignore")

First we need to import all the MD&A text we have located using the three Python files previously mentioned.

In [2]:
import os

file_directory = 'data/'
file_stopwords = 'stopwords.txt'

# Prepare list of all txt files in target directory
onlyfiles = []
for file in os.listdir(file_directory):
    if file.endswith('.txt'):
        onlyfiles.append(os.path.join(file_directory, file))

print('{} files loaded'.format(len(onlyfiles)))

574 files loaded


Next we will load our stopwords list, which consists of 800+ common and financial terms.

In [3]:
with open(file_stopwords, 'r') as f:
    stopwords = f.readlines()
    stopwords = [x.strip() for x in stopwords]
    print('{} stopwords'.format(len(stopwords)))

880 stopwords


The funtion below was created to remove any remaining stopwords that may have been missed during the intial parsing of the EDGAR data. We use it here again in case the user would like to test on raw text filings as well. The function input is a list of words. For each word, any whitespace characters and financial or currency references are removed using regular expressions.

In [4]:
import re

def remove_stopwords_list(word_list):
    
    filtered_word_list = []

    for word in word_list:
        foo = re.sub(r'[\s\d\,\-\(\)\.\$\']', '', word.lower())
        if foo[-2:].lower() == 'ed': # remove any word ending with 'ed'
            continue
        if len(foo) > 1 and foo not in stopwords:
            filtered_word_list.append(foo)

    return filtered_word_list

Here we read all the files from our document list, line by line. The line is split on whitespace and fed into the ```remove_stopwords_list``` function above.

In [5]:
docs = []
for file in onlyfiles:
    temp = []
    with open(file,'r') as f:
        for line in f:
            for word in line.split():
               temp.append(word)
        temp = remove_stopwords_list(temp)
    docs.append(temp)
print('{} documents in docs array'.format(len(docs)))

574 documents in docs array


Our documents need to be converted to vector representations using the bag-of-words method. We use Gensim to help assign a unique integer id to all words appearing in our documents (corpus), collecting word counts and relevant statistics along the way. Gensim creates a dictionary file for us to use for our modeling.

As you will see, there are up to tens of thousands of words in our dictionary.

In [6]:
from gensim import corpora, models, similarities

dictionary = corpora.Dictionary(docs)
print(dictionary)

Dictionary(18270 unique tokens: ['operationsitem', 'proceduresthe', 'evaluation', 'supervision', 'participation']...)


We then convert the tokenized documents to vectors and display just one result (note that the ```corpus``` size will likely cause data rate warnings if attempting to print in entirety). The function ```doc2bow()``` counts the number of occurrences of each distinct word, converts the word to its integer word id and returns the result as a sparse vector.

In [7]:
corpus = [dictionary.doc2bow(doc) for doc in docs]
print(corpus[:1])

[[(0, 1), (1, 1), (2, 3), (3, 1), (4, 1), (5, 9), (6, 1), (7, 4), (8, 2), (9, 1), (10, 1), (11, 1), (12, 1), (13, 2), (14, 4), (15, 2), (16, 2), (17, 1), (18, 1), (19, 1), (20, 2), (21, 2), (22, 2), (23, 2), (24, 1), (25, 1), (26, 1), (27, 2), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1), (39, 2), (40, 1), (41, 1), (42, 1), (43, 2), (44, 1), (45, 1), (46, 2), (47, 1), (48, 1), (49, 1), (50, 1), (51, 1), (52, 2), (53, 1), (54, 1), (55, 2), (56, 1), (57, 1), (58, 1), (59, 1), (60, 1), (61, 1), (62, 2), (63, 1), (64, 1), (65, 1), (66, 1), (67, 2), (68, 1), (69, 1), (70, 1), (71, 1), (72, 1), (73, 1), (74, 1), (75, 1), (76, 1), (77, 1), (78, 1), (79, 1), (80, 1), (81, 1), (82, 1), (83, 4), (84, 2), (85, 2), (86, 2), (87, 2), (88, 3), (89, 2), (90, 2), (91, 1), (92, 1), (93, 5), (94, 1), (95, 5), (96, 4), (97, 1), (98, 1), (99, 1), (100, 4), (101, 1), (102, 1), (103, 1), (104, 1), (105, 1), (106, 1), (107, 1), (108, 2), (109, 2), (110, 1)

We are ready to train the LDA model. The first of 90 topics is displayed with the probability of a word in that topic.

In [8]:
model = models.LdaModel(corpus, id2word=dictionary, num_topics=90)

print(model)
model.show_topic(topicid=1, topn=10)

LdaModel(num_terms=18270, num_topics=90, decay=0.5, chunksize=2000)


[('share', 0.0069929123801531342),
 ('technology', 0.0050395426295979703),
 ('license', 0.0050107988658426783),
 ('it', 0.0048795209949297246),
 ('acquisition', 0.0044708725532913284),
 ('maintenance', 0.0044499556717575673),
 ('support', 0.0039935359676677318),
 ('research', 0.0038583316530521198),
 ('lower', 0.0037165065389233487),
 ('rates', 0.0036735049657321173)]

We use pyLDAvis to generate a visualization of topics and distribution of words based on the LDA model generated above.

In [9]:
import pyLDAvis
import pyLDAvis.gensim

pyLDAvis.enable_notebook()
data = pyLDAvis.gensim.prepare(model, corpus, dictionary)

Finally, we generate the visualization for this topic model. The user can uncomment the ```pyLDAvis.save.html()``` line to save the visualization to disk.

In [10]:
# visuals = 'visuals/'
# pyLDAvis.save_html(data, visuals + 'vis_{}.html'.format(year))

pyLDAvis.display(data)

### Unigram visualizations for the last 10 years:

[2008](https://rgjeldum.github.io/cs410-final-project-team32/unigram_vis_2008.html) | 
[2009](https://rgjeldum.github.io/cs410-final-project-team32/unigram_vis_2009.html) | 
[2010](https://rgjeldum.github.io/cs410-final-project-team32/unigram_vis_2010.html) | 
[2011](https://rgjeldum.github.io/cs410-final-project-team32/unigram_vis_2011.html) | 
[2012](https://rgjeldum.github.io/cs410-final-project-team32/unigram_vis_2012.html) | 
[2013](https://rgjeldum.github.io/cs410-final-project-team32/unigram_vis_2013.html) | 
[2014](https://rgjeldum.github.io/cs410-final-project-team32/unigram_vis_2014.html) | 
[2015](https://rgjeldum.github.io/cs410-final-project-team32/unigram_vis_2015.html) | 
[2016](https://rgjeldum.github.io/cs410-final-project-team32/unigram_vis_2016.html) | 
[2017](https://rgjeldum.github.io/cs410-final-project-team32/unigram_vis_2017.html)

### Bigram visualizations

[2008](https://rgjeldum.github.io/cs410-final-project-team32/bigram_vis_2008.html) | 
[2009](https://rgjeldum.github.io/cs410-final-project-team32/bigram_vis_2009.html) | 
[2010](https://rgjeldum.github.io/cs410-final-project-team32/bigram_vis_2010.html) | 
[2011](https://rgjeldum.github.io/cs410-final-project-team32/bigram_vis_2011.html) | 
[2012](https://rgjeldum.github.io/cs410-final-project-team32/bigram_vis_2012.html) | 
[2013](https://rgjeldum.github.io/cs410-final-project-team32/bigram_vis_2013.html) | 
[2014](https://rgjeldum.github.io/cs410-final-project-team32/bigram_vis_2014.html) | 
[2015](https://rgjeldum.github.io/cs410-final-project-team32/bigram_vis_2015.html) | 
[2016](https://rgjeldum.github.io/cs410-final-project-team32/bigram_vis_2016.html) | 
[2017](https://rgjeldum.github.io/cs410-final-project-team32/bigram_vis_2017.html)


For a detailed description of how the pyLDAvis visualization is prepared, check out this [paper](https://cran.r-project.org/web/packages/LDAvis/vignettes/details.pdf) (credit unknown).