# Introduction

This tutorial will introduce a popular data science algorithm known as Statistical Topic Modeling. In the fields of machine learning and natural language processing, a topic model is used to find "topics" that occur in a collection of documents. Topic models rely on the instinctive understanding that if a document is about a particular topic, ex: "biology", then the word "biology", and related words such as "cell" or "membrane" will occur most frequently in the documents. Thus, the algorithm is useful in gaining insight into large collections of unstructured text quickly. A good way to think of it is like a word cloud, or a Wordle. The words that are the largest in a word cloud represent the words with the highest occurence in the document. 

<img src="https://pbs.twimg.com/media/BB4Tr9FCYAISYdu.jpg",width=800,height=800>

If we were to run topic modeling on this document, words like "biology", "life", and "organisms" would have the highest probabilities. Thus, the categories that these words fell under would be determined to be the topics

# Tutorial Content

In particular, this tutorial will focus on Latent Dirichlet Allocation(LDA) which is one of the simplest and most popular Topic Modeling algorithms.

The purpose of this tutorial is to expose readers to the basics of the algorithm and the implementation of various popular python lda libraries.

First, we'll discuss the mathematical basis of the algorithm and gain a basic understanding of how it calculates the topics in a document.

Then, we'll explore how to use the algorithm on various simple datasets using gensim, a popular topic modeling package. We'll go over proper preprocessing protocal and run LDA to explore the results. 

Next, we'll look at other similar topic modeling packages, lda and scikit learn, to decipher a similar output.

And finally, we'll use some basic visualization techniques to gain a visual and interactive understanding of results of a topic model.


## Setup

Before getting started, you'll need to install the various libraries that we will use. The following libraries are used for tokenization, LDA implementation and visualzation, respectively.
NLTK:
    
    $pip install -U nltk

Stop-words: 
    
    $pip install stop-words

Gensim: 
    
    $pip install gensim
    
LDA:

    $pip install lda

PyLDAVis: 
    
    $pip install pyldavis
    
   

In [27]:
import numpy as np
from nltk.tokenize import RegexpTokenizer
from stop_words import get_stop_words
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models
import gensim

from sklearn.decomposition import LatentDirichletAllocation
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer


import lda

## The Algorithm

So how does LDA actually work? 

First starting with a collection of documents, we know that our primary objective is to identify the topics that this particularl collection is about. Assuming that every document has a varied number and range of topics, then can define each topic to be a collection of words that have a certain probability of occuring in a document. 

We can't observe the topics directly - we just have the documents. So instead, we can interate through every word in a document, and determine whether this instance of the word 'belongs' to a particular topic. We then can do this over all of the documents. 

So, for every possible topic T, we can multiple the frequency of a word W in T by the number of other words in document D that already are of topic T. The result represents the probability that a given word W is of topic T.

On a fundamental level, this mathematically translates into

$P(T \mid D) = \frac{\text{ #words } W + \text{ in topic } T  \beta(w)}{\text{Total tokens in } T \text{ and } \beta} \times (\text{#words in } D \text{ that belong to } T + \alpha)$

The extra Beta(w) accounts for the chance that a word belongs to a topic even if the document doesn't actually have to do with that topic.

As we go through the collection, word by word, and assign words to topics, our model becomes more and more consistent. Topics begin to point to specific words occuring in specific documents. This means that essentially, topic modeling assigns words to topics by guessing, and then keeps repeating this method trying to improve and make guesses more accurately as we work our way through the collection. 

## Simple Data Processing & Gensim Usage Example

Now let's work through a very simple example just to see how to implement LDA using a fairly popular library called Gensim. We'll discuss a basic data processing and cleaning methodology using nltk.

Say we start with the following two documents. In reality, LDA would never be run on such simple documents, but for the sake of an example that is easy to interpret, they do the job.

From simple inspection we can tell that the first document is about eating, and the second is about cardiology.

In [29]:
doc_A = "I like to eat broccoli and bananas. I ate a banana and spinach smoothie for breakfast, and it was really good."
doc_B = "Cardiology is a study of medicine dealing with disorders of the heart and parts of the circulatory system."



Reading from a file isn't too difficult either, and in most cases with LDA, you'll have a folder of documents in the form of text files, so the following method will be useful.

In [30]:
def readDocument(doc_filepath):
    
    with open(doc_filepath, 'r') as myfile:
        data=myfile.read().replace('\n', '')
        
    return data

## Tokenization

Just like in previous homework assignments, data cleaning is essential to providing a standarized input for algorithms to take in and work with. In this case, we'll use tokenization, which strips unstructured text down to its atomic elements. In this case, I'm using a regular expression tokenizer, but nltk has many different tokenizers that perform a variety of specific actions. Similarly, one could use the function we created in HW3 for text processing.

In [31]:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')

raw = doc_A.lower()
tokens = tokenizer.tokenize(raw)
print tokens

['i', 'like', 'to', 'eat', 'broccoli', 'and', 'bananas', 'i', 'ate', 'a', 'banana', 'and', 'spinach', 'smoothie', 'for', 'breakfast', 'and', 'it', 'was', 'really', 'good']


Now, similar to the text-processing we performed in HW3, we want to use stopwords to remove frequent words like "the" and "I" out of the corpus, so that we're only left with words that might be potential "topics". For example, if we looked soley doc_A and sought to tokenize it, the code and result would appear as follows:

In [32]:
from stop_words import get_stop_words

# create English stop words list
en_stop = get_stop_words('en')

print en_stop[:10]

# remove stop words from tokens
stopped_tokens = [i for i in tokens if not i in en_stop]

print(stopped_tokens)

[u'a', u'about', u'above', u'after', u'again', u'against', u'all', u'am', u'an', u'and']
['like', 'eat', 'broccoli', 'bananas', 'ate', 'banana', 'spinach', 'smoothie', 'breakfast', 'really', 'good']


## Gensim Implementation

Combining everything together, and using the library Gensim, the following code is a good example of how to combine processing and implementing of the LDA.

In [51]:
import string
tokenizer = RegexpTokenizer(r'\w+')

# get a list of English Stopwords  & punctuation
stopwords = get_stop_words('en')
punc = set(string.punctuation)

# create sample documents
doc_A = "I like to eat broccoli and bananas. I ate a banana and spinach smoothie for breakfast, and it was really good."
doc_B = "The word broccoli comes from the Italian plural of broccolo, which means 'the flowering crest of a cabbage'."
doc_C = "Broccoli is often boiled or steamed but may be eaten raw"
doc_D = "Broccoli is a result of careful breeding of cultivated Brassica crops in the northern Mediterranean starting in about the 6th century BC."
doc_E = "The perceived bitterness of cruciferous vegetables such as broccoli varies from person to person, but the functional underpinnings of this variation are not known. " 

#combine the corpus
corp = [doc_A, doc_B, doc_C, doc_D, doc_E]

allText = []

#Clean up all the text
for i in corp:
    
    raw = i.lower()
    
    tokens = tokenizer.tokenize(raw)
    
    #don't include stopwords in final allWords list
    stopped_tokens = [i for i in tokens if not i in stopwords]
    noPunc = [i for i in stopped_tokens if i not in punc]
    
    allText.append(noPunc)

print("Clean Text")
print(allText)
bag = corpora.Dictionary(allText)

#doc2bow takes bag and makes vectors of (wordID, wordFrequency) for every word in allTexts
corpus = [bag.doc2bow(t) for t in allText]

print("Bag of Words")
print(corpus)
#because our corpus is small, we set number of topics to 2
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=2, id2word = bag, passes=20)

Clean Text
[['like', 'eat', 'broccoli', 'bananas', 'ate', 'banana', 'spinach', 'smoothie', 'breakfast', 'really', 'good'], ['word', 'broccoli', 'comes', 'italian', 'plural', 'broccolo', 'means', 'flowering', 'crest', 'cabbage'], ['broccoli', 'often', 'boiled', 'steamed', 'may', 'eaten', 'raw'], ['broccoli', 'result', 'careful', 'breeding', 'cultivated', 'brassica', 'crops', 'northern', 'mediterranean', 'starting', '6th', 'century', 'bc'], ['perceived', 'bitterness', 'cruciferous', 'vegetables', 'broccoli', 'varies', 'person', 'person', 'functional', 'underpinnings', 'variation', 'known']]
Bag of Words
[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1)], [(4, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1)], [(4, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1)], [(4, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 1)], [(4, 1), (38, 1), (39, 1),

In [52]:
print ldamodel.show_topics()

[(0, u'0.062*broccoli + 0.045*person + 0.027*perceived + 0.027*underpinnings + 0.027*variation + 0.027*cruciferous + 0.027*known + 0.027*result + 0.027*vegetables + 0.027*bitterness'), (1, u'0.056*broccoli + 0.033*breakfast + 0.033*eat + 0.033*bananas + 0.033*banana + 0.033*really + 0.033*ate + 0.033*good + 0.033*spinach + 0.033*smoothie')]


## SciKit Learn Implementation 

Another frequently used library for LDA is scikit learn. In this case, we'll be using their 20 newsgroups text dataset, a standard dataset that scikit includes in their library so that users can test out various scikit algorithms. This implmentation utilizes the help of sklearn.feature_extraction.text.CountVectorizer. It also makes sure to not include a list of stopword by setting the "stopword" attribute. A basic implementation of LDA using scikit would be: 

In [49]:
# Initialize variables
samples = 2500
features = 1000
topics = 10

dataset = fetch_20newsgroups(shuffle=True, random_state=1,remove=('headers', 'footers', 'quotes'))
d = dataset.data[:samples]
rawData = dataset.data

#Set up vectorizer
vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=features,stop_words='english')

#Transform and create model
tfidf = vectorizer.fit_transform(d)

dtm_tf = vectorizer.fit_transform(rawData)

ldasck_model = LatentDirichletAllocation(n_topics=topics, max_iter=5,learning_method='online', learning_offset=50.,random_state=0)

ldasck_model.fit(tfidf)

names = vectorizer.get_feature_names()

print("Feature Names")
print (names)

print("Shape of Model")
print(tfidf.shape)

doc_topic_distrib = ldasck_model.transform(tfidf)

print("Probabilty DocTopic Distribution")
print doc_topic_distrib

def print_top_words(model, feature_names, n_top_words):
    for topic_id, topic in enumerate(model.components_):
        print('\nTopic Num.%d:' % int(topic_id + 1)) 
        print(''.join([feature_names[i] + ' ' + str(round(topic[i], 2))
              +' | ' for i in topic.argsort()[:-n_top_words - 1:-1]]))

n_top_words = 4

print("Topic Distriution of Top Words per Topic")
print_top_words(ldasck_model, names, n_top_words)   

Feature Names
[u'00', u'000', u'01', u'02', u'03', u'04', u'0d', u'0t', u'10', u'100', u'11', u'12', u'128', u'13', u'14', u'145', u'15', u'16', u'17', u'18', u'19', u'1990', u'1991', u'1992', u'1993', u'1d9', u'1st', u'1t', u'20', u'200', u'21', u'22', u'23', u'24', u'25', u'250', u'26', u'27', u'28', u'29', u'2di', u'2tm', u'30', u'300', u'31', u'32', u'33', u'34', u'34u', u'35', u'36', u'37', u'38', u'39', u'3d', u'3t', u'40', u'42', u'43', u'44', u'45', u'50', u'500', u'55', u'60', u'64', u'6ei', u'70', u'75', u'75u', u'7ey', u'7u', u'80', u'800', u'86', u'90', u'91', u'92', u'93', u'9v', u'a86', u'able', u'ac', u'accept', u'access', u'according', u'act', u'action', u'actually', u'add', u'addition', u'address', u'administration', u'advance', u'age', u'ago', u'agree', u'ah', u'air', u'al', u'algorithm', u'allow', u'allowed', u'alt', u'america', u'american', u'analysis', u'anonymous', u'answer', u'answers', u'anti', u'anybody', u'apparently', u'appears', u'apple', u'application', u'a

## LDA Implementation

As detailed in the https://pythonhosted.org/lda/ documentation, lda is a really easy library to work with to accomplish the same topic modeling methodologies that we've seen. LDA makes it very convenient to parse through and find the topics as well. 

In [50]:


#get's data from one of lda's datasets
data = lda.datasets.load_reuters()

#Setup and Fit the Model
lda_model = lda.LDA(n_topics=20, n_iter=1500, random_state=1)
lda_model.fit(data)  

#Loading in the vocab and the titles
corp_vocabulary = lda.datasets.load_reuters_vocab()
corp_titles = lda.datasets.load_reuters_titles()


print("Corpus Size")
print (data.shape)
print()

t_word_distribution = model.topic_word_ # in other libraries, this is the same as model.components_

print("Topic Word Distribution\n")
print (t_word_distribution)

n_top_words = 8

print("Words per Topic")
for i, topic in enumerate(t_word_distribution):
    topic_words = np.array(corp_vocabulary)[np.argsort(topic)][:-n_top_words:-1]
    print('Topic {}: {}'.format(i, ' '.join(topic_words)))

    
doc_topic = model.doc_topic_
print

print

print("Top Topics for the First 20 Documents in the Corpus")
for i in range(20):
    print("{} (Topic: {})".format(corp_titles[i], doc_topic[i].argmax()))

Corpus Size
(395, 4258)
()
Topic Word Distribution

[[  3.62505347e-06   3.62505347e-06   3.62505347e-06 ...,   3.62505347e-06
    3.62505347e-06   3.62505347e-06]
 [  1.87498968e-02   1.17916463e-06   1.17916463e-06 ...,   1.17916463e-06
    1.17916463e-06   1.17916463e-06]
 [  1.52206232e-03   5.05668544e-06   4.05040504e-03 ...,   5.05668544e-06
    5.05668544e-06   5.05668544e-06]
 ..., 
 [  4.17266923e-02   3.93610908e-06   9.05698699e-03 ...,   3.93610908e-06
    3.93610908e-06   3.93610908e-06]
 [  2.37609835e-06   2.37609835e-06   2.37609835e-06 ...,   2.37609835e-06
    2.37609835e-06   2.37609835e-06]
 [  3.46310752e-06   3.46310752e-06   3.46310752e-06 ...,   3.46310752e-06
    3.46310752e-06   3.46310752e-06]]
Words per Topic
Topic 0: british churchill sale million major letters west
Topic 1: church government political country state people party
Topic 2: elvis king fans presley life concert young
Topic 3: yeltsin russian russia president kremlin moscow michael
Topic 4: pop

## Visualization

Being able to get the topic model from a collection of documents is useful, but as it stands, these numerical values aren't easy to process and understand. Moreover, the larger the collection of documents, the greater the number of topics, and the harder it becomes to differentiate between the probabilities. Therefore, a useful way to understand the results of a model is to visualize the data. One library that does a fantastic job of doing so is PyLDAVis. It's a spinoff of LDAVis which was initially written for R, and then versioned to be used in different languages. This visualization is interactable and ranks the top 30 most salient terms for ease of use. Hovering over any of the terms in the list, lets you view a dynamic visualization of interoptic distributionof your chosen topic compared to other topics.

In [46]:
import pyLDAvis
import pyLDAvis.sklearn

sklearnDisplay = pyLDAvis.sklearn.prepare(ldasck_model, dtm_tf, vectorizer)

pyLDAvis.display(sklearnDisplay)


PyLDAvis makes it incredibly easy to display a visualization of topic modeling data. Above was the scikit learn implementation, but it is also compatible with Gensim

## References

This tutorial only scratched the surface of the Topic Modeling and LDA, but there are plenty of resources and libraries online that are worth exploring to gain a deeper understanding. Links to the documentation of some of the tools we've used in this tutorial are listed below. 

Gensim: https://radimrehurek.com/gensim/models/ldamodel.html
Scikit Learn: http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html
lda: http://pythonhosted.org/lda/api.html
pyLDAvis: https://pyldavis.readthedocs.io/en/latest/


For more details about the actual LDA algorithm, this paper published in 2003 by Princeton University breaks down the derivation of the model pretty intuitively. https://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf