# Introduction to LDA
In this tutorial, we will learn about the text-mining aspect of an algorithm called Latent Dirichlet Allocation, or LDA for short. We have already spent a decent amount of time learning about text mining in this class, so I won't go over those basics now, but have a look at this wikipedia page if you would like a refresher: [Text Mining](https://en.wikipedia.org/wiki/Text_mining "link to wikipedia page")

Latent Dirichlet Allocation was developed by David Blei, Andrew Ng, and Michael I. Jordan, and was published in their 2003 paper about topic discovery called [Latent Dirichlet Allocation](http://jmlr.csail.mit.edu/papers/v3/blei03a.html). 

As per the wikipedia definition: 
>Latent Dirichlet Allocation is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word's creation is attributable to one of the document's topics.


<img style="float: right;" src="k-means.png" width="200">
In other words, LDA attempts to guess the topic of a document based on the words that are contained within that document and based on other documents within the same corpus that have similar words. 

LDA can be thought of as a type of clustering algorithm, where each observation is classified as one of a number of classifications. 

In K-Means, all observations are classified into mutually exclusive groups, as shown in the picture to the right. LDA differs from K-means in this respect however, because each observation (word) will be assigned to a topic using a continuous scale, so each word can (and often does) belong to many different topics. This ensures that certain documents share topics with other documents within the same corpus (which is to be expected. If you pass in all of Shakespeare's works into a topic model, you would expect to see recurring themes throughout his work). 

### LDA 
The LDA model can be described using the plate notation picture shown below:
<img src="LDA plate.png" width="500">
In this diagram, 
* ${D}$ represents the number of documents (in the example below, Tweets)
* ${N}$ represents the number of words in a document
* ${K}$ represents the number of topics
* ${\eta}$ represents the topic hyperparameter
* ${\alpha}$ represents a parameter of the Dirichlet distribution
* ${\beta}$ represents the distribution of topics
* ${\theta}$ represents the vector of topic proportions within a document ${d}$ within ${D}$
* ${Z}$ represents the distribution of topics for a given word ${n}$ in a document ${d}$
* ${W}$ represents a word found within a document ${d}$

We will use the same names in the code below. To begin, before we have trained the model, all we have is the last parameter ${W}$, or the words within documents. LDA infers the topic "structure" and distribution based on the frequency and co-occurance of words within and between documents. All variables besides ${W}$, ${\alpha}$, and ${\eta}$ are latent, meaning that they are inferred based on the data that the model is given (${i.e.}$ the words and the documents that those words are found in). 

>The generative process is as follows. Documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over words. LDA assumes the following generative process for a corpus ${D}$ consisting of ${D}$ documents each of length ${N}$
1. Choose ${\theta_i \sim Dirichlet(\alpha)}$ where ${i \in \{1 \dotsc D\}}$
2. Choose ${\eta_k \sim Dirichlet(\beta)}$ where ${k \in \{1 \dotsc K\}}$
3. For each word position ${i,j}$
   * choose a topic ${z_i,_j \sim Multinomial(\theta_i)}$
   * choose a word ${w_i,_j \sim Multinomial(\eta_i,_j)}$

https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation

Unlike other parameters, ${\alpha}$ and ${\eta}$ can be manually manipulated:
* ${\alpha}$< 1 makes fewer topics, ${\alpha}$ > 1 more topics per doc
* ${\eta}$< 1 less similarity within topic, ${\eta}$ > 1 more similarity within topic

### Variational Bayes
Learning the distributions ${\beta}$ and ${\theta}$ is a Bayesian inference problem. 
Learning the various distributions (the set of topics, their associated word probabilities, the topic of each word, and the particular topic mixture of each document) is a problem of Bayesian inference. There are several options for Bayesian inference, including Gibbs Sampling, and Expectation Propogation. In this implementation we'll use Variational Bayes to iteratively estimate the posterior distribution. Variational Bayes is used in this circumstance as a means to approximate the posterior probability of the unobserved topics. 

# Let's get started
I have pulled a little more than 300 tweets that contain "#debates" (using the Twitter API and tweepy), which is what I'll be using to demonstrate LDA. You can do the same if you like. Simply go to [https://apps.twitter.com/](https://apps.twitter.com/), request API keys, and off you go. 

To create a topic model using LDA, we'll need to first start by cleaning the data. Much of the code you see below is from the text_classification homework. I have created a function below called clean_up, which will take in my data (one row (tweet) at a time), and clean it up:
1. Convert all the text to lower-case, remove "'s" from all the words (it's to it; this helps to identify stop words, and generally makes better features), get rid of punctuation, etc.
2. Tokenize the words - each word will be considered independently (in context) within the topic model
3. Lemmatize the words - remove conjugations and other variations of words to get words to their infinitive form - makes for better features
4. Stem the words using Porter Stemmer. This will reduce the words down to their roots, which makes the features a little hard to understand sometimes, but groups similar tokens to make for better features

In order to run the code below, you may need to install a couple of packages:
* ```pip install lda```
* ```pip install pyldavis```

### Clean the data

In [19]:
import nltk
from nltk.corpus import stopwords
import string
from nltk.stem.porter import PorterStemmer
import sklearn as sk
import lda
import numpy as np
import math

# filtered_words = [word for word in word_list if word not in stopwords.words('english')]

tweets = []

def clean_up(text):       
#     print text
    lemmatizer=nltk.stem.wordnet.WordNetLemmatizer()
    text = text.lower()
    text = text.replace("'s", '')
    text = text.replace("'", '')
    replace_punctuation = string.maketrans(string.punctuation, ' '*len(string.punctuation))
    text = text.translate(replace_punctuation)
    tokens = nltk.word_tokenize(text)
    
    lemmatized = []
    
    
    stopwords= set(nltk.corpus.stopwords.words('english'))    

    for i in tokens:
        try:
            lemma = lemmatizer.lemmatize(i)
        except UnicodeDecodeError:
            continue
        lemmatized.append(lemma)
    stopped_tokens = [i for i in lemmatized if not i in stopwords]
    
#     Create p_stemmer of class PorterStemmer
    p_stemmer = PorterStemmer()
    features = [p_stemmer.stem(i) for i in stopped_tokens]
    
    return stopped_tokens
    pass


### Format data
Now we'll call the clean_up function and pass in the data, tweet by tweet. Clean_up outputs a list of strings (a list of tokens, one list per tweet). We'll take that output, and create another list below. Ultimately, we end up with <code>features</code>, which is a list of lists that contain each of the tokens, grouped by tweet. 

In [20]:
features = []

with open("output.txt") as tweets:
    lines = tweets.readlines()
    
for line in lines:
    line_s = str(line)
    features.append(clean_up(line_s))    

#put the features back into a format LDA can handle:

features_l = [' '.join(term) for term in features]
print "features_l:\n", features_l

features_l:
[u'politico mike penny http co udnyb9fshg', 'rt polltakerguy next president please retweet thank maga trump tv debate iregistered teamtrump debatenight', u'ok realdonaldtrump govpencein carry debate imwithher angela rye http co acd7hjfysb', u'called allergy going thru box tissue day cnn news maybe track awol http co 7mpjvjfw4h', u'real http co arrloguzlo', u'earnest candidate way debate accusing hillary clinton http co dglg6skmny', u'rt blueprint trump campaign ha something awful movie check watch yup end scarface', u'rt peggy7172 trump never drink smoked cigarette drug', '', u'guess obama cocaine habit grab http co', u'rt bpolitics josh earnest mock donald trump saying snorted way first two debate http co 9zerlzty0j http', u'rt jeanettejing hillary2016 debate trump2016 immigration http co kqthqaadps', u'rt cspan presssec joke trump candidate snorted way first two debate accusing', u'rt bpolitics josh earnest mock donald trump saying snorted way first two debate http co 9ze

### Additional Formatting

Now that we have all the tokens organized and cleaned up, let's create a document term matrix. This is a matrix of document ids along one axis, and terms along the other. It is composed of the number of a given term in a given document (tweets in this case).  First, let's loop through the tweets (they're currently organized in a list of lists: [[features in tweet 1],[features in tweet 2],...,[features in tweet n]]), and create the dtm by looping through the values. 

In [21]:
token_dict = {}

for i in range(len(features_l)):
    token_dict[i] = features_l[i]

print len(token_dict)
print token_dict

dtm = sk.feature_extraction.text.CountVectorizer(stop_words='english') # dtm
dtm_fit = dtm.fit_transform(token_dict.values())                       # fit dtm

312
{0: u'politico mike penny http co udnyb9fshg', 1: 'rt polltakerguy next president please retweet thank maga trump tv debate iregistered teamtrump debatenight', 2: u'ok realdonaldtrump govpencein carry debate imwithher angela rye http co acd7hjfysb', 3: u'called allergy going thru box tissue day cnn news maybe track awol http co 7mpjvjfw4h', 4: u'real http co arrloguzlo', 5: u'earnest candidate way debate accusing hillary clinton http co dglg6skmny', 6: u'rt blueprint trump campaign ha something awful movie check watch yup end scarface', 7: u'rt peggy7172 trump never drink smoked cigarette drug', 8: '', 9: u'guess obama cocaine habit grab http co', 10: u'rt bpolitics josh earnest mock donald trump saying snorted way first two debate http co 9zerlzty0j http', 11: u'rt jeanettejing hillary2016 debate trump2016 immigration http co kqthqaadps', 12: u'rt cspan presssec joke trump candidate snorted way first two debate accusing', 13: u'rt bpolitics josh earnest mock donald trump saying sn

### Create Model

Now that we have the Document Term Matrix (dtm), we're ready to build the model! We only need to select a couple of parameters: number of topics and number of words to display for each topic. For the number of topics, we'll start with a nebulous k:
$$k	= \sqrt{\frac{n}{2}}$$
where n is the number of documents in a corpus, in this case, the number of tweets. 


In [22]:
k = int(math.ceil(math.sqrt(len(lines)/2)))              # set the number of topics to look for

model = lda.LDA(n_topics=k, n_iter=1500, random_state=1) # create model
dtm_tf = model.fit_transform(dtm_fit)                    # fit dtm to LDA model
topic_word = model.topic_word_                           # get words that have a high probability in a given topic
vocab = dtm.get_feature_names()                          # feature names

w_t = 10                                                 # the number of words to display from each topic
print "number of topics:",k
print "number of words to display per topic:",w_t, "\n"

#print out topics with the words that compose the topics
for i, topic_dist in enumerate(topic_word):
    topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(w_t+1):-1]
    print('Topic {}: {}'.format(i, ' '.join(topic_words)))

number of topics: 13
number of words to display per topic: 10 

Topic 0: trump drug test http way presssec debate candidate rt snorted
Topic 1: rt debate tv president debatenight thank teamtrump maga trump iregistered
Topic 2: http rt talk going issue fired went stance parliament shot
Topic 3: http trump rt way house white snorted debate donald spokesman
Topic 4: rt ha like watch campaign trump end check msnbc movie
Topic 5: http debate josh press secretary earnest trumptv question cnnpolitics joke
Topic 6: candidate snorted debate way rt accusing presssec taking drug trump
Topic 7: http debate wednesday imwithher called news ou number immigration election2016
Topic 8: debate los rt en la el hoy medium sin para
Topic 9: http debate rt realdonaldtrump hillaryclinton say email coke rushing erinmcunningham
Topic 10: http debate saying josh earnest bpolitics 9zerlzty0j mock donald 2016
Topic 11: rt hillary said clinton amp obama aarpnv chair damian grant
Topic 12: rt candidate debate earne

### Visualize the Model

Next we'll visualize our model. We can do this through the sklearn library. We call the prepare function, which transforms and prepares a LDA model’s data for visualization. In creating the LDA model, Document Term Matrix, and the DTM fit to the LDA model, we have done most of the preparation already. The remaining parameters we'll leave at their defaults for the moment to see how well the visualization looks. 

Before we visualize our model, let's have a look at the parameters we're passing into the visualization:

In [23]:
print "\nvocab[100:120]:\n",vocab[100:120]      # unique words from our tweets. 100-120 is just a sample that makes some sense          
print "dtm_tf:\n",dtm_tf, "\n"                  # probability of a word belonging to a topic columns are topics, rows are words


vocab[100:120]:
[u'brazile', u'brother', u'bu', u'build', u'bumper', u'businessinsider', u'busted', u'ca', u'cabe', u'called', u'calling', u'camera', u'camp', u'campaign', u'candidate', u'cara', u'card', u'carry', u'carta', u'case']
dtm_tf:
[[ 0.33333333  0.33333333  0.17460317 ...,  0.01587302  0.01587302
   0.01587302]
 [ 0.0075188   0.90977444  0.0075188  ...,  0.0075188   0.0075188
   0.0075188 ]
 [ 0.00884956  0.00884956  0.00884956 ...,  0.00884956  0.00884956
   0.00884956]
 ..., 
 [ 0.00813008  0.00813008  0.00813008 ...,  0.00813008  0.00813008
   0.7398374 ]
 [ 0.92638037  0.00613497  0.00613497 ...,  0.00613497  0.00613497
   0.00613497]
 [ 0.0075188   0.90977444  0.0075188  ...,  0.0075188   0.0075188
   0.0075188 ]] 



Now, let's visualize. We'll use a package originally created for R called LDAvis. Below you will find each topic plotted on a graph with the first two principal components as the axes. The size of the circles indicates the relative size of the topic within the corpus. 

In [24]:
import pyLDAvis
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()

pyLDAvis.sklearn.prepare(model, dtm_fit, dtm)

### Refine the Results 

As you can tell from the terms that compose our clusters, we have some more cleaning up to do. Words like "http" and "rt" shouldn't have made it this far in the analysis, but it is easy to see in LDAvis that we should remove them to get a clearer picture of the topics and topic distributions. Additionally, I'm going to remove spanish words using the nltk list of stopwords, since the majority of the tweets are in english, and all of the spanish words are getting grouped together. I'm going to remove any terms that have a number in them, since most of the features that contain numbers are part a url that has been broken into pieces instead of useful features. Last but not least, there is clustering of clusters here, which may mean there are naturally fewer topics than I have specified for the model to find, so I'll reduce that number and check the results empirically. 

In [25]:
tweets = []

def more_cleaning(text):       
#     print text
    lemmatizer=nltk.stem.wordnet.WordNetLemmatizer()
    text = text.lower()
    text = text.replace("'s", '')
    text = text.replace("'", '')
    replace_punctuation = string.maketrans(string.punctuation, ' '*len(string.punctuation))
    text = text.translate(replace_punctuation)
    tokens = nltk.word_tokenize(text)
    
    lemmatized = []   
    
    stopwords= set(nltk.corpus.stopwords.words('english'))    
    stopwords.update('rt','retweet','amp','http')
    stopwords.update(nltk.corpus.stopwords.words('spanish'))
    
        
    for i in tokens:
        try:
            lemma = lemmatizer.lemmatize(i)
        except UnicodeDecodeError:
            continue
        lemmatized.append(lemma)
    stopped_tokens = [i for i in lemmatized if not i in stopwords]
    
#     Create p_stemmer of class PorterStemmer
    p_stemmer = PorterStemmer()
    features = [p_stemmer.stem(i) for i in stopped_tokens]
    
    return stopped_tokens
    pass

features = []

with open("output.txt") as tweets:
    lines = tweets.readlines()
    
for line in lines:
    line_s = str(line)
    features.append(clean_up(line_s))    

#put the features back into a format LDA can handle:
features_l = [' '.join(term) for term in features]

token_dict = {}

for i in range(len(features_l)):
    token_dict[i] = features_l[i]

dtm = sk.feature_extraction.text.CountVectorizer(stop_words='english') # dtm
dtm_fit = dtm.fit_transform(token_dict.values())                       # fit dtm



In [26]:
k = 6                                                    # based on the LDA vis output from before

model = lda.LDA(n_topics=k, n_iter=1500, random_state=1) # create model
dtm_tf = model.fit_transform(dtm_fit)                    # fit dtm to LDA model
topic_word = model.topic_word_                           # get words that have a high probability in a given topic
vocab = dtm.get_feature_names()                          # feature names

pyLDAvis.sklearn.prepare(model, dtm_fit, dtm)

### More Visualization

You'll notice now that the clusters are more independent than before. There is more separation between them, though there is still some overlap. The features have been cleaned up, but there is yet cleaning to do. After some additional cleaning, it was apparent that the tweets no longer produce 13 different topics, but only about 6. Now we have 6 topics that do a pretty good job of describing the set of tweets that we've given to it.

However, now that the model is built, we can do more to visualize the results. The first method is to find for each topic, the tweet that best represents it. I'll first print off the first 5 tweets as they appeared before processing (to remind you of how far we've come), and then we'll print off the tweet that has the highest probability of "belonging" to a topic. 

In [27]:
# from http://chrisstrelioff.ws/sandbox/2014/11/13/getting_started_with_latent_dirichlet_allocation_in_python.html

doc_topic = model.doc_topic_
print lines[:5]

for n in range(10):
    topic_most_pr = doc_topic[n].argmax()
    print("doc: {} topic: {}\n{}...".format(n,
                                            topic_most_pr,
                                            lines[n][:50]))

['@politico @mike_pence https://t.co/UDnYb9fSHG\n', 'RT @polltakerguy: Who is our next president? PLEASE RETWEET! THANK YOU! #MAGA Trump TV #debate #iRegistered @TeamTrump #debatenight #debate\xe2\x80\xa6\n', "It's ok @realDonaldTrump @GovPenceIN will carry you #debates #ImWithHer @angela_rye https://t.co/aCD7hjfySb\n", "It's called allergies, I am going thru a box of tissues a day @CNN HOW IS THIS NEWS? Maybe track down AWOL Hillary\xe2\x80\xa6 https://t.co/7MPJvJFW4h\n", 'Is this for real?! \xf0\x9f\x98\xb3 https://t.co/ARRLOGUZLo\n']
doc: 0 topic: 0
@politico @mike_pence https://t.co/UDnYb9fSHG
...
doc: 1 topic: 1
RT @polltakerguy: Who is our next president? PLEAS...
doc: 2 topic: 2
It's ok @realDonaldTrump @GovPenceIN will carry yo...
doc: 3 topic: 0
It's called allergies, I am going thru a box of ti...
doc: 4 topic: 5
Is this for real?! 😳 https://t.co/ARRLOGUZLo
...
doc: 5 topic: 4
Earnest: Candidate Who “Snorted His Way Through�...
doc: 6 topic: 2
RT @the_blueprint: trump's campa

Some interesting results, to say the least; Twitter is a wild and crazy place. 

### "Validate" with Another Model

Let's build the model using another method to see if we get similar results. This will serve as a means to pseudo-validate the model(s). This time around, we'll use gensim, which requires a slightly different structure for the inputs. 

In [28]:
from gensim import corpora, models
import gensim.models.ldamodel as lda

vocab_g = corpora.Dictionary(features)
dtm_g = [vocab_g.doc2bow(feature) for feature in features]
print(dtm_g[0])


[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1)]


Next, we'll build the model again using the newly formatted data. We'll use the same parameters from before where possible. 

In [29]:
print dtm_g[:1]
ldamodel = lda.LdaModel(dtm_g, num_topics=k, id2word = vocab_g, passes=30)
ldamodel.print_topics(num_topics=k, num_words=10)

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1)]]


[(0,
  u'0.076*candidate + 0.050*debate + 0.046*rt + 0.040*accusing + 0.035*way + 0.035*snorted + 0.028*1st + 0.025*two + 0.023*earnest + 0.022*w'),
 (1,
  u'0.072*co + 0.071*http + 0.050*debate + 0.042*rt + 0.035*white + 0.035*house + 0.032*way + 0.032*trump + 0.030*first + 0.029*two'),
 (2,
  u'0.017*debate + 0.011*v + 0.011*curious + 0.011*jackposobiec + 0.006*rt + 0.006*amp + 0.006*realdonaldtrump + 0.006*question + 0.006*longer + 0.006*cant'),
 (3,
  u'0.032*co + 0.032*http + 0.025*debate + 0.024*rt + 0.013*ha + 0.011*campaign + 0.008*trump + 0.007*watch + 0.007*something + 0.007*scarface'),
 (4,
  u'0.070*http + 0.060*co + 0.054*debate + 0.052*rt + 0.050*trump + 0.035*way + 0.034*snorted + 0.033*drug + 0.031*presssec + 0.031*test'),
 (5,
  u'0.046*rt + 0.045*debate + 0.033*trump + 0.030*president + 0.030*tv + 0.028*debatenight + 0.028*maga + 0.028*teamtrump + 0.027*thank + 0.027*polltakerguy')]

Above, the words that compose each topic are contained within tuples, which are in turn contained within a list. The output from this model is slightly different from the previous one in terms of the terms that are chosen for each topic, but the topics remain about the same.  Before, one topic was about Hillary Clinton generally, one about "snorting", one about Donald Trump generally, etc, which is roughly the same output as we see above. 

A benefit of the ```gensim``` pacakge is that it outputs the likelihood of each term belonging to a topic, which gives us a little more insight into why each term was chosen for a given topic. 

This concludes your crash course of Latent Dirichlet Allocation in python. 

For additional research, I recommend:
* David Blei's article (written in layman's terms): https://www.cs.princeton.edu/~blei/papers/Blei2012.pdf
* a two-part lecture series on LDA: http://videolectures.net/mlss09uk_blei_tm/
