# Introduction
Latent Dirichlet allocation (LDA) is particularly useful for finding reasonably accurate mixtures of topics within a given document set. We will explore generating an LDA model on a small sample data set as well as with a large data set of tweets on New York Time Health News. We will compare the outputs from two different LDA models, one from gensim and one from sci-kit learn. The LDA model will help us to find what are the most common topic words in health news.
### Tutorial content
In this tutorial, we will build a basic LDA model in Python, specifically using Gensim and Scikit-learn. We'll be using a CSV of data collected from Twitter with obtained from the UCI ML Repository [here](https://archive.ics.uci.edu/ml/datasets/Health+News+in+Twitter).

We will cover the following topics in this tutorial:

- [Understanding LDA](#understanding) <br><br>
- [Building an LDA Model](#building)
    - [Libraries Required](#libraries)
    - [Importing and Cleaning your Documents](#cleaning) (Tokenization, Stop words, Stemming)
    - [Constructing a document-term matrix](#matrix)<br><br>
- [Applying the LDA model](#applying) 
    - [Gensim](#gen)
    - [Scikit-learn](#scikit)<br><br>
- [Examining the Results](#conclusion)

<a id='understanding'></a>

# Understanding Latent Dirichlet Allocation (LDA)
LDA is a generative statistical model, meaning that it generates both inputs and outputs. The most important aspect of LDA is that it allows observations to be explained by unobserved groups. What this means in an NLP application is that the model looks through documents, finds words, and is able to return words of a similar topic. However, the model does not know what the topics relating the words are. Thus, the topics are unobserved groups that explain why some data is similar. With the exception of the actually words, all of the other variables are are unobserved, or are *latent*. Hence, it is called "Latent Dirichlet Allocation".

### A Simple Explanation
LDA assumes that documents are made up of a Dirichlet distribution of topics. A Dirichlet distribution is an extension of beta distribution and is used for modeling probabilities for two or more disjoint events. More information on this distribution can be found [here](https://en.wikipedia.org/wiki/Dirichlet_distribution). For example, a document can be 30% about school and 70% about eating. Every word that is in the document is attributed to one of these topics. Additionally, documents are assumed to have a sparse Dirichlet prior, meaning they only have a few topics and a few frequent words. The nature of the LDA model means topics are defined very vaguely, on the likelihood of term co-occurence. 

Just as a document is distributions over topics, a topic is a distributions over words:<br>
The topic "school" could have the following distribution: 40% "study", 30% "homework", 5% "food", 5% "dinner", ... <br>
The topic "eating" could have the following distribution: 5% "study", 5% "homework", 40% "food, 30% "dinner, ...<br>
It is important to note that the words pertaining to the other topic are also included. All the words are in every topic but the distribution is what defines the topic.  

Thus, given the above assumptions as to to the composition of documents, LDA then backtracks in an attempt to figure out what topics would have created the input documents in the first place. 

### A Mathematical Explanation
Let $\theta_d$ represent the distribution over topics for document "d"
for each word in the document we assign the topic of each word. Let $n$ be the number of words in the document. Let $z_{dn}$ correspond to the topic of the first word in document d. Thus, $z_{dn} \in \{1...T\}$ meaning $z_{dn}$ can take a value from $1$ to $T$, where $T$ is the number of topics we are trying to find in the corpus. The words, $w_{dn}$ are then sampled from the corresponding topics  Thus, $w_{dn} \in \{1...V\}$ meaning $w_{dn}$ can take a value from $1$ to $V$, where $V$ is the size of the vocabulary. These dependencies create the following Bayesian Network (in plate notation):
<img src="Bayesian Dependency.png",width=300,height=600>
Thus it is seen how $\theta$ is repeated d times, for each topic. $\theta$ then impacts the topics, $z$, and the words, $w$, which are both repeated n times for the number of words in a document. <br><br>
#### The Model
The joint probability for $W, Z, \theta$ is given below:

$p(W,Z,\Theta) = {\displaystyle \prod_{d=1}^{D} p(\theta_d)} {\displaystyle \prod_{n=1}^{N_d} p(z_{dn}|\theta_d)p(w_{dn}| z_{dn})}$

This means that for each document ($1$ to $D$), we generate topic probabilities, $p(\theta_d)$. Then for each word in the document ($1$ to $N_d$) , we select a topic, $p(z_{dn}|\theta_d)$. Lastly, we select a word from this topic, $p(w_{dn}| z_{dn})$.<br><br>

#### Defining the Probabilities
The probabilities are as defined below: <br><br>
$p(\theta_d) = Dir(\alpha)$ <br>
This is the Dirichlet Distribution. <br><br>
$p(z_{dn}|\theta_d) = \theta_dz_dn$ <br>
This is equal to the componets of the vector $\theta_d$ <br><br>
$p(w_{dn}| z_{dn}) = \Phi_{z_{dn}w_{dn}}$ <br>
This is to select the words. $\Phi$ is a matrix that stores these probabilities, thus the corresponding probability is located at row $z_{dn}$ and column $w_{dn}$ <br><br>

#### The Overall Goal
The goal of the LDA model is to find the probability matrix, $\Phi$, under the constrains that all the probabilities must be greater than zero and must sum to 1. Learning the probability distributions is a problem of [Bayesian inference](https://en.wikipedia.org/wiki/Bayesian_inference). Inference techniques include [variational Bayes approximation](https://en.wikipedia.org/wiki/Variational_Bayesian_methods), [Gibbs sampling](https://en.wikipedia.org/wiki/Gibbs_sampling), and [expectation propagation](https://en.wikipedia.org/wiki/Expectation_propagation). The methods used in this tutorial will use Gibbs sampling. The derivation of the equations for collapsed Gibbs sampling is available under "Inference" [here](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation). In simple terms, Gibbs sampling basically randomly assigns each word into one of the topics. This gives the topic representions for all documents as well as the word distributions for each topic. However, they're not very good. The model then goes through all the words, assuming all the topic assignments are correct except for the current word, and updates the assignment of the current word based on the matrix, $\Phi$. Once this is repeated many times, the topic assignments will become steady and the words associated with each topic will be discovered. 

#### Importance
LDA provides a generative probabilistic framework which is applicable in more complex applications in topic modeling. Topic modeling is very useful in exploratory data exploration when the structures you might find are unclear. 

<a id='libraries'></a>

## Libraries Required 

Before getting started, you'll need to install the following packages that we will use. NLTK is a package for natural language processing in Python, stop-words is a python package with basic stop-words, and gensim is a topic modeling package that has the LDA model we will be using. You can install all three using `pip`:

- Pandas/NumPy/Scikit
  - The easiest way to install pandas is through the Anaconda distribution
- NLTK
  - For Mac/Unix with pip: `sudo pip install -U nltk`
- stop-words
  - For Mac/Unix with pip: `sudo pip install stop-words`
- gensim
  - For Mac/Unix with pip: `sudo pip install gensim`
 

In [1]:
import pandas, string
import numpy as np

from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

from stop_words import get_stop_words

from gensim import corpora, models

from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer

<a id='building'></a>

# Building an LDA Model - Gensim
To quickly see how the model works we will create a small data set of sample documents. In a later section of this tutorial we will run two different implementations of the LDA Model on health data. 

## Loading Document Data
In this example, as the "documents" are single sentence string we do not need to do any work loading the data in, but we must place our "documents" into a list as shown below: 

In [2]:
str_1 = "All my boyfriend ever does is sit infront of a computer and do hacky code things. Code all day."
str_2 = "The Chinese Envoy, who just returned from North Korea, seems to have had no impact on Little Rocket Man."
str_3 = '''CVS (computer vision syndrome), more commonly known as computer eye strain, is a combination of vision 
            problems noticed during and after working long hours on the computer. This is common in sofware 
            engineers who code for long streches.'''
str_4 = "Soldiers are dangerously fleeing to South Korea. Rocket man now wants to talk for first time."

documents = [str_1, str_2, str_3, str_4] # the list of documents

<a id='cleaning'></a>

## Cleaning 
First we must tokenize each document, or convert it into its elements. We will use simple methods, from the string library, first removing punctuation, converting the words to lowercase and then splitting it into words.
Note that this is a very simple tokenization that works on the sample documents. A more thorough tokenization is presented on the Health data.  

In [3]:
table = str.maketrans(dict.fromkeys(string.punctuation + '—')) #handy way to remove punctuation
str_4 = str_4.lower().translate(table) #lowercase!
tokens = str_4.split()
print(tokens) #let's see how we did!

['soldiers', 'are', 'dangerously', 'fleeing', 'to', 'south', 'korea', 'rocket', 'man', 'now', 'wants', 'to', 'talk', 'for', 'first', 'time']


The next step in cleaning our data is to remove the meaningless words like "the," for example. Many times it is helpful to construct your own list of stop-words on the text contained in the documents. For this simple example, however, we will use an existing stop-words that we imported above.

In [4]:
stop_words = get_stop_words('en')
print(stop_words[:10]) #what sort of words are contained?

['a', 'about', 'above', 'after', 'again', 'against', 'all', 'am', 'an', 'and']


To remove the stop words we do the following:

In [5]:
stopped = [i for i in tokens if not i in stop_words]
print(stopped)

['soldiers', 'dangerously', 'fleeing', 'south', 'korea', 'rocket', 'man', 'now', 'wants', 'talk', 'first', 'time']


The last step in cleaning our data is to stem the words. This means we modify the words so that words like "run" and "running" are not thought of as seperate; we stem running into run. There are also many algorithms for stemming - some more strict than others. However one of the most popular forms of stemming is the Porter Stemming Algorithm which we imported above.


In [6]:
p_stemmer = PorterStemmer()
stemmed_stopped = [p_stemmer.stem(i) for i in stopped]
print(stemmed_stopped) 

['soldier', 'danger', 'flee', 'south', 'korea', 'rocket', 'man', 'now', 'want', 'talk', 'first', 'time']


The word "soldiers" became "soldier", "dangerously" became "danger", and "fleeing" became "flee". Thus, str_4 is tokenized, stopped, and stemmed. The following code does this on all four of the input strings. 

In [7]:
cleaned_texts = []

# all input strings
for input_string in documents:
    
    # tokenize 
    input_string = input_string.lower().translate(table) #punctuation table from early example
    tokens = input_string.split()

    # stop words
    stopped = [i for i in tokens if not i in stop_words]
    
    # stemmed
    stemmed_stopped = [p_stemmer.stem(i) for i in stopped]
    
    # concatenate
    cleaned_texts.append(stemmed_stopped)
    
print(cleaned_texts)

[['boyfriend', 'ever', 'sit', 'infront', 'comput', 'hacki', 'code', 'thing', 'code', 'day'], ['chines', 'envoy', 'just', 'return', 'north', 'korea', 'seem', 'impact', 'littl', 'rocket', 'man'], ['cv', 'comput', 'vision', 'syndrom', 'commonli', 'known', 'comput', 'eye', 'strain', 'combin', 'vision', 'problem', 'notic', 'work', 'long', 'hour', 'comput', 'common', 'sofwar', 'engin', 'code', 'long', 'strech'], ['soldier', 'danger', 'flee', 'south', 'korea', 'rocket', 'man', 'now', 'want', 'talk', 'first', 'time']]


<a id='matrix'></a>

## Document Term Matrix

Next we will use the gensim corpora package to analyze our text and understand word frequencies. The `corpora.Dictionary()` function is very handy for this. The function, as it's name suggests, acts like a dictionary and maps words to integer id's. It also collectes word counts and other statistics.

In [8]:
dictionary = corpora.Dictionary(cleaned_texts)
print(dictionary.token2id)

{'boyfriend': 0, 'code': 1, 'comput': 2, 'day': 3, 'ever': 4, 'hacki': 5, 'infront': 6, 'sit': 7, 'thing': 8, 'chines': 9, 'envoy': 10, 'impact': 11, 'just': 12, 'korea': 13, 'littl': 14, 'man': 15, 'north': 16, 'return': 17, 'rocket': 18, 'seem': 19, 'combin': 20, 'common': 21, 'commonli': 22, 'cv': 23, 'engin': 24, 'eye': 25, 'hour': 26, 'known': 27, 'long': 28, 'notic': 29, 'problem': 30, 'sofwar': 31, 'strain': 32, 'strech': 33, 'syndrom': 34, 'vision': 35, 'work': 36, 'danger': 37, 'first': 38, 'flee': 39, 'now': 40, 'soldier': 41, 'south': 42, 'talk': 43, 'time': 44, 'want': 45}


The dictionary is then converted into a bag-of-words. The result is a list of vectors for each document that contains the (term ID, term frequency). For example, "code" shows up twice in the first sentence, thus the second entry is (1,2).

In [9]:
corpus = [dictionary.doc2bow(text) for text in cleaned_texts]
print(corpus[0])

[(0, 1), (1, 2), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1)]


<a id='applying'></a>

## Applying the LDA Model

The ldamodel class is described [here](https://radimrehurek.com/gensim/models/ldamodel.html) in the gensim documentation.

We use the following parameters on this simple example:
- num_topics: (required) How many topics are we looking for? We hope the model will find the two.
- id2word: (required) The model needs the dictionary in order to map id's back to strings
- passes: (optional) How many times do we want the model to go through the data? Larger numbers take longer.

In [10]:
ldamodel = models.ldamodel.LdaModel(corpus, num_topics=2, id2word = dictionary, passes=200)
print(ldamodel.print_topics(num_words=3))

[(0, '0.054*"rocket" + 0.054*"korea" + 0.054*"man"'), (1, '0.080*"comput" + 0.062*"code" + 0.045*"vision"')]


#### What does that output even tell us?
Each topic is a tuple (topic number, top three relevant words). 
It looks like this model worked very well in the sample texts. The first topic is clearly Donald Trump's statments about North Korea, calling their leader "Rocket Man", and the second topic is about using computers. Next, we will take a look at a much larger and varied data set. 

<a id='gen'></a>

# LDA Model on Twitter Data - Gensim
Next we will use the model on a larger data set of all the tweets from the New York Times Health News' twitter account. Each line contains tweet id|date and time|tweet, the separators being '|'. The data set can be found [here](https://archive.ics.uci.edu/ml/datasets/Health+News+in+Twitter). Because the file is a csv and we need the text in the form of a list we will entlist the help of the pandas `read_csv()` function.

In [11]:
colnames = ["id", "date", "text"]
data = pandas.read_csv('nytimeshealth.txt', names=colnames, delimiter='|', quoting=3, encoding = "utf-8")

# list of documents
documents = list(data.text)
print(documents[:5])

['Risks in Using Social Media to Spot Signs of Mental Distress http://nyti.ms/1rqi9I1', "RT @paula_span: The most effective nationwide diabetes prevention program you've probably never heard of:  http://newoldage.blogs.nytimes.com/2014/12/26/diabetes-prevention-that-works/", 'The New Old Age Blog: Diabetes Prevention That Works http://nyti.ms/1xm7fTi', 'Well: Comfort Casseroles for Winter Dinners http://nyti.ms/1xTNoO0', 'High-Level Knowledge Before Veterans Affairs Scandal http://nyti.ms/13yCpvS']


## Cleaning

There are many different available tokenizers. However, here I have written my own tokenization function. I chose to leave out website links and retweets.

In [12]:
def tokenized(string):
    tokenized = []
    words = string.split()
    for word in words:
        if not word.startswith('http') and not word.startswith('RT') and not word.startswith('@'):
            tokenized.append(word.lower().translate(table))
    return tokenized

print(tokenized('''RT @lpolgreen: "The risk that anyone will contract Ebola in the United States is extremely small.”
                    http://www.nytimes.com/interactive/2014/07/31/world/africa/ebola-virus-outbreak-qa.html'''))

['the', 'risk', 'that', 'anyone', 'will', 'contract', 'ebola', 'in', 'the', 'united', 'states', 'is', 'extremely', 'small”']


## Stop Words 
Because Twitter data tends to be colloqial, we can use a pre-curated stop-words list. We will use the same stop-words as above, but also add on some other words from the `nltk.corpus` stop-word.

In [13]:
# joining two English stop words list
more_words = stopwords.words('english') # new words
en_stop = get_stop_words('en')          # original words
en_stop.extend(more_words)
stop_words = list(set(en_stop))

The process is then the same for the health data, using the new tokenizing method and stop words list. The LDA model code is presented below. 

In [14]:
cleaned_texts = []

# goes through each document
for input_string in documents:
    # tokenize
    tokens = tokenized(input_string)
    # stop words
    stopped = [i for i in tokens if not i in en_stop]
    # stemmed
    stemmed_stopped = [p_stemmer.stem(i) for i in stopped]    
    # concatenate
    cleaned_texts.append(stemmed_stopped)

dictionary = corpora.Dictionary(cleaned_texts)
    
# convert tokenized documents into a document-term matrix
corpus = [dictionary.doc2bow(text) for text in cleaned_texts]
# generate LDA model
n_topics = 5
n_words = 3
ldamodel = models.ldamodel.LdaModel(corpus, num_topics=n_topics, id2word = dictionary, passes=10)

Now that we have out model, let's look at its output.

In [15]:
#text formatting of output
n = 1
for item in ldamodel.print_topics(num_topics=n_topics, num_words=n_words):
    number, topics = item
    topics = topics.split(' + ')
    words = []
    for topic in topics:
        words.append(topic[7:-1])
    print('Topic: %d' % n)
    print(words)
    n += 1

Topic: 1
['ebola', 'patient', 'hospit']
Topic: 2
['new', 'health', 'age']
Topic: 3
['drug', 'letter', 'studi']
Topic: 4
['brief', 'risk', 'nation']
Topic: 5
['well', 'ask', 'exercis']


<a id='scikit'></a>

## Basic usage of the Sci-kit package for Latent Dirichlet Allocation (LDA)
Above we looked at the gensim implementation for LDA, here we will explore the sci-kit python package implementation. 

In [16]:
# cleaning function - very similar to the one used in the above example
# returns a string of words instead of a list of each word

def cleaned(string):
    cleaned_words = ''
    words = string.split()
    for word in words:
        if not word.startswith('http') and not word.startswith('RT') and not word.startswith('@'):
            translated = word.translate(table)
            stemmed = p_stemmer.stem(translated)
            if not stemmed in en_stop:
                cleaned_words = cleaned_words + stemmed.lower() + ' '
    return cleaned_words

# cleans each document, cleaned_text contains the list of each cleaned document string
cleaned_texts = []
for input_string in documents:
    tokens = cleaned(input_string)
    cleaned_texts.append(tokens)


Now that the data is cleaned, we will convert it to a document term matrix using sci-kit learns Count Vectorizer `.fit_transform()` function. It learns the vocabulary dictionary and returns term-document matrix. The `.get_feature_names()` function maps the indices to the word.

In [17]:
#convert a collection of text documents to a matrix of token counts
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(cleaned_texts)

# vocab is an array mapping from feature integer indices to feature name
vocab = vectorizer.get_feature_names()

#n_words, and n_topics is the same as in the Gensim example

model = LatentDirichletAllocation(n_components=n_topics, random_state=100, learning_method = 'online')

id_topic = model.fit_transform(X)

The following topic output formatting is taken from [here](https://stackoverflow.com/questions/44208501/getting-topic-word-distribution-from-lda-in-scikit-learn).

In [18]:
topic_words = {}
for topic, comp in enumerate(model.components_):
    word_idx = np.argsort(comp)[::-1][:n_words]
    topic_words[topic] = [vocab[i] for i in word_idx]
for topic, words in topic_words.items():
    print('Topic: %d' % topic)
    print('  %s' % ', '.join(words))

Topic: 0
  ebola, say, patient
Topic: 1
  well, new, age
Topic: 2
  letter, drug, stori
Topic: 3
  health, well, recip
Topic: 4
  well, cancer, studi


<a id='conclusion'></a>

## Conclusion and Reference
The LDA model picks out similar topics (specifically the ebola topic in the pregenerated answer) from both implementations.<br>

For more information check out these sources:<br>
[LDA model](https://ai.stanford.edu/~ang/papers/nips01-lda.pdf)<br>
[Gensim LDA](https://radimrehurek.com/gensim/wiki.html)<br>
[Scikit-learn](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html)<br>

For more examples of code using these two packages can be found [here](https://rstudio-pubs-static.s3.amazonaws.com/79360_850b2a69980c4488b1db95987a24867a.html), [here](https://medium.com/mlreview/topic-modeling-with-scikit-learn-e80d33668730), as well as [here](http://christop.club/2014/05/06/using-gensim-for-lda/).

The data used in this tutorial is available from the [UCI ML Repository](https://archive.ics.uci.edu/ml/datasets/Health+News+in+Twitter). 