# Topic Models - Latent Dirichlet Allocation (LDA) #

In our previous class we covered Latent Dirichlet Allocation (LDA) which is the most widely used topic model. We also discussed some of the LDA extensions such as the polylingual topic model (PLTM). For a collection of D documents with a collection wide vocabulary of V words, the model assumes that its only observable variable are the actual words in the document and that the number of topics in the collection is a priori set to T. For each topic $t=1,2,3,...,T$ in the collection, the model first draws a $V$ dimensional multinomial distribution $beta_t$ from a prior Dirichlet distribution with hyperparameter $\eta: \beta_t \sim Dir(\eta)$.  For each document $d=1,2,3,...,D$ in the collection, LDA's generative process includes the following steps:
* Draw a multinomial distribution $\theta_d$ from a collection wide Dirichlet distribution with hyperparameter $\alpha: \theta_d \sim Dir(\alpha)$.
* Go over each word position $n=1,2,3,..., N_d$ in document d and assign a topic indicator $z_n$ by drawing topic from  $\theta_d: z_n \sim Multinomial(\theta_d)$.
* Based on the drawn topic assignment $z_n=t$, draw the actual word $w_n$ from the topic specific distribution over words: $w_n \sim Multinomial(\theta_d)$.  
The above process is then repeated for every document in the collection. In LDA we deal with two Dirichlet distributions which are used as prior distributions for the document-topic distribution $\theta_d$ and the topic-word distributions $\beta_t$. 

In this lab session we are going to use LDA as a topic model to infer topics over documents and represent documents in a shared topic space. We'll working with the Gensim implementation of LDA. In terms of the dataset we will be using the Amazon products review collection. The gensim implimentation of LDA uses online variational inference to infer the posterior distributions. More about this approach could be found in the following paper:  
* M. Hoffman, D. Blei, and F. Bach. Online learning for latent Dirichlet allocation. Neural Information Processing Systems, 2010.  
http://www.cs.columbia.edu/~blei/papers/HoffmanBleiBach2010b.pdf

## Loading the Collection ##

Let's load the Amazon product reviews data. Again, as a reminder the reviews data is semi-structured and is in a json file format. Below is a preview of this data which contains the entry for one review:  
`
{
  "reviewerID": "A3HVRXV0LVJN7",
  "asin": "0110400550",
  "reviewerName": "BiancaNicole",
  "helpful": [
    4,
    4
  ],
  "reviewText": "Best phone case ever . Everywhere I go I get a ton of compliments on it. It was in perfect condition as well.",
  "overall": 5.0,
  "summary": "A++++",
  "unixReviewTime": 1358035200,
  "reviewTime": "01 13, 2013"
}
`
This dataset comes with a set of python functions that will help us convert the reviews from json format to Pandas dataframes. 

In [None]:
def parse(path):
  g = gzip.open(path, 'rb')
  for l in g:
    yield eval(l)

def getDF(path):
  i = 0
  df = {}
  for d in parse(path):
    df[i] = d
    i += 1
  return pd.DataFrame.from_dict(df, orient='index')

With these helper functions we'll extract the "reviewText" field from each review:

In [None]:
import pandas as pd
import gzip

review_file = "../../../data/amazon_reviews/cp/reviews_Cell_Phones_and_Accessories_h10k.json.gz"

df = getDF(review_file)
print (df['reviewText'])

## Text Processing ##
Now that we've extracted the reviews we'll proceed by tokenizing them. In this next step we'll perform the following:  
* Extract sentences
* Tokenize words
* Remove stopwords
* Remove punctuation marks

Note how we now focus on the reviews rather than the sentences. Reason being is that LDA doesn't operate well on sentence level due to the small number of words that could typically be present in any sentence which would not give us sufficient statistics to infer meaningful topic distributions. 

In [None]:
import nltk
import string
from nltk.tree import Tree
stopwords_list = nltk.corpus.stopwords.words('english')
# Create a list for the tokenized sentences:
tok_sentences = list()
# Create a list for the tokenized reviews:
tok_reviews = list()
# Create a list for the sentence assigned POS tags:
pos_sentences = list()
# Create a translation table for removing the punctuation marks:
translator = str.maketrans('', '', string.punctuation)

all_words = list()
r_count = 0
for review in df['reviewText']:
    r_count += 1
    if (r_count % 1000 == 0):
        print(r_count)
    sentences = nltk.sent_tokenize(review)
    review_words= list()
    for sentence in sentences:
        sent_words = nltk.word_tokenize(sentence)
        sent_words_tok = [word.lower() for word in sent_words if word not in stopwords_list and word.isalpha()]
        tok_sentences.append(sent_words_tok)
        for words in sent_words_tok:
            all_words.append(words)
            review_words.append(words)
    tok_reviews.append(review_words)


Topic models use the bag of words representation of documents. When inferring topics they rely on sufficient statistics over the number of times words appear in a document. To that end it is typically the case that rare words are removed from the collection. This is definitely the case with hapax words but in many instances, depending on the vocabulary size, words whose frequency of appearance is less than 10 times are also removed. For more details on how to preprocess collections for topic modeling you should refer to the following book chapter which is a very useful resource:
* Boyd-Graber, J., Mimno, D., & Newman, D. (2014). Care and feeding of topic models: Problems, diagnostics, and improvements. Handbook of mixed membership models and their applications, 225255 (2014).  
https://mimno.infosci.cornell.edu/papers/2014_book_chapter_care_and_feeding.pdf  

In the next step we'll also obtain the frequency count of words which would help us generate the effective vocabulary.

In [None]:
import numpy as np
frequency_count = nltk.FreqDist(all_words)
words =np.array([word for word in frequency_count.keys()])
word_freq=np.array([word for word in frequency_count.values()])
freq_sort = np.argsort(word_freq)[::-1]
word_freq_sort =word_freq[freq_sort]
words_sorted = words[freq_sort]

The effective vocabulary consists of the words that we will choose to use to represent the collection. In this case we'll discard hapax words (i.e. words whose frequency of occurrence is 1). We will also treat the top 25 words in the collection as stop words. These words will not be included in our effective vocabulary. Let's create our effective vocabulary:

In [None]:
rank=1
effective_vocab=list()
for object in words_sorted:
    if (rank>=25):
        fc = frequency_count[object]
        if (fc>1):
            effective_vocab.append(object)
    rank+=1

Now that we've created our effective vocabulary we'll go back and represent our set of reviews using this vocabulary:

In [None]:
tok_reviews_ev = list()
for review in tok_reviews:
    review_words_ev = [word for word in review if word in effective_vocab]
    tok_reviews_ev.append(review_words_ev)
print(tok_reviews_ev)

## LDA Model Training ##
Next we'll train a LDA model. We do that using the __gensim.models.LdaModel__ method. Before proceeding with training the model we will create a bag of words representation of our collection using the following code:

In [None]:
import gensim
dictionary = gensim.corpora.Dictionary(tok_reviews_ev)
dictionary.save("../../../data/amazon_reviews/cp/reviews_Cell_Phones_and_Accessories_h1k.dict")
corpus = [dictionary.doc2bow(doc) for doc in tok_reviews_ev]

In [None]:
lda = gensim.models.LdaModel(corpus,id2word=dictionary, num_topics=30, iterations=2000, alpha="auto")

**[Assignment 1]**  
the gensim lda implementation contains various useful methods to explore the inferred embedding space. For example, the methods __.model.print_topics(num_topics=, num_words=)__ lets us obtain the top __num_words__ for each topic of the __num_topics__ . Use this method to explore the inferred set of topic-word distributions. More about the various lda methods including their description could be found here:  
https://radimrehurek.com/gensim/models/ldamodel.html


**[Solution 1]**

**[Assignment 2]**
Look into the LDA model parameters such as the number of iterations over the collection (__iterations__), number of topics (__num_topics__), the __alpha__ and __rho__ hyperparameters. Train the LDA model with different parameter configurations and observe how they affect the obtained topics.

**[Solution 2]**

**[Assignment 3]**


Once we train a LDA model we could use it to infer topics onto unseen documents. We do this with the call to the following method:  
review_topics = lda[__new_doc_content__]  
As in the example below:

In [None]:
new_review = ['very', 'cheap', 'exactly', 'supposed', 'traveled', 'hands', 'free', 'problem', 'more', 'problem','yet', 'feels', 'durable']
new_review_bow = dictionary.doc2bow(new_review)

Generate a new review using words from the effective vocabulary and utilize this method to infer topics.

**[Solution 3]**

You could also plot the document-topic, i.e. the review-topic distribution:

In [None]:
import matplotlib.pyplot as plt
plt.style.use('seaborn')
%matplotlib inline
topic_vec = np.zeros(30)
for k, prob in new_review_topics:
    topic_vec[k] = prob
    
plt.rcParams['figure.figsize'] = (8,4)
fig = plt.figure()
x_pos = np.arange(len(topic_vec))
plt.bar(x_pos,topic_vec)
plt.ylabel('Probability(Topic)')
plt.xlabel('Topic #')
plt.title('Per Review Topic Distribution')

**[Assignment 4]**  
For a given topic you could use the same approach to plot the per topic-word distribution. Choose a topic and analyze its topic distribution over its words. Note that the numbers on the x-axis will correspond to the indices assigned to the words. To obtain the ten most probable words for a topic you could use the __show_topic(topicid, topn=)__ where the topn input argument let's you choose the top n most probable words that will be retrieved for that given topic.

In [None]:
top_words = lda.show_topic(10, topn=10)

In [None]:
words = list()
probs = list()
for word, prob in top_words:
    words.append(word)
    probs.append(prob)
x_pos = np.arange(len(probs))
plt.bar(x_pos, probs, align='center', alpha=0.5)
plt.xticks(x_pos, words)
plt.ylabel('Usage')
plt.title('Programming language usage')

Pick a certain topic and use the code above to analyze its probability distribution across the topn words.

**[Solution 4]**

**[Assignment 5]**
Explore the topic space on your own. 

**[Solution 5]**