# Latent Dirichlet Allocation - Implementation



In this notebook we perform Latent Dirichlet Allocation (LDA) to identify common topics in a set of documents. 

We use the **Gensim** topic modeling API. 

https://radimrehurek.com/gensim/models/ldamodel.html

Although there is a Scikit-Learn implementation of LDA, we prefer Gensim’s LDA as it provides a lot more built in functionality and applications for the LDA model such as a great Topic Coherence Pipeline or Dynamic Topic Modeling. 


We build an **end-to-end Natural Language Processing (NLP) pipeline**, starting with raw data and running through preparing, modeling, visualization.


- Pre-process Data
- Topic modeling with LDA
- Determine Optimal Number of Topics
- Visualizing topic models with pyLDAvis


## Dataset

We use a dataset containing scientific papers publised in the 2015 Neural Information Processing Systems (NIPS) conference. It is one of the top machine learning conferences in the world. It covers topics ranging from deep learning and computer vision to cognitive science and reinforcement learning.

The input CSV file contains one row for each of the 403 NIPS papers from 2015 conference. It includes the following fields

- Id - unique identifier for the paper (equivalent to the one in NIPS's system)
- Title - title of the paper
- EventType - whether it's a poster, oral, or spotlight presentation
- PdfName - filename for the PDF document
- Abstract - text for the abstract (scraped from the NIPS website)
- PaperText - raw text from the PDF document (created using the tool pdftotext)


In [20]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.DEBUG)

%pylab inline
import pandas as pd
import pickle as pk
from scipy import sparse as sp

2020-02-27 20:00:00,732 : DEBUG : Loaded backend module://ipykernel.pylab.backend_inline version unknown.


Populating the interactive namespace from numpy and matplotlib


## Load & Explore the Data

In [2]:
df = pd.read_csv('/Users/hasan/datasets/NIPS2015_Papers.csv')

df.head()

Unnamed: 0,Id,Title,EventType,PdfName,Abstract,PaperText
0,5677,Double or Nothing: Multiplicative Incentive Me...,Poster,5677-double-or-nothing-multiplicative-incentiv...,Crowdsourcing has gained immense popularity in...,Double or Nothing: Multiplicative\nIncentive M...
1,5941,Learning with Symmetric Label Noise: The Impor...,Spotlight,5941-learning-with-symmetric-label-noise-the-i...,Convex potential minimisation is the de facto ...,Learning with Symmetric Label Noise: The\nImpo...
2,6019,Algorithmic Stability and Uniform Generalization,Poster,6019-algorithmic-stability-and-uniform-general...,One of the central questions in statistical le...,Algorithmic Stability and Uniform Generalizati...
3,6035,Adaptive Low-Complexity Sequential Inference f...,Poster,6035-adaptive-low-complexity-sequential-infere...,We develop a sequential low-complexity inferen...,Adaptive Low-Complexity Sequential Inference f...
4,5978,Covariance-Controlled Adaptive Langevin Thermo...,Poster,5978-covariance-controlled-adaptive-langevin-t...,Monte Carlo sampling for Bayesian posterior in...,Covariance-Controlled Adaptive Langevin\nTherm...


## Description of the Data

DataFrame’s info() method is useful to get a quick description of the data, in particular the total number of rows, and each attribute’s type and number of non-null values.

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 403 entries, 0 to 402
Data columns (total 6 columns):
Id           403 non-null int64
Title        403 non-null object
EventType    403 non-null object
PdfName      403 non-null object
Abstract     403 non-null object
PaperText    403 non-null object
dtypes: int64(1), object(5)
memory usage: 19.0+ KB


## Dimension the Data

Get the dimension (number of rows and columns) of the data using DataFrame's shape method.

In [4]:
print("Dimension of the data: ", df.shape)

no_of_rows = df.shape[0]
no_of_columns = df.shape[1]

print("No. of Rows: %d" % no_of_rows)
print("No. of Columns: %d" % no_of_columns)

Dimension of the data:  (403, 6)
No. of Rows: 403
No. of Columns: 6


## Convert the DataFrame Object into a 2D Array of Documents

We convert the documents from DataFrame object to an array of documents.

It's a 2D array in which each row reprents a document.

In [5]:
docs_array = array(df['PaperText'])

print("Dimension of the documents array: ", docs_array.shape)

# Display the first document
#print(docs_array[0])

Dimension of the documents array:  (403,)


## Pre-process the Data


We pre-process the data as follows. 

- Convert to lowercase 
- Tokenize (split the documents into tokens or words)
- Remove numbers, but not words that contain numbers
- Remove words that are only one character
- Lemmatize the tokens/words


### Tokenization

We tokenize the text using a regular expression tokenizer from NLTK. We remove numeric tokens and tokens that are only a single character, as they don’t tend to be useful, and the dataset contains a lot of them.


The NLTK Regular-Expression Tokenizer class "RegexpTokenizer" splits a string into substrings using a regular expression. We use the regular expression "\w+" to matche token of words. 

See the following two links for a list of regular expressions and NLTK tokenize module.
https://github.com/tartley/python-regex-cheatsheet/blob/master/cheatsheet.rst
https://www.nltk.org/api/nltk.tokenize.html


## Function to Convert the 2D Document Array into a 2D Array of Tokenized Documents

In [6]:
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer

def docs_preprocessor(docs):
    tokenizer = RegexpTokenizer(r'\w+') # Tokenize the words.
    
    for idx in range(len(docs)):
        docs[idx] = docs[idx].lower()  # Convert to lowercase.
        docs[idx] = tokenizer.tokenize(docs[idx])  # Split into words.

    # Remove numbers, but not words that contain numbers.
    docs = [[token for token in doc if not token.isdigit()] for doc in docs]
    
    # Remove words that are only one character.
    docs = [[token for token in doc if len(token) > 3] for doc in docs]
    
    # Lemmatize all words in documents.
    lemmatizer = WordNetLemmatizer()
    docs = [[lemmatizer.lemmatize(token) for token in doc] for doc in docs]
  
    return docs

## Convert the 2D Document Array into a 1D Array of Tokenized Words

In [7]:
# Convert the 2D Document Array into a 1D Array of Tokenized Words
%time docs = docs_preprocessor(docs_array)

CPU times: user 8.06 s, sys: 182 ms, total: 8.24 s
Wall time: 8.25 s


In [8]:
print("Length of the 2D Array of Tokenized Documents: ", len(docs))

# Display the first two document
#print(docs[0:2])

Length of the 2D Array of Tokenized Documents:  403


## Compute Bigrams/Trigrams:


When topics are very similar, we may **use phrases** rather than single/individual words to distinguis each topic. 

Thus, we compute both bigrams and trigrams. Depending on the dataset it may not be necessary to create trigrams.

Note that we only keep the **frequent** phrases (bigrams/trigrams).

#### Bigrams
Bigrams are sets of two adjacent words. Using bigrams we can get phrases like “machine_learning” in our output (spaces are replaced with underscores). Without bigrams we would only get “machine” and “learning”.

Note that in the code below, we find bigrams and then add them to the original data, because we would like to keep the words “machine” and “learning” as well as the bigram “machine_learning”.

In [9]:
from gensim.models import Phrases

# Add bigrams and trigrams to docs (only ones that appear 10 times or more).
bigram = Phrases(docs, min_count=10)
trigram = Phrases(bigram[docs])

for idx in range(len(docs)):
    for token in bigram[docs[idx]]:
        if '_' in token:
            # Token is a bigram, add to document.
            docs[idx].append(token)
    for token in trigram[docs[idx]]:
        if '_' in token:
            # Token is a bigram, add to document.
            docs[idx].append(token)

2020-02-27 19:07:56,786 : INFO : collecting all words and their counts
2020-02-27 19:07:56,787 : INFO : PROGRESS: at sentence #0, processed 0 words and 0 word types
2020-02-27 19:07:58,882 : INFO : collected 556123 word types from a corpus of 1141175 words (unigram + bigrams) and 403 sentences
2020-02-27 19:07:58,883 : INFO : using 556123 counts as vocab in Phrases<0 vocab, min_count=10, threshold=10.0, max_vocab_size=40000000>
2020-02-27 19:07:58,884 : INFO : collecting all words and their counts
2020-02-27 19:07:58,897 : INFO : PROGRESS: at sentence #0, processed 0 words and 0 word types
2020-02-27 19:08:05,368 : INFO : collected 616916 word types from a corpus of 1020757 words (unigram + bigrams) and 403 sentences
2020-02-27 19:08:05,369 : INFO : using 616916 counts as vocab in Phrases<0 vocab, min_count=5, threshold=10.0, max_vocab_size=40000000>


## Remove Rare and Common Tokens/Words

We remove rare words and common words based on their document frequency. 

For example, we may remove words that appear in less than 10 documents or in more than 20% of the documents. 

In [10]:
from gensim.corpora import Dictionary

# Create a dictionary representation of the documents.
dictionary = Dictionary(docs)
print('Number of unique words in initital documents:', len(dictionary))

# Filter out words that occur less than 10 documents, or more than 20% of the documents.
dictionary.filter_extremes(no_below=10, no_above=0.2)
print('Number of unique words after removing rare and common words:', len(dictionary))

2020-02-27 19:08:14,827 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2020-02-27 19:08:15,910 : INFO : built Dictionary(39534 unique tokens: ['abdel', 'ability', 'about', 'above', 'abstract']...) from 403 documents (total 1544630 corpus positions)
2020-02-27 19:08:15,986 : INFO : discarding 33533 tokens: [('abdel', 4), ('ability', 104), ('about', 266), ('above', 300), ('abstract', 402), ('according', 204), ('accuracy', 210), ('across', 183), ('across_trial', 7), ('added', 85)]...
2020-02-27 19:08:15,987 : INFO : keeping 6001 tokens which were in no less than 10 and no more than 80 (=20.0%) documents
2020-02-27 19:08:16,002 : DEBUG : rebuilding dictionary, shrinking gaps
2020-02-27 19:08:16,008 : INFO : resulting dictionary: Dictionary(6001 unique tokens: ['accessed', 'acoustic', 'acquisition', 'additive', 'address_this']...)


Number of unique words in initital documents: 39534
Number of unique words after removing rare and common words: 6001


## Bag-of-Words Representation of Data


Finally, we transform the documents to a **vectorized form**. 

We simply compute the frequency of each word, including the bigrams.

In [11]:
# Bag-of-words representation of the documents.
corpus = [dictionary.doc2bow(doc) for doc in docs]

print('Number of unique tokens: %d' % len(dictionary))
print('Number of documents: %d' % len(corpus))

Number of unique tokens: 6001
Number of documents: 403


## Training the LDA Model

We use the gensim.models.LdaModel class for performing LDA.

We need to set the parameters of the LdaModel object carefully. The full list of the parameters are given:

https://radimrehurek.com/gensim/models/ldamodel.html


#### Below we discuss the setting of some of the key parameters.

- num_topics (int, optional) – The number of requested latent topics to be extracted from the training corpus.

 
LDA is an unsupervised technique, meaning that we don't know prior to running the model how many topics exits in our corpus. It depends on the data and the application. We may use the following two technique to determine the number of topics.


        Technique 1: Topic Coherence 
The main technique to determine the number of topics is **Topic coherence**:
http://svn.aksw.org/papers/2015/WSDM_Topic_Evaluation/public.pdf


        Technique 2: Visualizing Inter-Topic Distance 
Use the LDA visualization tool pyLDAvis to observe Intertopic Distance Map (discussed later). By varying the number of topics we could determine the optimal value from the visualization.

We **use both techniques** to determine the optimal number of topics.


- chunksize (int, optional) – Number of documents to be used in each training chunk.

It controls how many documents are processed at a time in the training algorithm. Increasing chunksize will speed up training, at least as long as the chunk of documents easily fit into memory. 

We set chunksize = 2000, which is more than the amount of documents. Thus, it processes all the data in one go. 

Chunksize can however influence the quality of the model.


- passes (int, optional) – Number of passes through the corpus during training.

It controls how often we train the model on the entire corpus. Another word for passes might be “epochs”. 


- iterations (int, optional) – Maximum number of iterations through the corpus when inferring the topic distribution of a corpus.

It is somewhat technical, but essentially it controls how often we repeat a particular loop over each document. 

        It is important to set the number of “passes” and “iterations” high enough.



#### How to Set "passes" and "iterations":

First, enable logging and set eval_every = 1 (however, it might slow down, so, we use None) in LdaModel. 

When training the model look for a line in the log that looks something like this:

        2020-02-25 19:07:04,716 : DEBUG : 49/403 documents converged within 400 iterations

If we set passes = 20, we will see this line 20 times. 

### Important: We need to make sure that by the final passes, most of the documents have converged. 

For example, if passes = 20 and iterations = 400, then, we should see something like following:


        2020-02-25 19:07:18,041 : INFO : PROGRESS: pass 19, at document #403/403
        2020-02-25 19:07:18,042 : DEBUG : performing inference on a chunk of 403 documents
        2020-02-25 19:07:18,627 : DEBUG : 402/403 documents converged within 400 iterations

Thus, want to choose both passes and iterations to be high enough for this to happen.


- eval_every (int, optional) – Log perplexity is estimated every that many updates. Setting this to 1 slows down training by ~2x.


- alpha ({numpy.ndarray, str}, optional): Can be set to an 1D array of length equal to the number of expected topics that expresses our a-priori belief for the each topics’ probability. 

Alternatively default prior selecting strategies can be employed by supplying a string:

        ’asymmetric’: Uses a fixed normalized asymmetric prior of 1.0 / topicno.

        ’auto’: Learns an asymmetric prior from the corpus (not available if distributed==True).
        
        
- eta ({float, np.array, str}, optional) – A-priori belief on word probability.

It can be:

        scalar for a symmetric prior over topic/word probability,

        vector of length num_words to denote an asymmetric user defined probability for each word,

        matrix of shape (num_topics, num_words) to assign a probability for each word-topic combination,

        the string ‘auto’ to learn the asymmetric prior from the data.


We set alpha = 'auto' and eta = 'auto'. Again this is somewhat technical, but essentially we are automatically learning two parameters in the model that we usually would have to specify explicitly.

In [12]:
from gensim.models import LdaModel

# Set training parameters.
num_topics = 4
chunksize = 500 # Size of the doc looked at every pass
passes = 20 # Number of passes through documents
iterations = 400 # Maximum number of iterations through the corpus when inferring the topic distribution of a corpus.
eval_every = None  # Don't evaluate model perplexity, takes too much time.

# Make an index to word dictionary.
temp = dictionary[0]  # This is only to "load" the dictionary.
id2word = dictionary.id2token

%time model = LdaModel(corpus=corpus, id2word=id2word, chunksize=chunksize, \
                       alpha='auto', eta='auto', \
                       iterations=iterations, num_topics=num_topics, \
                       passes=passes, eval_every=eval_every)

2020-02-27 19:08:16,576 : INFO : using autotuned alpha, starting with [0.25, 0.25, 0.25, 0.25]
2020-02-27 19:08:16,579 : INFO : using serial LDA version on this node
2020-02-27 19:08:16,585 : INFO : running online (multi-pass) LDA training, 4 topics, 20 passes over the supplied corpus of 403 documents, updating model once every 403 documents, evaluating perplexity every 0 documents, iterating 400x with a convergence threshold of 0.001000
2020-02-27 19:08:16,586 : INFO : PROGRESS: pass 0, at document #403/403
2020-02-27 19:08:16,586 : DEBUG : performing inference on a chunk of 403 documents
2020-02-27 19:08:19,757 : DEBUG : 60/403 documents converged within 400 iterations
2020-02-27 19:08:19,761 : INFO : optimized alpha [0.06745957, 0.14463517, 0.05922535, 0.19062406]
2020-02-27 19:08:19,761 : DEBUG : updating topics
2020-02-27 19:08:19,765 : INFO : topic #0 (0.067): 0.004*"proposal" + 0.003*"covariance_matrix" + 0.003*"variational_inference" + 0.003*"data_set" + 0.003*"document" + 0.00

2020-02-27 19:08:23,489 : INFO : topic diff=0.128294, rho=0.408248
2020-02-27 19:08:23,495 : INFO : PROGRESS: pass 5, at document #403/403
2020-02-27 19:08:23,496 : DEBUG : performing inference on a chunk of 403 documents
2020-02-27 19:08:24,186 : DEBUG : 401/403 documents converged within 400 iterations
2020-02-27 19:08:24,190 : INFO : optimized alpha [0.052063763, 0.055151038, 0.05386876, 0.0699383]
2020-02-27 19:08:24,191 : DEBUG : updating topics
2020-02-27 19:08:24,195 : INFO : topic #0 (0.052): 0.005*"convolutional" + 0.005*"fully_connected" + 0.005*"proposal" + 0.004*"recurrent_neural" + 0.004*"recurrent" + 0.004*"hidden_unit" + 0.004*"lstm" + 0.004*"deep_learning" + 0.004*"document" + 0.004*"ground_truth"
2020-02-27 19:08:24,196 : INFO : topic #1 (0.055): 0.012*"regret" + 0.007*"bandit" + 0.006*"active_learning" + 0.006*"policy" + 0.005*"game" + 0.005*"reward" + 0.005*"regret_bound" + 0.005*"query" + 0.004*"sample_complexity" + 0.004*"online_learning"
2020-02-27 19:08:24,197 : 

2020-02-27 19:08:27,368 : INFO : topic #0 (0.049): 0.006*"convolutional" + 0.005*"fully_connected" + 0.005*"recurrent" + 0.005*"recurrent_neural" + 0.005*"proposal" + 0.005*"hidden_unit" + 0.005*"deep_learning" + 0.004*"lstm" + 0.004*"hidden_layer" + 0.004*"ground_truth"
2020-02-27 19:08:27,369 : INFO : topic #1 (0.050): 0.012*"regret" + 0.007*"bandit" + 0.006*"policy" + 0.006*"active_learning" + 0.005*"reward" + 0.005*"game" + 0.005*"regret_bound" + 0.005*"query" + 0.005*"submodular" + 0.004*"item"
2020-02-27 19:08:27,371 : INFO : topic #2 (0.055): 0.007*"matrix_completion" + 0.006*"rank_matrix" + 0.005*"convergence_rate" + 0.004*"sample_complexity" + 0.004*"recovery" + 0.004*"gradient_descent" + 0.004*"line_search" + 0.004*"strongly_convex" + 0.004*"regularization_parameter" + 0.004*"step_size"
2020-02-27 19:08:27,372 : INFO : topic #3 (0.065): 0.007*"gaussian_process" + 0.005*"variational_inference" + 0.004*"markov_chain" + 0.003*"posterior_distribution" + 0.003*"step_size" + 0.003*

2020-02-27 19:08:30,382 : INFO : topic #2 (0.058): 0.006*"matrix_completion" + 0.006*"rank_matrix" + 0.006*"convergence_rate" + 0.004*"gradient_descent" + 0.004*"sample_complexity" + 0.004*"singular_value" + 0.004*"step_size" + 0.004*"recovery" + 0.004*"regularization_parameter" + 0.004*"strongly_convex"
2020-02-27 19:08:30,384 : INFO : topic #3 (0.065): 0.008*"gaussian_process" + 0.006*"variational_inference" + 0.004*"markov_chain" + 0.004*"posterior_distribution" + 0.003*"sampler" + 0.003*"covariance_matrix" + 0.003*"mcmc" + 0.003*"step_size" + 0.003*"mixture_model" + 0.003*"graphical_model"
2020-02-27 19:08:30,385 : INFO : topic diff=0.043159, rho=0.242536
2020-02-27 19:08:30,391 : INFO : PROGRESS: pass 16, at document #403/403
2020-02-27 19:08:30,392 : DEBUG : performing inference on a chunk of 403 documents
2020-02-27 19:08:30,936 : DEBUG : 402/403 documents converged within 400 iterations
2020-02-27 19:08:30,940 : INFO : optimized alpha [0.048441563, 0.05053887, 0.058612633, 0.06

CPU times: user 47.3 s, sys: 1.42 s, total: 48.8 s
Wall time: 16 s


## Technique 1 for Determining Optimal Number of Topics: Topic Coherence

Topic Coherence is a measure used to evaluate topic models. Each such generated topic consists of words, and the topic coherence is applied to the top N words from the topic. 

Topic Coherence measures score a single topic by **measuring the degree of semantic similarity between high scoring words in the topic**. These measurements help distinguish between topics that are semantically interpretable topics and topics that are artifacts of statistical inference. 

A set of statements or facts is said to be coherent, if they support each other. Thus, a coherent fact set can be interpreted in a context that covers all or most of the facts. An example of a coherent fact set is “the game is a team sport”, “the game is played with a ball”, “the game demands great physical efforts”

Topic Coherence is defined as the average of the pairwise word-similarity scores of the words in the topic.

A good model will generate coherent topics, i.e., topics with high topic coherence scores. Good topics are topics that can be described by a short label, therefore this is what the topic coherence measure should capture.


Below we display 
- the average topic coherence and
- print the topics in order of topic coherence


We use LdaModel's "top_topics" method to get the topics with the highest coherence score the coherence for each topic.

Note that we use the “Umass” topic coherence measure here (see gensim.models.ldamodel.LdaModel.top_topics()).

In [13]:
top_topics = model.top_topics(corpus) #, num_words=20)

# Average topic coherence is the sum of topic coherences of all topics, divided by the number of topics.
avg_topic_coherence = sum([t[1] for t in top_topics]) / num_topics
print('Average topic coherence: %.4f.' % avg_topic_coherence)

from pprint import pprint
pprint(top_topics)

2020-02-27 19:08:32,613 : DEBUG : Setting topics to those of the model: LdaModel(num_terms=6001, num_topics=4, decay=0.5, chunksize=500)


Average topic coherence: -1.6154.
[([(0.006503164, 'convolutional'),
   (0.005490133, 'fully_connected'),
   (0.0054501276, 'recurrent'),
   (0.0052782292, 'deep_learning'),
   (0.0052202637, 'recurrent_neural'),
   (0.004987433, 'hidden_unit'),
   (0.0047173062, 'hidden_layer'),
   (0.0046798713, 'lstm'),
   (0.0045049815, 'generative_model'),
   (0.0044646817, 'proposal'),
   (0.004120675, 'convolutional_neural'),
   (0.0040228195, 'segmentation'),
   (0.0039855647, 'ground_truth'),
   (0.003952189, 'pixel'),
   (0.0039449367, 'during_training'),
   (0.0034225394, 'sentence'),
   (0.0032817528, 'convolutional_network'),
   (0.003246106, 'document'),
   (0.0032047795, 'deep_network'),
   (0.003117191, 'embedding')],
  -1.120357126950858),
 ([(0.0063719475, 'matrix_completion'),
   (0.0060113417, 'rank_matrix'),
   (0.005963978, 'convergence_rate'),
   (0.0045467974, 'gradient_descent'),
   (0.0044015325, 'singular_value'),
   (0.004283363, 'step_size'),
   (0.0040224795, 'regularizati

## Technique 2 for Determining Optimal Number of Topics: Visualization

We use **pyLDAvis** to interpret the topics in a topic model that has been fit to a corpus of text data. 

It extracts information from a fitted LDA topic model to inform an interactive web-based visualization.

In [14]:
import pyLDAvis.gensim
pyLDAvis.enable_notebook()

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 

pyLDAvis.gensim.prepare(model, corpus, dictionary)

2020-02-27 19:08:32,828 : DEBUG : performing inference on a chunk of 403 documents
2020-02-27 19:08:33,291 : DEBUG : 403/403 documents converged within 400 iterations
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))



## Interpretation of the Visualization 



- Left Panel: 
The labeld Intertopic Distance Map, circles represent different topics and the distance between them. Similar topics appear closer and the dissimilar topics farther. The relative size of a topic's circle in the plot corresponds to the relative frequency of the topic in the corpus. An individual topic may be selected for closer scrutiny by clicking on its circle, or entering its number in the "selected topic" box in the upper-left.



- Right Panel:
It includes the bar chart of the top 30 terms. When no topic is selected in the plot on the left, the bar chart shows the top-30 most "salient" terms in the corpus. A term's saliency is a measure of both how frequent the term is in the corpus and how "distinctive" it is in distinguishing between different topics. Selecting each topic on the right, modifies the bar chart to show the "relevant" terms for the selected topic. 

Relevence is defined as in footer 2 and can be tuned by parameter $\lambda$.
- Smaller $\lambda$ gives higher weight to the term's distinctiveness.
- larger $\lambda$ corresponds to probablity of the term occurance per topics.

Therefore, to get a better sense of terms per topic we use $\lambda = 0$.

## Display the Top Words in the Topics

In [15]:
def get_lda_topics(model, num_topics, top_words):
    word_dict = {};
    for i in range(num_topics):
        words = model.show_topic(i, topn = top_words);
        word_dict['Topic # ' + '{:02d}'.format(i+1)] = [i[0] for i in words];
    return pd.DataFrame(word_dict)

In [16]:
get_lda_topics(model, num_topics, 20)

Unnamed: 0,Topic # 01,Topic # 02,Topic # 03,Topic # 04
0,convolutional,regret,matrix_completion,gaussian_process
1,fully_connected,bandit,rank_matrix,variational_inference
2,recurrent,policy,convergence_rate,markov_chain
3,deep_learning,active_learning,gradient_descent,posterior_distribution
4,recurrent_neural,submodular,singular_value,sampler
5,hidden_unit,reward,step_size,mcmc
6,hidden_layer,game,regularization_parameter,covariance_matrix
7,lstm,regret_bound,sample_complexity,graphical_model
8,generative_model,query,recovery,gibbs
9,proposal,item,strongly_convex,mixture_model


## Generate Labels for the Topics

We can manually generate human-interpretable labels for each topic by looking at the terms that appear more in each topic.


We use LdaModel's "show_topic" method that returns **Word-probability pairs** for the most relevant words generated by the topic.

In [17]:
def explore_topic(lda_model, topic_number, topn, output=True):
    """
    accept a ldamodel, a topic number and topn vocabs of interest
    prints a formatted list of the topn terms
    """
    terms = []
    for term, frequency in lda_model.show_topic(topic_number, topn=topn):
        terms += [term]
        if output:
            print(u'{:30} {:.3f}'.format(term, round(frequency, 3)))
    
    return terms

In [18]:
topic_summaries = []

print(u'{:25} {}'.format(u'term', u'frequency') + u'\n')

for i in range(num_topics):
    print('\nTopic '+str(i)+' |---------------------------\n')
    tmp = explore_topic(model,topic_number=i, topn=10, output=True )
#     print tmp[:5]
    topic_summaries += [tmp[:5]]
    print

term                      frequency


Topic 0 |---------------------------

convolutional                  0.007
fully_connected                0.005
recurrent                      0.005
deep_learning                  0.005
recurrent_neural               0.005
hidden_unit                    0.005
hidden_layer                   0.005
lstm                           0.005
generative_model               0.005
proposal                       0.004

Topic 1 |---------------------------

regret                         0.012
bandit                         0.007
policy                         0.006
active_learning                0.006
submodular                     0.005
reward                         0.005
game                           0.005
regret_bound                   0.005
query                          0.005
item                           0.005

Topic 2 |---------------------------

matrix_completion              0.006
rank_matrix                    0.006
convergence_rate               0

## Manually Generate Topic Labels

Based on the most probable words generated by each topic, we assign human-interpretable labels for the topics.

In [19]:
top_labels = {0: 'Statistics', 1:'Numerical Analysis', 2:'Online Learning', 3:'Deep Learning'}