<i> Note: This notebook is inspired by the Topic-Modeling-Latent-Dirichlet-Allocation series at: https://github.com/rhasanbd/Topic-Modeling-Latent-Dirichlet-Allocation </i>

## Latent Dirichlet Allocation - Implementation on Yelp dataset

In this notebook, we implement Latent Dirichlet Allocation(LDA) on the Yelp reviews data to carry out Topic Modelling. We use the Gensim topic modelling API https://radimrehurek.com/gensim/models/ldamodel.html. Scikit-Learn implementation is also available (we use Gensim since it provides more functionality and application like Topic Coherence Pipeline or Dynamic Topic Modeling.)

We build an **end-to-end Natural Language Processing (NLP) pipeline**, starting with raw data and running through preparing, modeling, visualization.
The steps that we will carry out involves the following:
1. Exploratory Data Analysis
2. Data Cleaning and Pre-processing
3. Topic modeling with LDA
4. Determine optimal number of Topics
5. Visualize topic model using pyLDAvis

### Yelp Review Dataset
The Yelp Review Dataset is a CSV file that contains a sub-sample of 10,000 reviews extracted from the Yelp dataset available at: https://www.yelp.com/dataset.

The review dataset contains the following fields:
- business_id : Unique identifier of business
- date : Data of review posted YYYY-MM-DD
- review_id : Unique identifier of review
- stars : Star rating (upto 4 stars)
- text : Review text
- user_id : Unique identifier of user who posted the review
- cool : Number of cool votes received
- useful : Number of useful votes received
- funny : Number of funny votes received

In [1]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.DEBUG)

%pylab inline
import pandas as pd
import pickle as pk
from scipy import sparse as sp

import nltk
nltk.download('wordnet')

2020-03-17 21:25:41,307 : DEBUG : $HOME=C:\Users\rojin
2020-03-17 21:25:41,310 : DEBUG : CONFIGDIR=C:\Users\rojin\.matplotlib
2020-03-17 21:25:41,311 : DEBUG : matplotlib data path: c:\users\rojin\appdata\local\programs\python\python37\lib\site-packages\matplotlib\mpl-data
2020-03-17 21:25:41,315 : DEBUG : loaded rc file c:\users\rojin\appdata\local\programs\python\python37\lib\site-packages\matplotlib\mpl-data\matplotlibrc
2020-03-17 21:25:41,319 : DEBUG : matplotlib version 3.1.2
2020-03-17 21:25:41,320 : DEBUG : interactive is False
2020-03-17 21:25:41,320 : DEBUG : platform is win32


2020-03-17 21:25:41,378 : DEBUG : CACHEDIR=C:\Users\rojin\.matplotlib
2020-03-17 21:25:41,384 : DEBUG : Using fontManager instance from C:\Users\rojin\.matplotlib\fontlist-v310.json
2020-03-17 21:25:41,552 : DEBUG : Loaded backend module://ipykernel.pylab.backend_inline version unknown.
2020-03-17 21:25:41,559 : DEBUG : Loaded backend module://ipykernel.pylab.backend_inline version unknown.
2020-03-17 21:25:41,563 : DEBUG : Loaded backend module://ipykernel.pylab.backend_inline version unknown.


Populating the interactive namespace from numpy and matplotlib


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\rojin\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

## Load & Explore the Data

In [2]:
df = pd.read_csv('Data/yelp_academic_dataset_review_10000.csv')

df.head()

Unnamed: 0,business_id,date,review_id,stars,text,user_id,cool,useful,funny
0,9yKzy9PApeiPPOUJEtnvkg,1/26/2011,fWKvX83p0-ka4JS3dc6E5A,5,My wife took me here on my birthday for breakf...,rLtl8ZkDX5vH5nAx9C3q5Q,2,5,0
1,ZRJwVLyzEJq1VAihDhYiow,7/27/2011,IjZ33sJrzXqU-0X6U8NwyA,5,I have no idea why some people give bad review...,0a2KyEL0d3Yb1V6aivbIuQ,0,0,0
2,6oRAC4uyJCsJl1X0WZpVSA,6/14/2012,IESLBzqUCLdSzSqm0eCSxQ,4,love the gyro plate. Rice is so good and I als...,0hT2KtfLiobPvh6cDC8JQg,0,1,0
3,_1QQZuf4zZOyFCvXc0o6Vg,5/27/2010,G-WvGaISbqqaMHlNnByodA,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!...",uZetl9T0NcROGOyFfughhg,1,2,0
4,6ozycU1RpktNG2-1BroVtw,1/5/2012,1uJFq2r5QfJG_6ExMRCaGw,5,General Manager Scott Petello is a good egg!!!...,vYmM4KTsC8ZfQBg-j5MWkw,0,0,0


## Description of the Data

DataFrame’s info() method is useful to get a quick description of the data, in particular the total number of rows, and each attribute’s type and number of non-null values.

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 9 columns):
business_id    10000 non-null object
date           10000 non-null object
review_id      10000 non-null object
stars          10000 non-null int64
text           10000 non-null object
user_id        10000 non-null object
cool           10000 non-null int64
useful         10000 non-null int64
funny          10000 non-null int64
dtypes: int64(4), object(5)
memory usage: 703.2+ KB


## Dimension the Data

Get the dimension (number of rows and columns) of the data using DataFrame's shape method.

In [4]:
print("Dimension of the data: ", df.shape)

no_of_rows = df.shape[0]
no_of_columns = df.shape[1]

print("No. of Rows: %d" % no_of_rows)
print("No. of Columns: %d" % no_of_columns)

Dimension of the data:  (10000, 9)
No. of Rows: 10000
No. of Columns: 9


## Convert the DataFrame Object into a 2D Array of Documents

We convert the documents from DataFrame object to an array of documents.

It's a 2D array in which each row reprents a document.

In [5]:
docs_array = array(df['text'])

print("Dimension of the documents array: ", docs_array.shape)

# Display the first document
#print(docs_array[0])

Dimension of the documents array:  (10000,)


## Pre-process the Data

Pre-processing of the text data is done using the following steps:

- Convert to lowercase 
- Tokenize (split the documents into tokens or words)
- Remove numbers, but not words that contain numbers
- Remove words that are only a single character
- Lemmatize the tokens/words


### Tokenization and Lemmatization

We convert all the words into lowercase then tokenize each word using NLTK Regular-Expression Tokenizer class "RegexpTokenizer". It splits a given string to substrings using a regular expression. Then we remove numbers and single character words since they usually don't impart much useful information and are very high in number. Then, we lemmatize the tokens using WordNetLemmatizer from NLTK, where we extract the root words of the tokens using the dictionary.

## Function to Convert the 2D Document Array into a 2D Array of Tokenized Documents

In [6]:
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer

def docs_preprocessor(docs):
    tokenizer = RegexpTokenizer(r'\w+') # Tokenize the words.
    
    for idx in range(len(docs)):
        docs[idx] = docs[idx].lower()  # Convert to lowercase.
        docs[idx] = tokenizer.tokenize(docs[idx])  # Split into words.

    # Remove numbers, but not words that contain numbers.
    docs = [[token for token in doc if not token.isdigit()] for doc in docs]
    
    # Remove words that are only one character.
    docs = [[token for token in doc if len(token) > 3] for doc in docs]
    
    # Lemmatize all words in documents.
    lemmatizer = WordNetLemmatizer()
    
    docs = [[lemmatizer.lemmatize(token) for token in doc] for doc in docs]
  
    return docs

## Convert the 2D Document Array into a 1D Array of Tokenized Words

In [7]:
# Convert the 2D Document Array into a 1D Array of Tokenized Words
%time docs = docs_preprocessor(docs_array)

Wall time: 7.34 s


In [8]:
print("Length of the 2D Array of Tokenized Documents: ", len(docs))

#Display the first two document
print(docs[0:2])

Length of the 2D Array of Tokenized Documents:  10000
[['wife', 'took', 'here', 'birthday', 'breakfast', 'excellent', 'weather', 'perfect', 'which', 'made', 'sitting', 'outside', 'overlooking', 'their', 'ground', 'absolute', 'pleasure', 'waitress', 'excellent', 'food', 'arrived', 'quickly', 'semi', 'busy', 'saturday', 'morning', 'looked', 'like', 'place', 'fill', 'pretty', 'quickly', 'earlier', 'here', 'better', 'yourself', 'favor', 'their', 'bloody', 'mary', 'phenomenal', 'simply', 'best', 'ever', 'pretty', 'sure', 'they', 'only', 'ingredient', 'from', 'their', 'garden', 'blend', 'them', 'fresh', 'when', 'order', 'amazing', 'while', 'everything', 'menu', 'look', 'excellent', 'white', 'truffle', 'scrambled', 'egg', 'vegetable', 'skillet', 'tasty', 'delicious', 'came', 'with', 'piece', 'their', 'griddled', 'bread', 'with', 'amazing', 'absolutely', 'made', 'meal', 'complete', 'best', 'toast', 'ever', 'anyway', 'wait', 'back'], ['have', 'idea', 'some', 'people', 'give', 'review', 'about',

## Compute Bigrams/Trigrams:

N-grams are combinations of adjacent words or letters of length 'n' that you can find in your source text. These combinations of words carry a special meaning. For example: car-pool is an n-gram formed using the two words car and pool that carries a distinct meaning different from the individual words. 

If n=2, it is called a Bigram and if n=3, it is called a Trigram.

We find all the combinations of Bigrams and Trigrams. Then, we keep only the frequent phrases. We finally add the frequent phrases to the original data, since we would like to keep the words “car” and “pool” as well as the bigram “car_pool”.

In [9]:
from gensim.models import Phrases

# Add bigrams and trigrams to doc (only ones that appear 100 times or more).
bigram = Phrases(docs, min_count=200)
trigram = Phrases(bigram[docs])

for idx in range(len(docs)):
    for token in bigram[docs[idx]]:
        if '_' in token:
            # Token is a bigram, add to document.
            docs[idx].append(token)
    for token in trigram[docs[idx]]:
        if '_' in token:
            # Token is a trigram, add to document.
            docs[idx].append(token)

2020-03-17 21:25:51,086 : INFO : collecting all words and their counts
2020-03-17 21:25:51,086 : INFO : PROGRESS: at sentence #0, processed 0 words and 0 word types
2020-03-17 21:25:52,996 : INFO : collected 410104 word types from a corpus of 733158 words (unigram + bigrams) and 10000 sentences
2020-03-17 21:25:52,997 : INFO : using 410104 counts as vocab in Phrases<0 vocab, min_count=200, threshold=10.0, max_vocab_size=40000000>
2020-03-17 21:25:52,998 : INFO : collecting all words and their counts
2020-03-17 21:25:53,000 : INFO : PROGRESS: at sentence #0, processed 0 words and 0 word types
2020-03-17 21:25:59,084 : INFO : collected 412756 word types from a corpus of 726457 words (unigram + bigrams) and 10000 sentences
2020-03-17 21:25:59,085 : INFO : using 412756 counts as vocab in Phrases<0 vocab, min_count=5, threshold=10.0, max_vocab_size=40000000>


In [10]:
from gensim.corpora import Dictionary
# Create a dictionary representation of the documents.
dictionary = Dictionary(docs)
print('Number of unique words in initital documents:', len(dictionary))

2020-03-17 21:26:07,458 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2020-03-17 21:26:09,175 : INFO : built Dictionary(26649 unique tokens: ['absolute', 'absolutely', 'amazing', 'anyway', 'arrived']...) from 10000 documents (total 803130 corpus positions)


Number of unique words in initital documents: 26649


In [11]:
for i in range (0 , 300):
    print(dictionary[i])

absolute
absolutely
amazing
anyway
arrived
back
best
best_ever
better
birthday
blend
bloody
bloody_mary
bread
breakfast
busy
came
complete
delicious
earlier
egg
ever
everything
excellent
favor
fill
food
fresh
from
garden
griddled
ground
here
ingredient
like
look
looked
looked_like
made
mary
meal
menu
morning
only
order
outside
overlooking
perfect
phenomenal
piece
place
pleasure
pretty
pretty_quickly
quickly
saturday
saturday_morning
scrambled
scrambled_egg
semi
simply
sitting
sitting_outside
skillet
sure
tasty
their
them
they
toast
took
truffle
vegetable
wait
waitress
weather
when
which
while
white
wife
with
yourself
yourself_favor
about
awesome
baked
because
beef
both
calzone
case
come
come_back
crowded
decided
doe
door
drink
else
evening
everyone
fault
forever
friend
girl
give
go
good
great
griping
have
home
host
huge
idea
issue
liked
many
many_people
more
more_than
once
part
past
people
personal
pizza
placed
placed_order
pleasant
please
price
probably
review
reviewer
said
sauce
seat

## Remove Rare and Common Tokens/Words

Now we remove in-frequent words from our dictionary. We also remove words that appear frequently in most documents.

In [12]:
# Filter out words that occur less than 100 documents, or more than 20% of the documents.
dictionary.filter_extremes(no_below=200, no_above=0.20) #100,60

print('Number of unique words after removing rare and common words:', len(dictionary))

2020-03-17 21:26:09,276 : INFO : discarding 26118 tokens: [('absolute', 55), ('back', 2326), ('best_ever', 62), ('birthday', 170), ('blend', 44), ('bloody', 46), ('bloody_mary', 37), ('complete', 99), ('earlier', 71), ('egg', 178)]...
2020-03-17 21:26:09,277 : INFO : keeping 531 tokens which were in no less than 200 and no more than 2000 (=20.0%) documents
2020-03-17 21:26:09,285 : DEBUG : rebuilding dictionary, shrinking gaps
2020-03-17 21:26:09,288 : INFO : resulting dictionary: Dictionary(531 unique tokens: ['absolutely', 'amazing', 'anyway', 'arrived', 'best']...)


Number of unique words after removing rare and common words: 531


## Bag-of-Words Representation of Data


Finally, we transform the documents to a **vectorized form**. 

We simply compute the frequency of each word, including the bigrams.

In [13]:
# Bag-of-words representation of the documents.
corpus = [dictionary.doc2bow(doc) for doc in docs]

print('Number of unique tokens: %d' % len(dictionary))
print('Number of documents: %d' % len(corpus))

Number of unique tokens: 531
Number of documents: 10000


In [14]:
for i in range (0 , 500):
    print(dictionary[i])

absolutely
amazing
anyway
arrived
best
better
bread
breakfast
busy
came
delicious
ever
everything
excellent
fresh
ingredient
look
looked
made
meal
menu
morning
order
outside
perfect
piece
pretty
quickly
saturday
sitting
sure
tasty
them
took
wait
waitress
while
white
wife
awesome
because
beef
both
come
come_back
decided
doe
door
drink
else
evening
everyone
friend
girl
give
home
huge
idea
liked
many
more_than
once
part
past
people
pizza
please
price
probably
review
said
sauce
seat
seated
seating
server
show
small
someone
something
sunday
take
than
these
thing
thought
waiter
wanted
well
also
love
plate
rice
selection
area
clean
find
located
over
pick
scottsdale
wonderful
always
customer
into
life
manager
staff
surprised
thanks
totally
walk
your
after
almost
another
beautiful
before
bill
bring
butter
cake
chef
couple
day
definitely
dessert
didn
enough
entree
even
feeling
five
full
glass
impressed
inside
kitchen
know
lady
later
live
long
maybe
meat
minute
much
offer
ordered
pork
problem
qui

## Training the LDA Model

We use the gensim.models.LdaModel class for performing LDA. [https://radimrehurek.com/gensim/models/ldamodel.html]


#### Below we discuss the setting of some of the key parameters.

- num_topics (int, optional) – The number of requested latent topics to be extracted from the training corpus.

 
LDA is an unsupervised technique, meaning that we don't know prior to running the model how many topics exits in our corpus. It depends on the data and the application. We may use the following two technique to determine the number of topics.


        Technique 1: Topic Coherence 
The main technique to determine the number of topics is **Topic coherence** [http://svn.aksw.org/papers/2015/WSDM_Topic_Evaluation/public.pdf]


        Technique 2: Visualizing Inter-Topic Distance 
Use the LDA visualization tool pyLDAvis to observe Intertopic Distance Map (discussed later). By varying the number of topics we could determine the optimal value from the visualization.

We **use both techniques** to determine the optimal number of topics.

- chunksize (int, optional) – Number of documents to be used in each training chunk.

It controls how many documents are processed at a time in the training algorithm. Increasing chunksize will speed up training, at least as long as the chunk of documents easily fit into memory. 

We set chunksize = 20000, which is more than the amount of documents. Thus, it processes all the data in one go. Chunksize can however influence the quality of the model.

- passes (int, optional) – Number of passes through the corpus during training.

It controls how often we train the model on the entire corpus. Another word for passes might be “epochs”. 

- iterations (int, optional) – Maximum number of iterations through the corpus when inferring the topic distribution of a corpus.

It is somewhat technical, but essentially it controls how often we repeat a particular loop over each document. 

        It is important to set the number of “passes” and “iterations” high enough.
        

#### How to Set "passes" and "iterations":

First, enable logging and set eval_every = 1 (however, it might slow down, so, we use None) in LdaModel. 

When training the model look for a line in the log that looks something like this:

        2020-02-25 19:07:04,716 : DEBUG : 49/403 documents converged within 400 iterations

If we set passes = 20, we will see this line 20 times. 

### Important: We need to make sure that by the final passes, most of the documents have converged. 

For example, if passes = 20 and iterations = 400, then, we should see something like following:


        2020-02-25 19:07:18,041 : INFO : PROGRESS: pass 19, at document #403/403
        2020-02-25 19:07:18,042 : DEBUG : performing inference on a chunk of 403 documents
        2020-02-25 19:07:18,627 : DEBUG : 402/403 documents converged within 400 iterations

Thus, want to choose both passes and iterations to be high enough for this to happen.


- eval_every (int, optional) – Log perplexity is estimated every that many updates. Setting this to 1 slows down training by ~2x.


- alpha ({numpy.ndarray, str}, optional): Can be set to an 1D array of length equal to the number of expected topics that expresses our a-priori belief for the each topics’ probability. 

Alternatively default prior selecting strategies can be employed by supplying a string:

        ’asymmetric’: Uses a fixed normalized asymmetric prior of 1.0 / topicno.

        ’auto’: Learns an asymmetric prior from the corpus (not available if distributed==True).
        
        
- eta ({float, np.array, str}, optional) – A-priori belief on word probability.

It can be:

        scalar for a symmetric prior over topic/word probability,

        vector of length num_words to denote an asymmetric user defined probability for each word,

        matrix of shape (num_topics, num_words) to assign a probability for each word-topic combination,

        the string ‘auto’ to learn the asymmetric prior from the data.


We set alpha = 'auto' and eta = 'auto'. Again this is somewhat technical, but essentially we are automatically learning two parameters in the model that we usually would have to specify explicitly.

In [23]:
from gensim.models import LdaModel

# Set training parameters.
num_topics = 14
chunksize = 10000 # Size of the doc looked at every pass #500
passes = 20 # Number of passes through documents #50
iterations = 500 # Maximum number of iterations through the corpus when inferring the topic distribution of a corpus.
eval_every = None  # Don't evaluate model perplexity, takes too much time.

# Make an index to word dictionary.
temp = dictionary[0]  # This is only to "load" the dictionary.
id2word = dictionary.id2token

%time model = LdaModel(corpus=corpus, id2word=id2word, chunksize=chunksize, \
                       alpha='auto', eta='auto', \
                       iterations=iterations, num_topics=num_topics, \
                       passes=passes, eval_every=eval_every)

2020-03-17 21:32:58,000 : INFO : using autotuned alpha, starting with [0.071428575, 0.071428575, 0.071428575, 0.071428575, 0.071428575, 0.071428575, 0.071428575, 0.071428575, 0.071428575, 0.071428575, 0.071428575, 0.071428575, 0.071428575, 0.071428575]
2020-03-17 21:32:58,001 : INFO : using serial LDA version on this node
2020-03-17 21:32:58,003 : INFO : running online (multi-pass) LDA training, 14 topics, 20 passes over the supplied corpus of 10000 documents, updating model once every 10000 documents, evaluating perplexity every 0 documents, iterating 500x with a convergence threshold of 0.001000
2020-03-17 21:32:58,005 : INFO : PROGRESS: pass 0, at document #10000/10000
2020-03-17 21:32:58,006 : DEBUG : performing inference on a chunk of 10000 documents
2020-03-17 21:33:14,903 : DEBUG : 9797/10000 documents converged within 500 iterations
2020-03-17 21:33:14,962 : INFO : optimized alpha [0.060104355, 0.057134457, 0.059176445, 0.057408243, 0.05710619, 0.05619271, 0.0606529, 0.05568631

2020-03-17 21:34:04,578 : INFO : topic #6 (0.048): 0.022*"your" + 0.018*"highly_recommend" + 0.013*"dish" + 0.010*"make" + 0.010*"flavor" + 0.010*"little" + 0.010*"recommend" + 0.009*"best" + 0.009*"know" + 0.008*"chicken"
2020-03-17 21:34:04,580 : INFO : topic #5 (0.051): 0.019*"always" + 0.019*"love" + 0.016*"staff" + 0.015*"friendly" + 0.014*"drink" + 0.012*"location" + 0.010*"beer" + 0.009*"never" + 0.009*"night" + 0.009*"people"
2020-03-17 21:34:04,581 : INFO : topic #0 (0.055): 0.027*"pizza" + 0.017*"salad" + 0.013*"sauce" + 0.013*"sandwich" + 0.012*"also" + 0.012*"delicious" + 0.011*"cheese" + 0.011*"chicken" + 0.010*"fresh" + 0.009*"lunch"
2020-03-17 21:34:04,582 : INFO : topic diff=0.094048, rho=0.408248
2020-03-17 21:34:04,587 : INFO : PROGRESS: pass 5, at document #10000/10000
2020-03-17 21:34:04,588 : DEBUG : performing inference on a chunk of 10000 documents
2020-03-17 21:34:15,426 : DEBUG : 9995/10000 documents converged within 500 iterations
2020-03-17 21:34:15,482 : INF

2020-03-17 21:34:53,034 : INFO : topic #9 (0.046): 0.023*"store" + 0.023*"your" + 0.015*"nice" + 0.014*"also" + 0.012*"price" + 0.011*"room" + 0.011*"find" + 0.010*"little" + 0.010*"area" + 0.010*"shop"
2020-03-17 21:34:53,035 : INFO : topic #5 (0.052): 0.024*"love" + 0.021*"always" + 0.017*"staff" + 0.017*"drink" + 0.017*"friendly" + 0.015*"beer" + 0.014*"location" + 0.012*"night" + 0.011*"music" + 0.009*"people"
2020-03-17 21:34:53,036 : INFO : topic #0 (0.055): 0.034*"pizza" + 0.022*"salad" + 0.018*"sandwich" + 0.014*"sauce" + 0.014*"cheese" + 0.013*"chicken" + 0.012*"also" + 0.012*"delicious" + 0.011*"sweet" + 0.011*"lunch"
2020-03-17 21:34:53,037 : INFO : topic diff=0.074428, rho=0.301511
2020-03-17 21:34:53,040 : INFO : PROGRESS: pass 10, at document #10000/10000
2020-03-17 21:34:53,041 : DEBUG : performing inference on a chunk of 10000 documents
2020-03-17 21:35:01,928 : DEBUG : 9998/10000 documents converged within 500 iterations
2020-03-17 21:35:01,984 : INFO : optimized alpha

2020-03-17 21:35:36,815 : INFO : topic #9 (0.047): 0.025*"store" + 0.023*"your" + 0.015*"nice" + 0.015*"also" + 0.014*"room" + 0.012*"price" + 0.012*"find" + 0.011*"shop" + 0.010*"area" + 0.010*"little"
2020-03-17 21:35:36,816 : INFO : topic #5 (0.053): 0.027*"love" + 0.022*"always" + 0.019*"beer" + 0.019*"drink" + 0.017*"staff" + 0.016*"friendly" + 0.016*"location" + 0.014*"night" + 0.013*"music" + 0.010*"nice"
2020-03-17 21:35:36,817 : INFO : topic #0 (0.056): 0.038*"pizza" + 0.025*"salad" + 0.024*"sandwich" + 0.017*"cheese" + 0.015*"sauce" + 0.013*"chicken" + 0.012*"also" + 0.012*"delicious" + 0.012*"sweet" + 0.012*"lunch"
2020-03-17 21:35:36,818 : INFO : topic diff=0.079000, rho=0.250000
2020-03-17 21:35:36,821 : INFO : PROGRESS: pass 15, at document #10000/10000
2020-03-17 21:35:36,822 : DEBUG : performing inference on a chunk of 10000 documents
2020-03-17 21:35:45,371 : DEBUG : 10000/10000 documents converged within 500 iterations
2020-03-17 21:35:45,430 : INFO : optimized alpha 

2020-03-17 21:36:19,216 : INFO : topic #9 (0.048): 0.026*"store" + 0.023*"your" + 0.016*"room" + 0.015*"nice" + 0.015*"also" + 0.013*"find" + 0.012*"price" + 0.012*"shop" + 0.010*"area" + 0.010*"little"
2020-03-17 21:36:19,217 : INFO : topic #5 (0.054): 0.029*"love" + 0.022*"beer" + 0.021*"always" + 0.021*"drink" + 0.017*"location" + 0.016*"staff" + 0.016*"night" + 0.014*"friendly" + 0.014*"music" + 0.011*"nice"
2020-03-17 21:36:19,218 : INFO : topic #0 (0.056): 0.041*"pizza" + 0.028*"sandwich" + 0.028*"salad" + 0.019*"cheese" + 0.015*"sauce" + 0.013*"chicken" + 0.013*"also" + 0.012*"lunch" + 0.012*"sweet" + 0.012*"delicious"
2020-03-17 21:36:19,220 : INFO : topic diff=0.082916, rho=0.218218


Wall time: 3min 21s


## Technique 1 for Determining Optimal Number of Topics: Topic Coherence

Topic Coherence is a measure used to evaluate topic models. Each such generated topic consists of words, and the topic coherence is applied to the top N words from the topic. 

Topic Coherence measures score a single topic by **measuring the degree of semantic similarity between high scoring words in the topic**. These measurements help distinguish between topics that are semantically interpretable topics and topics that are artifacts of statistical inference. 

A set of statements or facts is said to be coherent, if they support each other. Thus, a coherent fact set can be interpreted in a context that covers all or most of the facts. An example of a coherent fact set is “the game is a team sport”, “the game is played with a ball”, “the game demands great physical efforts”

Topic Coherence is defined as the average of the pairwise word-similarity scores of the words in the topic.

A good model will generate coherent topics, i.e., topics with high topic coherence scores. Good topics are topics that can be described by a short label, therefore this is what the topic coherence measure should capture.


Below we display 
- the average topic coherence and
- print the topics in order of topic coherence


We use LdaModel's "top_topics" method to get the topics with the highest coherence score the coherence for each topic.

Note that we use the “Umass” topic coherence measure here (see gensim.models.ldamodel.LdaModel.top_topics()).

In [24]:
top_topics = model.top_topics(corpus) #, num_words=20)

# Average topic coherence is the sum of topic coherences of all topics, divided by the number of topics.
avg_topic_coherence = sum([t[1] for t in top_topics]) / num_topics
print('Average topic coherence: %.4f.' % avg_topic_coherence)

from pprint import pprint
pprint(top_topics)

2020-03-17 21:36:19,229 : DEBUG : Setting topics to those of the model: LdaModel(num_terms=531, num_topics=14, decay=0.5, chunksize=10000)
2020-03-17 21:36:19,258 : INFO : CorpusAccumulator accumulated stats from 1000 documents
2020-03-17 21:36:19,281 : INFO : CorpusAccumulator accumulated stats from 2000 documents
2020-03-17 21:36:19,300 : INFO : CorpusAccumulator accumulated stats from 3000 documents
2020-03-17 21:36:19,320 : INFO : CorpusAccumulator accumulated stats from 4000 documents
2020-03-17 21:36:19,341 : INFO : CorpusAccumulator accumulated stats from 5000 documents
2020-03-17 21:36:19,364 : INFO : CorpusAccumulator accumulated stats from 6000 documents
2020-03-17 21:36:19,388 : INFO : CorpusAccumulator accumulated stats from 7000 documents
2020-03-17 21:36:19,410 : INFO : CorpusAccumulator accumulated stats from 8000 documents
2020-03-17 21:36:19,433 : INFO : CorpusAccumulator accumulated stats from 9000 documents
2020-03-17 21:36:19,456 : INFO : CorpusAccumulator accumulat

Average topic coherence: -1.8491.
[([(0.022851782, 'customer_service'),
   (0.018153325, 'customer'),
   (0.017723894, 'them'),
   (0.012956987, 'then'),
   (0.012516752, 'after'),
   (0.012309641, 'told'),
   (0.011260493, 'said'),
   (0.010750193, 'minute'),
   (0.010196096, 'because'),
   (0.009698457, 'asked'),
   (0.009417125, 'your'),
   (0.009396388, 'didn'),
   (0.009390397, 'could'),
   (0.008917284, 'went'),
   (0.0085692, 'make'),
   (0.0085212635, 'order'),
   (0.00840338, 'never'),
   (0.008251017, 'another'),
   (0.008216825, 'over'),
   (0.008032461, 'people')],
  -1.5326407886621778),
 ([(0.03609862, 'even_though'),
   (0.02889225, 'even'),
   (0.02646996, 'coffee'),
   (0.022566702, 'though'),
   (0.01795907, 'drink'),
   (0.013913874, 'didn'),
   (0.009937388, 'night'),
   (0.0097421575, 'table'),
   (0.009143943, 'went'),
   (0.008727885, 'nice'),
   (0.008644028, 'come'),
   (0.008509229, 'because'),
   (0.008310297, 'before'),
   (0.0081869345, 'after'),
   (0.0078

## Technique 2 for Determining Optimal Number of Topics: Visualization

We use **pyLDAvis** to interpret the topics in a topic model that has been fit to a corpus of text data. 

It extracts information from a fitted LDA topic model to inform an interactive web-based visualization.

In [None]:
import pyLDAvis.gensim
pyLDAvis.enable_notebook()

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 

pyLDAvis.gensim.prepare(model, corpus, dictionary)

2020-03-17 21:36:19,698 : DEBUG : performing inference on a chunk of 10000 documents



## Interpretation of the Visualization 



- Left Panel: 
The labeld Intertopic Distance Map, circles represent different topics and the distance between them. Similar topics appear closer and the dissimilar topics farther. The relative size of a topic's circle in the plot corresponds to the relative frequency of the topic in the corpus. An individual topic may be selected for closer scrutiny by clicking on its circle, or entering its number in the "selected topic" box in the upper-left.



- Right Panel:
It includes the bar chart of the top 30 terms. When no topic is selected in the plot on the left, the bar chart shows the top-30 most "salient" terms in the corpus. A term's saliency is a measure of both how frequent the term is in the corpus and how "distinctive" it is in distinguishing between different topics. Selecting each topic on the right, modifies the bar chart to show the "relevant" terms for the selected topic. 

Relevence is defined as in footer 2 and can be tuned by parameter $\lambda$.
- Smaller $\lambda$ gives higher weight to the term's distinctiveness.
- larger $\lambda$ corresponds to probablity of the term occurance per topics.

Therefore, to get a better sense of terms per topic we use $\lambda = 0$.

## Display the Top Words in the Topics

In [None]:
def get_lda_topics(model, num_topics, top_words):
    word_dict = {};
    for i in range(num_topics):
        words = model.show_topic(i, topn = top_words);
        word_dict['Topic # ' + '{:02d}'.format(i+1)] = [i[0] for i in words];
    return pd.DataFrame(word_dict)

In [None]:
get_lda_topics(model, num_topics, 20)

## Generate Labels for the Topics

We can manually generate human-interpretable labels for each topic by looking at the terms that appear more in each topic.


We use LdaModel's "show_topic" method that returns **Word-probability pairs** for the most relevant words generated by the topic.

In [None]:
def explore_topic(lda_model, topic_number, topn, output=True):
    """
    accept a ldamodel, a topic number and topn vocabs of interest
    prints a formatted list of the topn terms
    """
    terms = []
    for term, frequency in lda_model.show_topic(topic_number, topn=topn):
        terms += [term]
        if output:
            print(u'{:30} {:.3f}'.format(term, round(frequency, 3)))
    
    return terms

In [None]:
topic_summaries = []

print(u'{:25} {}'.format(u'term', u'frequency') + u'\n')

for i in range(num_topics):
    print('\nTopic '+str(i)+' |---------------------------\n')
    tmp = explore_topic(model,topic_number=i, topn=10, output=True )
#     print tmp[:5]
    topic_summaries += [tmp[:5]]
    print

## Manually Generate Topic Labels

Based on the most probable words generated by each topic, we assign human-interpretable labels for the topics.

In [22]:
top_labels = {0: 'Asian Cuisine', 1:'Mall', 2:'First Visit', 3:'Customer Service', 4:'Comparison', 5:'Store', 6:'Pizza', 7:'Night', 8:'Happy Hour'}