<i> Note: This notebook is inspired by the Topic-Modeling-Latent-Dirichlet-Allocation series at: https://github.com/rhasanbd/Topic-Modeling-Latent-Dirichlet-Allocation </i>

## Latent Dirichlet Allocation - Implementation on Yelp dataset

In this notebook, we implement Latent Dirichlet Allocation(LDA) on the Yelp reviews data to carry out Topic Modelling. We use the Gensim topic modelling API https://radimrehurek.com/gensim/models/ldamodel.html. Scikit-Learn implementation is also available (we use Gensim since it provides more functionality and application like Topic Coherence Pipeline or Dynamic Topic Modeling.)

We build an **end-to-end Natural Language Processing (NLP) pipeline**, starting with raw data and running through preparing, modeling, visualization.
The steps that we will carry out involves the following:
1. Exploratory Data Analysis
2. Data Cleaning and Pre-processing
3. Topic modeling with LDA
4. Determine optimal number of Topics
5. Visualize topic model using pyLDAvis

In [1]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.DEBUG)

%pylab inline
import pandas as pd
import pickle as pk
from scipy import sparse as sp

import nltk
nltk.download('wordnet')

2020-03-18 19:28:32,787 : DEBUG : $HOME=C:\Users\rojin
2020-03-18 19:28:32,792 : DEBUG : CONFIGDIR=C:\Users\rojin\.matplotlib
2020-03-18 19:28:32,794 : DEBUG : matplotlib data path: c:\users\rojin\appdata\local\programs\python\python37\lib\site-packages\matplotlib\mpl-data
2020-03-18 19:28:32,817 : DEBUG : loaded rc file c:\users\rojin\appdata\local\programs\python\python37\lib\site-packages\matplotlib\mpl-data\matplotlibrc
2020-03-18 19:28:32,826 : DEBUG : matplotlib version 3.1.2
2020-03-18 19:28:32,828 : DEBUG : interactive is False
2020-03-18 19:28:32,830 : DEBUG : platform is win32


2020-03-18 19:28:32,975 : DEBUG : CACHEDIR=C:\Users\rojin\.matplotlib
2020-03-18 19:28:32,988 : DEBUG : Using fontManager instance from C:\Users\rojin\.matplotlib\fontlist-v310.json
2020-03-18 19:28:33,454 : DEBUG : Loaded backend module://ipykernel.pylab.backend_inline version unknown.
2020-03-18 19:28:33,472 : DEBUG : Loaded backend module://ipykernel.pylab.backend_inline version unknown.
2020-03-18 19:28:33,478 : DEBUG : Loaded backend module://ipykernel.pylab.backend_inline version unknown.


Populating the interactive namespace from numpy and matplotlib


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\rojin\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

## Load & Explore the Data

In [2]:
df = pd.read_csv('Data/yelp_academic_dataset_review_10000.csv') # Read data into pandas dataframe

df.head() # Quick check of the data samples

Unnamed: 0,business_id,date,review_id,stars,text,user_id,cool,useful,funny
0,9yKzy9PApeiPPOUJEtnvkg,1/26/2011,fWKvX83p0-ka4JS3dc6E5A,5,My wife took me here on my birthday for breakf...,rLtl8ZkDX5vH5nAx9C3q5Q,2,5,0
1,ZRJwVLyzEJq1VAihDhYiow,7/27/2011,IjZ33sJrzXqU-0X6U8NwyA,5,I have no idea why some people give bad review...,0a2KyEL0d3Yb1V6aivbIuQ,0,0,0
2,6oRAC4uyJCsJl1X0WZpVSA,6/14/2012,IESLBzqUCLdSzSqm0eCSxQ,4,love the gyro plate. Rice is so good and I als...,0hT2KtfLiobPvh6cDC8JQg,0,1,0
3,_1QQZuf4zZOyFCvXc0o6Vg,5/27/2010,G-WvGaISbqqaMHlNnByodA,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!...",uZetl9T0NcROGOyFfughhg,1,2,0
4,6ozycU1RpktNG2-1BroVtw,1/5/2012,1uJFq2r5QfJG_6ExMRCaGw,5,General Manager Scott Petello is a good egg!!!...,vYmM4KTsC8ZfQBg-j5MWkw,0,0,0


In [3]:
df.info() # View data description (Total rows, Column names, type and number of non-null values)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 9 columns):
business_id    10000 non-null object
date           10000 non-null object
review_id      10000 non-null object
stars          10000 non-null int64
text           10000 non-null object
user_id        10000 non-null object
cool           10000 non-null int64
useful         10000 non-null int64
funny          10000 non-null int64
dtypes: int64(4), object(5)
memory usage: 703.2+ KB


In [4]:
print("Dimension of the data: ", df.shape) # View data dimension

no_of_rows = df.shape[0]
no_of_columns = df.shape[1]

print("No. of Rows: %d" % no_of_rows)
print("No. of Columns: %d" % no_of_columns)

Dimension of the data:  (10000, 9)
No. of Rows: 10000
No. of Columns: 9


## Convert the Text column into a 2D Array of Documents

- We convert the documents from the text column to an array of documents.

- It's a 2D array in which each row reprents a document.

In [5]:
docs_array = array(df['text']) # Convert the 'text' column into array

print("Dimension of the documents array: ", docs_array.shape) # View dimensions of new array

#print(docs_array[0]) # Display the first document

Dimension of the documents array:  (10000,)


## Pre-process the Data

Pre-processing of the text data is done using the following steps:

- Convert to lowercase 
- Tokenize (split the documents into tokens or words)
- Remove numbers, but not words that contain numbers
- Remove words that are only a single character
- Lemmatize the tokens/words


### Tokenization and Lemmatization

- We convert all the words into lowercase then tokenize each word using NLTK Regular-Expression Tokenizer class "RegexpTokenizer". 
- It splits a given string to substrings using a regular expression. 
- Then we remove numbers and single character words since they usually don't impart much useful information and are very high in number.
- Finally, we lemmatize the tokens using WordNetLemmatizer from NLTK, where we extract the root words of the tokens using the dictionary.

In [6]:
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer

def docs_preprocessor(docs):
    '''Function to Convert the 2D Document Array into a 2D Array of Processed Words'''
    tokenizer = RegexpTokenizer(r'\w+') # Tokenize the words
    
    for idx in range(len(docs)):
        docs[idx] = docs[idx].lower()  # Convert doc to lowercase
        docs[idx] = tokenizer.tokenize(docs[idx])  # Split doc into words

    # Remove numbers, but not words that contain numbers
    docs = [[token for token in doc if not token.isdigit()] for doc in docs]
    
    # Remove words with only one character
    docs = [[token for token in doc if len(token) > 3] for doc in docs]
    
    # Lemmatize all words
    lemmatizer = WordNetLemmatizer()
    docs = [[lemmatizer.lemmatize(token) for token in doc] for doc in docs]
  
    return docs

- Now we convert the 2D Document Array into a 2D Array of Tokenized Words using the above function

In [7]:
%time docs = docs_preprocessor(docs_array)
print("Length of the 2D Array of Tokenized Documents: ", len(docs))

Wall time: 10.6 s
Length of the 2D Array of Tokenized Documents:  10000


In [8]:
print(docs[0:2]) #Display the first two documents

[['wife', 'took', 'here', 'birthday', 'breakfast', 'excellent', 'weather', 'perfect', 'which', 'made', 'sitting', 'outside', 'overlooking', 'their', 'ground', 'absolute', 'pleasure', 'waitress', 'excellent', 'food', 'arrived', 'quickly', 'semi', 'busy', 'saturday', 'morning', 'looked', 'like', 'place', 'fill', 'pretty', 'quickly', 'earlier', 'here', 'better', 'yourself', 'favor', 'their', 'bloody', 'mary', 'phenomenal', 'simply', 'best', 'ever', 'pretty', 'sure', 'they', 'only', 'ingredient', 'from', 'their', 'garden', 'blend', 'them', 'fresh', 'when', 'order', 'amazing', 'while', 'everything', 'menu', 'look', 'excellent', 'white', 'truffle', 'scrambled', 'egg', 'vegetable', 'skillet', 'tasty', 'delicious', 'came', 'with', 'piece', 'their', 'griddled', 'bread', 'with', 'amazing', 'absolutely', 'made', 'meal', 'complete', 'best', 'toast', 'ever', 'anyway', 'wait', 'back'], ['have', 'idea', 'some', 'people', 'give', 'review', 'about', 'this', 'place', 'go', 'show', 'please', 'everyone', 

## Remove all stop words

- Stop words are words like “and”, “the”, “him”, which are presumed to be uninformative in representing the content of a text. 
- The stop words may be removed to avoid them being construed as signal for prediction.
- To remove the stop words, we use the "stopwords" module from the nltk library.

In [9]:
# Load library
from nltk.corpus import stopwords

# You will have to download the set of stop words the first time
import nltk
nltk.download('stopwords')

# Load stop words
stop_words = stopwords.words('english')

# Show stop words
stop_words[:5]

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\rojin\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['i', 'me', 'my', 'myself', 'we']

In [10]:
# Remove all stop words from the doc
for i in range(len(docs)):
    docs[i] = [word for word in docs[i] if word not in stop_words]

## Compute Bigrams/Trigrams:

- N-grams are combinations of adjacent words or letters of length 'n' that you can find in your source text. These combinations of words carry a special meaning. For example: car-pool is an n-gram formed using the two words car and pool that carries a distinct meaning different from the individual words. 

- If n=2, it is called a Bigram and if n=3, it is called a Trigram.

- We find all the combinations of Bigrams and Trigrams. Then, we keep only the frequent phrases. 
- We finally add the frequent phrases to the original data, since we would like to keep the words “car” and “pool” as well as the bigram “car_pool”.

In [11]:
from gensim.models import Phrases

bigram = Phrases(docs, min_count=200) # Add bigrams (if appears 300 times or more)
trigram = Phrases(bigram[docs], min_count=200) # Add trigrams (if appears 300 times or more)

for idx in range(len(docs)):
    for token in bigram[docs[idx]]:
        if '_' in token:
            docs[idx].append(token)  # Token is a bigram, add to document
    for token in trigram[docs[idx]]:
        if '_' in token:
            docs[idx].append(token)  # Token is a trigram, add to document

2020-03-18 19:28:58,361 : INFO : collecting all words and their counts
2020-03-18 19:28:58,363 : INFO : PROGRESS: at sentence #0, processed 0 words and 0 word types
2020-03-18 19:29:00,460 : INFO : collected 393383 word types from a corpus of 581861 words (unigram + bigrams) and 10000 sentences
2020-03-18 19:29:00,461 : INFO : using 393383 counts as vocab in Phrases<0 vocab, min_count=200, threshold=10.0, max_vocab_size=40000000>
2020-03-18 19:29:00,462 : INFO : collecting all words and their counts
2020-03-18 19:29:00,473 : INFO : PROGRESS: at sentence #0, processed 0 words and 0 word types
2020-03-18 19:29:08,361 : INFO : collected 394023 word types from a corpus of 580378 words (unigram + bigrams) and 10000 sentences
2020-03-18 19:29:08,363 : INFO : using 394023 counts as vocab in Phrases<0 vocab, min_count=200, threshold=10.0, max_vocab_size=40000000>


In [12]:
from gensim.corpora import Dictionary

dictionary = Dictionary(docs) # Create a dictionary representation of the documents
print('Number of unique words in initital documents:', len(dictionary))

2020-03-18 19:29:19,551 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2020-03-18 19:29:21,489 : INFO : built Dictionary(24215 unique tokens: ['absolute', 'absolutely', 'amazing', 'anyway', 'arrived']...) from 10000 documents (total 586852 corpus positions)


Number of unique words in initital documents: 24215


## Remove Rare and Common Tokens/Words

- We remove in-frequent words from our dictionary. 
- We also remove words that appear frequently in most documents.

In [13]:
# Filter out words that occur less than 300 documents, or more than 20% of the documents
dictionary.filter_extremes(no_below=200, no_above=0.20) 

print('Number of unique words after removing rare and common words:', len(dictionary))

2020-03-18 19:29:21,611 : INFO : discarding 23736 tokens: [('absolute', 55), ('back', 2326), ('birthday', 170), ('blend', 44), ('bloody', 46), ('complete', 99), ('earlier', 71), ('egg', 178), ('favor', 51), ('fill', 107)]...
2020-03-18 19:29:21,613 : INFO : keeping 479 tokens which were in no less than 200 and no more than 2000 (=20.0%) documents
2020-03-18 19:29:21,629 : DEBUG : rebuilding dictionary, shrinking gaps
2020-03-18 19:29:21,633 : INFO : resulting dictionary: Dictionary(479 unique tokens: ['absolutely', 'amazing', 'anyway', 'arrived', 'best']...)


Number of unique words after removing rare and common words: 479


## Bag-of-Words Representation of Data


- We transform the documents to a **vectorized form**. 

- We simply compute the frequency of each word, including the bigrams/trigrams.

In [14]:
corpus = [dictionary.doc2bow(doc) for doc in docs] # Bag-of-words representation of the docs

print('Number of unique tokens: %d' % len(dictionary))
print('Number of documents: %d' % len(corpus))

Number of unique tokens: 479
Number of documents: 10000


In [15]:
for i in range (0 , 324):
    print(dictionary[i]) #View first 100 words in the dictionary

absolutely
amazing
anyway
arrived
best
better
bread
breakfast
busy
came
delicious
ever
everything
excellent
fresh
ingredient
look
looked
made
meal
menu
morning
order
outside
perfect
piece
pretty
quickly
saturday
sitting
sure
tasty
took
wait
waitress
white
wife
awesome
beef
come
decided
doe
door
drink
else
evening
everyone
friend
girl
give
home
huge
idea
liked
many
part
past
people
pizza
please
price
probably
review
said
sauce
seat
seated
seating
server
show
small
someone
something
sunday
take
thing
thought
waiter
wanted
well
also
love
plate
rice
selection
area
clean
find
located
pick
scottsdale
wonderful
always
customer
life
manager
staff
surprised
thanks
totally
walk
almost
another
beautiful
bill
bring
butter
cake
chef
couple
day
definitely
dessert
enough
entree
even
feeling
five
full
glass
impressed
inside
kitchen
know
lady
later
live
long
maybe
meat
minute
much
offer
ordered
pork
problem
quite
restaurant
return
salad
sandwich
seemed
slice
star
start
started
tried
veggie
waiting
wall

## Training the LDA Model

- We use the gensim.models.LdaModel class for performing LDA. [https://radimrehurek.com/gensim/models/ldamodel.html]
- This module allows both LDA model estimation from a training corpus and inference of topic distribution on new, unseen documents. 

#### The key parameters in this model are chosen as shown beloew:

- **num_topics (int, optional) – The number of requested latent topics to be extracted from the training corpus.**

Since this is an supervised learning problem, we do not know how many topics are present in the given dataset. Inroder to determine the number of topics we use the following techniques:

Technique 1: Topic Coherence 
The main technique to determine the number of topics is **Topic coherence** [http://svn.aksw.org/papers/2015/WSDM_Topic_Evaluation/public.pdf]

Technique 2: Visualizing Inter-Topic Distance 
Use the LDA visualization tool pyLDAvis to observe Intertopic Distance Map (discussed later). By varying the number of topics we could determine the optimal value from the visualization.

- **chunksize (int, optional) – Number of documents to be used in each training chunk.**

It controls how many documents are processed at a time in the training algorithm. Increasing chunksize will speed up training, at least as long as the chunk of documents easily fit into memory. 

We set chunksize = 10000, which is equal to the amount of documents. Thus, it processes all the data in one go. Chunksize can however influence the quality of the model.

- **passes (int, optional) – Number of passes through the corpus during training.**

It controls how often we train the model on the entire corpus. Another word for passes might be “epochs”. 

- **iterations (int, optional) – Maximum number of iterations through the corpus when inferring the topic distribution of a corpus.**

It controls how often we repeat a particular loop over each document.

- **eval_every (int, optional) – Log perplexity is estimated every that many updates.**

Setting this to 1 slows down training by ~2x.


- **alpha ({numpy.ndarray, str}, optional): Can be set to an 1D array of length equal to the number of expected topics that expresses our a-priori belief for the each topics’ probability.**         
        
- **eta ({float, np.array, str}, optional) – A-priori belief on word probability.**

We set alpha = 'auto' and eta = 'auto'. Essentially we are automatically learning two parameters in the model that we usually would have to specify explicitly.

In [25]:
from gensim.models import LdaModel

#------Set training parameters
num_topics = 14 # Number of topics to discover
chunksize = 10000 # Size of the doc looked at every pass
passes = 20 # Number of passes through the corpus
iterations = 500 # Maximum number of iterations through the corpus when inferring the topic distribution of a corpus
eval_every = None  # Don't evaluate model perplexity, takes too much time.

#-------Make an index to word dictionary
temp = dictionary[0]  # This is only to "load" the dictionary
id2word = dictionary.id2token

%time model = LdaModel(corpus=corpus, id2word=id2word, chunksize=chunksize, \
                       alpha='auto', eta='auto', \
                       iterations=iterations, num_topics=num_topics, \
                       passes=passes, eval_every=eval_every)

2020-03-18 19:36:11,932 : INFO : using autotuned alpha, starting with [0.071428575, 0.071428575, 0.071428575, 0.071428575, 0.071428575, 0.071428575, 0.071428575, 0.071428575, 0.071428575, 0.071428575, 0.071428575, 0.071428575, 0.071428575, 0.071428575]
2020-03-18 19:36:11,934 : INFO : using serial LDA version on this node
2020-03-18 19:36:11,937 : INFO : running online (multi-pass) LDA training, 14 topics, 20 passes over the supplied corpus of 10000 documents, updating model once every 10000 documents, evaluating perplexity every 0 documents, iterating 500x with a convergence threshold of 0.001000
2020-03-18 19:36:11,939 : INFO : PROGRESS: pass 0, at document #10000/10000
2020-03-18 19:36:11,941 : DEBUG : performing inference on a chunk of 10000 documents
2020-03-18 19:36:37,951 : DEBUG : 9875/10000 documents converged within 500 iterations
2020-03-18 19:36:38,020 : INFO : optimized alpha [0.055341482, 0.05646221, 0.05717276, 0.062277704, 0.058756128, 0.059127245, 0.057126306, 0.059545

2020-03-18 19:37:54,589 : INFO : topic #3 (0.051): 0.015*"restaurant" + 0.014*"sushi" + 0.014*"roll" + 0.012*"menu" + 0.012*"rice" + 0.011*"star" + 0.010*"sauce" + 0.009*"beef" + 0.009*"ordered" + 0.009*"order"
2020-03-18 19:37:54,590 : INFO : topic #12 (0.053): 0.021*"always" + 0.020*"store" + 0.012*"well" + 0.011*"love" + 0.010*"year" + 0.010*"also" + 0.010*"people" + 0.010*"work" + 0.010*"know" + 0.009*"staff"
2020-03-18 19:37:54,592 : INFO : topic diff=0.098870, rho=0.408248
2020-03-18 19:37:54,596 : INFO : PROGRESS: pass 5, at document #10000/10000
2020-03-18 19:37:54,599 : DEBUG : performing inference on a chunk of 10000 documents
2020-03-18 19:38:10,315 : DEBUG : 10000/10000 documents converged within 500 iterations
2020-03-18 19:38:10,394 : INFO : optimized alpha [0.044673413, 0.043061648, 0.04561814, 0.049796727, 0.04538715, 0.047351837, 0.04403644, 0.04925276, 0.042509515, 0.041081168, 0.04302194, 0.04709848, 0.053057656, 0.046172522]
2020-03-18 19:38:10,395 : DEBUG : updatin

2020-03-18 19:39:09,281 : INFO : topic #7 (0.047): 0.025*"table" + 0.017*"order" + 0.016*"ordered" + 0.015*"minute" + 0.014*"server" + 0.013*"wait" + 0.013*"came" + 0.011*"pizza" + 0.010*"menu" + 0.010*"night"
2020-03-18 19:39:09,283 : INFO : topic #12 (0.055): 0.022*"store" + 0.021*"always" + 0.012*"love" + 0.012*"year" + 0.012*"well" + 0.011*"know" + 0.011*"never" + 0.011*"also" + 0.010*"work" + 0.010*"people"
2020-03-18 19:39:09,285 : INFO : topic diff=0.082371, rho=0.301511
2020-03-18 19:39:09,289 : INFO : PROGRESS: pass 10, at document #10000/10000
2020-03-18 19:39:09,290 : DEBUG : performing inference on a chunk of 10000 documents
2020-03-18 19:39:23,365 : DEBUG : 9999/10000 documents converged within 500 iterations
2020-03-18 19:39:23,459 : INFO : optimized alpha [0.04115383, 0.040803753, 0.040649623, 0.04608517, 0.040144037, 0.045671053, 0.03822016, 0.046554968, 0.039678257, 0.0347954, 0.040491406, 0.04482319, 0.0557692, 0.044775072]
2020-03-18 19:39:23,461 : DEBUG : updating t

2020-03-18 19:40:17,728 : INFO : topic #7 (0.046): 0.029*"table" + 0.019*"order" + 0.018*"ordered" + 0.018*"minute" + 0.015*"server" + 0.015*"came" + 0.015*"wait" + 0.011*"asked" + 0.010*"menu" + 0.010*"said"
2020-03-18 19:40:17,730 : INFO : topic #12 (0.058): 0.023*"store" + 0.021*"always" + 0.013*"year" + 0.012*"love" + 0.012*"never" + 0.012*"know" + 0.012*"also" + 0.011*"well" + 0.011*"work" + 0.010*"people"
2020-03-18 19:40:17,733 : INFO : topic diff=0.088074, rho=0.250000
2020-03-18 19:40:17,738 : INFO : PROGRESS: pass 15, at document #10000/10000
2020-03-18 19:40:17,739 : DEBUG : performing inference on a chunk of 10000 documents
2020-03-18 19:40:31,087 : DEBUG : 9999/10000 documents converged within 500 iterations
2020-03-18 19:40:31,185 : INFO : optimized alpha [0.039892215, 0.040928133, 0.037746377, 0.04483856, 0.037387565, 0.045229323, 0.03499313, 0.0455719, 0.040061288, 0.031309586, 0.040278073, 0.04486677, 0.057997096, 0.044705972]
2020-03-18 19:40:31,186 : DEBUG : updating

2020-03-18 19:41:22,823 : INFO : topic #7 (0.045): 0.031*"table" + 0.020*"order" + 0.020*"minute" + 0.019*"ordered" + 0.017*"came" + 0.016*"server" + 0.016*"wait" + 0.013*"asked" + 0.011*"said" + 0.010*"went"
2020-03-18 19:41:22,826 : INFO : topic #12 (0.059): 0.025*"store" + 0.022*"always" + 0.013*"year" + 0.013*"never" + 0.012*"love" + 0.012*"know" + 0.012*"also" + 0.011*"well" + 0.011*"work" + 0.010*"staff"
2020-03-18 19:41:22,827 : INFO : topic diff=0.093212, rho=0.218218


Wall time: 5min 10s


## Technique 1 for Determining Optimal Number of Topics: Topic Coherence

- Topic Coherence is a measure used to evaluate topic models. 
- A set of statements or facts is said to be coherent, if they support each other. 
- An example of a coherent fact set is “the game is a team sport”, “the game is played with a ball”, “the game demands great physical efforts”. Each such generated topic consists of words, and the topic coherence is applied to the top N words from the topic. 

Below we display 
- the average topic coherence and
- print the topics in order of topic coherence

- We use LdaModel's "top_topics" method to get the topics with highest coherence score for each topic.
- Note that we use the “Umass” topic coherence measure here (see gensim.models.ldamodel.LdaModel.top_topics()).

In [26]:
top_topics = model.top_topics(corpus)

# Average topic coherence is the sum of topic coherences of all topics, divided by the number of topics.
avg_topic_coherence = sum([t[1] for t in top_topics]) / num_topics
print('Average topic coherence: %.4f.' % avg_topic_coherence)

from pprint import pprint
pprint(top_topics)

2020-03-18 19:41:22,849 : DEBUG : Setting topics to those of the model: LdaModel(num_terms=479, num_topics=14, decay=0.5, chunksize=10000)
2020-03-18 19:41:22,899 : INFO : CorpusAccumulator accumulated stats from 1000 documents
2020-03-18 19:41:22,952 : INFO : CorpusAccumulator accumulated stats from 2000 documents
2020-03-18 19:41:22,985 : INFO : CorpusAccumulator accumulated stats from 3000 documents
2020-03-18 19:41:23,016 : INFO : CorpusAccumulator accumulated stats from 4000 documents
2020-03-18 19:41:23,043 : INFO : CorpusAccumulator accumulated stats from 5000 documents
2020-03-18 19:41:23,066 : INFO : CorpusAccumulator accumulated stats from 6000 documents
2020-03-18 19:41:23,094 : INFO : CorpusAccumulator accumulated stats from 7000 documents
2020-03-18 19:41:23,126 : INFO : CorpusAccumulator accumulated stats from 8000 documents
2020-03-18 19:41:23,162 : INFO : CorpusAccumulator accumulated stats from 9000 documents
2020-03-18 19:41:23,201 : INFO : CorpusAccumulator accumulat

Average topic coherence: -1.9597.
[([(0.014500951, 'make'),
   (0.014462573, 'even'),
   (0.014384397, 'know'),
   (0.013427718, 'thing'),
   (0.013147641, 'could'),
   (0.012447422, 'think'),
   (0.011965686, 'much'),
   (0.011433464, 'people'),
   (0.011416536, 'better'),
   (0.010520441, 'take'),
   (0.010044893, 'friend'),
   (0.009982269, 'review'),
   (0.009933541, 'need'),
   (0.0098300455, 'sure'),
   (0.009385238, 'home'),
   (0.009070197, 'going'),
   (0.009027183, 'next'),
   (0.009021322, 'star'),
   (0.0088329725, 'right'),
   (0.008640835, 'come')],
  -1.743831219453768),
 ([(0.07707651, 'burger'),
   (0.055282053, 'fry'),
   (0.019355007, 'sweet'),
   (0.017651645, 'potato'),
   (0.017036986, 'cheese'),
   (0.015218481, 'sandwich'),
   (0.014760232, 'onion'),
   (0.013069318, 'ordered'),
   (0.012904886, 'well'),
   (0.01212739, 'nice'),
   (0.011381641, 'steak'),
   (0.010598591, 'also'),
   (0.009919349, 'bacon'),
   (0.009762588, 'side'),
   (0.009751068, 'pretty'),
 

## Technique 2 for Determining Optimal Number of Topics: Visualization

- We use **pyLDAvis** to interpret the topics in a topic model that has been fit to a corpus of text data. 

- It extracts information from a fitted LDA topic model to inform an interactive web-based visualization.

In [27]:
import pyLDAvis.gensim
pyLDAvis.enable_notebook()

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 

pyLDAvis.gensim.prepare(model, corpus, dictionary)

2020-03-18 19:41:23,532 : DEBUG : performing inference on a chunk of 10000 documents
2020-03-18 19:41:35,948 : DEBUG : 10000/10000 documents converged within 500 iterations
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))



## Interpretation of the Visualization 

- Relevence is defined as in footer 2 and can be tuned by parameter $\lambda$.

Smaller $\lambda$ gives higher weight to the term's distinctiveness.

Larger $\lambda$ corresponds to probablity of the term occurance per topics.

- Therefore, to get a better sense of terms per topic we use $\lambda = 0$.

## Display the Top Words in the Topics

- We display the top 10 words for each topic.

In [28]:
def get_lda_topics(model, num_topics, top_words):
    '''Function to return top words for num_topics'''
    word_dict = {};
    for i in range(num_topics):
        words = model.show_topic(i, topn = top_words);
        word_dict['Topic # ' + '{:02d}'.format(i+1)] = [i[0] for i in words];
    return pd.DataFrame(word_dict)

In [29]:
get_lda_topics(model, num_topics, 10) #View top 20 words for each topic

Unnamed: 0,Topic # 01,Topic # 02,Topic # 03,Topic # 04,Topic # 05,Topic # 06,Topic # 07,Topic # 08,Topic # 09,Topic # 10,Topic # 11,Topic # 12,Topic # 13,Topic # 14
0,highly_recommend,area,burger,rice,customer_service,pizza,first_time,table,make,happy_hour,beer,wine,store,breakfast
1,price,restaurant,fry,roll,customer,salad,first,order,even,hour,drink,restaurant,always,coffee
2,recommend,phoenix,sweet,sushi,staff,chicken,best,minute,know,happy,night,dish,year,sandwich
3,love,location,potato,sauce,room,love,ever,ordered,thing,drink,game,meal,never,taco
4,highly,scottsdale,cheese,chicken,friendly,lunch,amazing,came,could,menu,music,dinner,love,chip
5,always,nice,sandwich,restaurant,nice,sauce,never,server,think,price,pretty,also,know,cheese
6,quality,little,onion,beef,hotel,sandwich,went,wait,much,sushi,friend,bread,also,mexican
7,best,room,ordered,dish,pool,always,love,asked,people,special,people,little,well,salsa
8,every,mall,well,meat,clean,delicious,year,said,better,pretty,bartender,dessert,work,burrito
9,worth,town,nice,spicy,stay,also,visit,went,take,love,selection,menu,staff,cream


## Generate Labels for the Topics

- We can manually generate human-interpretable labels for each topic by looking at the terms that appear more in each topic.


- We use LdaModel's "show_topic" method that returns **Word-probability pairs** for the most relevant words generated by the topic.

In [30]:
def explore_topic(lda_model, topic_number, topn, output=True):
    """
    accept a ldamodel, a topic number and topn vocabs of interest
    prints a formatted list of the topn terms
    """
    terms = []
    for term, frequency in lda_model.show_topic(topic_number, topn=topn):
        terms += [term]
        if output:
            print(u'{:30} {:.3f}'.format(term, round(frequency, 3)))
    
    return terms

In [31]:
topic_summaries = []

print(u'{:25} {}'.format(u'term', u'frequency') + u'\n')

for i in range(num_topics):
    print('\nTopic '+str(i)+' |---------------------------\n')
    tmp = explore_topic(model,topic_number=i, topn=8, output=True )
    topic_summaries += [tmp[:5]]
    print

term                      frequency


Topic 0 |---------------------------

highly_recommend               0.048
price                          0.044
recommend                      0.036
love                           0.034
highly                         0.025
always                         0.020
quality                        0.020
best                           0.015

Topic 1 |---------------------------

area                           0.025
restaurant                     0.023
phoenix                        0.022
location                       0.022
scottsdale                     0.019
nice                           0.017
little                         0.014
room                           0.013

Topic 2 |---------------------------

burger                         0.077
fry                            0.055
sweet                          0.019
potato                         0.018
cheese                         0.017
sandwich                       0.015
onion                          0

## Manually Generate Topic Labels

- Based on the most probable words generated by each topic, we assign human-interpretable labels for the topics.

In [32]:
top_labels = {0: 'Recommendation', 1:'Location', 2:'Burger', 3:'Asian Cuisine', 4:'Hotel', 5:'Fast Food', 
              6:'Experience', 7:'Order', 8:'Opinion', 9:'Happy Hour', 10:'Beer', 11:'Fine Dining', 12:'Store',
             13:'Breakfast'}

In [33]:
top_labels

{0: 'Recommendation',
 1: 'Location',
 2: 'Burger',
 3: 'Asian Cuisine',
 4: 'Hotel',
 5: 'Fast Food',
 6: 'Experience',
 7: 'Order',
 8: 'Opinion',
 9: 'Happy Hour',
 10: 'Beer',
 11: 'Fine Dining',
 12: 'Store',
 13: 'Breakfast'}