|<i> Note: This notebook is inspired by the Topic-Modeling-Latent-Dirichlet-Allocation series at: https://github.com/rhasanbd/Topic-Modeling-Latent-Dirichlet-Allocation </i>

## Latent Dirichlet Allocation - Implementation on Yelp dataset

In this notebook, we implement Latent Dirichlet Allocation(LDA) on the Yelp reviews data to carry out Topic Modelling. We use the Gensim topic modelling API https://radimrehurek.com/gensim/models/ldamodel.html. Scikit-Learn implementation is also available (we use Gensim since it provides more functionality and application like Topic Coherence Pipeline or Dynamic Topic Modeling.)

We build an **end-to-end Natural Language Processing (NLP) pipeline**, starting with raw data and running through preparing, modeling, visualization.
The steps that we will carry out involves the following:
1. Exploratory Data Analysis
2. Data Cleaning and Pre-processing
3. Topic modeling with LDA
4. Determine optimal number of Topics
5. Visualize topic model using pyLDAvis

In [1]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.DEBUG)

%pylab inline
import pandas as pd
import pickle as pk
from scipy import sparse as sp

import nltk
nltk.download('wordnet')

from pymongo import MongoClient

2020-04-23 10:02:11,633 : DEBUG : $HOME=C:\Users\rojin
2020-04-23 10:02:11,637 : DEBUG : CONFIGDIR=C:\Users\rojin\.matplotlib
2020-04-23 10:02:11,638 : DEBUG : matplotlib data path: c:\users\rojin\appdata\local\programs\python\python37\lib\site-packages\matplotlib\mpl-data
2020-04-23 10:02:11,641 : DEBUG : loaded rc file c:\users\rojin\appdata\local\programs\python\python37\lib\site-packages\matplotlib\mpl-data\matplotlibrc
2020-04-23 10:02:11,644 : DEBUG : matplotlib version 3.1.2
2020-04-23 10:02:11,645 : DEBUG : interactive is False
2020-04-23 10:02:11,645 : DEBUG : platform is win32


2020-04-23 10:02:11,689 : DEBUG : CACHEDIR=C:\Users\rojin\.matplotlib
2020-04-23 10:02:11,694 : DEBUG : Using fontManager instance from C:\Users\rojin\.matplotlib\fontlist-v310.json
2020-04-23 10:02:11,813 : DEBUG : Loaded backend module://ipykernel.pylab.backend_inline version unknown.
2020-04-23 10:02:11,817 : DEBUG : Loaded backend module://ipykernel.pylab.backend_inline version unknown.
2020-04-23 10:02:11,820 : DEBUG : Loaded backend module://ipykernel.pylab.backend_inline version unknown.


Populating the interactive namespace from numpy and matplotlib


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\rojin\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## Load & Explore the Data

In [2]:
client = MongoClient("mongodb://localhost:27017/")
db = client.yelp_database
df = pd.DataFrame(db.business_restaurant.find({},{"reviews.text":1, "_id":0}))
df = df.applymap(lambda x : x[0]['text'])
df.head() #Quick Check of the data

Unnamed: 0,reviews
0,During the recent Yelp scavenger hunt event my...
1,Bolt is within walking distance of The Drake H...
2,Apteka was one the highest rated places I have...
3,"When people say Korean food, what do you think..."
4,I'm SO glad I finally got to get SERVED last w...


In [3]:
print("The total number of reviews is:", df.shape[0])

The total number of reviews is: 8688


In [4]:
df.info() # View data description (Total rows, Column names, type and number of non-null values)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8688 entries, 0 to 8687
Data columns (total 1 columns):
reviews    8688 non-null object
dtypes: object(1)
memory usage: 68.0+ KB


In [5]:
print("Dimension of the data: ", df.shape) # View data dimension

no_of_rows = df.shape[0]
no_of_columns = df.shape[1]

print("No. of Rows: %d" % no_of_rows)
print("No. of Columns: %d" % no_of_columns)

Dimension of the data:  (8688, 1)
No. of Rows: 8688
No. of Columns: 1


## Convert the Text column into a 2D Array of Documents

- We convert the documents from the text column to an array of documents.

- It's a 2D array in which each row reprents a document.

In [86]:
from array import array

docs_array = np.array(df['reviews']) # Convert the 'text' column into array

print("Dimension of the documents array: ", docs_array.shape) # View dimensions of new array
print()
print(docs_array[6]) # View a document

Dimension of the documents array:  (8688,)

Have been to the Salt Cellar countless times over the years. Cannot believe I've never left a review here. Interesting underground restaurant that is very easy to miss if you do not know where it is. It's a small little door on the top but a huge restaurant Underground. Has very good seafood for Arizona. Flown in fresh and cooked to order properly. They also do a great happy hour so you can try some of their Specialties at a discounted price on food and drink. Fun place to come and check out with some friends.


## Pre-process the Data

Pre-processing of the text data is done using the following steps:

- Convert to lowercase 
- Tokenize (split the documents into tokens or words)
- Remove numbers, but not words that contain numbers
- Remove words that are only a single character
- Lemmatize the tokens/words


### Tokenization and Lemmatization

- We convert all the words into lowercase then tokenize each word using NLTK Regular-Expression Tokenizer class "RegexpTokenizer". 
- It splits a given string to substrings using a regular expression. 
- Then we remove numbers and single character words since they usually don't impart much useful information and are very high in number.
- Finally, we lemmatize the tokens using WordNetLemmatizer from NLTK, where we extract the root words of the tokens using the dictionary.

In [87]:
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer

def docs_preprocessor(docs):
    '''Function to Convert the 2D Document Array into a 2D Array of Processed Words'''
    tokenizer = RegexpTokenizer(r'\w+') # Tokenize the words
    
    for idx in range(len(docs)):
        docs[idx] = docs[idx].lower()  # Convert doc to lowercase
        docs[idx] = tokenizer.tokenize(docs[idx])  # Split doc into words

    # Remove numbers, but not words that contain numbers
    docs = [[token for token in doc if not token.isdigit()] for doc in docs]
    
    # Remove words with only one character
    docs = [[token for token in doc if len(token) > 3] for doc in docs]
    
    # Lemmatize all words
    lemmatizer = WordNetLemmatizer()
    docs = [[lemmatizer.lemmatize(token) for token in doc] for doc in docs]
  
    return docs

- Now we convert the 2D Document Array into a 2D Array of Tokenized Words using the above function

In [88]:
%time 
docs = docs_preprocessor(docs_array)
print("Length of the 2D Array of Tokenized Documents: ", len(docs))

Wall time: 0 ns
Length of the 2D Array of Tokenized Documents:  8688


In [89]:
print(docs[0:2]) #Display the first two documents

[['during', 'recent', 'yelp', 'scavenger', 'hunt', 'event', 'husband', 'this', 'place', 'last', 'venue', 'were', 'pretty', 'full', 'from', 'eating', 'elsewhere', 'told', 'them', 'they', 'would', 'sample', 'would', 'home', 'they', 'were', 'more', 'than', 'happy', 'this', 'asked', 'them', 'other', 'location', 'were', 'concerned', 'this', 'only', 'city', 'greek', 'themed', 'when', 'finally', 'were', 'able', 'have', 'doggie', 'pleased', 'that', 'they', 'gave', 'small', 'gyro', 'which', 'their', 'specialty', 'along', 'with', 'lemon', 'chicken', 'soup', 'oyster', 'cracker', 'believe', 'disappoint', 'just', 'enough', 'light', 'meal', 'later', 'they', 'have', 'reward', 'program', 'that', 'sandwich', 'salad', 'then', 'free', 'when', 'home', 'chance', 'review', 'menu', 'detail', 'serve', 'vegetarian', 'gyro', 'specialty', 'burger', 'specialty', 'sandwich', 'along', 'with', 'side', 'everything', 'carte', 'however', 'side', 'fry', 'coleslaw', 'reduced', 'price', 'order', 'sandwich', 'breakfast', '

## Remove all stop words

- Stop words are words like “and”, “the”, “him”, which are presumed to be uninformative in representing the content of a text. 
- The stop words may be removed to avoid them being construed as signal for prediction.
- To remove the stop words, we use the "stopwords" module from the nltk library.

In [90]:
# Load library
from nltk.corpus import stopwords

# You will have to download the set of stop words the first time
import nltk
nltk.download('stopwords')

# Load stop words
stop_words = stopwords.words('english')

# Show stop words
stop_words[:5]

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\rojin\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['i', 'me', 'my', 'myself', 'we']

In [91]:
# Remove all stop words from the doc
for i in range(len(docs)):
    docs[i] = [word for word in docs[i] if word not in stop_words]

## Compute Bigrams/Trigrams:

- N-grams are combinations of adjacent words or letters of length 'n' that you can find in your source text. These combinations of words carry a special meaning. For example: car-pool is an n-gram formed using the two words car and pool that carries a distinct meaning different from the individual words. 

- If n=2, it is called a Bigram and if n=3, it is called a Trigram.

- We find all the combinations of Bigrams and Trigrams. Then, we keep only the frequent phrases. 
- We finally add the frequent phrases to the original data, since we would like to keep the words “car” and “pool” as well as the bigram “car_pool”.

In [92]:
from gensim.models import Phrases

bigram = Phrases(docs, min_count=10, threshold=100) # Add bigrams (if appears 300 times or more)
trigram = Phrases(bigram[docs], min_count=10, threshold=100) # Add trigrams (if appears 300 times or more)

for idx in range(len(docs)):
    for token in bigram[docs[idx]]:
        if '_' in token:
            docs[idx].append(token)  # Token is a bigram, add to document
    for token in trigram[docs[idx]]:
        if '_' in token:
            docs[idx].append(token)  # Token is a trigram, add to document

2020-04-23 10:43:36,134 : INFO : collecting all words and their counts
2020-04-23 10:43:36,136 : INFO : PROGRESS: at sentence #0, processed 0 words and 0 word types
2020-04-23 10:43:39,217 : INFO : collected 497941 word types from a corpus of 812038 words (unigram + bigrams) and 8688 sentences
2020-04-23 10:43:39,218 : INFO : using 497941 counts as vocab in Phrases<0 vocab, min_count=10, threshold=100, max_vocab_size=40000000>
2020-04-23 10:43:39,254 : INFO : collecting all words and their counts
2020-04-23 10:43:39,257 : INFO : PROGRESS: at sentence #0, processed 0 words and 0 word types
2020-04-23 10:43:51,401 : INFO : collected 502177 word types from a corpus of 800933 words (unigram + bigrams) and 8688 sentences
2020-04-23 10:43:51,403 : INFO : using 502177 counts as vocab in Phrases<0 vocab, min_count=10, threshold=100, max_vocab_size=40000000>


In [93]:
from gensim.corpora import Dictionary

dictionary = Dictionary(docs) # Create a dictionary representation of the documents
print('Number of unique words in initital documents:', len(dictionary))

2020-04-23 10:44:05,905 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2020-04-23 10:44:09,218 : INFO : built Dictionary(26831 unique tokens: ['able', 'airport', 'airside', 'along', 'approach']...) from 8688 documents (total 845958 corpus positions)


Number of unique words in initital documents: 26831


## Remove Rare and Common Tokens/Words

- We remove in-frequent words from our dictionary. 
- We also remove words that appear frequently in most documents.

In [94]:
# Filter out words that occur less than 300 documents, or more than 20% of the documents
dictionary.filter_extremes(no_below=10, no_above=0.20) 

print('Number of unique words after removing rare and common words:', len(dictionary))

2020-04-23 10:44:09,452 : INFO : discarding 21463 tokens: [('airside', 3), ('area', 1359), ('asked', 962), ('chicken', 2006), ('doggie', 6), ('enough', 963), ('everything', 1289), ('find', 1015), ('food', 4851), ('fry', 921)]...
2020-04-23 10:44:09,454 : INFO : keeping 5368 tokens which were in no less than 10 and no more than 868 (=10.0%) documents
2020-04-23 10:44:09,477 : DEBUG : rebuilding dictionary, shrinking gaps
2020-04-23 10:44:09,486 : INFO : resulting dictionary: Dictionary(5368 unique tokens: ['able', 'airport', 'along', 'approach', 'available']...)


Number of unique words after removing rare and common words: 5368


## Bag-of-Words Representation of Data


- We transform the documents to a **vectorized form**. 

- We simply compute the frequency of each word, including the bigrams/trigrams.

In [95]:
corpus = [dictionary.doc2bow(doc) for doc in docs] # Bag-of-words representation of the docs

print('Number of unique tokens: %d' % len(dictionary))
print('Number of documents: %d' % len(corpus))

Number of unique tokens: 5368
Number of documents: 8688


In [61]:
for i in range (0 , 912):
    if(dictionary[i]=='feel_like'):
        print(dictionary[i]) #View first 100 words in the dictionary

## Training the LDA Model

- We use the gensim.models.LdaModel class for performing LDA. [https://radimrehurek.com/gensim/models/ldamodel.html]
- This module allows both LDA model estimation from a training corpus and inference of topic distribution on new, unseen documents. 

#### The key parameters in this model are chosen as shown beloew:

- **num_topics (int, optional) – The number of requested latent topics to be extracted from the training corpus.**

Since this is an supervised learning problem, we do not know how many topics are present in the given dataset. Inroder to determine the number of topics we use the following techniques:

Technique 1: Topic Coherence 
The main technique to determine the number of topics is **Topic coherence** [http://svn.aksw.org/papers/2015/WSDM_Topic_Evaluation/public.pdf]

Technique 2: Visualizing Inter-Topic Distance 
Use the LDA visualization tool pyLDAvis to observe Intertopic Distance Map (discussed later). By varying the number of topics we could determine the optimal value from the visualization.

- **chunksize (int, optional) – Number of documents to be used in each training chunk.**

It controls how many documents are processed at a time in the training algorithm. Increasing chunksize will speed up training, at least as long as the chunk of documents easily fit into memory. 

We set chunksize = 10000, which is equal to the amount of documents. Thus, it processes all the data in one go. Chunksize can however influence the quality of the model.

- **passes (int, optional) – Number of passes through the corpus during training.**

It controls how often we train the model on the entire corpus. Another word for passes might be “epochs”. 

- **iterations (int, optional) – Maximum number of iterations through the corpus when inferring the topic distribution of a corpus.**

It controls how often we repeat a particular loop over each document.

- **eval_every (int, optional) – Log perplexity is estimated every that many updates.**

Setting this to 1 slows down training by ~2x.


- **alpha ({numpy.ndarray, str}, optional): Can be set to an 1D array of length equal to the number of expected topics that expresses our a-priori belief for the each topics’ probability.**         
        
- **eta ({float, np.array, str}, optional) – A-priori belief on word probability.**

We set alpha = 'auto' and eta = 'auto'. Essentially we are automatically learning two parameters in the model that we usually would have to specify explicitly.

In [96]:
from gensim.models import LdaModel

#------Set training parameters
num_topics = 14 # Number of topics to discover
chunksize = 8688 # Size of the doc looked at every pass
passes = 30 # Number of passes through the corpus
iterations = 400 # Maximum number of iterations through the corpus when inferring the topic distribution of a corpus
eval_every = None  # Don't evaluate model perplexity, takes too much time.

#-------Make an index to word dictionary
temp = dictionary[0]  # This is only to "load" the dictionary
id2word = dictionary.id2token

%time model = LdaModel(corpus=corpus, id2word=id2word, chunksize=chunksize, \
                       alpha='auto', eta='auto', \
                       iterations=iterations, num_topics=num_topics, \
                       passes=passes, eval_every=eval_every, random_state=0)

2020-04-23 10:44:33,159 : INFO : using autotuned alpha, starting with [0.071428575, 0.071428575, 0.071428575, 0.071428575, 0.071428575, 0.071428575, 0.071428575, 0.071428575, 0.071428575, 0.071428575, 0.071428575, 0.071428575, 0.071428575, 0.071428575]
2020-04-23 10:44:33,162 : INFO : using serial LDA version on this node
2020-04-23 10:44:33,181 : INFO : running online (multi-pass) LDA training, 14 topics, 30 passes over the supplied corpus of 8688 documents, updating model once every 8688 documents, evaluating perplexity every 0 documents, iterating 250x with a convergence threshold of 0.001000
2020-04-23 10:44:33,183 : INFO : PROGRESS: pass 0, at document #8688/8688
2020-04-23 10:44:33,185 : DEBUG : performing inference on a chunk of 8688 documents
2020-04-23 10:45:09,564 : DEBUG : 7248/8688 documents converged within 250 iterations
2020-04-23 10:45:09,650 : INFO : optimized alpha [0.054211393, 0.05330117, 0.05547096, 0.054973148, 0.058158416, 0.05327438, 0.055156033, 0.05503418, 0.0

2020-04-23 10:46:21,018 : INFO : topic #4 (0.046): 0.007*"sushi" + 0.006*"soup" + 0.005*"spicy" + 0.005*"bowl" + 0.004*"indian" + 0.004*"sandwich" + 0.004*"dessert" + 0.004*"clean" + 0.003*"beer" + 0.003*"fast"
2020-04-23 10:46:21,022 : INFO : topic #8 (0.048): 0.009*"beer" + 0.006*"sandwich" + 0.004*"pizza" + 0.003*"pulled_pork" + 0.003*"game" + 0.003*"probably" + 0.003*"happy_hour" + 0.003*"pork" + 0.003*"year" + 0.003*"plate"
2020-04-23 10:46:21,023 : INFO : topic #6 (0.048): 0.007*"coffee" + 0.006*"told" + 0.005*"sandwich" + 0.005*"manager" + 0.004*"breakfast" + 0.004*"business" + 0.003*"owner" + 0.003*"tell" + 0.003*"year" + 0.003*"line"
2020-04-23 10:46:21,024 : INFO : topic diff=0.339353, rho=0.408248
2020-04-23 10:46:21,041 : INFO : PROGRESS: pass 5, at document #8688/8688
2020-04-23 10:46:21,042 : DEBUG : performing inference on a chunk of 8688 documents
2020-04-23 10:46:36,104 : DEBUG : 8634/8688 documents converged within 250 iterations
2020-04-23 10:46:36,204 : INFO : optim

2020-04-23 10:47:30,028 : INFO : topic #12 (0.036): 0.010*"goat_cheese" + 0.007*"foie_gras" + 0.006*"breakfast" + 0.006*"bloody_mary" + 0.005*"bread" + 0.005*"biscuit" + 0.004*"short_rib" + 0.004*"brunch" + 0.004*"dessert" + 0.004*"potato"
2020-04-23 10:47:30,030 : INFO : topic #3 (0.043): 0.011*"cream" + 0.010*"spring_roll" + 0.009*"cake" + 0.007*"chocolate" + 0.007*"roll" + 0.006*"coffee" + 0.006*"milk" + 0.005*"dessert" + 0.004*"thai" + 0.004*"spring"
2020-04-23 10:47:30,031 : INFO : topic #8 (0.046): 0.013*"beer" + 0.006*"sandwich" + 0.005*"pulled_pork" + 0.004*"game" + 0.003*"probably" + 0.003*"year" + 0.003*"pork" + 0.003*"brisket" + 0.003*"especially" + 0.003*"plate"
2020-04-23 10:47:30,033 : INFO : topic #6 (0.052): 0.008*"told" + 0.006*"manager" + 0.006*"coffee" + 0.005*"business" + 0.004*"owner" + 0.004*"sandwich" + 0.004*"waiting" + 0.004*"employee" + 0.003*"tell" + 0.003*"left"
2020-04-23 10:47:30,036 : INFO : topic diff=0.264184, rho=0.301511
2020-04-23 10:47:30,050 : INFO

2020-04-23 10:48:28,657 : DEBUG : updating topics
2020-04-23 10:48:28,678 : INFO : topic #10 (0.033): 0.015*"sandwich" + 0.007*"room" + 0.006*"tucked_away" + 0.006*"vega" + 0.005*"looking_forward" + 0.005*"bread" + 0.004*"front_desk" + 0.004*"quick" + 0.004*"away" + 0.004*"five_star"
2020-04-23 10:48:28,680 : INFO : topic #12 (0.033): 0.011*"goat_cheese" + 0.008*"foie_gras" + 0.008*"bloody_mary" + 0.006*"bread" + 0.006*"biscuit" + 0.006*"short_rib" + 0.006*"brussels_sprout" + 0.005*"call_ahead" + 0.005*"goat" + 0.005*"pour"
2020-04-23 10:48:28,683 : INFO : topic #3 (0.043): 0.012*"cream" + 0.010*"spring_roll" + 0.009*"cake" + 0.009*"coffee" + 0.008*"chocolate" + 0.007*"milk" + 0.006*"dessert" + 0.006*"roll" + 0.004*"strip_mall" + 0.004*"vegan"
2020-04-23 10:48:28,684 : INFO : topic #8 (0.047): 0.015*"beer" + 0.006*"sandwich" + 0.005*"game" + 0.005*"pulled_pork" + 0.003*"year" + 0.003*"probably" + 0.003*"local" + 0.003*"brisket" + 0.003*"pork" + 0.003*"especially"
2020-04-23 10:48:28,68

2020-04-23 10:49:12,700 : INFO : PROGRESS: pass 19, at document #8688/8688
2020-04-23 10:49:12,702 : DEBUG : performing inference on a chunk of 8688 documents
2020-04-23 10:49:23,270 : DEBUG : 8681/8688 documents converged within 250 iterations
2020-04-23 10:49:23,351 : INFO : optimized alpha [0.043072697, 0.04132358, 0.038351174, 0.044255696, 0.040036146, 0.037776083, 0.06520288, 0.036439084, 0.04957697, 0.03716778, 0.031987567, 0.040339913, 0.030595725, 0.035012126]
2020-04-23 10:49:23,352 : DEBUG : updating topics
2020-04-23 10:49:23,372 : INFO : topic #12 (0.031): 0.013*"goat_cheese" + 0.009*"foie_gras" + 0.009*"bloody_mary" + 0.007*"biscuit" + 0.007*"short_rib" + 0.006*"brussels_sprout" + 0.006*"bread" + 0.006*"call_ahead" + 0.006*"goat" + 0.006*"pour"
2020-04-23 10:49:23,373 : INFO : topic #10 (0.032): 0.019*"sandwich" + 0.007*"vega" + 0.007*"tucked_away" + 0.007*"room" + 0.006*"looking_forward" + 0.005*"five_star" + 0.005*"bread" + 0.005*"quick" + 0.004*"hotel" + 0.004*"front_de

2020-04-23 10:50:05,515 : INFO : topic #6 (0.071): 0.009*"told" + 0.006*"manager" + 0.005*"business" + 0.005*"owner" + 0.005*"waitress" + 0.004*"left" + 0.004*"waiting" + 0.004*"walked" + 0.004*"employee" + 0.004*"away"
2020-04-23 10:50:05,518 : INFO : topic diff=0.109915, rho=0.200000
2020-04-23 10:50:05,536 : INFO : PROGRESS: pass 24, at document #8688/8688
2020-04-23 10:50:05,538 : DEBUG : performing inference on a chunk of 8688 documents
2020-04-23 10:50:15,565 : DEBUG : 8683/8688 documents converged within 250 iterations
2020-04-23 10:50:15,648 : INFO : optimized alpha [0.044278793, 0.042996723, 0.038498938, 0.04540979, 0.04033374, 0.03714197, 0.07250489, 0.036167152, 0.05214275, 0.037615627, 0.03214283, 0.04067851, 0.029062424, 0.034742832]
2020-04-23 10:50:15,650 : DEBUG : updating topics
2020-04-23 10:50:15,670 : INFO : topic #12 (0.029): 0.014*"goat_cheese" + 0.010*"foie_gras" + 0.010*"bloody_mary" + 0.007*"short_rib" + 0.007*"brussels_sprout" + 0.007*"biscuit" + 0.007*"call_a

2020-04-23 10:50:56,473 : INFO : topic #3 (0.046): 0.014*"coffee" + 0.014*"cream" + 0.010*"spring_roll" + 0.010*"cake" + 0.009*"chocolate" + 0.008*"dessert" + 0.008*"milk" + 0.005*"shop" + 0.005*"cafe" + 0.005*"sugar"
2020-04-23 10:50:56,475 : INFO : topic #8 (0.054): 0.017*"beer" + 0.006*"game" + 0.005*"pulled_pork" + 0.004*"sandwich" + 0.004*"local" + 0.004*"year" + 0.003*"pork" + 0.003*"brisket" + 0.003*"bartender" + 0.003*"probably"
2020-04-23 10:50:56,477 : INFO : topic #6 (0.078): 0.009*"told" + 0.006*"manager" + 0.005*"waitress" + 0.005*"business" + 0.005*"owner" + 0.005*"left" + 0.004*"waiting" + 0.004*"walked" + 0.004*"away" + 0.004*"employee"
2020-04-23 10:50:56,479 : INFO : topic diff=0.085145, rho=0.182574
2020-04-23 10:50:56,500 : INFO : PROGRESS: pass 29, at document #8688/8688
2020-04-23 10:50:56,502 : DEBUG : performing inference on a chunk of 8688 documents
2020-04-23 10:51:06,217 : DEBUG : 8683/8688 documents converged within 250 iterations
2020-04-23 10:51:06,283 : I

Wall time: 6min 33s


In [124]:
#------Set training parameters
num_topics = 9 # Number of topics to discover
chunksize = 8688 # Size of the doc looked at every pass
passes = 35 # Number of passes through the corpus
iterations = 400 # Maximum number of iterations through the corpus when inferring the topic distribution of a corpus
eval_every = None  # Don't evaluate model perplexity, takes too much time.

#-------Make an index to word dictionary
temp = dictionary[0]  # This is only to "load" the dictionary
id2word = dictionary.id2token

%time model = LdaModel(corpus=corpus, id2word=id2word, chunksize=chunksize, \
                       alpha='auto', eta='auto', \
                       iterations=iterations, num_topics=num_topics, \
                       passes=passes, eval_every=eval_every, random_state=0)

2020-04-23 11:45:23,901 : INFO : using autotuned alpha, starting with [0.11111111, 0.11111111, 0.11111111, 0.11111111, 0.11111111, 0.11111111, 0.11111111, 0.11111111, 0.11111111]
2020-04-23 11:45:23,903 : INFO : using serial LDA version on this node
2020-04-23 11:45:23,908 : INFO : running online (multi-pass) LDA training, 9 topics, 35 passes over the supplied corpus of 8688 documents, updating model once every 8688 documents, evaluating perplexity every 0 documents, iterating 400x with a convergence threshold of 0.001000
2020-04-23 11:45:23,910 : INFO : PROGRESS: pass 0, at document #8688/8688
2020-04-23 11:45:23,910 : DEBUG : performing inference on a chunk of 8688 documents
2020-04-23 11:45:40,016 : DEBUG : 7720/8688 documents converged within 400 iterations
2020-04-23 11:45:40,047 : INFO : optimized alpha [0.06737, 0.066226706, 0.07057916, 0.06962964, 0.077339575, 0.06567255, 0.06939833, 0.06990467, 0.075203344]
2020-04-23 11:45:40,048 : DEBUG : updating topics
2020-04-23 11:45:40,

2020-04-23 11:46:13,234 : INFO : topic #8 (0.059): 0.009*"happy_hour" + 0.008*"sandwich" + 0.007*"beer" + 0.004*"pizza" + 0.003*"game" + 0.003*"burger" + 0.003*"year" + 0.003*"away" + 0.003*"pork" + 0.003*"pulled_pork"
2020-04-23 11:46:13,235 : INFO : topic diff=0.230068, rho=0.408248
2020-04-23 11:46:13,242 : INFO : PROGRESS: pass 5, at document #8688/8688
2020-04-23 11:46:13,242 : DEBUG : performing inference on a chunk of 8688 documents
2020-04-23 11:46:19,630 : DEBUG : 8672/8688 documents converged within 400 iterations
2020-04-23 11:46:19,675 : INFO : optimized alpha [0.050593093, 0.047887098, 0.04989773, 0.05116678, 0.054736998, 0.048659768, 0.05543188, 0.050936602, 0.057203665]
2020-04-23 11:46:19,676 : DEBUG : updating topics
2020-04-23 11:46:19,682 : INFO : topic #1 (0.048): 0.012*"noodle" + 0.009*"soup" + 0.008*"pork" + 0.006*"shrimp" + 0.005*"mashed_potato" + 0.005*"plate" + 0.004*"broth" + 0.004*"pork_chop" + 0.004*"goat_cheese" + 0.004*"thai"
2020-04-23 11:46:19,682 : INFO

2020-04-23 11:46:43,176 : INFO : topic diff=0.169080, rho=0.301511
2020-04-23 11:46:43,182 : INFO : PROGRESS: pass 10, at document #8688/8688
2020-04-23 11:46:43,183 : DEBUG : performing inference on a chunk of 8688 documents
2020-04-23 11:46:48,877 : DEBUG : 8679/8688 documents converged within 400 iterations
2020-04-23 11:46:48,918 : INFO : optimized alpha [0.048909727, 0.045246225, 0.044336345, 0.047880687, 0.049626507, 0.04508358, 0.057567675, 0.04673463, 0.05546772]
2020-04-23 11:46:48,918 : DEBUG : updating topics
2020-04-23 11:46:48,924 : INFO : topic #2 (0.044): 0.047*"pizza" + 0.009*"crust" + 0.006*"deep_fried" + 0.006*"topping" + 0.006*"lobster" + 0.005*"steak" + 0.005*"slice" + 0.005*"thin_crust" + 0.004*"wing" + 0.004*"italian"
2020-04-23 11:46:48,925 : INFO : topic #5 (0.045): 0.026*"taco" + 0.010*"roll" + 0.009*"salsa" + 0.009*"spicy" + 0.009*"burrito" + 0.008*"carne_asada" + 0.007*"mexican" + 0.007*"bean" + 0.007*"chinese" + 0.007*"chip_salsa"
2020-04-23 11:46:48,927 : I

2020-04-23 11:47:11,673 : INFO : PROGRESS: pass 15, at document #8688/8688
2020-04-23 11:47:11,674 : DEBUG : performing inference on a chunk of 8688 documents
2020-04-23 11:47:17,368 : DEBUG : 8687/8688 documents converged within 400 iterations
2020-04-23 11:47:17,404 : INFO : optimized alpha [0.049751673, 0.0454386, 0.042084824, 0.047240432, 0.048446093, 0.04321949, 0.062439155, 0.045434147, 0.05715317]
2020-04-23 11:47:17,405 : DEBUG : updating topics
2020-04-23 11:47:17,410 : INFO : topic #2 (0.042): 0.053*"pizza" + 0.010*"crust" + 0.007*"wing" + 0.006*"topping" + 0.006*"deep_fried" + 0.006*"lobster" + 0.006*"slice" + 0.005*"italian" + 0.005*"steak" + 0.005*"thin_crust"
2020-04-23 11:47:17,411 : INFO : topic #5 (0.043): 0.030*"taco" + 0.011*"salsa" + 0.010*"burrito" + 0.009*"carne_asada" + 0.009*"roll" + 0.009*"spicy" + 0.008*"mexican" + 0.008*"bean" + 0.008*"chip_salsa" + 0.008*"chip"
2020-04-23 11:47:17,412 : INFO : topic #0 (0.050): 0.008*"highly_recommend" + 0.008*"wine" + 0.008

2020-04-23 11:47:38,535 : INFO : PROGRESS: pass 20, at document #8688/8688
2020-04-23 11:47:38,535 : DEBUG : performing inference on a chunk of 8688 documents
2020-04-23 11:47:43,406 : DEBUG : 8686/8688 documents converged within 400 iterations
2020-04-23 11:47:43,439 : INFO : optimized alpha [0.05130225, 0.046649225, 0.041139908, 0.04776195, 0.048590716, 0.042009056, 0.067908734, 0.045156527, 0.060055256]
2020-04-23 11:47:43,440 : DEBUG : updating topics
2020-04-23 11:47:43,445 : INFO : topic #2 (0.041): 0.057*"pizza" + 0.010*"crust" + 0.009*"wing" + 0.007*"topping" + 0.006*"deep_fried" + 0.006*"lobster" + 0.006*"slice" + 0.006*"italian" + 0.006*"pasta" + 0.006*"garlic"
2020-04-23 11:47:43,446 : INFO : topic #5 (0.042): 0.034*"taco" + 0.012*"salsa" + 0.011*"burrito" + 0.010*"carne_asada" + 0.009*"mexican" + 0.009*"bean" + 0.009*"chip" + 0.009*"chip_salsa" + 0.008*"spicy" + 0.006*"tortilla"
2020-04-23 11:47:43,447 : INFO : topic #0 (0.051): 0.009*"highly_recommend" + 0.008*"wine" + 0.0

2020-04-23 11:48:03,329 : INFO : PROGRESS: pass 25, at document #8688/8688
2020-04-23 11:48:03,330 : DEBUG : performing inference on a chunk of 8688 documents
2020-04-23 11:48:08,121 : DEBUG : 8687/8688 documents converged within 400 iterations
2020-04-23 11:48:08,161 : INFO : optimized alpha [0.053008087, 0.04830505, 0.040815763, 0.0487734, 0.049217615, 0.041125588, 0.073615454, 0.04542962, 0.06341148]
2020-04-23 11:48:08,162 : DEBUG : updating topics
2020-04-23 11:48:08,167 : INFO : topic #2 (0.041): 0.060*"pizza" + 0.011*"crust" + 0.011*"wing" + 0.007*"italian" + 0.007*"topping" + 0.007*"slice" + 0.007*"deep_fried" + 0.007*"pasta" + 0.006*"lobster" + 0.006*"garlic"
2020-04-23 11:48:08,168 : INFO : topic #5 (0.041): 0.037*"taco" + 0.014*"salsa" + 0.012*"burrito" + 0.010*"mexican" + 0.010*"carne_asada" + 0.010*"chip" + 0.010*"bean" + 0.010*"chip_salsa" + 0.008*"spicy" + 0.007*"tortilla"
2020-04-23 11:48:08,169 : INFO : topic #0 (0.053): 0.009*"highly_recommend" + 0.009*"wine" + 0.008*

2020-04-23 11:48:29,111 : INFO : PROGRESS: pass 30, at document #8688/8688
2020-04-23 11:48:29,112 : DEBUG : performing inference on a chunk of 8688 documents
2020-04-23 11:48:33,599 : DEBUG : 8686/8688 documents converged within 400 iterations
2020-04-23 11:48:33,634 : INFO : optimized alpha [0.054743923, 0.05024816, 0.04082915, 0.049905956, 0.05012922, 0.040521156, 0.07941794, 0.045998, 0.06697818]
2020-04-23 11:48:33,635 : DEBUG : updating topics
2020-04-23 11:48:33,643 : INFO : topic #5 (0.041): 0.039*"taco" + 0.014*"salsa" + 0.013*"burrito" + 0.011*"mexican" + 0.011*"chip" + 0.011*"bean" + 0.011*"carne_asada" + 0.010*"chip_salsa" + 0.007*"tortilla" + 0.007*"sour_cream"
2020-04-23 11:48:33,645 : INFO : topic #2 (0.041): 0.062*"pizza" + 0.012*"wing" + 0.011*"crust" + 0.008*"italian" + 0.007*"topping" + 0.007*"pasta" + 0.007*"slice" + 0.007*"garlic" + 0.007*"deep_fried" + 0.006*"lobster"
2020-04-23 11:48:33,646 : INFO : topic #0 (0.055): 0.010*"highly_recommend" + 0.009*"wine" + 0.00

Wall time: 3min 28s


## Technique 1 for Determining Optimal Number of Topics: Topic Coherence

- Topic Coherence is a measure used to evaluate topic models. 
- A set of statements or facts is said to be coherent, if they support each other. 
- An example of a coherent fact set is “the game is a team sport”, “the game is played with a ball”, “the game demands great physical efforts”. Each such generated topic consists of words, and the topic coherence is applied to the top N words from the topic. 

Below we display 
- the average topic coherence and
- print the topics in order of topic coherence

- We use LdaModel's "top_topics" method to get the topics with highest coherence score for each topic.
- Note that we use the “Umass” topic coherence measure here (see gensim.models.ldamodel.LdaModel.top_topics()).

In [125]:
top_topics = model.top_topics(corpus)

# Average topic coherence is the sum of topic coherences of all topics, divided by the number of topics.
avg_topic_coherence = sum([t[1] for t in top_topics]) / num_topics
print('Average topic coherence: %.4f.' % avg_topic_coherence)

from pprint import pprint
pprint(top_topics)

2020-04-23 11:48:52,054 : DEBUG : Setting topics to those of the model: LdaModel(num_terms=5368, num_topics=9, decay=0.5, chunksize=8688)
2020-04-23 11:48:52,074 : INFO : CorpusAccumulator accumulated stats from 1000 documents
2020-04-23 11:48:52,089 : INFO : CorpusAccumulator accumulated stats from 2000 documents
2020-04-23 11:48:52,103 : INFO : CorpusAccumulator accumulated stats from 3000 documents
2020-04-23 11:48:52,117 : INFO : CorpusAccumulator accumulated stats from 4000 documents
2020-04-23 11:48:52,131 : INFO : CorpusAccumulator accumulated stats from 5000 documents
2020-04-23 11:48:52,154 : INFO : CorpusAccumulator accumulated stats from 6000 documents
2020-04-23 11:48:52,167 : INFO : CorpusAccumulator accumulated stats from 7000 documents
2020-04-23 11:48:52,188 : INFO : CorpusAccumulator accumulated stats from 8000 documents


Average topic coherence: -3.0070.
[([(0.008108814, 'told'),
   (0.0055919816, 'manager'),
   (0.004920743, 'waitress'),
   (0.004616446, 'left'),
   (0.00409281, 'business'),
   (0.00401735, 'waiting'),
   (0.00400646, 'owner'),
   (0.0038892885, 'walked'),
   (0.0038374988, 'away'),
   (0.0036012216, 'looked'),
   (0.0035949524, 'water'),
   (0.0034833462, 'someone'),
   (0.003462351, 'gave'),
   (0.0034219755, 'later'),
   (0.003268127, 'employee'),
   (0.0032047888, 'front'),
   (0.0031672341, 'waited'),
   (0.003151949, 'seated'),
   (0.0031442423, 'anything'),
   (0.0031044954, 'year')],
  -2.1672690709675178),
 ([(0.009913259, 'highly_recommend'),
   (0.008971715, 'wine'),
   (0.0079843, 'dining_room'),
   (0.0077860747, 'room'),
   (0.007014186, 'dining'),
   (0.00695396, 'gluten_free'),
   (0.006884391, 'dessert'),
   (0.006162452, 'steak'),
   (0.005223123, 'chef'),
   (0.0047742114, 'reservation'),
   (0.004683277, 'bread'),
   (0.004670844, 'cocktail'),
   (0.0045223883, 'hi

## Technique 2 for Determining Optimal Number of Topics: Visualization

- We use **pyLDAvis** to interpret the topics in a topic model that has been fit to a corpus of text data. 

- It extracts information from a fitted LDA topic model to inform an interactive web-based visualization.

In [122]:
# import pyLDAvis.gensim
# pyLDAvis.enable_notebook()

# import warnings
# warnings.filterwarnings("ignore", category=DeprecationWarning) 

# pyLDAvis.gensim.prepare(model, corpus, dictionary)


## Interpretation of the Visualization 

- Relevence is defined as in footer 2 and can be tuned by parameter $\lambda$.

Smaller $\lambda$ gives higher weight to the term's distinctiveness.

Larger $\lambda$ corresponds to probablity of the term occurance per topics.

- Therefore, to get a better sense of terms per topic we use $\lambda = 0$.

## Display the Top Words in the Topics

- We display the top 10 words for each topic.

In [98]:
def get_lda_topics(model, num_topics, top_words):
    '''Function to return top words for num_topics'''
    word_dict = {};
    for i in range(num_topics):
        words = model.show_topic(i, topn = top_words);
        word_dict['Topic # ' + '{:02d}'.format(i+1)] = [i[0] for i in words];
    return pd.DataFrame(word_dict)

In [126]:
get_lda_topics(model, num_topics, 10) #View top 10 words for each topic

Unnamed: 0,Topic # 01,Topic # 02,Topic # 03,Topic # 04,Topic # 05,Topic # 06,Topic # 07,Topic # 08,Topic # 09
0,highly_recommend,soup,pizza,coffee,sushi,taco,told,burger,happy_hour
1,wine,noodle,wing,cream,roll,salsa,manager,breakfast,beer
2,dining_room,pork,crust,spring_roll,bowl,burrito,waitress,bacon,sandwich
3,room,shrimp,italian,cake,spicy,chip,left,egg,game
4,dining,thai,pasta,chocolate,ramen,mexican,business,drive_thru,local
5,gluten_free,mashed_potato,topping,dessert,fish,bean,waiting,sandwich,year
6,dessert,broth,slice,milk,soup,carne_asada,owner,brunch,pulled_pork
7,steak,pork_belly,garlic,cafe,buffet,chip_salsa,walked,onion,five_star
8,chef,plate,deep_fried,shop,flavour,black_bean,away,toast,open
9,reservation,cooked,lobster,vegan,korean,tortilla,looked,onion_ring,music


## Generate Labels for the Topics

- We can manually generate human-interpretable labels for each topic by looking at the terms that appear more in each topic.


- We use LdaModel's "show_topic" method that returns **Word-probability pairs** for the most relevant words generated by the topic.

In [149]:
def explore_topic(lda_model, topic_number, topn, output=True):
    """
    accept a ldamodel, a topic number and topn vocabs of interest
    prints a formatted list of the topn terms
    """
    terms = []
    for term, frequency in lda_model.show_topic(topic_number, topn=topn):
        terms += [term]
        if output:
            print(u'{:30} {:.3f}'.format(term, round(frequency, 3)))
    
    return terms

In [150]:
topic_summaries = []

print(u'{:25} {}'.format(u'term', u'frequency') + u'\n')

for i in range(num_topics):
    print('\nTopic '+str(i)+' |---------------------------\n')
    tmp = explore_topic(model,topic_number=i, topn=8, output=True )
    topic_summaries += [tmp[:5]]
    print

term                      frequency


Topic 0 |---------------------------

highly_recommend               0.010
wine                           0.009
dining_room                    0.008
room                           0.008
dining                         0.007
gluten_free                    0.007
dessert                        0.007
steak                          0.006

Topic 1 |---------------------------

soup                           0.016
noodle                         0.016
pork                           0.013
shrimp                         0.009
thai                           0.009
mashed_potato                  0.006
broth                          0.005
pork_belly                     0.005

Topic 2 |---------------------------

pizza                          0.060
wing                           0.012
crust                          0.011
italian                        0.008
pasta                          0.007
topping                        0.007
slice                          0

## Manually Generate Topic Labels

- Based on the most probable words generated by each topic, we assign human-interpretable labels for the topics.

In [165]:
top_labels = {0: 'Fine Dining', 1:'Thai Food', 2:'Italian Food', 3:'Bakery', 4:'Asian Food', 5:'Mexican Food',
              6:'Customer Experience', 7:'Fast Food', 8:'Happy Hour'}

In [166]:
top_labels

{0: 'Fine Dining',
 1: 'Thai Food',
 2: 'Italian Food',
 3: 'Bakery',
 4: 'Asian Food',
 5: 'Mexican Food',
 6: 'Customer Experience',
 7: 'Fast Food',
 8: 'Happy Hour'}

In [128]:
from gensim.test.utils import datapath
# Save model to disk.
model.save('LDA_model')
# Load a potentially pretrained model from disk.
model = LdaModel.load('LDA_model')

2020-04-23 11:52:58,318 : INFO : saving LdaState object under LDA_model.state, separately None
2020-04-23 11:52:58,318 : DEBUG : {'uri': 'LDA_model.state', 'mode': 'wb', 'buffering': -1, 'encoding': None, 'errors': None, 'newline': None, 'closefd': True, 'opener': None, 'ignore_ext': False, 'transport_params': None}
2020-04-23 11:52:58,321 : INFO : saved LDA_model.state
2020-04-23 11:52:58,321 : DEBUG : {'uri': 'LDA_model.id2word', 'mode': 'wb', 'buffering': -1, 'encoding': None, 'errors': None, 'newline': None, 'closefd': True, 'opener': None, 'ignore_ext': False, 'transport_params': None}
2020-04-23 11:52:58,324 : INFO : saving LdaModel object under LDA_model, separately ['expElogbeta', 'sstats']
2020-04-23 11:52:58,325 : INFO : storing np array 'expElogbeta' to LDA_model.expElogbeta.npy
2020-04-23 11:52:58,327 : INFO : not storing attribute dispatcher
2020-04-23 11:52:58,328 : INFO : not storing attribute id2word
2020-04-23 11:52:58,328 : INFO : not storing attribute state
2020-04-2

In [134]:
# Create a new corpus, made of previously unseen documents.
other_texts = [
['taco', 'with', 'salsa'],
['taco', 'mexican', 'burrito', 'wife'],
['tortilla', 'chips', 'saturday']
]
other_corpus = [dictionary.doc2bow(text) for text in other_texts]

unseen_doc = other_corpus[0]
vector = model[unseen_doc]  # get topic probability distribution for a document

In [135]:
#Update the model by incrementally training on the new corpus
model.update(other_corpus)
vector = model[unseen_doc]

2020-04-23 11:59:19,705 : INFO : running online (multi-pass) LDA training, 9 topics, 35 passes over the supplied corpus of 3 documents, updating model once every 3 documents, evaluating perplexity every 0 documents, iterating 400x with a convergence threshold of 0.001000
2020-04-23 11:59:19,706 : INFO : PROGRESS: pass 0, at document #3/3
2020-04-23 11:59:19,707 : DEBUG : performing inference on a chunk of 3 documents
2020-04-23 11:59:19,708 : DEBUG : 3/3 documents converged within 400 iterations
2020-04-23 11:59:19,708 : INFO : optimized alpha [0.05022064, 0.046735425, 0.037715655, 0.045930922, 0.046018545, 0.07255265, 0.071380675, 0.04243688, 0.060923647]
2020-04-23 11:59:19,709 : DEBUG : updating topics
2020-04-23 11:59:19,711 : INFO : merging changes from 3 documents into a model of 8694 documents
2020-04-23 11:59:19,715 : INFO : topic #2 (0.038): 0.062*"pizza" + 0.012*"wing" + 0.012*"crust" + 0.008*"italian" + 0.008*"pasta" + 0.007*"topping" + 0.007*"slice" + 0.007*"garlic" + 0.007

2020-04-23 11:59:19,799 : INFO : topic #6 (0.070): 0.008*"told" + 0.006*"manager" + 0.005*"waitress" + 0.005*"left" + 0.004*"business" + 0.004*"waiting" + 0.004*"owner" + 0.004*"walked" + 0.004*"away" + 0.004*"looked"
2020-04-23 11:59:19,799 : INFO : topic #5 (0.077): 0.169*"taco" + 0.060*"salsa" + 0.059*"burrito" + 0.058*"mexican" + 0.056*"tortilla" + 0.049*"chip" + 0.009*"wife" + 0.009*"saturday" + 0.007*"bean" + 0.006*"carne_asada"
2020-04-23 11:59:19,800 : INFO : topic diff=0.010082, rho=0.018560
2020-04-23 11:59:19,806 : INFO : PROGRESS: pass 5, at document #3/3
2020-04-23 11:59:19,807 : DEBUG : performing inference on a chunk of 3 documents
2020-04-23 11:59:19,809 : DEBUG : 3/3 documents converged within 400 iterations
2020-04-23 11:59:19,810 : INFO : optimized alpha [0.049527153, 0.046134103, 0.037322816, 0.04534996, 0.045435376, 0.07851374, 0.069991186, 0.041940335, 0.05990719]
2020-04-23 11:59:19,811 : DEBUG : updating topics
2020-04-23 11:59:19,813 : INFO : merging changes fr

2020-04-23 11:59:19,904 : INFO : topic #8 (0.059): 0.017*"happy_hour" + 0.013*"beer" + 0.010*"sandwich" + 0.005*"game" + 0.004*"local" + 0.004*"year" + 0.003*"pulled_pork" + 0.003*"five_star" + 0.003*"open" + 0.003*"music"
2020-04-23 11:59:19,906 : INFO : topic #6 (0.069): 0.008*"told" + 0.006*"manager" + 0.005*"waitress" + 0.005*"left" + 0.004*"business" + 0.004*"waiting" + 0.004*"owner" + 0.004*"walked" + 0.004*"away" + 0.004*"looked"
2020-04-23 11:59:19,907 : INFO : topic #5 (0.084): 0.174*"taco" + 0.064*"salsa" + 0.064*"burrito" + 0.063*"mexican" + 0.061*"tortilla" + 0.046*"chip" + 0.017*"wife" + 0.017*"saturday" + 0.006*"bean" + 0.006*"carne_asada"
2020-04-23 11:59:19,908 : INFO : topic diff=0.010270, rho=0.018544
2020-04-23 11:59:19,913 : INFO : PROGRESS: pass 10, at document #3/3
2020-04-23 11:59:19,914 : DEBUG : performing inference on a chunk of 3 documents
2020-04-23 11:59:19,915 : DEBUG : 3/3 documents converged within 400 iterations
2020-04-23 11:59:19,916 : INFO : optimize

2020-04-23 11:59:19,998 : INFO : topic #7 (0.041): 0.034*"burger" + 0.017*"breakfast" + 0.012*"bacon" + 0.011*"egg" + 0.010*"drive_thru" + 0.009*"sandwich" + 0.008*"brunch" + 0.008*"onion" + 0.007*"toast" + 0.007*"onion_ring"
2020-04-23 11:59:19,999 : INFO : topic #8 (0.058): 0.017*"happy_hour" + 0.013*"beer" + 0.010*"sandwich" + 0.005*"game" + 0.004*"local" + 0.004*"year" + 0.003*"pulled_pork" + 0.003*"five_star" + 0.003*"open" + 0.003*"music"
2020-04-23 11:59:20,000 : INFO : topic #6 (0.068): 0.008*"told" + 0.006*"manager" + 0.005*"waitress" + 0.005*"left" + 0.004*"business" + 0.004*"waiting" + 0.004*"owner" + 0.004*"walked" + 0.004*"away" + 0.004*"looked"
2020-04-23 11:59:20,001 : INFO : topic #5 (0.090): 0.179*"taco" + 0.069*"salsa" + 0.068*"burrito" + 0.067*"mexican" + 0.065*"tortilla" + 0.042*"chip" + 0.025*"wife" + 0.025*"saturday" + 0.006*"bean" + 0.006*"carne_asada"
2020-04-23 11:59:20,002 : INFO : topic diff=0.010471, rho=0.018528
2020-04-23 11:59:20,009 : INFO : PROGRESS: pa

2020-04-23 11:59:20,101 : DEBUG : updating topics
2020-04-23 11:59:20,104 : INFO : merging changes from 3 documents into a model of 8694 documents
2020-04-23 11:59:20,108 : INFO : topic #2 (0.036): 0.061*"pizza" + 0.012*"wing" + 0.011*"crust" + 0.008*"italian" + 0.007*"pasta" + 0.007*"topping" + 0.007*"slice" + 0.007*"garlic" + 0.007*"deep_fried" + 0.006*"lobster"
2020-04-23 11:59:20,108 : INFO : topic #7 (0.041): 0.034*"burger" + 0.017*"breakfast" + 0.012*"bacon" + 0.011*"egg" + 0.010*"drive_thru" + 0.009*"sandwich" + 0.008*"brunch" + 0.008*"onion" + 0.007*"toast" + 0.007*"onion_ring"
2020-04-23 11:59:20,109 : INFO : topic #8 (0.057): 0.017*"happy_hour" + 0.013*"beer" + 0.010*"sandwich" + 0.005*"game" + 0.004*"local" + 0.004*"year" + 0.003*"pulled_pork" + 0.003*"five_star" + 0.003*"open" + 0.003*"music"
2020-04-23 11:59:20,110 : INFO : topic #6 (0.066): 0.008*"told" + 0.006*"manager" + 0.005*"waitress" + 0.005*"left" + 0.004*"business" + 0.004*"waiting" + 0.004*"owner" + 0.004*"walked

2020-04-23 11:59:20,198 : DEBUG : performing inference on a chunk of 3 documents
2020-04-23 11:59:20,199 : DEBUG : 3/3 documents converged within 400 iterations
2020-04-23 11:59:20,200 : INFO : optimized alpha [0.04714527, 0.04406002, 0.03595281, 0.04334414, 0.043422174, 0.105097435, 0.065336674, 0.04021866, 0.056460347]
2020-04-23 11:59:20,200 : DEBUG : updating topics
2020-04-23 11:59:20,203 : INFO : merging changes from 3 documents into a model of 8694 documents
2020-04-23 11:59:20,206 : INFO : topic #2 (0.036): 0.061*"pizza" + 0.012*"wing" + 0.011*"crust" + 0.008*"italian" + 0.007*"pasta" + 0.007*"topping" + 0.007*"slice" + 0.007*"garlic" + 0.007*"deep_fried" + 0.006*"lobster"
2020-04-23 11:59:20,207 : INFO : topic #7 (0.040): 0.033*"burger" + 0.017*"breakfast" + 0.012*"bacon" + 0.010*"egg" + 0.010*"drive_thru" + 0.009*"sandwich" + 0.008*"brunch" + 0.008*"onion" + 0.007*"toast" + 0.007*"onion_ring"
2020-04-23 11:59:20,208 : INFO : topic #8 (0.056): 0.017*"happy_hour" + 0.013*"beer"

2020-04-23 11:59:20,284 : INFO : topic #5 (0.112): 0.192*"taco" + 0.079*"salsa" + 0.079*"burrito" + 0.078*"mexican" + 0.076*"tortilla" + 0.044*"wife" + 0.044*"saturday" + 0.034*"chip" + 0.005*"bean" + 0.004*"carne_asada"
2020-04-23 11:59:20,284 : INFO : topic diff=0.011086, rho=0.018484
2020-04-23 11:59:20,290 : INFO : PROGRESS: pass 29, at document #3/3
2020-04-23 11:59:20,290 : DEBUG : performing inference on a chunk of 3 documents
2020-04-23 11:59:20,291 : DEBUG : 3/3 documents converged within 400 iterations
2020-04-23 11:59:20,292 : INFO : optimized alpha [0.046581637, 0.043567233, 0.035623822, 0.042867124, 0.042943448, 0.11322573, 0.06426126, 0.039807532, 0.05565462]
2020-04-23 11:59:20,293 : DEBUG : updating topics
2020-04-23 11:59:20,295 : INFO : merging changes from 3 documents into a model of 8694 documents
2020-04-23 11:59:20,298 : INFO : topic #2 (0.036): 0.061*"pizza" + 0.012*"wing" + 0.011*"crust" + 0.008*"italian" + 0.007*"pasta" + 0.007*"topping" + 0.007*"slice" + 0.007

2020-04-23 11:59:20,378 : INFO : topic #6 (0.063): 0.008*"told" + 0.006*"manager" + 0.005*"waitress" + 0.005*"left" + 0.004*"business" + 0.004*"waiting" + 0.004*"owner" + 0.004*"walked" + 0.004*"away" + 0.004*"looked"
2020-04-23 11:59:20,379 : INFO : topic #5 (0.120): 0.196*"taco" + 0.083*"salsa" + 0.082*"burrito" + 0.081*"mexican" + 0.080*"tortilla" + 0.050*"wife" + 0.050*"saturday" + 0.031*"chip" + 0.004*"bean" + 0.004*"carne_asada"
2020-04-23 11:59:20,380 : INFO : topic diff=0.011323, rho=0.018468
2020-04-23 11:59:20,385 : INFO : PROGRESS: pass 34, at document #3/3
2020-04-23 11:59:20,385 : DEBUG : performing inference on a chunk of 3 documents
2020-04-23 11:59:20,386 : DEBUG : 3/3 documents converged within 400 iterations
2020-04-23 11:59:20,387 : INFO : optimized alpha [0.04604281, 0.043095414, 0.03530757, 0.042410247, 0.04248495, 0.121869005, 0.06324221, 0.039413154, 0.054887854]
2020-04-23 11:59:20,389 : DEBUG : updating topics
2020-04-23 11:59:20,392 : INFO : merging changes fr

In [164]:
print("Probabilities of belonging to each Topic: ", vector) #Show the probability to belong to each topic


vector.sort(key = lambda x: x[1],reverse=True)


print("\n\nThe given document belongs to the Topic: ", top_labels[vector[0][0]])

Probabilities of belonging to each Topic:  [(5, 0.852583), (6, 0.025411228), (8, 0.02205438), (0, 0.01850037), (1, 0.017316082), (4, 0.017070794), (3, 0.017040778), (7, 0.015836522), (2, 0.014186866)]


The given document belongs to the Topic:  Mexican Food


In [138]:
# unseen_document = 'I just love taco and chips.'
# bow_vector = dictionary.doc2bow(preprocess(unseen_document))
# for index, score in sorted(model[bow_vector], key=lambda tup: -1*tup[1]):
#     print("Score: {}\t Topic: {}".format(score, model.print_topic(index, 5)))