|<i> Note: This notebook is inspired by the Topic-Modeling-Latent-Dirichlet-Allocation series at: https://github.com/rhasanbd/Topic-Modeling-Latent-Dirichlet-Allocation </i>

## Latent Dirichlet Allocation - Implementation on Yelp dataset

In this notebook, we implement Latent Dirichlet Allocation(LDA) on the Yelp reviews data to carry out Topic Modelling. We use the Gensim topic modelling API https://radimrehurek.com/gensim/models/ldamodel.html. Scikit-Learn implementation is also available (we use Gensim since it provides more functionality and application like Topic Coherence Pipeline or Dynamic Topic Modeling.)

We build an **end-to-end Natural Language Processing (NLP) pipeline**, starting with raw data and running through preparing, modeling, visualization.
The steps that we will carry out involves the following:
1. Exploratory Data Analysis
2. Data Cleaning and Pre-processing
3. Topic modeling with LDA
4. Determine optimal number of Topics
5. Visualize topic model using pyLDAvis

In [1]:
%pylab inline
import pandas as pd
import pickle as pk
from scipy import sparse as sp

import nltk
nltk.download('wordnet')

from pymongo import MongoClient

Populating the interactive namespace from numpy and matplotlib


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\rojin\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## Load & Explore the Data

In [2]:
client = MongoClient("mongodb://localhost:27017/")
db = client.yelp_database
df = pd.DataFrame(db.business_restaurant.find({},{"reviews.text":1, "_id":0}))
df = df.applymap(lambda x : x[0]['text'])
df.head() #Quick Check of the data

Unnamed: 0,reviews
0,During the recent Yelp scavenger hunt event my...
1,Bolt is within walking distance of The Drake H...
2,Apteka was one the highest rated places I have...
3,"When people say Korean food, what do you think..."
4,NOM NOM NOM! \n\nHow did it take me so long to...


In [3]:
print("The total number of reviews is:", df.shape[0])

The total number of reviews is: 8688


In [4]:
df.info() # View data description (Total rows, Column names, type and number of non-null values)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8688 entries, 0 to 8687
Data columns (total 1 columns):
reviews    8688 non-null object
dtypes: object(1)
memory usage: 68.0+ KB


In [5]:
print("Dimension of the data: ", df.shape) # View data dimension

no_of_rows = df.shape[0]
no_of_columns = df.shape[1]

print("No. of Rows: %d" % no_of_rows)
print("No. of Columns: %d" % no_of_columns)

Dimension of the data:  (8688, 1)
No. of Rows: 8688
No. of Columns: 1


## Convert the Text column into a 2D Array of Documents

- We convert the documents from the text column to an array of documents.

- It's a 2D array in which each row reprents a document.

In [6]:
from array import array

docs_array = np.array(df['reviews']) # Convert the 'reviews' column into array

print("Dimension of the documents array: ", docs_array.shape) # View dimensions of new array
print()
print(docs_array[6]) # View a document

Dimension of the documents array:  (8688,)

Have been to the Salt Cellar countless times over the years. Cannot believe I've never left a review here. Interesting underground restaurant that is very easy to miss if you do not know where it is. It's a small little door on the top but a huge restaurant Underground. Has very good seafood for Arizona. Flown in fresh and cooked to order properly. They also do a great happy hour so you can try some of their Specialties at a discounted price on food and drink. Fun place to come and check out with some friends.


## Pre-process the Data

Pre-processing of the text data is done using the following steps:

- Convert to lowercase 
- Tokenize (split the documents into tokens or words)
- Remove numbers, but not words that contain numbers
- Remove words that are only a single character
- Lemmatize the tokens/words


### Tokenization and Lemmatization

- We convert all the words into lowercase then tokenize each word using NLTK Regular-Expression Tokenizer class "RegexpTokenizer". 
- It splits a given string to substrings using a regular expression. 
- Then we remove numbers and single character words since they usually don't impart much useful information and are very high in number.
- Finally, we lemmatize the tokens using WordNetLemmatizer from NLTK, where we extract the root words of the tokens using the dictionary.

In [7]:
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer

def docs_preprocessor(docs):
    '''Function to Convert the 2D Document Array into a 2D Array of Processed Words'''
    tokenizer = RegexpTokenizer(r'\w+') # Tokenize the words
    
    for idx in range(len(docs)):
        docs[idx] = docs[idx].lower()  # Convert doc to lowercase
        docs[idx] = tokenizer.tokenize(docs[idx])  # Split doc into words

    # Remove numbers, but not words that contain numbers
    docs = [[token for token in doc if not token.isdigit()] for doc in docs]
    
    # Remove words with only one character
    docs = [[token for token in doc if len(token) > 3] for doc in docs]
    
    # Lemmatize all words
    lemmatizer = WordNetLemmatizer()
    docs = [[lemmatizer.lemmatize(token) for token in doc] for doc in docs]
  
    return docs

- Now we convert the 2D Document Array into a 2D Array of Tokenized Words using the above function

In [8]:
%time 
docs = docs_preprocessor(docs_array)
print("Length of the 2D Array of Tokenized Documents: ", len(docs))

Wall time: 0 ns
Length of the 2D Array of Tokenized Documents:  8688


In [9]:
print(docs[0:2]) #Display the first two documents with tokenized words

[['during', 'recent', 'yelp', 'scavenger', 'hunt', 'event', 'husband', 'this', 'place', 'last', 'venue', 'were', 'pretty', 'full', 'from', 'eating', 'elsewhere', 'told', 'them', 'they', 'would', 'sample', 'would', 'home', 'they', 'were', 'more', 'than', 'happy', 'this', 'asked', 'them', 'other', 'location', 'were', 'concerned', 'this', 'only', 'city', 'greek', 'themed', 'when', 'finally', 'were', 'able', 'have', 'doggie', 'pleased', 'that', 'they', 'gave', 'small', 'gyro', 'which', 'their', 'specialty', 'along', 'with', 'lemon', 'chicken', 'soup', 'oyster', 'cracker', 'believe', 'disappoint', 'just', 'enough', 'light', 'meal', 'later', 'they', 'have', 'reward', 'program', 'that', 'sandwich', 'salad', 'then', 'free', 'when', 'home', 'chance', 'review', 'menu', 'detail', 'serve', 'vegetarian', 'gyro', 'specialty', 'burger', 'specialty', 'sandwich', 'along', 'with', 'side', 'everything', 'carte', 'however', 'side', 'fry', 'coleslaw', 'reduced', 'price', 'order', 'sandwich', 'breakfast', '

## Remove all stop words

- Stop words are words like “and”, “the”, “him”, which are presumed to be uninformative in representing the content of a text. 
- The stop words may be removed to avoid them being construed as signal for prediction.
- To remove the stop words, we use the "stopwords" module from the nltk library.

In [10]:
# Load library
from nltk.corpus import stopwords

# You will have to download the set of stop words the first time
import nltk
nltk.download('stopwords')

# Load stop words
stop_words = stopwords.words('english')

# Show stop words
stop_words[:5]

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\rojin\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['i', 'me', 'my', 'myself', 'we']

In [11]:
# Remove all stop words from the doc
for i in range(len(docs)):
    docs[i] = [word for word in docs[i] if word not in stop_words]

## Compute Bigrams/Trigrams:

- N-grams are combinations of adjacent words or letters of length 'n' that you can find in your source text. These combinations of words carry a special meaning. For example: car-pool is an n-gram formed using the two words car and pool that carries a distinct meaning different from the individual words. 

- If n=2, it is called a Bigram and if n=3, it is called a Trigram.

- We find all the combinations of Bigrams and Trigrams. Then, we keep only the frequent phrases. 
- We finally add the frequent phrases to the original data, since we would like to keep the words “car” and “pool” as well as the bigram “car_pool”.

In [12]:
from gensim.models import Phrases

bigram = Phrases(docs, min_count=10, threshold=100) # Add bigrams (if appears 10 times or more)
trigram = Phrases(bigram[docs], min_count=10, threshold=100) # Add trigrams (if appears 10 times or more)

for idx in range(len(docs)):
    for token in bigram[docs[idx]]:
        if '_' in token:
            docs[idx].append(token)  # Token is a bigram, add to document
    for token in trigram[docs[idx]]:
        if '_' in token:
            docs[idx].append(token)  # Token is a trigram, add to document

In [13]:
from gensim.corpora import Dictionary

dictionary = Dictionary(docs) # Create a dictionary representation of the documents
print('Number of unique words in initital documents:', len(dictionary))

Number of unique words in initital documents: 26831


## Remove Rare and Common Tokens/Words

- We remove in-frequent words from our dictionary. 
- We also remove words that appear frequently in most documents.

In [14]:
# Filter out words that occur less than 10 documents, or more than 10% of the documents
dictionary.filter_extremes(no_below=10, no_above=0.10) 

print('Number of unique words after removing rare and common words:', len(dictionary))

Number of unique words after removing rare and common words: 5368


## Bag-of-Words Representation of Data


- We transform the documents to a **vectorized form**. 

- We simply compute the frequency of each word, including the bigrams/trigrams.

In [15]:
corpus = [dictionary.doc2bow(doc) for doc in docs] # Bag-of-words representation of the docs

print('Number of unique tokens: %d' % len(dictionary))
print('Number of documents: %d' % len(corpus))

Number of unique tokens: 5368
Number of documents: 8688


## Training the LDA Model

- We use the gensim.models.LdaModel class for performing LDA. [https://radimrehurek.com/gensim/models/ldamodel.html]
- This module allows both LDA model estimation from a training corpus and inference of topic distribution on new, unseen documents. 

#### The key parameters in this model are chosen as shown beloew:

- **num_topics (int, optional) – The number of requested latent topics to be extracted from the training corpus.**

Since this is an supervised learning problem, we do not know how many topics are present in the given dataset. Inroder to determine the number of topics we use the following techniques:

Technique 1: Topic Coherence 
The main technique to determine the number of topics is **Topic coherence** [http://svn.aksw.org/papers/2015/WSDM_Topic_Evaluation/public.pdf]

Technique 2: Visualizing Inter-Topic Distance 
Use the LDA visualization tool pyLDAvis to observe Intertopic Distance Map (discussed later). By varying the number of topics we could determine the optimal value from the visualization.

- **chunksize (int, optional) – Number of documents to be used in each training chunk.**

It controls how many documents are processed at a time in the training algorithm. Increasing chunksize will speed up training, at least as long as the chunk of documents easily fit into memory. 

We set chunksize = 10000, which is equal to the amount of documents. Thus, it processes all the data in one go. Chunksize can however influence the quality of the model.

- **passes (int, optional) – Number of passes through the corpus during training.**

It controls how often we train the model on the entire corpus. Another word for passes might be “epochs”. 

- **iterations (int, optional) – Maximum number of iterations through the corpus when inferring the topic distribution of a corpus.**

It controls how often we repeat a particular loop over each document.

- **eval_every (int, optional) – Log perplexity is estimated every that many updates.**

Setting this to 1 slows down training by ~2x.


- **alpha ({numpy.ndarray, str}, optional): Can be set to an 1D array of length equal to the number of expected topics that expresses our a-priori belief for the each topics’ probability.**         
        
- **eta ({float, np.array, str}, optional) – A-priori belief on word probability.**

We set alpha = 'auto' and eta = 'auto'. Essentially we are automatically learning two parameters in the model that we usually would have to specify explicitly.

In [16]:
from gensim.models import LdaModel

#------Set training parameters
num_topics = 9 # Number of topics to discover
chunksize = 8688 # Size of the doc looked at every pass
passes = 35 # Number of passes through the corpus
iterations = 400 # Maximum number of iterations through the corpus when inferring the topic distribution of a corpus
eval_every = None  # Don't evaluate model perplexity, takes too much time.

#-------Make an index to word dictionary
temp = dictionary[0]  # This is only to "load" the dictionary
id2word = dictionary.id2token

%time model = LdaModel(corpus=corpus, id2word=id2word, chunksize=chunksize, \
                       alpha='auto', eta='auto', \
                       iterations=iterations, num_topics=num_topics, \
                       passes=passes, eval_every=eval_every, random_state=0)

Wall time: 3min 21s


## Technique 1 for Determining Optimal Number of Topics: Topic Coherence

- Topic Coherence is a measure used to evaluate topic models. 
- A set of statements or facts is said to be coherent, if they support each other. 
- An example of a coherent fact set is “the game is a team sport”, “the game is played with a ball”, “the game demands great physical efforts”. Each such generated topic consists of words, and the topic coherence is applied to the top N words from the topic. 

Below we display 
- the average topic coherence and
- print the topics in order of topic coherence

- We use LdaModel's "top_topics" method to get the topics with highest coherence score for each topic.
- Note that we use the “Umass” topic coherence measure here (see gensim.models.ldamodel.LdaModel.top_topics()).

In [17]:
top_topics = model.top_topics(corpus)

# Average topic coherence is the sum of topic coherences of all topics, divided by the number of topics.
avg_topic_coherence = sum([t[1] for t in top_topics]) / num_topics
print('Average topic coherence: %.4f.' % avg_topic_coherence)

from pprint import pprint
pprint(top_topics)

Average topic coherence: -3.2542.
[([(0.007950498, 'told'),
   (0.0053800517, 'manager'),
   (0.0050074374, 'waitress'),
   (0.004569946, 'left'),
   (0.0041263755, 'business'),
   (0.0040160106, 'away'),
   (0.0039047233, 'walked'),
   (0.003816554, 'waiting'),
   (0.003791713, 'owner'),
   (0.0036249827, 'looked'),
   (0.003618072, 'gave'),
   (0.003615429, 'year'),
   (0.0034398541, 'someone'),
   (0.003400193, 'water'),
   (0.0033694864, 'line'),
   (0.0033534097, 'later'),
   (0.0033080312, 'anything'),
   (0.0032832872, 'employee'),
   (0.0031821125, 'waited'),
   (0.0031360579, 'finally')],
  -2.2501967315398983),
 ([(0.013167798, 'noodle'),
   (0.013112263, 'sushi'),
   (0.012789463, 'soup'),
   (0.011534428, 'roll'),
   (0.009455893, 'bowl'),
   (0.008074639, 'spicy'),
   (0.0058222134, 'fish'),
   (0.005401786, 'pork'),
   (0.005258772, 'curry'),
   (0.005045314, 'broth'),
   (0.004900553, 'flavour'),
   (0.0048907343, 'ramen'),
   (0.004621502, 'chinese'),
   (0.004372038, '

## Technique 2 for Determining Optimal Number of Topics: Visualization

- We use **pyLDAvis** to interpret the topics in a topic model that has been fit to a corpus of text data. 

- It extracts information from a fitted LDA topic model to inform an interactive web-based visualization.

In [18]:
import pyLDAvis.gensim
pyLDAvis.enable_notebook()

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 

pyLDAvis.gensim.prepare(model, corpus, dictionary)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))



## Interpretation of the Visualization 

- Relevence is defined as in footer 2 and can be tuned by parameter $\lambda$.

Smaller $\lambda$ gives higher weight to the term's distinctiveness.

Larger $\lambda$ corresponds to probablity of the term occurance per topics.

- Therefore, to get a better sense of terms per topic we use $\lambda = 0$.

## Display the Top Words in the Topics

- We display the top 10 words for each topic.

In [19]:
def get_lda_topics(model, num_topics, top_words):
    '''Function to return top words for num_topics'''
    word_dict = {};
    for i in range(num_topics):
        words = model.show_topic(i, topn = top_words);
        word_dict['Topic # ' + '{:02d}'.format(i+1)] = [i[0] for i in words];
    return pd.DataFrame(word_dict)

In [20]:
get_lda_topics(model, num_topics, 10) #View top 10 words for each topic

Unnamed: 0,Topic # 01,Topic # 02,Topic # 03,Topic # 04,Topic # 05,Topic # 06,Topic # 07,Topic # 08,Topic # 09
0,room,steak,pizza,coffee,noodle,taco,told,burger,happy_hour
1,dining_room,bread,wing,cream,sushi,salsa,manager,breakfast,sandwich
2,gluten_free,dessert,burger,chocolate,soup,burrito,waitress,egg,beer
3,dining,potato,crust,cake,roll,bean,left,bacon,five_star
4,highly_recommend,shrimp,onion_ring,beer,bowl,thai,business,drive_thru,pulled_pork
5,hotel,cooked,topping,shop,spicy,chip,away,toast,bread
6,free,plate,onion,cafe,fish,mexican,walked,brunch,wine
7,chef,lobster,slice,milk,pork,carne_asada,waiting,french_toast,turkey
8,game,mashed_potato,garlic,store,curry,spring_roll,owner,potato,five
9,vega,appetizer,thin_crust,parking,broth,chip_salsa,looked,pancake,corned_beef


## Generate Labels for the Topics

- We can manually generate human-interpretable labels for each topic by looking at the terms that appear more in each topic.


- We use LdaModel's "show_topic" method that returns **Word-probability pairs** for the most relevant words generated by the topic.

In [21]:
def explore_topic(lda_model, topic_number, topn, output=True):
    """
    accept a ldamodel, a topic number and topn vocabs of interest
    prints a formatted list of the topn terms
    """
    terms = []
    probabilities=[]
    for term, probability in lda_model.show_topic(topic_number, topn=topn):
        terms += [term]
        if output:
            probabilities += [np.float64(probability)]
            print(u'{:30} {:.3f}'.format(term, round(probability, 3)))
    
    return terms, probabilities

In [22]:
topic_summaries = []

print(u'{:25} {}'.format(u'term', u'probability') + u'\n')

for i in range(num_topics):
    print('\nTopic '+str(i)+' |---------------------------\n')
    tmp = explore_topic(model, topic_number=i, topn=10, output=True )
    topic_summaries += [tmp[:5]]

term                      probability


Topic 0 |---------------------------

room                           0.010
dining_room                    0.008
gluten_free                    0.007
dining                         0.007
highly_recommend               0.006
hotel                          0.005
free                           0.005
chef                           0.005
game                           0.004
vega                           0.004

Topic 1 |---------------------------

steak                          0.010
bread                          0.009
dessert                        0.007
potato                         0.007
shrimp                         0.006
cooked                         0.006
plate                          0.006
lobster                        0.006
mashed_potato                  0.006
appetizer                      0.005

Topic 2 |---------------------------

pizza                          0.051
wing                           0.017
burger                        

## Manually Generate Topic Labels

- Based on the most probable words generated by each topic, we assign human-interpretable labels for the topics.

In [23]:
top_labels = {"0": 'Fine Dining', "1":'Thai Food', "2":'Italian Food', "3":'Bakery', "4":'Asian Food', "5":'Mexican Food', "6":'Customer Experience', "7":'Fast Food', "8":'Happy Hour'}

for k, item in top_labels.items():
    top_labels[k]= {"topic": item, 
                    "details":[{"word": topic_summaries[int(k[-1])][0][i], "probability": topic_summaries[int(k[-1])][1][i]} for i in range(10)]
                   }
    
top_labels #View dict with the topic labels and the words

{'0': {'topic': 'Fine Dining',
  'details': [{'word': 'room', 'probability': 0.010474324226379395},
   {'word': 'dining_room', 'probability': 0.00830595288425684},
   {'word': 'gluten_free', 'probability': 0.00731181213632226},
   {'word': 'dining', 'probability': 0.0065196724608540535},
   {'word': 'highly_recommend', 'probability': 0.006396329030394554},
   {'word': 'hotel', 'probability': 0.00537074776366353},
   {'word': 'free', 'probability': 0.005216171499341726},
   {'word': 'chef', 'probability': 0.004911673720926046},
   {'word': 'game', 'probability': 0.004430868662893772},
   {'word': 'vega', 'probability': 0.004381795413792133}]},
 '1': {'topic': 'Thai Food',
  'details': [{'word': 'steak', 'probability': 0.010466063395142555},
   {'word': 'bread', 'probability': 0.008790980093181133},
   {'word': 'dessert', 'probability': 0.007166932802647352},
   {'word': 'potato', 'probability': 0.007165350951254368},
   {'word': 'shrimp', 'probability': 0.006414857227355242},
   {'word'

In [24]:
# Add average topic coherence to the top labels dictionary
i = 0
for key, item in top_labels.items():
    value = top_topics[i][1]
    top_labels[key]["coherence"] = value
    i +=1

In [25]:
top_labels

{'0': {'topic': 'Fine Dining',
  'details': [{'word': 'room', 'probability': 0.010474324226379395},
   {'word': 'dining_room', 'probability': 0.00830595288425684},
   {'word': 'gluten_free', 'probability': 0.00731181213632226},
   {'word': 'dining', 'probability': 0.0065196724608540535},
   {'word': 'highly_recommend', 'probability': 0.006396329030394554},
   {'word': 'hotel', 'probability': 0.00537074776366353},
   {'word': 'free', 'probability': 0.005216171499341726},
   {'word': 'chef', 'probability': 0.004911673720926046},
   {'word': 'game', 'probability': 0.004430868662893772},
   {'word': 'vega', 'probability': 0.004381795413792133}],
  'coherence': -2.2501967315398983},
 '1': {'topic': 'Thai Food',
  'details': [{'word': 'steak', 'probability': 0.010466063395142555},
   {'word': 'bread', 'probability': 0.008790980093181133},
   {'word': 'dessert', 'probability': 0.007166932802647352},
   {'word': 'potato', 'probability': 0.007165350951254368},
   {'word': 'shrimp', 'probability

In [26]:
topics = db['topic'] #Create new collection named topic in the database
for key in top_labels.keys(): #Insert each key as a new document to Mongodb
    topics.insert_one(top_labels[key])

## Save Model

In [27]:
from gensim.test.utils import datapath

# Save model to disk.
model.save('Models/LDA_model')

# Load a potentially pretrained model from disk.
model = LdaModel.load('Models/LDA_model')

## Test Model on Unseen Data 

In [28]:
# Create a new corpus, made of previously unseen documents.
other_texts = [
['taco', 'with', 'salsa'],
['taco', 'mexican', 'burrito', 'wife'],
['tortilla', 'chips', 'saturday']
]
other_corpus = [dictionary.doc2bow(text) for text in other_texts]

unseen_doc = other_corpus[0]

vector = model[unseen_doc]  # get topic probability distribution for a document

In [29]:
#Update the model by incrementally training on the new corpus
model.update(other_corpus)
vector = model[unseen_doc]

In [30]:
print("Probabilities of belonging to each Topic: ", vector) #Show the probability to belong to each topic


vector.sort(key = lambda x: x[1],reverse=True)

max_index = str(vector[0][0]) # Get topic index with highest probability

print("\n\nThe given document belongs to the Topic: ", top_labels[max_index]["topic"])

Probabilities of belonging to each Topic:  [(0, 0.020468578), (1, 0.018489294), (2, 0.016610784), (3, 0.01959241), (4, 0.023518225), (5, 0.8391875), (6, 0.030798083), (7, 0.016047796), (8, 0.015287285)]


The given document belongs to the Topic:  Mexican Food


## Conclusion
Our model was able to predict a document belonging to a topic well. The model has also given us 9 topics that are relevant to any restaurant business. 