<i> Note: This notebook is inspired by the Topic-Modeling-Latent-Dirichlet-Allocation series at: https://github.com/rhasanbd/Topic-Modeling-Latent-Dirichlet-Allocation </i>

## Latent Dirichlet Allocation - Implementation on Yelp dataset

In this notebook, we implement Latent Dirichlet Allocation(LDA) on the Yelp reviews data to carry out Topic Modelling. We use the Gensim topic modelling API https://radimrehurek.com/gensim/models/ldamodel.html. Scikit-Learn implementation is also available (we use Gensim since it provides more functionality and application like Topic Coherence Pipeline or Dynamic Topic Modeling.)

We build an **end-to-end Natural Language Processing (NLP) pipeline**, starting with raw data and running through preparing, modeling, visualization.
The steps that we will carry out involves the following:
1. Exploratory Data Analysis
2. Data Cleaning and Pre-processing
3. Topic modeling with LDA
4. Determine optimal number of Topics
5. Visualize topic model using pyLDAvis

### Yelp Review Dataset
The Yelp Review Dataset is a CSV file that contains a sub-sample of 10,000 reviews extracted from the Yelp dataset available at: https://www.yelp.com/dataset.

The review dataset contains the following fields:
- business_id : Unique identifier of business
- date : Data of review posted YYYY-MM-DD
- review_id : Unique identifier of review
- stars : Star rating (upto 4 stars)
- text : Review text
- user_id : Unique identifier of user who posted the review
- cool : Number of cool votes received
- useful : Number of useful votes received
- funny : Number of funny votes received

In [1]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.DEBUG)

%pylab inline
import pandas as pd
import pickle as pk
from scipy import sparse as sp

import nltk
nltk.download('wordnet')

2020-03-18 10:14:17,433 : DEBUG : $HOME=C:\Users\rojin
2020-03-18 10:14:17,439 : DEBUG : CONFIGDIR=C:\Users\rojin\.matplotlib
2020-03-18 10:14:17,441 : DEBUG : matplotlib data path: c:\users\rojin\appdata\local\programs\python\python37\lib\site-packages\matplotlib\mpl-data
2020-03-18 10:14:17,462 : DEBUG : loaded rc file c:\users\rojin\appdata\local\programs\python\python37\lib\site-packages\matplotlib\mpl-data\matplotlibrc
2020-03-18 10:14:17,470 : DEBUG : matplotlib version 3.1.2
2020-03-18 10:14:17,472 : DEBUG : interactive is False
2020-03-18 10:14:17,474 : DEBUG : platform is win32


2020-03-18 10:14:17,618 : DEBUG : CACHEDIR=C:\Users\rojin\.matplotlib
2020-03-18 10:14:17,638 : DEBUG : Using fontManager instance from C:\Users\rojin\.matplotlib\fontlist-v310.json
2020-03-18 10:14:18,340 : DEBUG : Loaded backend module://ipykernel.pylab.backend_inline version unknown.
2020-03-18 10:14:18,352 : DEBUG : Loaded backend module://ipykernel.pylab.backend_inline version unknown.
2020-03-18 10:14:18,361 : DEBUG : Loaded backend module://ipykernel.pylab.backend_inline version unknown.


Populating the interactive namespace from numpy and matplotlib


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\rojin\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

## Load & Explore the Data

In [2]:
df = pd.read_csv('Data/yelp_academic_dataset_review_10000.csv') # Read data into pandas dataframe

df.head() # Quick check of the data samples

Unnamed: 0,business_id,date,review_id,stars,text,user_id,cool,useful,funny
0,9yKzy9PApeiPPOUJEtnvkg,1/26/2011,fWKvX83p0-ka4JS3dc6E5A,5,My wife took me here on my birthday for breakf...,rLtl8ZkDX5vH5nAx9C3q5Q,2,5,0
1,ZRJwVLyzEJq1VAihDhYiow,7/27/2011,IjZ33sJrzXqU-0X6U8NwyA,5,I have no idea why some people give bad review...,0a2KyEL0d3Yb1V6aivbIuQ,0,0,0
2,6oRAC4uyJCsJl1X0WZpVSA,6/14/2012,IESLBzqUCLdSzSqm0eCSxQ,4,love the gyro plate. Rice is so good and I als...,0hT2KtfLiobPvh6cDC8JQg,0,1,0
3,_1QQZuf4zZOyFCvXc0o6Vg,5/27/2010,G-WvGaISbqqaMHlNnByodA,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!...",uZetl9T0NcROGOyFfughhg,1,2,0
4,6ozycU1RpktNG2-1BroVtw,1/5/2012,1uJFq2r5QfJG_6ExMRCaGw,5,General Manager Scott Petello is a good egg!!!...,vYmM4KTsC8ZfQBg-j5MWkw,0,0,0


In [3]:
df.info() # View data description (Total rows, Column names, type and number of non-null values)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 9 columns):
business_id    10000 non-null object
date           10000 non-null object
review_id      10000 non-null object
stars          10000 non-null int64
text           10000 non-null object
user_id        10000 non-null object
cool           10000 non-null int64
useful         10000 non-null int64
funny          10000 non-null int64
dtypes: int64(4), object(5)
memory usage: 703.2+ KB


In [4]:
print("Dimension of the data: ", df.shape) # View data dimension

no_of_rows = df.shape[0]
no_of_columns = df.shape[1]

print("No. of Rows: %d" % no_of_rows)
print("No. of Columns: %d" % no_of_columns)

Dimension of the data:  (10000, 9)
No. of Rows: 10000
No. of Columns: 9


## Convert the Text column into a 2D Array of Documents

We convert the documents from the text column to an array of documents.

It's a 2D array in which each row reprents a document.

In [5]:
docs_array = array(df['text']) # Convert the 'text' column into array

print("Dimension of the documents array: ", docs_array.shape) # View dimensions of new array

#print(docs_array[0]) # Display the first document

Dimension of the documents array:  (10000,)


## Pre-process the Data

Pre-processing of the text data is done using the following steps:

- Convert to lowercase 
- Tokenize (split the documents into tokens or words)
- Remove numbers, but not words that contain numbers
- Remove words that are only a single character
- Lemmatize the tokens/words


### Tokenization and Lemmatization

We convert all the words into lowercase then tokenize each word using NLTK Regular-Expression Tokenizer class "RegexpTokenizer". It splits a given string to substrings using a regular expression. Then we remove numbers and single character words since they usually don't impart much useful information and are very high in number. Then, we lemmatize the tokens using WordNetLemmatizer from NLTK, where we extract the root words of the tokens using the dictionary.

In [17]:
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer

def docs_preprocessor(docs):
    '''Function to Convert the 2D Document Array into a 2D Array of Processed Words'''
    tokenizer = RegexpTokenizer(r'\w+') # Tokenize the words
    
    for idx in range(len(docs)):
        docs[idx] = docs[idx].lower()  # Convert doc to lowercase
        docs[idx] = tokenizer.tokenize(docs[idx])  # Split doc into words

    # Remove numbers, but not words that contain numbers
    docs = [[token for token in doc if not token.isdigit()] for doc in docs]
    
    # Remove words with only one character
    docs = [[token for token in doc if len(token) > 3] for doc in docs]
    
    # Lemmatize all words
    lemmatizer = WordNetLemmatizer()
    docs = [[lemmatizer.lemmatize(token) for token in doc] for doc in docs]
  
    return docs

Now we convert the 2D Document Array into a 2D Array of Tokenized Words using the above function

In [19]:
%time docs = docs_preprocessor(docs_array)
print("Length of the 2D Array of Tokenized Documents: ", len(docs))

AttributeError: 'list' object has no attribute 'lower'

Length of the 2D Array of Tokenized Documents:  10000


In [8]:
print(docs[0:2]) #Display the first two documents

Length of the 2D Array of Tokenized Documents:  10000
[['wife', 'took', 'here', 'birthday', 'breakfast', 'excellent', 'weather', 'perfect', 'which', 'made', 'sitting', 'outside', 'overlooking', 'their', 'ground', 'absolute', 'pleasure', 'waitress', 'excellent', 'food', 'arrived', 'quickly', 'semi', 'busy', 'saturday', 'morning', 'looked', 'like', 'place', 'fill', 'pretty', 'quickly', 'earlier', 'here', 'better', 'yourself', 'favor', 'their', 'bloody', 'mary', 'phenomenal', 'simply', 'best', 'ever', 'pretty', 'sure', 'they', 'only', 'ingredient', 'from', 'their', 'garden', 'blend', 'them', 'fresh', 'when', 'order', 'amazing', 'while', 'everything', 'menu', 'look', 'excellent', 'white', 'truffle', 'scrambled', 'egg', 'vegetable', 'skillet', 'tasty', 'delicious', 'came', 'with', 'piece', 'their', 'griddled', 'bread', 'with', 'amazing', 'absolutely', 'made', 'meal', 'complete', 'best', 'toast', 'ever', 'anyway', 'wait', 'back'], ['have', 'idea', 'some', 'people', 'give', 'review', 'about',

## Compute Bigrams/Trigrams:

N-grams are combinations of adjacent words or letters of length 'n' that you can find in your source text. These combinations of words carry a special meaning. For example: car-pool is an n-gram formed using the two words car and pool that carries a distinct meaning different from the individual words. 

If n=2, it is called a Bigram and if n=3, it is called a Trigram.

We find all the combinations of Bigrams and Trigrams. Then, we keep only the frequent phrases. We finally add the frequent phrases to the original data, since we would like to keep the words “car” and “pool” as well as the bigram “car_pool”.

In [9]:
from gensim.models import Phrases

bigram = Phrases(docs, min_count=300) # Add bigrams (if appears 300 times or more)
trigram = Phrases(bigram[docs], min_count=300) # Add trigrams (if appears 300 times or more)

for idx in range(len(docs)):
    for token in bigram[docs[idx]]:
        if '_' in token:
            docs[idx].append(token)  # Token is a bigram, add to document
    for token in trigram[docs[idx]]:
        if '_' in token:
            docs[idx].append(token)  # Token is a trigram, add to document

2020-03-18 10:14:52,010 : INFO : collecting all words and their counts
2020-03-18 10:14:52,011 : INFO : PROGRESS: at sentence #0, processed 0 words and 0 word types
2020-03-18 10:14:54,643 : INFO : collected 410104 word types from a corpus of 733158 words (unigram + bigrams) and 10000 sentences
2020-03-18 10:14:54,645 : INFO : using 410104 counts as vocab in Phrases<0 vocab, min_count=300, threshold=10.0, max_vocab_size=40000000>
2020-03-18 10:14:54,647 : INFO : collecting all words and their counts
2020-03-18 10:14:54,649 : INFO : PROGRESS: at sentence #0, processed 0 words and 0 word types
2020-03-18 10:15:04,255 : INFO : collected 411739 word types from a corpus of 728342 words (unigram + bigrams) and 10000 sentences
2020-03-18 10:15:04,258 : INFO : using 411739 counts as vocab in Phrases<0 vocab, min_count=5, threshold=10.0, max_vocab_size=40000000>


In [10]:
from gensim.corpora import Dictionary

dictionary = Dictionary(docs) # Create a dictionary representation of the documents
print('Number of unique words in initital documents:', len(dictionary))

2020-03-18 10:15:17,266 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2020-03-18 10:15:20,279 : INFO : built Dictionary(26643 unique tokens: ['absolute', 'absolutely', 'amazing', 'anyway', 'arrived']...) from 10000 documents (total 799182 corpus positions)


Number of unique words in initital documents: 26643


In [11]:
for i in range (0 , 100):
    print(dictionary[i]) #View first 100 words in the dictionary

absolute
absolutely
amazing
anyway
arrived
back
best
best_ever
better
birthday
blend
bloody
bloody_mary
bread
breakfast
busy
came
complete
delicious
earlier
egg
ever
everything
excellent
favor
fill
food
fresh
from
garden
griddled
ground
here
ingredient
like
look
looked
looked_like
made
mary
meal
menu
morning
only
order
outside
overlooking
perfect
phenomenal
piece
place
pleasure
pretty
pretty_quickly
quickly
saturday
saturday_morning
scrambled
scrambled_egg
semi
simply
sitting
sitting_outside
skillet
sure
tasty
their
them
they
toast
took
truffle
vegetable
wait
waitress
weather
when
which
while
white
wife
with
yourself
yourself_favor
about
awesome
baked
because
beef
both
calzone
case
come
come_back
crowded
decided
doe
door
drink
else
evening
everyone
fault
forever
friend
girl
give
go
good
great
griping
have
home
host
huge
idea
issue
liked
many
many_people
more
more_than
once
part
past
people
personal
pizza
placed
placed_order
pleasant
please
price
probably
review
reviewer
said
sauce
seat

## Remove Rare and Common Tokens/Words

Now we remove in-frequent words from our dictionary. We also remove words that appear frequently in most documents.

In [12]:
# Filter out words that occur less than 300 documents, or more than 20% of the documents
dictionary.filter_extremes(no_below=300, no_above=0.20) 

print('Number of unique words after removing rare and common words:', len(dictionary))

2020-03-18 10:15:20,442 : INFO : discarding 26290 tokens: [('absolute', 55), ('absolutely', 298), ('anyway', 238), ('arrived', 245), ('back', 2326), ('best_ever', 62), ('birthday', 170), ('blend', 44), ('bloody', 46), ('bloody_mary', 37)]...
2020-03-18 10:15:20,444 : INFO : keeping 353 tokens which were in no less than 300 and no more than 2000 (=20.0%) documents
2020-03-18 10:15:20,463 : DEBUG : rebuilding dictionary, shrinking gaps
2020-03-18 10:15:20,469 : INFO : resulting dictionary: Dictionary(353 unique tokens: ['amazing', 'best', 'better', 'bread', 'breakfast']...)


Number of unique words after removing rare and common words: 353


## Bag-of-Words Representation of Data


Finally, we transform the documents to a **vectorized form**. 

We simply compute the frequency of each word, including the bigrams/trigrams.

In [13]:
corpus = [dictionary.doc2bow(doc) for doc in docs] # Bag-of-words representation of the docs

print('Number of unique tokens: %d' % len(dictionary))
print('Number of documents: %d' % len(corpus))

Number of unique tokens: 353
Number of documents: 10000


## Training the LDA Model

We use the gensim.models.LdaModel class for performing LDA. [https://radimrehurek.com/gensim/models/ldamodel.html]


#### Below we discuss the setting of some of the key parameters.

- num_topics (int, optional) – The number of requested latent topics to be extracted from the training corpus.

 
LDA is an unsupervised technique, meaning that we don't know prior to running the model how many topics exits in our corpus. It depends on the data and the application. We may use the following two technique to determine the number of topics.


        Technique 1: Topic Coherence 
The main technique to determine the number of topics is **Topic coherence** [http://svn.aksw.org/papers/2015/WSDM_Topic_Evaluation/public.pdf]


        Technique 2: Visualizing Inter-Topic Distance 
Use the LDA visualization tool pyLDAvis to observe Intertopic Distance Map (discussed later). By varying the number of topics we could determine the optimal value from the visualization.

We **use both techniques** to determine the optimal number of topics.

- chunksize (int, optional) – Number of documents to be used in each training chunk.

It controls how many documents are processed at a time in the training algorithm. Increasing chunksize will speed up training, at least as long as the chunk of documents easily fit into memory. 

We set chunksize = 10000, which is equal to the amount of documents. Thus, it processes all the data in one go. Chunksize can however influence the quality of the model.

- passes (int, optional) – Number of passes through the corpus during training.

It controls how often we train the model on the entire corpus. Another word for passes might be “epochs”. 

- iterations (int, optional) – Maximum number of iterations through the corpus when inferring the topic distribution of a corpus.

It is somewhat technical, but essentially it controls how often we repeat a particular loop over each document. 

        It is important to set the number of “passes” and “iterations” high enough.
        

#### How to Set "passes" and "iterations":

First, enable logging and set eval_every = 1 (however, it might slow down, so, we use None) in LdaModel. 

When training the model look for a line in the log that looks something like this:

        2020-02-25 19:07:04,716 : DEBUG : 9985/10000 documents converged within 300 iterations

If we set passes = 20, we will see this line 20 times. 

### Important: We need to make sure that by the final passes, most of the documents have converged. Thus, want to choose both passes and iterations to be high enough for this to happen.

- eval_every (int, optional) – Log perplexity is estimated every that many updates. Setting this to 1 slows down training by ~2x.


- alpha ({numpy.ndarray, str}, optional): Can be set to an 1D array of length equal to the number of expected topics that expresses our a-priori belief for the each topics’ probability. 

Alternatively default prior selecting strategies can be employed by supplying a string:

        ’asymmetric’: Uses a fixed normalized asymmetric prior of 1.0 / topicno.

        ’auto’: Learns an asymmetric prior from the corpus (not available if distributed==True).
        
        
- eta ({float, np.array, str}, optional) – A-priori belief on word probability.

It can be:

        scalar for a symmetric prior over topic/word probability,

        vector of length num_words to denote an asymmetric user defined probability for each word,

        matrix of shape (num_topics, num_words) to assign a probability for each word-topic combination,

        the string ‘auto’ to learn the asymmetric prior from the data.


We set alpha = 'auto' and eta = 'auto'. Again this is somewhat technical, but essentially we are automatically learning two parameters in the model that we usually would have to specify explicitly.

In [None]:
from gensim.models import LdaModel

#------Set training parameters
num_topics = 14 # Number of topics to discover
chunksize = 10000 # Size of the doc looked at every pass
passes = 20 # Number of passes through the corpus
iterations = 300 # Maximum number of iterations through the corpus when inferring the topic distribution of a corpus
eval_every = None  # Don't evaluate model perplexity, takes too much time.

#-------Make an index to word dictionary
temp = dictionary[0]  # This is only to "load" the dictionary
id2word = dictionary.id2token

%time model = LdaModel(corpus=corpus, id2word=id2word, chunksize=chunksize, \
                       alpha='auto', eta='auto', \
                       iterations=iterations, num_topics=num_topics, \
                       passes=passes, eval_every=eval_every)

## Technique 1 for Determining Optimal Number of Topics: Topic Coherence

Topic Coherence is a measure used to evaluate topic models. Each such generated topic consists of words, and the topic coherence is applied to the top N words from the topic. 

Topic Coherence measures score a single topic by **measuring the degree of semantic similarity between high scoring words in the topic**. These measurements help distinguish between topics that are semantically interpretable topics and topics that are artifacts of statistical inference. 

A set of statements or facts is said to be coherent, if they support each other. Thus, a coherent fact set can be interpreted in a context that covers all or most of the facts. An example of a coherent fact set is “the game is a team sport”, “the game is played with a ball”, “the game demands great physical efforts”

Topic Coherence is defined as the average of the pairwise word-similarity scores of the words in the topic.

A good model will generate coherent topics, i.e., topics with high topic coherence scores. Good topics are topics that can be described by a short label, therefore this is what the topic coherence measure should capture.


Below we display 
- the average topic coherence and
- print the topics in order of topic coherence


We use LdaModel's "top_topics" method to get the topics with the highest coherence score the coherence for each topic.

Note that we use the “Umass” topic coherence measure here (see gensim.models.ldamodel.LdaModel.top_topics()).

In [None]:
top_topics = model.top_topics(corpus) #, num_words=20)

# Average topic coherence is the sum of topic coherences of all topics, divided by the number of topics.
avg_topic_coherence = sum([t[1] for t in top_topics]) / num_topics
print('Average topic coherence: %.4f.' % avg_topic_coherence)

from pprint import pprint
pprint(top_topics)

## Technique 2 for Determining Optimal Number of Topics: Visualization

We use **pyLDAvis** to interpret the topics in a topic model that has been fit to a corpus of text data. 

It extracts information from a fitted LDA topic model to inform an interactive web-based visualization.

In [None]:
import pyLDAvis.gensim
pyLDAvis.enable_notebook()

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 

pyLDAvis.gensim.prepare(model, corpus, dictionary)


## Interpretation of the Visualization 



- Left Panel: 
The labeld Intertopic Distance Map, circles represent different topics and the distance between them. Similar topics appear closer and the dissimilar topics farther. The relative size of a topic's circle in the plot corresponds to the relative frequency of the topic in the corpus. An individual topic may be selected for closer scrutiny by clicking on its circle, or entering its number in the "selected topic" box in the upper-left.



- Right Panel:
It includes the bar chart of the top 30 terms. When no topic is selected in the plot on the left, the bar chart shows the top-30 most "salient" terms in the corpus. A term's saliency is a measure of both how frequent the term is in the corpus and how "distinctive" it is in distinguishing between different topics. Selecting each topic on the right, modifies the bar chart to show the "relevant" terms for the selected topic. 

Relevence is defined as in footer 2 and can be tuned by parameter $\lambda$.
- Smaller $\lambda$ gives higher weight to the term's distinctiveness.
- larger $\lambda$ corresponds to probablity of the term occurance per topics.

Therefore, to get a better sense of terms per topic we use $\lambda = 0$.

## Display the Top Words in the Topics

In [None]:
def get_lda_topics(model, num_topics, top_words):
    word_dict = {};
    for i in range(num_topics):
        words = model.show_topic(i, topn = top_words);
        word_dict['Topic # ' + '{:02d}'.format(i+1)] = [i[0] for i in words];
    return pd.DataFrame(word_dict)

In [None]:
get_lda_topics(model, num_topics, 20)

## Generate Labels for the Topics

We can manually generate human-interpretable labels for each topic by looking at the terms that appear more in each topic.


We use LdaModel's "show_topic" method that returns **Word-probability pairs** for the most relevant words generated by the topic.

In [None]:
def explore_topic(lda_model, topic_number, topn, output=True):
    """
    accept a ldamodel, a topic number and topn vocabs of interest
    prints a formatted list of the topn terms
    """
    terms = []
    for term, frequency in lda_model.show_topic(topic_number, topn=topn):
        terms += [term]
        if output:
            print(u'{:30} {:.3f}'.format(term, round(frequency, 3)))
    
    return terms

In [None]:
topic_summaries = []

print(u'{:25} {}'.format(u'term', u'frequency') + u'\n')

for i in range(num_topics):
    print('\nTopic '+str(i)+' |---------------------------\n')
    tmp = explore_topic(model,topic_number=i, topn=10, output=True )
#     print tmp[:5]
    topic_summaries += [tmp[:5]]
    print

## Manually Generate Topic Labels

Based on the most probable words generated by each topic, we assign human-interpretable labels for the topics.

In [None]:
top_labels = {0: 'Asian Cuisine', 1:'Mall', 2:'First Visit', 3:'Customer Service', 4:'Comparison', 5:'Store', 6:'Pizza', 7:'Night', 8:'Happy Hour'}