# Yelp Review LDA

Use Latent Dirichlet Allocation to extract the topics from the yelp reviews.

In [2]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

In [35]:
# Read toronto restaurant review data
# Read data from gcp
# df = pd.read_csv('gs://yelp_review_toronto_restaurant/toronto_restaurant_review.csv', index_col=0)
# Read data from local file
df = pd.read_csv('toronto_restaurant_review.csv')
toronto_restaurant_review = df.text

# Preview review data
print(f'{df.shape[0]} reviews')
df.head()

376702 reviews


Unnamed: 0.1,Unnamed: 0,business_id,cool,date,funny,review_id,stars_x,text,useful,user_id,...,city,hours,is_open,latitude,longitude,name,postal_code,review_count,stars_y,state
0,0,AakkkTuGZA2KBodKi2_u8A,0,2012-07-16 00:37:14,1,JVcjMhlavKKn3UIt9p9OXA,1,I cannot believe how things have changed in 3 ...,1,TpyOT5E16YASd7EWjLQlrw,...,Toronto,"{'Monday': '11:0-22:0', 'Tuesday': '11:0-22:0'...",1,43.649674,-79.435116,Pho Phuong,M6K 1T9,55,3.5,ON
1,1,AakkkTuGZA2KBodKi2_u8A,0,2014-02-24 01:45:02,0,vKhtzhPUz9RJbllyvHm3qA,3,"Pretty good, food,, about the same as other vi...",0,G-9ujgKmc1J2k7HSqXszsw,...,Toronto,"{'Monday': '11:0-22:0', 'Tuesday': '11:0-22:0'...",1,43.649674,-79.435116,Pho Phuong,M6K 1T9,55,3.5,ON
2,2,AakkkTuGZA2KBodKi2_u8A,0,2016-02-12 00:25:23,0,Je6AF9sTKwXwOVw2YHR1dg,5,I've been going to this place since it opened ...,0,NA4sslQXta6U263fqzwKiw,...,Toronto,"{'Monday': '11:0-22:0', 'Tuesday': '11:0-22:0'...",1,43.649674,-79.435116,Pho Phuong,M6K 1T9,55,3.5,ON
3,3,AakkkTuGZA2KBodKi2_u8A,0,2013-05-07 06:03:17,0,b_xVF8U5Vqljz58OUEjqgA,4,One of the best Vietnamese places I`ve tried i...,1,1fNQRju9gmoCEvbPQBSo7w,...,Toronto,"{'Monday': '11:0-22:0', 'Tuesday': '11:0-22:0'...",1,43.649674,-79.435116,Pho Phuong,M6K 1T9,55,3.5,ON
4,4,AakkkTuGZA2KBodKi2_u8A,0,2011-11-30 16:46:24,0,vFPpG1xDBSWcvy_165fxKg,3,"This place is just ok. Nice atmosphere, big op...",0,fYJGKhZK2FZckYWDMdCooA,...,Toronto,"{'Monday': '11:0-22:0', 'Tuesday': '11:0-22:0'...",1,43.649674,-79.435116,Pho Phuong,M6K 1T9,55,3.5,ON


In [15]:
# Convert acountlection of text documents to a matrix of token counts
vectorizer = CountVectorizer()

# LDA model
lda = LatentDirichletAllocation(max_iter=5, 
                                learning_method='online', learning_offset=50., 
                                random_state=0)

In [14]:
def fit_lda(data, vectorizer, lda):
    trr = vectorizer.fit_transform(data)
    lda = LatentDirichletAllocation(max_iter=5, 
                                    learning_method='online', learning_offset=50., 
                                    random_state=0).fit(trr)
    return lda

In [10]:
# Define the method to print topic words
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx
        message += " ".join([feature_names[i]
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)
    print()

In [11]:
# Define the method to return topic words as List
def get_top_words(model, feature_names, n_top_words):
    top_words = []
    for topic_idx, topic in enumerate(model.components_):
        top_words.append(' '.join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]]))
    return top_words

In [12]:
model = fit_lda(toronto_restaurant_review, vectorizer, lda)

In [13]:
print_top_words(model, vectorizer.get_feature_names(), 20)

Topic #0: pizza pasta italian shawarma market toppings falafel kensington thin gourmet veal dough bulgogi italy vermicelli delight style prosciutto hoof pepperoni
Topic #1: the was and it we were to of for but had with good very my not on that food ordered
Topic #2: the and is food great for to are place this in good service very of it you here have my
Topic #3: and chicken the rice sushi with of soup thai spicy ramen noodles fried beef in pork is curry rolls sauce
Topic #4: the to it is you and of that in for but they this not have if are on place so
Topic #5: and of the bar with on beer great drinks night patio music drink in wine for to selection an nice
Topic #6: the and with of was cheese on it fries burger sauce chicken in my sandwich had to salad sweet fried
Topic #7: the to and we was our my for us were that had food of she at he in they service
Topic #8: coffee tea cream ice cake chocolate dessert cheesecake dim cafe sum milk with desserts latte matcha crepe cup green their
To

We can see that some words in the topic do not contribute to the undersanding of the topics (words like the, was, and, it etc.).

Let's tunning the parameters to make a better.

## Tunning The Parameters

### stop_word
The first thing we'll try is to specify stop word which doesn't give much information in representing the content of the text. The stop words will be removed from the document-term matrix.

In [16]:
vectorizer = CountVectorizer(stop_words='english')
model = fit_lda(toronto_restaurant_review, vectorizer, lda)
print_top_words(model, vectorizer.get_feature_names(), 20)

Topic #0: fish toronto tacos best beer chips jerk mexican taco gem ve city roti local hidden beers indian great authentic love
Topic #1: pizza pasta italian sauce wine cheese crust bread tomato pizzas oil fresh mushrooms sea truffle spaghetti gnocchi slice risotto veal
Topic #2: place like bar just location good area don little street people menu nice open new really want patio right tables
Topic #3: vegetarian vegan overpriced shawarma gluten greek original options comfort fare exceptional falafel salsa danforth hummus uncle le la wing souvlaki
Topic #4: food just time order didn service like table don came asked ordered minutes said did restaurant got server people bad
Topic #5: food good great place service really restaurant time nice definitely menu delicious friendly amazing wait try ve staff dinner come
Topic #6: dish dessert salad steak cake sauce lobster ordered cooked sweet served duck cheese came delicious bread meal perfectly cheesecake like
Topic #7: fries chicken burger sa

With the stop words removed, we have more meaningful words to represent the topics.

### max_df and min_df

Document frequency (df) shows how frequent a specific word shows up in all the documents.

By specifying max_df, the vectorizer ignores all the words whose df is higher than the threshold.
By specifying min_df, the vectorizer ignores all the words whose df is lower than the threshold.

If a word appears only in few documents, it probably doesn't contribute to understanding the topic of the documents.
In a similar way, if a word appears in all documents, it can probably be removed.

In [17]:
# Remove the word if it appears in more than 80% of the documents 
max_df = 0.8
# Remove the word if it appears in less than 5 documents
min_df = 5

vectorizer = CountVectorizer(max_df=max_df, min_df=min_df, stop_words='english')
model = fit_lda(toronto_restaurant_review, vectorizer, lda)
print_top_words(model, vectorizer.get_feature_names(), 20)

Topic #0: menu restaurant wine steak dish great pasta dinner meal service salad dessert experience delicious night dishes good nice main lobster
Topic #1: food great place good service friendly staff nice delicious definitely amazing best really love restaurant ve recommend atmosphere toronto menu
Topic #2: sushi fish good fresh roll tacos japanese chips place sashimi rolls music burrito lunch small quality menu mexican price live
Topic #3: chicken rice good soup sauce fried beef ordered pork spicy like ramen dish meat noodles really dishes hot curry taste
Topic #4: coffee brunch cream tea sweet ice breakfast like eggs chocolate cake good dessert try really cheesecake delicious got just milk
Topic #5: pizza fries burger cheese good sandwich beer chicken great sauce meat wings burgers try poutine salad really best sandwiches ordered
Topic #6: like place food just don ve good better really time think know bad want people eat way make say thing
Topic #7: food came time table service order

From the above words in topics, we can see that:

* Topic 1: Good restaurant and service
* Topic 2: Japanese and Mexican food
* Topic 4: Breakfast and dessert 
* Topic 5: Bar and comfort food
* Topic 8: Location

### ngram_range and token_pattern

In [20]:
ngram_range = (1, 2)
token_pattern = r'\b\w+\b'

vectorizer = CountVectorizer(max_df=max_df, min_df=min_df, stop_words='english', ngram_range=ngram_range, token_pattern=token_pattern)
model = fit_lda(toronto_restaurant_review, vectorizer, lda)
print_top_words(model, vectorizer.get_feature_names(), 20)

Topic #0: pizza dessert cream pasta ice cake tea chocolate sweet cheese bread ice cream cheesecake dim sum desserts dim sum delicious crust milk
Topic #1: food t order service time ordered came wait minutes asked table took got said didn didn t told server long did
Topic #2: t s like good just really place don don t m ve didn didn t wasn t wasn pretty think try got food
Topic #3: chicken sauce fried rice dish pork beef soup ordered spicy meat dishes noodles hot curry shrimp came flavour crispy cooked
Topic #4: sandwich coffee brunch beer breakfast cheese eggs bacon patio great french fries good sandwiches wings delicious menu toast nice bread
Topic #5: s restaurant night bar experience place food menu people table like drinks just time dinner friends service server drink group
Topic #6: great food place service good friendly delicious amazing best definitely staff love toronto nice recommend restaurant atmosphere ve thai really
Topic #7: burger fries fish tacos chips burgers s burrito 

Looks like using bigram doesn't give us much improvement on understanding the topic. We'll keep using the single word.

### TF-IDF (Term-Frequency Times Inverse Document-Frequency)
Some words in the text corpus give less meaningful information ('a', 'the' in English). Tf-idf can re-weight the count features into floating point value.

In [21]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
model = fit_lda(toronto_restaurant_review, tfidf, lda)
print_top_words(model, vectorizer.get_feature_names(), 20)

Topic #0: criminal bayildi beer preference burgers excellent arrived outside closing really bums bit gamey better santouka app love bites special cool innovative bun noodles bourgeois cabana la anthony rose choice served add sort classic margherita compared chinese
Topic #1: boston pizza coconut gelato 407 clams way benny really alright pork birthday lobster classic neighbourhood 40c beach food big chewy alright pretty cheaper fish beginning said burnt cheese bland worst box regular authentic okonomiyaki athens restaurant brunch atmosphere
Topic #2: complain try 30 reservations covered huge concept work bits sauce bites sauce cakes bought base want bit air cozy experience crisply afternoon breakfast cream churro bartenders complemented pork complain rave complaints enjoy business thing beer looking cafe huge
Topic #3: bigger 9 range apt description company overall bated breath comfortable benches cooked oven aesthetics service cheap options bland menu available 1 cooked crisp bits plat

Here, we can see that some new and confusing words show up in the result (criminal, aesthetics etc.). Tf-idf actually makes it harder to understand the topics. This may be caused by that those words are rarely showed up in the documents (reviews), so they are assigned with more weights than others.

### Topics for Single Restaurant Comments

In [24]:
# Find the restaurant which has most comments
business_id_with_most_comments = df.business_id.value_counts().index[0]

In [25]:
# For the sake of own my interest, print the name and score of this restaurant
# Unfortunately, name of the restaurant is in business.json, I looked it up and hard-coded the name here
restaurant_name = 'Pai Northern Thai Kitchen'
restaurant_score = df[df.business_id == business_id_with_most_comments].stars_y.iloc[0]
print(f'The restaurant with most comments in Toronto is: {restaurant_name}')
print(f'The score is: {restaurant_score}')

The restaurant with most comments in Toronto is: Pai Northern Thai Kitchen
The score is: 4.5


In [26]:
# Get all comments for the restaurant
review_for_single_restaurant = df[df.business_id == business_id_with_most_comments].text

In [33]:
n_topics = 3

vectorizer = CountVectorizer(max_df=max_df, min_df=min_df, stop_words='english')
lda = LatentDirichletAllocation(n_components=n_topics, max_iter=5, 
                                learning_method='online', learning_offset=50., 
                                random_state=0)
lda_single_restaurant = fit_lda(review_for_single_restaurant, vectorizer, lda)

In [34]:
print_top_words(lda_single_restaurant, vectorizer.get_feature_names(), 20)

Topic #0: thai curry pad food place khao rice dish good ve chicken ordered best really pai sauce beef great green soi
Topic #1: food place good thai great service wait time restaurant table really just like menu came pai definitely make come busy
Topic #2: thai food pad service just like spicy taste know order pai place good bit did thought ice high morning don



* Topic 1: Type of food
* Topic 2: Busy place
* Topic 3: NA

## Summary

In this notebook, we explore the topics extracted from yelp's toronto restaurant review using LDA and how the parameters affect the result of the topics.

Due to the limited resource of the project, we'll stop here. However, the parameters can be further tuned to get more interpretable topics. Unsupervised learning can also be applied to cluster the topics to categorise them.

As we notice, human judgement is still required for the interpretation of the feature words in the topics. However, if the reivew has pre-defined topics labelled, the feature words can be used to train a classifer to predict the topic of the review.