# LDA Model Results
<p>Here we will review teh results of the various ldam models that were generated. The goal is to idenity a model that has enough topics of interest while still providing significant deliniation between the topics.  Models are numbered Model 1 through Model 5, and descriptionsof each are included below.</p>

# Results
<p><b>Model 5</b> provides the best results and includes all reviews and will thus be used to identify subtopics in review texts.</p>

## Step 0: Import packages

In [1]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:95% !important; }</style>"))

In [2]:
from gensim.models.ldamulticore import LdaMulticore
import itertools
from collections import Counter

import pandas as pd
import numpy as np
import datetime

def time_marker(text=''):
    print('[{}] {}'.format(datetime.datetime.now().time(), text.lower()))

___

### Pretty Printer Function

In [3]:
def print_topic_terms(model, num_topics=-1, num_words=10, unique=False, topics_of_interest={}):
    results = model.print_topics(num_topics=num_topics, num_words=num_words)
    if not unique:
        print('=============================== Terms Per Topic ===============================')
        for r in results:
            topic = r[0]
            term_list = r[1]

            term_list = term_list.split('"')[1::2]
            topic_terms = [term for term in term_list]
            
            if len(topics_of_interest) > 0:
                if topic in list(topics_of_interest.values()):
                    
                    print('{}\t{}'.format(topic, topic_terms))
            else:
                print('{:>2}\t{}'.format(topic, topic_terms))
    else:
        terms = [x[1] for x in results]
        term_lists = [x.split('"')[1::2] for x in terms]

        flatList = itertools.chain.from_iterable(term_lists)
        term_counts = Counter(flatList)

        # non_unique_terms = term_counts
        test = dict(term_counts)

        # extract terms that appear more than once
        non_unique_terms = [key for key, value in test.items() if value > 1]
        
        
        print('============================ Unique Terms Per Topic ===========================')
        for r in results:
            topic = r[0]
            term_list = r[1]

            term_list = term_list.split('"')[1::2]
            topic_terms = [term for term in term_list if term not in non_unique_terms]
            if len(topics_of_interest) > 0:
                if topic in list(topics_of_interest.values()):
                    
                    print('{}\t{}'.format(topic, topic_terms))
            else:
                print('{:>2}\t{}'.format(topic, topic_terms))
            

___

## Step 1: Review Model 5 - All Reviews, All Tokens
<p>Looking at all Reviews, and limiting tokens to only nouns and verb tokens more common than the 10,000th most common noun or verb token.</p>

<ul>
    <li>Num Topics: 50</li>
    <li>Num Terms: 10</li>
    <li>Num Passes: 50</li>
    <li>Key Topics Identified: 1:Loyalty, 7:Wait Time, 8:Atmosphere, 9:Ordering, 13:Cleanliness, 15:Food Quality, 17:Customer Service, 26:Lunch Parking, 35:Price Value</li>
</ul>

In [4]:
model_05 = LdaMulticore.load('../models/ldam_all_restaurants_50_topics_10_terms_50_passes.model')

In [5]:
print_topic_terms(model_05, num_topics=-1, num_words=10, unique=False)

 0	['dessert', 'wine', 'ice', 'cream', 'cake', 'entree', 'meal', 'chocolate', 'course', 'appetizer']
 1	['time', 'back', 'first', 'try', 'place', 'went', 'definitely', 'go', 'next', 'great']
 2	['u', 'table', 'came', 'server', 'food', 'drink', 'waitress', 'asked', 'ordered', 'minute']
 3	['dish', 'flavor', 'sauce', 'like', 'taste', 'menu', 'one', 'would', 'bit', 'meat']
 4	['pizza', 'crust', 'slice', 'topping', 'cheese', 'pie', 'thin', 'good', 'sauce', 'pepperoni']
 5	['crab', 'leg', 'shell', 'pound', 'coworker', 'saving', 'panini', 'hub', 'e', 'angry']
 6	['chicken', 'rice', 'chinese', 'fried', 'food', 'beef', 'soup', 'egg', 'orange', 'sour']
 7	['wait', 'minute', 'time', 'food', 'get', 'order', 'long', 'line', 'hour', 'waiting']
 8	['great', 'nice', 'patio', 'atmosphere', 'outside', 'place', 'cool', 'fun', 'inside', 'food']
 9	['order', 'ordered', 'called', 'delivery', 'extra', 'got', 'time', 'get', 'card', 'call']
10	['tempe', 'school', 'opening', 'mill', 'w', 'college', 'b', 'asu',

## Step 2: Assign labels to interesting topics
<p>The goal here is to inspect qualities and attribures about the restaurant, not what is on the menu.  Many topics identified contain highly specific menu categories.  This information is useful to set asidde from other sub topics.</p>
<p>In another pass, these topics could be used to double check the assigned cuisine categories to each restaurant.</p>

In [6]:
topics_of_interest = {'retention_1': 1,
                      'food_quality_3': 3,
                      'wait_time_7': 7,
                      'atmosphere_8': 8,
                      'ordering_9': 9,
                      'cleanliness_13' : 13,
                      'menu_options_19' : 19,
                      'food_quality_20': 20,
                      'food_quality_21': 21,
                      'customer_service_27' : 27,
                      'customer_Service_44': 44,
                      'value_35': 35}

In [7]:
subtopic_labels = list(set([' '.join(x.split('_')[:-1]).replace(' ', '_').lower() for x in topics_of_interest.keys()]))

## Step 3: Inspect Topics of Interest

In [8]:
print_topic_terms(model_05, num_topics=-1, num_words=10, unique=False, topics_of_interest=topics_of_interest)

1	['time', 'back', 'first', 'try', 'place', 'went', 'definitely', 'go', 'next', 'great']
3	['dish', 'flavor', 'sauce', 'like', 'taste', 'menu', 'one', 'would', 'bit', 'meat']
7	['wait', 'minute', 'time', 'food', 'get', 'order', 'long', 'line', 'hour', 'waiting']
8	['great', 'nice', 'patio', 'atmosphere', 'outside', 'place', 'cool', 'fun', 'inside', 'food']
9	['order', 'ordered', 'called', 'delivery', 'extra', 'got', 'time', 'get', 'card', 'call']
13	['table', 'dirty', 'clean', 'bathroom', 'floor', 'plate', 'hand', 'cup', 'paper', 'chair']
19	['option', 'menu', 'free', 'gyro', 'meat', 'vegetarian', 'veggie', 'choose', 'vegan', 'choice']
20	['food', 'like', 'place', 'ordered', 'tasted', 'bad', 'even', 'back', 'cold', 'taste']
21	['good', 'food', 'place', 'price', 'pretty', 'service', 'better', 'like', 'would', 'really']
27	['great', 'food', 'service', 'place', 'friendly', 'good', 'recommend', 'staff', 'delicious', 'price']
35	['good', 'really', 'got', 'ordered', 'little', 'nice', 'pretty

## Step 4: Assigning Topic to Reviews

### Step 4a: Load Review Data and Restaurant Business Data

In [9]:
time_marker('Loading Restaurant Review data...')
reviews = pd.read_csv('../clean_data/az_restaurant_reviews.csv', index_col=0, parse_dates=['date'], low_memory=False)

reviews.dropna(how='any', inplace=True)
reviews.reset_index(inplace=True, drop=True)

time_marker('Loading Restaurant Business data...')
biz = pd.read_csv('../clean_data/az_restaurant_business_clean.csv', index_col=0)
biz = biz.iloc[:,:9].copy()

time_marker('done')

[22:01:21.965839] loading restaurant review data...
[22:02:21.280660] loading restaurant business data...
[22:02:21.331491] done


### Step 4b: Merge Restaurant Name to Reviews

In [10]:
review_df = reviews.merge(biz[['name', 'business_id']], on='business_id', how='left')

In [11]:
review_df['business_id']  = review_df['business_id'].astype('str')
review_df['cool']         = review_df['cool'].astype('int')
review_df['date']         = pd.to_datetime(review_df['date'])
review_df['funny']        = review_df['funny'].astype('int')
review_df['review_id']    = review_df['review_id'].astype('str')
review_df['stars']        = review_df['stars'].astype('int')
review_df['text']         = review_df['text'].astype('str')
review_df['useful']       = review_df['useful'].astype('int')
review_df['user_id']      = review_df['user_id'].astype('str')
review_df['is_fast_food'] = review_df['is_fast_food'].astype('bool')
review_df['review_len']   = review_df['review_len'].astype('int')
review_df['name']         = review_df['name'].astype('str')

In [12]:
review_df.head(3)

Unnamed: 0,business_id,cool,date,funny,review_id,stars,text,useful,user_id,is_fast_food,review_len,name
0,JlNeaOymdVbE6_bubqjohg,0,2014-08-09,0,BF0ANB54sc_f-3_howQBCg,1,we always go to the chevo's in chandler which ...,3,ssuXFjkH4neiBgwv-oN4IA,False,422,Papa Chevo's Taco Shop
1,0Rni7ocMC_Lg2UH0lDeKMQ,0,2014-08-09,0,DbLUpPT61ykLTakknCF9CQ,1,this place is always so dirty and grimy been t...,6,ssuXFjkH4neiBgwv-oN4IA,False,111,Barro's Pizza
2,S-oLPRdhlyL5HAknBKTUcQ,0,2017-11-30,0,z_mVLygzPn8uHp63SSCErw,4,holy portion sizes! you get a lot of bang for ...,0,MzEnYCyZlRYQRISNMXTWIg,False,130,Harumi Sushi


In [13]:
review_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 495893 entries, 0 to 495892
Data columns (total 12 columns):
business_id     495893 non-null object
cool            495893 non-null int64
date            495893 non-null datetime64[ns]
funny           495893 non-null int64
review_id       495893 non-null object
stars           495893 non-null int64
text            495893 non-null object
useful          495893 non-null int64
user_id         495893 non-null object
is_fast_food    495893 non-null bool
review_len      495893 non-null int64
name            495893 non-null object
dtypes: bool(1), datetime64[ns](1), int64(5), object(5)
memory usage: 45.9+ MB


## Step 5: Examples - Extract Reviews From Most Reviewed Fast Food and Non Fast Food Rstaurants

In [14]:
nff_review_counter = Counter(review_df[review_df.is_fast_food == 0].business_id.values).most_common(5)
nff_most_reviewed_bid = nff_review_counter[0][0]
nff_most_reviewed = review_df[review_df.business_id == nff_most_reviewed_bid].name.unique()[0]
print('Name: {}\tBusiness ID: {}'.format(nff_most_reviewed, nff_most_reviewed_bid))


most_nff_reviews = review_df[review_df.business_id == nff_most_reviewed_bid].copy()
print(most_nff_reviews.shape[0])

Name: Pizzeria Bianco	Business ID: pSQFynH1VxkfSmehRXlZWw
2004


In [15]:
ff_review_counter = Counter(review_df[review_df.is_fast_food == 1].business_id.values).most_common(5)
ff_most_reviewed_bid = ff_review_counter[0][0]
ff_most_reviewed = review_df[review_df.business_id == ff_most_reviewed_bid].name.unique()[0]
print('Name: {}\tBusiness ID: {}'.format(ff_most_reviewed, ff_most_reviewed_bid))


most_ff_reviews = review_df[review_df.business_id == ff_most_reviewed_bid].copy()
print(most_ff_reviews.shape[0])

Name: Portillo's Hot Dogs	Business ID: 0W_pPAiTXgazY2mtX6o0_w
633


## Step 6: Print Most Frequent Subtopics identified in Given Review

In [16]:
def print_top_n_review_topics(model, review, n_topics=5, valid_topics = {}):
    
    
    
    review_topic_categories = []
    for word in review.split(' '):
        try:
            r = model.get_term_topics(word_id = word)
            [review_topic_categories.append(x[0]) for x in r]
        except:
            pass
    
    # count occurances of each identified topic
    topic_counter = Counter(review_topic_categories) 
    top_n_topics = [x[0] for x in topic_counter.most_common(n_topics)]
    
    if len(valid_topics) > 0:
        
        valid_topic_ids = list(topics_of_interest.values())
        
        # prune to only topics we care about
        topics = [topic for topic in top_n_topics if topic in valid_topic_ids]
    else:
        topics = top_n_topics

    print('Review Text:\n\t{}'.format(review.replace('\n', ' ')))
    print('Topics Identified:')

    for n in topics:
        topic_label = list(topics_of_interest.keys())[list(topics_of_interest.values()).index(n)]
        print('\t{}'.format(topic_label))
    print(topics)

### Step 6a: Non Fast Food Sample Bad Reviews

In [17]:
nff_bad_reviews = most_nff_reviews[most_nff_reviews.stars < 3].iloc[1:5,6]
nff_good_reviews = most_nff_reviews[most_nff_reviews.stars > 3].iloc[1:5,6]

In [18]:
for rev in nff_bad_reviews:
    print_top_n_review_topics(model_05, rev, n_topics=5, valid_topics = topics_of_interest)
    print('='*80)

Review Text:
	this place is over-rated and expensive for what you get.  apparently from some glowing reviews this place either gives a wide range of experiences or is living off of some great past reputation.  the food was actually pretty tasty but the cost the wait the aging center it is located in and the small portions definitely do not give a value to customers. i won't be going back
Topics Identified:
	retention_1
	customer_service_27
	value_35
	atmosphere_8
[1, 27, 35, 8]
Review Text:
	ok so i've gone a few times now.  first time great experience but now this is what i think.  1. over priced ..the green salad..greens and 3 green olives...bland dressing..$6..pizzas good but really not worth the wait nor cost.  2. requesting parmesan will cost you $2 and they will not let  you take it unless you request it. 3. customer service i've had really better service other places. this past week i was in there for a lunch meeting and this group of 25 came in complaining they couldn't sit tog

In [19]:
for rev in nff_good_reviews:
    print_top_n_review_topics(model_05, rev, n_topics=5, valid_topics = topics_of_interest)
    print('='*80)

Review Text:
	family and i waited close to an hour for a table. by the time we were seated all i could think of was "this frick'n pizza better be worth the wait!!" well if you have the patience of a saint - it is worth every minute and bite! my family and i are from california and do not believe in lines. we are passing through phoenix. the pizza is so good we are going to stop by phoenix on the way back just for the pizza - again! we ordered the wiseguy marguerita and one with salami (i forget the name) - all delicious! i would recommend all of them! did i mention there is a bar next door? while you wait you can slam down a few drinks...that should take off the edge for the impatient  ones.
Topics Identified:
	wait_time_7
	retention_1
[7, 1]
Review Text:
	you are missing out if you haven't been here yet. the bartender was knowledgeable and obviously good pie!!
Topics Identified:
	retention_1
[1]
Review Text:
	it's all been said before so i'll keep this short.  ----------this is the be

### Step 6b: Fast Food Sample Reviews

In [20]:
ff_bad_reviews = most_ff_reviews[most_ff_reviews.stars < 3].iloc[1:5,6]
ff_good_reviews = most_ff_reviews[most_ff_reviews.stars > 3].iloc[1:5,6]

In [21]:
for rev in ff_bad_reviews:
    print_top_n_review_topics(model_05, rev, n_topics=5, valid_topics = topics_of_interest)
    print('='*80)

Review Text:
	i was expecting more out of the food really bland. the service was like a fast food restraint. long lines for such marginal product
Topics Identified:
	food_quality_21
	customer_service_27
	wait_time_7
[21, 27, 7]
Review Text:
	this place is overhyped.  the food is ok the portions a bit small for the price.  it's not the oregano's of hot dogs like most of the reviews would lead you to believe.
Topics Identified:
	food_quality_21
	food_quality_20
	value_35
	customer_service_27
[21, 20, 35, 27]
Review Text:
	i love portillos but every time i come to this location the hotdogs taste rubbery.  i dont know what they're doing wrong  but to me it seems like they're cooking them too long....
Topics Identified:
	food_quality_3
	food_quality_20
[3, 20]
Review Text:
	went here today for the first time. my wife had been here before and did nothing but talk it up. so i gave in and tried it. the food was too greasy and when we ordered the fries i opened the bag and went to grab them out

In [22]:
for rev in ff_good_reviews:
    print_top_n_review_topics(model_05, rev, n_topics=5, valid_topics = topics_of_interest)
    print('='*80)

Review Text:
	this place is really good. we got the cheese fries italian beefand a chocolate cake shake and they were the business. we made the trip from peoria and it was worth it.
Topics Identified:
	value_35
	retention_1
	food_quality_20
	food_quality_21
[35, 1, 20, 21]
Review Text:
	went here and got a hot dog and cheeseburger. also fries and a sprite. bun was very moist for the hotdog so you must like it that way. unlike some reviews on here refills are free it just seems they are not advertised. all you have to do is go up to the counter where you got your food take off your lid and straw and request a refill and its done!  downfall: it always seems to be busy and is hard to find a table. get there when they open to avoid chaos! i'm also a chicago native and can atest to the food being no different compared to the locations back home.  enjoy! i highly recommend their cheeseburgers!
Topics Identified:
	food_quality_20
	value_35
	wait_time_7
	retention_1
[20, 35, 7, 1]
Review Text:

### Step 7: Assign Identified topics to review records

In [27]:
review_df['subtopics'] = np.nan

In [28]:
review_df.head(3)

Unnamed: 0,business_id,cool,date,funny,review_id,stars,text,useful,user_id,is_fast_food,review_len,name,subtopics
0,JlNeaOymdVbE6_bubqjohg,0,2014-08-09,0,BF0ANB54sc_f-3_howQBCg,1,we always go to the chevo's in chandler which ...,3,ssuXFjkH4neiBgwv-oN4IA,False,422,Papa Chevo's Taco Shop,
1,0Rni7ocMC_Lg2UH0lDeKMQ,0,2014-08-09,0,DbLUpPT61ykLTakknCF9CQ,1,this place is always so dirty and grimy been t...,6,ssuXFjkH4neiBgwv-oN4IA,False,111,Barro's Pizza,
2,S-oLPRdhlyL5HAknBKTUcQ,0,2017-11-30,0,z_mVLygzPn8uHp63SSCErw,4,holy portion sizes! you get a lot of bang for ...,0,MzEnYCyZlRYQRISNMXTWIg,False,130,Harumi Sushi,


In [29]:
def get_subtopics(review_record, model=model_05, n_topics=5, valid_topics = {}):
    
    review_topic_categories = []
    for word in review_record.text.split(' '):
        try:
            word_topic = model.get_term_topics(word_id = word)
            [review_topic_categories.append(x[0]) for x in word_topic]
        except:
            pass
    
    # count occurances of each identified topic
    topic_counter = Counter(review_topic_categories) 
    top_n_topics = [x[0] for x in topic_counter.most_common(n_topics)]
    
    if len(valid_topics) > 0:
        
        valid_topic_ids = list(valid_topics.values())
        
        # prune to only topics we care about
        topics = [topic for topic in top_n_topics if topic in valid_topic_ids]
    else:
        topics = top_n_topics

    subtopic_dict = dict(zip(subtopic_labels, [np.nan for x in range(0, len(subtopic_labels))]))

    for t in topics:
        if t in [35]:
            subtopic_dict['value']            = review_record.stars
        if t in [19]:
            subtopic_dict['menu_options']     = review_record.stars
        if t in [8]:
            subtopic_dict['atmosphere']       = review_record.stars
        if t in [1]:
            subtopic_dict['retention']        = review_record.stars
        if t in [13]:
            subtopic_dict['cleanliness']      = review_record.stars
        if t in [7]:
            subtopic_dict['wait_time']        = review_record.stars
        if t in [9]:
            subtopic_dict['ordering']         = review_record.stars
        if t in [44, 27]:
            subtopic_dict['customer_service'] = review_record.stars
        if t in [3, 20, 21]:
            subtopic_dict['food_quality']     = review_record.stars
                  
    return [list(subtopic_dict.values())]

In [30]:
time_marker('getting subtopics for each review')
# get list of subtopic star ratings
review_df['subtopics'] = review_df.apply(lambda row: get_subtopics(row, valid_topics=topics_of_interest), axis=1)

[22:41:30.795339] getting subtopics for each review


In [31]:
time_marker('splitting subtopic label lists into columns')
# split list into separate columns
review_df[subtopic_labels] = pd.DataFrame(review_df.subtopics.values.tolist(), index= review_df.index)

[22:58:59.140285] splitting subtopic label lists into columns


In [32]:
time_marker('Cleaning up...')
# drop dummy column
review_df.drop(['subtopics'], inplace=True, axis=1)

[22:58:59.794379] cleaning up...


In [33]:
review_df.head(3).transpose()

Unnamed: 0,0,1,2
business_id,JlNeaOymdVbE6_bubqjohg,0Rni7ocMC_Lg2UH0lDeKMQ,S-oLPRdhlyL5HAknBKTUcQ
cool,0,0,0
date,2014-08-09 00:00:00,2014-08-09 00:00:00,2017-11-30 00:00:00
funny,0,0,0
review_id,BF0ANB54sc_f-3_howQBCg,DbLUpPT61ykLTakknCF9CQ,z_mVLygzPn8uHp63SSCErw
stars,1,1,4
text,we always go to the chevo's in chandler which ...,this place is always so dirty and grimy been t...,holy portion sizes! you get a lot of bang for ...
useful,3,6,0
user_id,ssuXFjkH4neiBgwv-oN4IA,ssuXFjkH4neiBgwv-oN4IA,MzEnYCyZlRYQRISNMXTWIg
is_fast_food,False,False,False


In [34]:
review_df.to_csv('../clean_data/az_restaurant_reviews_with_subtopics.csv')

In [35]:
review_df.describe()

Unnamed: 0,cool,funny,stars,useful,review_len,atmosphere,value,retention,cleanliness,ordering,customer_service,wait_time,menu_options,food_quality
count,495893.0,495893.0,495893.0,495893.0,495893.0,94385.0,159932.0,225063.0,1795.0,36124.0,285977.0,89636.0,12243.0,299715.0
mean,0.550042,0.465808,3.689457,1.127642,546.456032,4.238841,3.620145,3.853637,2.854596,2.762014,3.94346,3.220737,3.879686,3.448269
std,2.035899,1.708676,1.415221,2.561864,511.313496,1.108483,1.293034,1.339301,1.637603,1.616081,1.317797,1.573312,1.350567,1.439882
min,-1.0,0.0,1.0,-1.0,80.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
25%,0.0,0.0,3.0,0.0,220.0,4.0,3.0,3.0,1.0,1.0,3.0,2.0,3.0,2.0
50%,0.0,0.0,4.0,0.0,383.0,5.0,4.0,4.0,3.0,2.0,4.0,4.0,4.0,4.0
75%,1.0,0.0,5.0,1.0,688.0,5.0,5.0,5.0,5.0,4.0,5.0,5.0,5.0,5.0
max,221.0,161.0,5.0,216.0,4989.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0


In [37]:
review_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 495893 entries, 0 to 495892
Data columns (total 21 columns):
business_id         495893 non-null object
cool                495893 non-null int64
date                495893 non-null datetime64[ns]
funny               495893 non-null int64
review_id           495893 non-null object
stars               495893 non-null int64
text                495893 non-null object
useful              495893 non-null int64
user_id             495893 non-null object
is_fast_food        495893 non-null bool
review_len          495893 non-null int64
name                495893 non-null object
atmosphere          94385 non-null float64
value               159932 non-null float64
retention           225063 non-null float64
cleanliness         1795 non-null float64
ordering            36124 non-null float64
customer_service    285977 non-null float64
wait_time           89636 non-null float64
menu_options        12243 non-null float64
food_quality        299715