## Rating Features

### Questions:
1. Based on sentiment score, how are the key restaurant features scored at Mon Ami Gabi?  
    1. Food
    1. Service
    1. Cleanliness
    1. Ambiance
    1. Authenticity
    1. Portion Size

### Import libraries

In [13]:
import csv
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pickle
import itertools as it
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from IPython.display import clear_output
from time import time 

# For loading secret environment variables, e.g. postgres username and password
import os
from dotenv import load_dotenv, find_dotenv

# For reading from Postgres
from sqlalchemy import create_engine

# Display full content of column
pd.set_option('display.max_colwidth', 100)

### Set filepaths

In [9]:
# Directories
external_data_directory = os.path.join('..', 'data', 'external')
interim_data_directory = os.path.join('..', 'data', 'interim')
raw_data_directory = os.path.join('..', 'data', 'raw')

# Filepaths
review_csv_filepath = os.path.join(raw_data_directory, 
                                   'yelp_academic_dataset_review.csv')
restaurant_review_csv_filepath = os.path.join(interim_data_directory,
                                              'restaurant_review.csv')

mon_ami_gabi_filepath = os.path.join(interim_data_directory, 'mon_ami_gabi.csv')

### Set environment variables

In [14]:
public_ip = 'localhost'
username = 'postgres'
password = 'password'
port = '5432'
database = 'yelp'

# Construct database URL from environment variables
uri = f'postgresql://{username}:{password}@{public_ip}:{port}/{database}'

# Connection to Postgres database
engine = create_engine(uri)

### Get reviews for one restaurant by matching `business_id`

In [10]:
mon_ami_gabi_id = '4JNXUYY8wbaaDmk3BPzlWw'

In [15]:
%%time

mon_ami_gabi_id = '4JNXUYY8wbaaDmk3BPzlWw' 

SQL = f'''
SELECT date, stars, text, review_id, business_id, business_name
FROM reviews
WHERE business_id = '{mon_ami_gabi_id}'
'''

mon_ami_gabi = pd.read_sql(SQL, con = engine)

CPU times: user 99.1 ms, sys: 81.5 ms, total: 181 ms
Wall time: 6.59 s


In [16]:
mon_ami_gabi.head()

Unnamed: 0,date,stars,text,review_id,business_id,business_name
0,2017-05-10,5,Loved the fried goat cheese in tomato sauce along with dogfish 60 minutes IPA. Very nice view of...,BvmhSQ6WFm2Jxu01G8OpdQ,4JNXUYY8wbaaDmk3BPzlWw,Mon Ami Gabi
1,2012-06-10,4,"I booked a table here for brunch and it did not disappoint, it was a great experience and more r...",wl8BO_I-is-JaMwMW5c_gQ,4JNXUYY8wbaaDmk3BPzlWw,Mon Ami Gabi
2,2012-01-20,4,"Came here for lunch after a long night of partying. I'm a huge fan of French food, and had defi...",cf9RrqHY9eQ9M53OPyXLtg,4JNXUYY8wbaaDmk3BPzlWw,Mon Ami Gabi
3,2014-06-04,5,Best steak in Vegas. Best mashed potatoes in Vegas. Best French restaurant in Vegas. MAKE MAKE ...,7YNmSq7Lb1zi4SUKXaSjfg,4JNXUYY8wbaaDmk3BPzlWw,Mon Ami Gabi
4,2014-05-03,5,"Love the outdoor atmosphere. Price was right, service exceptional and the food tasted fantastic",IoKp9n1489XohTV_-EJ0IQ,4JNXUYY8wbaaDmk3BPzlWw,Mon Ami Gabi


In [17]:
mon_ami_gabi.shape

(7968, 6)

### Load Key Features

In [78]:
features = pickle.load(open('../data/interim/key_features.pk', 'rb'))
features.shape

(6, 3)

In [79]:
features

Unnamed: 0,id,name,variations
0,food,food,"[food, meal, flavor, tast, delicious, gross, disgusting, seasoned, perfection, amazing, cold]"
1,service,service,"[service, waiter, waitress, server, manager, management, staff, polite, rude, slow, fast]"
2,cleanliness,cleanliness,"[clean, dirty, dusty]"
3,ambiance,ambiance,"[ambiance, view, chic, atmosphere]"
4,authenticity,authenticity,[authentic]
5,portion_size,portion_size,[portion]


### Preprocess restaurant reviews

#### Lowercase, drop punctuations

In [21]:
import re

def multiple_replace(string, replace_dict):
    '''
    Performs multiple string replacements at once
    '''
    pattern = re.compile('|'.join([re.escape(k) for k in sorted(replace_dict, key = len, reverse = True)]), flags = re.DOTALL)
    return pattern.sub(lambda x : replace_dict[x.group(0)], string)                  

stop_chars = []
replace_dict = {','   : '', 
                ';'   : '',
                '('   : '', 
                ')'   : '', 
                '\n' : ' ', 
                '!'   : '.',
                '?'   : '.'}

In [22]:
mon_ami_gabi['text'] = mon_ami_gabi['text'].apply(lambda text: multiple_replace(text.lower(), replace_dict))


In [24]:
mon_ami_gabi[['text']].head()

Unnamed: 0,text
0,loved the fried goat cheese in tomato sauce along with dogfish 60 minutes ipa. very nice view of...
1,i booked a table here for brunch and it did not disappoint it was a great experience and more re...
2,came here for lunch after a long night of partying. i'm a huge fan of french food and had defin...
3,best steak in vegas. best mashed potatoes in vegas. best french restaurant in vegas. make make ...
4,love the outdoor atmosphere. price was right service exceptional and the food tasted fantastic


### For each item, extract sentence fragments

In [37]:
def find_term(word_list, term):
    '''
    Arguments:
    word_list : List of strings
    term      : List or string of words to search for
    
    Return:
    results : List of tuples of `start` and `end` indices of term in word_list,
              where `start` is the index of the first character in `term` in word_list,
              and `end` is the index of the last character in `term` in word_list.
    '''    
    # Check if word_list is a string or list
    if type(word_list) is str:
        word_list = word_list.split()
    elif type(word_list) is not list:
        print('Error: word_list must be a list or string.')
        return None

    # Check if term is a string or list    
    if type(term) is str:
        term = term.split()
    elif type(term) is not list:
        print('Error: term must be a list or string.')
        return None

    results = []
    term_length = len(term)

    # Find indices of term[0] in sentence
    for ind in (i for i, word in enumerate(word_list) if word == term[0]):
        # Check if rest of the term matches
        if word_list[ind:ind + term_length] == term:
            results.append((ind, ind+term_length-1))

    return results

In [36]:
def get_chunks(word_list, term, n_before = 5, n_after = 5):
    '''
    Arguments:
    word_list : List or string of words
    term      : List or string of words to search for
    before    : Number of characters to span before term
    after     : Number of characters to span after term   
    
    Builds a list of chunks containing term
    
    Return:
    chunks : List of chunks
    
    '''
    # Check if word_list is a string or list
    if type(word_list) is str:
        word_list = word_list.split()
    elif type(word_list) is not list:
        print('Error: word_list must be a list or string.')
        return None
    
    # Check if term is a string or list    
    if type(term) is str:
        term = term.split()
    elif type(term) is not list:
        print('Error: term must be a list or string.')
        return None    
    
    indices = find_term(word_list, term)
#     print(indices)
    chunks = []

    for start, end in indices:
        before = n_before
        after = n_after
        
        # Check if start index is near the beginning of the word_list
        if start < n_before:
            before = start
        # Check if end index is near the end of the word_list
        if end > len(word_list) - n_after:
            after = len(word_list) - end
            
        chunks.append(' '.join(word_list[start-before : end+after+1]))
        
    return chunks



### Extract Sentence Fragments for each food item

In [26]:
review_text = mon_ami_gabi[mon_ami_gabi['text'].str.contains('service')]

In [32]:
pd.set_option('display.max_colwidth', 200)
review_text[['text']].head()

Unnamed: 0,text
0,loved the fried goat cheese in tomato sauce along with dogfish 60 minutes ipa. very nice view of the vegas strip along with excellent service
4,love the outdoor atmosphere. price was right service exceptional and the food tasted fantastic
7,i would've given this 5 stars if my steak didn't have to be sent back 4 times... i asked for medium to medium well my first steak was pretty much jerky and the next 2 were raw. completely raw. the...
9,coming here for many years and this is my first bad experience. the food was still great but the service and the new policy of not allowing you to have a sample tasting of wine before ordering sp...
11,believe it or not my friends and i got a recommendation for mon ami gabi from our cab driver. on our cab ride from the airport to the hotel the cab driver was giving us a song and dance about bein...


In [82]:
%%time

# Instantiate features chunks list
features_chunks = {feature_id : [] for feature_id in features['id'] }

nrows = len(mon_ami_gabi)
# each review
for i, review in mon_ami_gabi.iterrows():
    if (i+1)%100==0:
        clear_output(wait = True)
        print(f'Parsing {i+1}/{nrows} reviews')
    
    # each sentence
    for sentence in review['text'].split('.'):
        # each feature item
        for _, item in features.iterrows():            
            # each feature variation
            for variation in item['variations']:
                # get sentence fragment if sentence contains feature item
                if variation in sentence:
                    chunks = get_chunks(sentence, variation, 7, 7)
                    features_chunks[item['id']].append(chunks)
                    break


Parsing 7900/7968 reviews
CPU times: user 34.5 s, sys: 152 ms, total: 34.6 s
Wall time: 34.7 s


#### Drop any empty lists

In [83]:
for k,v in features_chunks.items():
    features_chunks[k] = [c for c in v if c]

In [40]:
def flatten(superlist): 
    '''
    Arguments: 
    superlist : A list of list of strings.

    Requirements: 
    Each element in superlist must be a list.
    
    Return:
    A flattened list of strings.

    ex: 
    flatten([['a'], ['b', 'c'], ['d', 'e', 'f']])
    >> ['a', 'b', 'c', 'd', 'e', 'f']
    '''    
    return [item \
            for sublist in superlist \
            for item in sublist]

#### Flatten list of lists of feature chunks

In [84]:
# Flatten each food's list of chunks
features_chunks2 = {feature_id : flatten(chunks) for feature_id, chunks in features_chunks.items()}

In [85]:
# save for later
pickle.dump(features_chunks2, open("../data/interim/features_chunks.pk", "wb"))

### Predict sentiment score using `SentimentIntensityAnalyzer`

`SentimentIntensityAnalyzer` returns sentiment scores between `[0, 1]` for positive, negative, and neutral sentiments, `[-1, 1]` for the compound score.

In [45]:
def get_sentiments(docs):
    '''
    Returns a Dataframe of sentiment scores with columns:
    'compound', 'pos', 'neu', 'neg'
    
    For each doc, 'pos', 'neu', 'neg' scores add to 1.
    'compound' is an overall sentiment score of the doc.
    '''
    # Instantiate SentimentIntensityAnalyzer
    sia = SentimentIntensityAnalyzer()

    sentiments = []

    # Generate sentiment score for each review
    for doc in docs:
        sentiment = sia.polarity_scores(doc)
        sentiments.append(sentiment)

    return pd.DataFrame(sentiments)

In [86]:
service_chunks = flatten(features_chunks['service'])

service_sentiments = get_sentiments(service_chunks)

In [90]:
service_sentiments.head()

Unnamed: 0,compound,neg,neu,pos
0,0.5719,0.0,0.654,0.346
1,0.4404,0.0,0.707,0.293
2,0.5574,0.0,0.714,0.286
3,0.0,0.0,1.0,0.0
4,0.0,0.0,1.0,0.0


### Predict subjectivity score

A sentence can be **objective** (has no opinion: "I ordered the onion soup"), or **subjective** (has opinion: "The onion soup was great"). I want to predict the ratings of key features using only the subjective reviews. 

In [93]:
# TODO

### Predictions without dropping `compound=0`

In [88]:
features_sentiments = {}

for feature_id, chunks in features_chunks2.items():
    # Run sentiment analyzer if item has chunks, otherwise default to 0
    if len(chunks) > 0:
        features_sentiments[feature_id] = get_sentiments(chunks)['compound'].mean()
    else:
        features_sentiments[feature_id] = np.nan

### Convert Sentiment Score to Stars

Transform continuous `[-1.0, 1.0]` range to discrete `[1, 5]`  
Metric: RMSE of `stars` and `mean_compound`.

In [91]:
def transform_to_stars(n, lower = -1, upper = 1):
    '''
    Transform n to a discrete value between 1 to 5, inclusive.
    lower and upper set the bounds of n.
    '''
    if n < lower or n > upper:
        print(f'OutOfRangeError: Set n between [{lower},{upper}]')
        return None
    
    interval = np.round((upper - lower) / 9, 2)
    bins = [np.round(lower + i*interval, 2) for i in range(0,10)]
#     print(interval)
#     print(bins)
    
    # one star is the lowest rating, can't get 0.5 star
    n_categories = 9
    
    if n >= 0.5:
        return 5.0
    elif n >= 0.46:
        return 4.5
    elif n >= .42: 
        return 4.0
    elif n >= .38:
        return 3.5
    elif n >= .32:
        return 3.0
    elif n >= .28:        
        return 2.5
    elif n >= .24:
        return 2.0
    elif n >= .20:
        return 1.5
    else: 
        return 1.0
    
    for i in range(2,11):
        if bins[i-2] <= n <= bins[i-1]:
#             print(i)
            return i/2

    if n >= bins[-1]:
        return 5.0


In [92]:
def rescale_to_stars(scores, lower = -1, upper = 1):
    '''
    Rescale list of values to discrete values between 1 to 5, inclusive.

    Distributes the 5-star system to a uniform distribution of the input range.
    '''    
    interval = np.round((upper - lower) / 5, 2)
    bins = [np.round(lower + i*interval, 2) for i in range(0, 6)]

    return list(pd.cut(scores, bins = bins, labels = np.arange(1, 6, 1), include_lowest = True))


### Determine the overall sentiment score of each key feature

In [93]:
sentiment_df = pd.DataFrame(list(zip(features_sentiments.keys(), features_sentiments.values())), columns = ['feature_id', 'sentiment_score'])
sentiment_df

Unnamed: 0,feature_id,sentiment_score
0,food,0.417114
1,service,0.322873
2,cleanliness,0.180028
3,ambiance,0.443734
4,authenticity,0.22632
5,portion_size,0.317824


In [94]:
sentiment_df['stars_pred'] = sentiment_df['sentiment_score'].apply(lambda score : transform_to_stars(score))
sentiment_df['mentions'] = sentiment_df['feature_id'].apply(lambda feature_id : len(features_chunks[feature_id]))

## Overall sentiment of each key feature

In [95]:
sentiment_df.head()

Unnamed: 0,feature_id,sentiment_score,stars_pred,mentions
0,food,0.417114,3.5,10875
1,service,0.322873,3.0,6995
2,cleanliness,0.180028,1.0,100
3,ambiance,0.443734,4.0,3285
4,authenticity,0.22632,1.5,97


## Food

In [147]:
food_df = pd.concat([pd.DataFrame(features_chunks2['food'], columns=['text']), get_sentiments(features_chunks2['food'])['compound']], axis = 1)
food_df.head()

Unnamed: 0,text,compound
0,i'm a huge fan of french food and had definitely been craving it ever,0.743
1,the quiche lorraine was delicious but i was so full by the,0.3291
2,but highly disappointing was the cold butter,-0.6946
3,definitely worth coming here for a meal that is nice but not too nice,0.7845
4,to leave this place as our last meal in vegas and i wouldn't want to,-0.1083


### Most positive reviews on food

In [150]:
food_df.nlargest(5, 'compound')

Unnamed: 0,text,compound
3298,be thicker but it's perfectly cooked with delicious sauce and a huge side of perfectly,0.9705
1244,great atmosphere great location great service great food great prices,0.9702
1730,love about fine dining: beautiful scenery delicious food great music perfect ambiance,0.9698
9776,the name but the sauce was wow amazing perfectly cooked the waiter was really attentive,0.9595
1964,tourist attraction but who cares if the food is delicious and the service is great,0.9565


### Most negative reviews on food 

In [151]:
food_df.nsmallest(5, 'compound')

Unnamed: 0,text,compound
2656,guy but bad service can kill a meal for me,-0.9231
9827,avoid this disgusting and dishonest place,-0.8519
1098,just got an off evening but the food was a severe disappointment,-0.8338
7938,the creme brulee is excruciatingly delicious it hurts so bad,-0.8312
8919,in our poorly made drinks very disappointing food and terrible service we decided to cut,-0.8268


## Service

In [123]:
service_df = pd.concat([pd.DataFrame(features_chunks2['service'], columns=['text']), get_sentiments(features_chunks2['service'])['compound']], axis = 1)
service_df.head()

Unnamed: 0,text,compound
0,of the vegas strip along with excellent service,0.5719
1,our server made our experience 10x times better,0.4404
2,price was right service exceptional and the food tasted fantastic,0.5574
3,once seated we waited forever for a server,0.0
4,the entire experience was just slow,0.0


### Most positive reviews on service 

In [152]:
service_df.nlargest(5, 'compound')

Unnamed: 0,text,compound
865,great atmosphere great location great service great food great prices,0.9702
2939,great view great food great friendly service and the price is great,0.9666
58,excellent service excellent burger excellent steak excellent chicken excellent,0.9612
7016,a great romantic and relaxing environment great service and great food,0.9595
2256,great food great service great price great setting,0.9545


### Most negative reviews on service 

In [146]:
service_df.nsmallest(5, 'compound')

Unnamed: 0,text,compound
1709,off due to that guy but bad service can kill a meal for me,-0.9231
5734,made drinks very disappointing food and terrible service we decided to cut out losses and,-0.8858
436,one i won't forget but the horrible service left a bad taste in my mouth,-0.8798
1858,but when the service is this bad no matter how hard,-0.8732
5463,with my boyfriend i got a horrible waiter who was obnoxiously rude and just didn't,-0.8689


## Cleanliness

In [124]:
cleanliness_df = pd.concat([pd.DataFrame(features_chunks2['cleanliness'], columns=['text']), get_sentiments(features_chunks2['cleanliness'])['compound']], axis = 1)
cleanliness_df.head()

Unnamed: 0,text,compound
0,with dirty forks,-0.4404
1,this place is very clean and all the servers was friendly and,0.7346
2,10 food - 10 wait time- 9 clean - 10,0.4019
3,i saw another person ask for a clean glass also while i was there,0.4019
4,overall very clean well set place,0.6566


### Most positive reviews on cleanliness 

In [145]:
cleanliness_df.nlargest(5, 'compound')

Unnamed: 0,text,compound
60,awesome and i had the shrimp cocktail clean and good while my daughter loved the,0.9274
84,still on the strip but quiet clean friendly and not bad prices,0.9122
88,were perfectly ripe and such a nice clean dessert after something like steak,0.9042
36,inside of the restaurant is very clean seats are very comfortable and very relaxing,0.8773
40,sandwich and crepe were amazing- crepe was clean and the peas went nicely with the,0.8555


### Most negative reviews on cleanliness 

In [144]:
cleanliness_df.nsmallest(5, 'compound')

Unnamed: 0,text,compound
21,napkin because the first was stained and dirty and our server brought us the wrong,-0.7184
44,dirty plates dirty everything,-0.7003
45,dirty plates dirty everything,-0.7003
86,a very unimpressive meal i didn't even clean my plate,-0.6061
26,i was sticky dirty blurry at some parts,-0.5106


## Ambiance

In [153]:
ambiance_df = pd.concat([pd.DataFrame(features_chunks2['ambiance'], columns=['text']), get_sentiments(features_chunks2['ambiance'])['compound']], axis = 1)
ambiance_df.head()

Unnamed: 0,text,compound
0,very nice view of the vegas strip along with excellent,0.7778
1,love the outdoor atmosphere,0.6369
2,amazing service yummy food friendly staff beautiful view and free dessert for the hassle with,0.9559
3,the ambiance was good too,0.4404
4,a perfect view with a perfect breakfast,0.8126


### Most positive reviews on ambiance

In [154]:
ambiance_df.nlargest(5, 'compound')

Unnamed: 0,text,compound
414,great atmosphere great location great service great food great,0.9702
2,amazing service yummy food friendly staff beautiful view and free dessert for the hassle with,0.9559
1503,great good great atmosphere and great value,0.9559
2832,overall great food great service great ambiance and beautiful views,0.9531
1360,great view great food great friendly service and the,0.9477


### Most negative reviews on ambiance

In [155]:
ambiance_df.nsmallest(5, 'compound')

Unnamed: 0,text,compound
164,terrible the food not that great the view was blocked by pedestrians and trees,-0.8173
847,crappy service mediocre food loud obnoxious atmosphere i was thinking this place might have,-0.765
2947,drop dead view of the bellagio fountains if you can,-0.7506
2764,are inside dim strip hotels with no view inconvenient from the street with limited menus,-0.6705
2085,we requested outside seating the view was good but the only negative here,-0.6249


## Authenticity

In [156]:
authenticity_df = pd.concat([pd.DataFrame(features_chunks2['authenticity'], columns=['text']), get_sentiments(features_chunks2['authenticity'])['compound']], axis = 1)
authenticity_df.head()

Unnamed: 0,text,compound
0,it's classy and very authentic french cuisine,0.4404
1,authentic is the word you should use to,0.0
2,that being said it's not the most authentic french restaurant i've ever been to but,0.0
3,very authentic french atmosphere,0.0
4,the food is delicious and authentic,0.5719


### Most positive reviews on authenticity

In [157]:
authenticity_df.nlargest(5, 'compound')

Unnamed: 0,text,compound
90,chris topped off the great food and authentic french ambiance with gracious and extremely competent,0.8832
32,nice relaxing place to enjoy a near authentic parisienne bistro experience,0.8481
23,service is excellent and the decor is authentic and quite lovely,0.8313
80,the food feels authentic the staff is incredibly friendly fun and,0.7947
34,fabulous service authentic french cruise and divine atmosphere,0.7906


### Most negative reviews on authenticity

In [158]:
authenticity_df.nsmallest(5, 'compound')

Unnamed: 0,text,compound
77,it would be a little bit more authentic but was sadly disappointed,-0.8338
41,we got an authentic french waiter but the poor thing was,-0.631
44,give you homemade bread which is very authentic unfortunately i didn't eat much of it,-0.3976
94,hostesses aren't french they certainly have that authentic french snooty attitude,-0.2584
85,it's a great baguette not authentic but pretty close,-0.2244


Certain features are likely subject to bias. Reviews are more likely to write negatively about service and cleanliness.


In [131]:
sentiment_df.to_csv('../data/predictions/features_sentiment.csv', index=False)

### Predictions after dropping `compound=0`
I am treating chunks with `compound=0` as reviews with no opinion.

In [134]:
features_sentiments2 = {}

for feature_id, chunks in features_chunks2.items():
    # Run sentiment analyzer if item has chunks, otherwise default to 0
    if len(chunks) > 0:
        sentiment_score = get_sentiments(chunks)['compound']
        features_sentiments2[feature_id] = sentiment_score[sentiment_score != 0].mean()
    else:
        features_sentiments2[feature_id] = np.nan

In [135]:
sentiment_df2 = pd.DataFrame(list(zip(features_sentiments2.keys(), features_sentiments2.values())), columns = ['feature_id', 'sentiment_score'])


In [137]:
sentiment_df2['stars_pred'] = sentiment_df2['sentiment_score'].apply(lambda score : transform_to_stars(score))
sentiment_df2['mentions'] = sentiment_df2['feature_id'].apply(lambda feature_id : len(features_chunks[feature_id]))

In [138]:
sentiment_df2.to_csv('../data/predictions/features_sentiment_subjective.csv', index=False)

In [143]:
sentiment_df2

Unnamed: 0,feature_id,sentiment_score,stars_pred,mentions
0,food,0.512972,5.0,10875
1,service,0.433615,4.0,6995
2,cleanliness,0.185539,1.0,100
3,ambiance,0.575814,5.0,3285
4,authenticity,0.396061,3.5,97
5,portion_size,0.488959,4.5,295
