ABSA Plan: 

- 5 key sentiments: Location, service, quality, atmosphere, value
- Rate each review on sentiment toward each of these topics
- Rate each review on strength of that sentiment in the review (maybe)
- For a business, summarize overall sentiment toward each of these aspects
    - Pull out positive or negative aspects of reviews

In [4]:
import spacy
import pandas as pd
import numpy as np
import textblob
from textblob import TextBlob
import math 
import matplotlib.pyplot as plt

sp = spacy.load('en_core_web_md')

#### Load datasets: 

In [5]:
# original dataset, sampled
df = pd.read_csv('reviews_with_parents.csv')

In [None]:
# tokenized dataset
df = pd.read_csv('yelp_processed_v3.csv')

# turn tokens back into arrays
import ast
df['text'] = df['text'].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else x)

if (type(df['text'][0] != 'text')): df['text'] = df['text'].apply(lambda x: ' '.join(x))

In [4]:
df = pd.read_csv('yelp_processed_v4.csv')
df = df[:500]

In [6]:
df.head()

Unnamed: 0.1,Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date,category
0,0,KU_O5udG6zpxOg-VcAEodg,mh_-eMZ6K5RLWhZyISBhwA,XQfwVwDr-v0ZS3_CbbE5Xw,3.0,0,0,0,"If you decide to eat here, just be aware it is...",2018-07-07 22:09:11,food
1,1,BiTunyQ73aT9WBnpR9DZGw,OyoGAe7OKpv6SyGZT5g77Q,7ATYjTIgM3jUlt4UM3IypQ,5.0,1,0,1,I've taken a lot of spin classes over the year...,2012-01-03 15:28:18,entertainment
2,2,saUsX_uimxRlCVr67Z4Jig,8g_iMtfSiwikVnbP2etR0A,YjUWPpI6HXG530lwP-fb2A,3.0,0,0,0,Family diner. Had the buffet. Eclectic assortm...,2014-02-05 20:30:30,food
3,3,AqPFMleE6RsU23_auESxiA,_7bHUi9Uuf5__HHc_Q8guQ,kxX2SOes4o-D3ZQBkiMRfA,5.0,1,0,1,"Wow! Yummy, different, delicious. Our favo...",2015-01-04 00:01:03,food
4,4,Sx8TMOWLNuJBWer-0pcmoA,bcjbaE6dDog4jkNY91ncLQ,e4Vwtrqf-wpJfwesgvdgxQ,4.0,1,0,1,Cute interior and owner (?) gave us tour of up...,2017-01-14 20:54:15,food


#### Model Architecture

Input: Reviews

Output: Binary output for each sentiment category (sigmoid)

Loss: BCE 

Optim: Adam 

If you decide to eat here, just be aware it is going to take about 2 hours from beginning to end. 

We have tried it multiple times, because I want to like it! 

I have been to it's other locations in NJ and never had a bad experience. 

The food is good, but it takes a very long time to come out. 

The waitstaff is very young, but usually pleasant. 

We have just had too many experiences where we spent way too long waiting. 

We usually opt for another diner or restaurant on the weekends, in order to be done quicker.

#### Calculate aspect-based sentiments

In [7]:
def get_similar_words(topics, vocab_doc, threshold=0.55):
    words_dict = {}
    for topic in topics: 
        topic_token = sp(topic)
        similar_words = set()
        
        # compare with words in vocabulary
        for word in vocab_doc: 
            if word.has_vector and word.is_alpha: 
                similarity = topic_token.similarity(word)
                if similarity > threshold: 
                    similar_words.add((word.text.lower(), similarity))

        # sort by similarity score
        similar_words = sorted(similar_words, key=lambda x: x[1], reverse=True)# [:n] 
        words_dict[topic] = [word[0] for word in similar_words] 
    
    return words_dict

In [8]:
# set topics
topics = ['location', 'service', 'quality', 'atmosphere', 'value']

# set vocab
vocab_doc = " ".join([w for w in df['text']])
vocab_doc = " ".join(set(vocab_doc.split()))
vocab_doc = sp(vocab_doc)

# get similar words
similar_words = get_similar_words(topics, vocab_doc, threshold=0.55)
print(similar_words)

{'location': ['location', 'accurately', 'exact', 'verify', 'readings', 'precisely', 'gauges', 'correct', 'detailed', 'consistent', 'precision', 'pinpoint', 'accurate', 'timing', 'location', 'precise', 'timely', 'gauge', 'locate', 'dated', 'inaccurate', 'incorrect', 'questionable'], 'service': ['service', 'service', 'ordering', 'options', 'pricing', 'specifications', 'details', 'bundle', 'availability', 'cart', 'bought', 'pack', 'price', 'item', 'value', 'purchased', 'buying', 'order', 'shipping', 'priced', 'shops', 'prices', 'option', 'save', 'supplies', 'upgrades', 'plan', 'delivery', 'available', 'anywhere', 'bungalow', 'purchasing', 'cost', 'plans', 'for', 'cheaper', 'replacement', 'package', 'accommodations', 'travelers', 'reservations', 'vary', 'motels', 'destinations', 'rental', 'occupancy', 'rates', 'lodging', 'rooms', 'transportation', 'domestic', 'savings', 'burden', 'budget', 'costs', 'freight', 'warehouse', 'buyers', 'stores', 'franchises', 'showroom', 'consumer', 'dealer', 

In [9]:
# 0.55: highest threshold without significant loss
def extract_topic_sentiment(text, topics, similar_words, threshold=0.55): 
    doc = sp(text)
    # sentiment score for extracted sentence
    sentiment_scores = {}

    for topic in topics:

        # sentences for now, TODO make it phrases later 
        topic_sentences = [sent.text for sent in doc.sents if any(word in sent.text.lower() for word in similar_words[topic])]

        topic_sentiments = []
        for sentence in topic_sentences: 
            blob = TextBlob(sentence)
            topic_sentiments.append(blob.sentiment.polarity)

        # average sentiments for each topic
        if topic_sentiments: 
            sentiment_scores[topic] = (sum(topic_sentiments) / len(topic_sentiments))
        else: 
            sentiment_scores[topic] = None

    # print(sentiment_scores)
        
    # if there is not a score for each sentiment, average all sentiments and set that to the score
    blob = TextBlob(text)
    avg_sentiment = blob.sentiment.polarity

    # print([s for s in sentiment_scores])

    for s in sentiment_scores: 
        if sentiment_scores[s] is None: 
            sentiment_scores[s] = avg_sentiment

    sentiment_scores['overall'] = avg_sentiment
    return sentiment_scores
        
sample_text = 'If you decide to eat here, just be aware it is going to take about 2 hours from beginning to end. We have tried it multiple times, because I want to like it! I have been to its other locations in NJ and never had a bad experience. The food is good, but it takes a very long time to come out. The waitstaff is very young, but usually pleasant. We have just had too many experiences where we spent way too long waiting. We usually opt for another diner or restaurant on the weekends, in order to be done quicker.'
        
sentiment_scores = extract_topic_sentiment('Wow! Yummy, different, delicious. Our favo...	', topics, similar_words, threshold=0.55)

In [10]:
print(sentiment_scores)

{'location': 0.375, 'service': 0.375, 'quality': 0.0, 'atmosphere': 0.0, 'value': 0.375, 'overall': 0.375}


In [128]:
def add_topic_cols(df, sentiment_scores):
    for topic in sentiment_scores.keys(): 
        df[topic] = None
    return df

In [None]:
df = df.drop(['location', 'service', 'quality', 'atmosphere', 'value', 'overall'], axis=1)

In [11]:
df.head()

Unnamed: 0.1,Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date,category
0,0,KU_O5udG6zpxOg-VcAEodg,mh_-eMZ6K5RLWhZyISBhwA,XQfwVwDr-v0ZS3_CbbE5Xw,3.0,0,0,0,"If you decide to eat here, just be aware it is...",2018-07-07 22:09:11,food
1,1,BiTunyQ73aT9WBnpR9DZGw,OyoGAe7OKpv6SyGZT5g77Q,7ATYjTIgM3jUlt4UM3IypQ,5.0,1,0,1,I've taken a lot of spin classes over the year...,2012-01-03 15:28:18,entertainment
2,2,saUsX_uimxRlCVr67Z4Jig,8g_iMtfSiwikVnbP2etR0A,YjUWPpI6HXG530lwP-fb2A,3.0,0,0,0,Family diner. Had the buffet. Eclectic assortm...,2014-02-05 20:30:30,food
3,3,AqPFMleE6RsU23_auESxiA,_7bHUi9Uuf5__HHc_Q8guQ,kxX2SOes4o-D3ZQBkiMRfA,5.0,1,0,1,"Wow! Yummy, different, delicious. Our favo...",2015-01-04 00:01:03,food
4,4,Sx8TMOWLNuJBWer-0pcmoA,bcjbaE6dDog4jkNY91ncLQ,e4Vwtrqf-wpJfwesgvdgxQ,4.0,1,0,1,Cute interior and owner (?) gave us tour of up...,2017-01-14 20:54:15,food


In [12]:
# get topic sentiments for each review 
sentiments = df['text'].apply(lambda x: extract_topic_sentiment(x, topics, similar_words))

In [13]:
# convert sentiments to df 
sentiment_df = pd.DataFrame.from_dict(sentiments)
sentiment_df = pd.json_normalize(sentiments)

df = pd.concat([df, sentiment_df], axis=1)

In [16]:
df.head()

Unnamed: 0.1,Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date,category,location,service,quality,atmosphere,value,overall
0,0,KU_O5udG6zpxOg-VcAEodg,mh_-eMZ6K5RLWhZyISBhwA,XQfwVwDr-v0ZS3_CbbE5Xw,3.0,0,0,0,"If you decide to eat here, just be aware it is...",2018-07-07 22:09:11,food,-0.4125,-0.25,0.25,0.064762,-0.25,0.085278
1,1,BiTunyQ73aT9WBnpR9DZGw,OyoGAe7OKpv6SyGZT5g77Q,7ATYjTIgM3jUlt4UM3IypQ,5.0,1,0,1,I've taken a lot of spin classes over the year...,2012-01-03 15:28:18,entertainment,0.402273,0.264583,0.315,0.225,0.227778,0.402273
2,2,saUsX_uimxRlCVr67Z4Jig,8g_iMtfSiwikVnbP2etR0A,YjUWPpI6HXG530lwP-fb2A,3.0,0,0,0,Family diner. Had the buffet. Eclectic assortm...,2014-02-05 20:30:30,food,0.139935,0.1,0.0,0.123214,0.05,0.139935
3,3,AqPFMleE6RsU23_auESxiA,_7bHUi9Uuf5__HHc_Q8guQ,kxX2SOes4o-D3ZQBkiMRfA,5.0,1,0,1,"Wow! Yummy, different, delicious. Our favo...",2015-01-04 00:01:03,food,0.302557,0.085227,0.292614,0.302557,0.085227,0.302557
4,4,Sx8TMOWLNuJBWer-0pcmoA,bcjbaE6dDog4jkNY91ncLQ,e4Vwtrqf-wpJfwesgvdgxQ,4.0,1,0,1,Cute interior and owner (?) gave us tour of up...,2017-01-14 20:54:15,food,0.400969,0.491667,0.716667,0.253267,0.133333,0.400969


In [15]:
df.to_csv('df_withsentiments.csv')

#### Add names of businesses to dataframe

In [17]:
reviews_df = pd.read_csv('df_withsentiments.csv')
bus_df = pd.read_csv('yelp_businesses.csv')

In [20]:
bus_df.head()

Unnamed: 0.1,Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours
0,0,Pns2l4eNsfO8kk83dixA6A,"Abby Rappoport, LAC, CMQ","1616 Chapala St, Ste 2",Santa Barbara,CA,93101,34.426679,-119.711197,5.0,7,0,{'ByAppointmentOnly': 'True'},"Doctors, Traditional Chinese Medicine, Naturop...",
1,1,mpf3x-BjTdTEA3yCZrAYPw,The UPS Store,87 Grasso Plaza Shopping Center,Affton,MO,63123,38.551126,-90.335695,3.0,15,1,{'BusinessAcceptsCreditCards': 'True'},"Shipping Centers, Local Services, Notaries, Ma...","{'Monday': '0:0-0:0', 'Tuesday': '8:0-18:30', ..."
2,2,tUFrWirKiKi_TAnsVWINQQ,Target,5255 E Broadway Blvd,Tucson,AZ,85711,32.223236,-110.880452,3.5,22,0,"{'BikeParking': 'True', 'BusinessAcceptsCredit...","Department Stores, Shopping, Fashion, Home & G...","{'Monday': '8:0-22:0', 'Tuesday': '8:0-22:0', ..."
3,3,MTSW4McQd7CbVtyjqoe9mw,St Honore Pastries,935 Race St,Philadelphia,PA,19107,39.955505,-75.155564,4.0,80,1,"{'RestaurantsDelivery': 'False', 'OutdoorSeati...","Restaurants, Food, Bubble Tea, Coffee & Tea, B...","{'Monday': '7:0-20:0', 'Tuesday': '7:0-20:0', ..."
4,4,mWMc6_wTdE0EUBKIGXDVfA,Perkiomen Valley Brewery,101 Walnut St,Green Lane,PA,18054,40.338183,-75.471659,4.5,13,1,"{'BusinessAcceptsCreditCards': 'True', 'Wheelc...","Brewpubs, Breweries, Food","{'Wednesday': '14:0-22:0', 'Thursday': '16:0-2..."


In [31]:
merged_df = reviews_df.merge(bus_df[['business_id', 'name', 'address', 'city', 'state', 'postal_code', 'latitude', 'longitude', 'review_count']],
                             on = 'business_id',
                             how='left')

In [32]:
merged_df.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,...,value,overall,name,address,city,state,postal_code,latitude,longitude,review_count
0,0,0,KU_O5udG6zpxOg-VcAEodg,mh_-eMZ6K5RLWhZyISBhwA,XQfwVwDr-v0ZS3_CbbE5Xw,3.0,0,0,0,"If you decide to eat here, just be aware it is...",...,-0.25,0.085278,Turning Point of North Wales,1460 Bethlehem Pike,North Wales,PA,19454,40.210196,-75.223639,169
1,1,1,BiTunyQ73aT9WBnpR9DZGw,OyoGAe7OKpv6SyGZT5g77Q,7ATYjTIgM3jUlt4UM3IypQ,5.0,1,0,1,I've taken a lot of spin classes over the year...,...,0.227778,0.402273,Body Cycle Spinning Studio,"1923 Chestnut St, 2nd Fl",Philadelphia,PA,19119,39.952103,-75.172753,144
2,2,2,saUsX_uimxRlCVr67Z4Jig,8g_iMtfSiwikVnbP2etR0A,YjUWPpI6HXG530lwP-fb2A,3.0,0,0,0,Family diner. Had the buffet. Eclectic assortm...,...,0.05,0.139935,Kettle Restaurant,748 W Starr Pass Blvd,Tucson,AZ,85713,32.207233,-110.980864,47
3,3,3,AqPFMleE6RsU23_auESxiA,_7bHUi9Uuf5__HHc_Q8guQ,kxX2SOes4o-D3ZQBkiMRfA,5.0,1,0,1,"Wow! Yummy, different, delicious. Our favo...",...,0.085227,0.302557,Zaika,2481 Grant Ave,Philadelphia,PA,19114,40.079848,-75.02508,181
4,4,4,Sx8TMOWLNuJBWer-0pcmoA,bcjbaE6dDog4jkNY91ncLQ,e4Vwtrqf-wpJfwesgvdgxQ,4.0,1,0,1,Cute interior and owner (?) gave us tour of up...,...,0.133333,0.400969,Melt,2549 Banks St,New Orleans,LA,70119,29.962102,-90.087958,32


In [33]:
merged_df.to_csv('reviews_wpars_wsents_wbusinfo.csv')