# Objective : To get the top 5 bigrams for low rated business in negative reviews based on polarity score

The objective of this document is to get the top 5 most negative bigrams for individual business.
We are going through following steps:

1. Read the data from the file
2. Filter the data to get the low rated reviews only
3. Identify the non-English reviews and remove the corresponding reviews as we can work only with English reviews
4. Create a dataframe for business by taking only business information from above dataframe
5. Create a new column in the business dataframe to keep 5 sample reviews for UI display
6. join all the reviews of individual business
7. Next we have cleared any special characters and number from the review text
8. changed the casing of the text to lower case to ensure that we get same format for lemmatized text
9. We have used English stopwords from nltk.corpus library to remove the stopwords from review text
10. We have taken the cleaned text from above and created the bigrams

To create Bigrams and take top 5 we have performed following on the joined text of each business, 

1. We have taken cleaned data for each business and created list of bigrams by using consecutive 2 words.
2. Calculated the polarity of each bigram by using the TEXTBLOB library
3. Sorted the bigrams in descending order of polarity
4. taken the top 5 Bigrams based on polarity score
5. saved the data in csv file for presentation on UI

***Note that we are not lemmatizing the token here as the bigrams will make more sense in original format


### Importing libraries

In [1]:
import pandas as pd
import numpy as np

#To detect the languages
from langdetect import detect

#for distributed task, to make the processing faster
import dask.dataframe as dd
from dask.multiprocessing import get

#importing regular expressions to handle the text data
import re
from nltk.corpus import stopwords

from collections import Counter
from textblob import TextBlob

### Reading data from csv files

In [2]:
reviews_data = pd.read_csv("data/AZ_restaurant_low_high_review_final.csv", sep="\t")

In [3]:
#Checking the data distribution for different price ranges

reviews_data.RestaurantsPriceRange2.value_counts()

2.0    171194
1.0     90216
3.0     13236
4.0      1675
Name: RestaurantsPriceRange2, dtype: int64

In [4]:
reviews_data.head(5)

Unnamed: 0,User_id,Business_id,Name,Address,City,State,Postal_code,Review_count,Restaurant_ratings,RestaurantsPriceRange2,Review_ratings,Review_text
0,Ck73f1qtZbu68F_vjzsBrQ,rDMptJYWtnMhpQu_rRXHng,McDonald's,719 E Thunderbird Rd,Phoenix,AZ,85022.0,10.0,1.0,1.0,1.0,The speed of delivery of my food order was ter...
1,F95NFEFwuwA__SIRt9IJNA,rDMptJYWtnMhpQu_rRXHng,McDonald's,719 E Thunderbird Rd,Phoenix,AZ,85022.0,10.0,1.0,1.0,1.0,I stopped by for a double quarter pounder with...
2,2gWCW1oEuyhaxrlTTghvtQ,rDMptJYWtnMhpQu_rRXHng,McDonald's,719 E Thunderbird Rd,Phoenix,AZ,85022.0,10.0,1.0,1.0,1.0,I was told tonight at 8:30 pm that they were n...
3,yKyfDC9EPHvSuBXPCP-EmQ,rDMptJYWtnMhpQu_rRXHng,McDonald's,719 E Thunderbird Rd,Phoenix,AZ,85022.0,10.0,1.0,1.0,1.0,Cashier was disgusting and unsanitary. Picking...
4,wxu6RAQqre73_id5lALttA,rDMptJYWtnMhpQu_rRXHng,McDonald's,719 E Thunderbird Rd,Phoenix,AZ,85022.0,10.0,1.0,1.0,1.0,Don't waste your money! Terrible service. The...


In [5]:
reviews_data.shape

(276327, 12)

In [6]:
#Checking for data in different review ratings

reviews_data.Review_ratings.value_counts()

5.0    181549
4.0     78005
1.0     12601
2.0      4166
Name: Review_ratings, dtype: int64

In [7]:
reviews_data.columns

Index(['User_id', 'Business_id', 'Name', 'Address', 'City', 'State',
       'Postal_code', 'Review_count', 'Restaurant_ratings',
       'RestaurantsPriceRange2', 'Review_ratings', 'Review_text'],
      dtype='object')

### Taking only low rated reviews data

In [8]:
low_rated_reviews =  reviews_data[reviews_data['Review_ratings'] < 3]

In [9]:
low_rated_reviews.shape

(16767, 12)

    There are about 16767 low rated reviews

In [10]:
columnnames = ['Business_id', 'Name', 'Address', 'City', 'State','Postal_code', 'Review_count', 'Restaurant_ratings','RestaurantsPriceRange2']


### Taking out the business information in a new dataframe

In [11]:
low_rated_business_data = low_rated_reviews[columnnames]

In [12]:
low_rated_business_data.shape

(16767, 9)

In [13]:
low_rated_business_data = low_rated_business_data.drop_duplicates()

In [14]:
low_rated_business_data.shape

(1001, 9)

    There are 1001 unique business for which we are going to show improvements as they are low rated businesses

In [15]:
low_rated_business_data.RestaurantsPriceRange2.value_counts()

1.0    771
2.0    219
3.0      8
4.0      3
Name: RestaurantsPriceRange2, dtype: int64

    Checking for different price ranges distribution...however we are not going to include the price range in further processing for data except for presenting the price range on UI

### Filtering Out non English Data

In [16]:
#This method is going to return the language of each review text. 
#if the language is not identifiable by langdetect library, it will return nonknown
def removeSpecialCharsandGetLanguage(text):
    #print(text)
    if(pd.isnull(text)):
        text=""    
    lettersonly =  re.sub("[^a-zA-Z]", " ", text)
    #print(lettersonly)
    # 豬肋排好吃不油膩, 加點醋醬風味更佳。\r\n炭烤半雞也不賴。小菜不錯, 服務也很貼心, 吃...
    #Above line is giving error with language detection. Hence we have put a try except block to ensure 
    #that it doesn't give exception when a non identifiable text appears
    try:
        lang=detect(lettersonly)
    except:
        lang='notknown'
        pass
    
    return lang

In [17]:
#Implementing the language detection using dask libraries as it is very slow.
def filterNonEnglishReviewsDask(df):
    ddata = dd.from_pandas(df, npartitions=30)
    lang = ddata.map_partitions(lambda df: df.apply((lambda row: removeSpecialCharsandGetLanguage(row["Review_text"])), axis=1)).compute(get=get)
    return lang

In [18]:
#Apply language detection to the data

lang = filterNonEnglishReviewsDask(reviews_data)
reviews_data["lang"] = lang



In [19]:
totalreviewscount = reviews_data.shape[0]
print("# of records before filtering for non english: %d"%(totalreviewscount))
    
reviews_data= reviews_data[reviews_data["lang"] == "en"] 
cntAfterFilter = reviews_data.shape[0]
cntFiltered = totalreviewscount - cntAfterFilter
print("# of non english records filtered out: %d"%(cntFiltered))
    

# of records before filtering for non english: 276327
# of non english records filtered out: 246


In [20]:
#This method is going to return the joined text for each business
#this will be needed for calculating bigrams

def JoinReviewsData(business_id):
    data = reviews_data[reviews_data['Business_id'] == business_id]
    text_rows = data['Review_text']
    text = " ".join(text_rows)
    return text

In [21]:
#Apply the review joining for the business
textAll = low_rated_business_data.Business_id.apply(JoinReviewsData)
#Assign the joined review and top 5 reviews to the businessdata
low_rated_business_data["JoinedText"] = textAll

### Get top 5 reviews for sample

In [22]:
#Following code will add 5 new columns to the dataframe low_rated_business_data
#Each of the column is a sample review for the specific business id
# We will create a dataframe named df which will contain the data initially.
#later this will be merged with low_rated_business_data
#we have ensured that if the number of sample reviews for the business is less than 5, this doesn't give exception

cols = ['Business_id',"Review_1","Review_2","Review_3","Review_4","Review_5"]  
df = pd.DataFrame( columns=cols)

for business_id in low_rated_business_data["Business_id"]:
    df2 = pd.DataFrame([[ business_id,"","","","",""]], columns=cols)
    df = df.append(df2)
    data = reviews_data[reviews_data['Business_id'] == business_id].head(5)["Review_text"]
    length = data.count()
    i=0
    for review in data:
        if i<length:
            colname = "Review_"+str(i+1)
            df.loc[df['Business_id'] == business_id, colname] = review
        i=i+1
        
    

In [23]:
df.head(2)

Unnamed: 0,Business_id,Review_1,Review_2,Review_3,Review_4,Review_5
0,rDMptJYWtnMhpQu_rRXHng,The speed of delivery of my food order was ter...,I stopped by for a double quarter pounder with...,I was told tonight at 8:30 pm that they were n...,Cashier was disgusting and unsanitary. Picking...,Don't waste your money! Terrible service. The...
0,qB15WElGAlI_eGWjn0kT2w,"Very polite, decent customer service. The rice...",Wow. You guys should really work on ur service...,Slowest drive thru I have ever gone through. T...,Terrible service. Always a long wait. If you g...,Literally the slowest drive thru I've ever exp...


In [24]:
low_rated_business_data_w_reviews = pd.merge(low_rated_business_data, df, on='Business_id')

In [25]:
low_rated_business_data_w_reviews.head(2)

Unnamed: 0,Business_id,Name,Address,City,State,Postal_code,Review_count,Restaurant_ratings,RestaurantsPriceRange2,JoinedText,Review_1,Review_2,Review_3,Review_4,Review_5
0,rDMptJYWtnMhpQu_rRXHng,McDonald's,719 E Thunderbird Rd,Phoenix,AZ,85022.0,10.0,1.0,1.0,The speed of delivery of my food order was ter...,The speed of delivery of my food order was ter...,I stopped by for a double quarter pounder with...,I was told tonight at 8:30 pm that they were n...,Cashier was disgusting and unsanitary. Picking...,Don't waste your money! Terrible service. The...
1,qB15WElGAlI_eGWjn0kT2w,Taco Bell,15240 N. 32nd Street,Phoenix,AZ,85032.0,18.0,2.0,1.0,"Very polite, decent customer service. The rice...","Very polite, decent customer service. The rice...",Wow. You guys should really work on ur service...,Slowest drive thru I have ever gone through. T...,Terrible service. Always a long wait. If you g...,Literally the slowest drive thru I've ever exp...


In [26]:
#Saving the intermediate results
low_rated_business_data_w_reviews.to_csv("low_rated_business_data_w_reviews.csv", index=False)

In [27]:
#Loading from intermediate csvs

lr_business_reviews = pd.read_csv("low_rated_business_data_w_reviews.csv")
                               

In [28]:
#verify a sample of the data
lr_business_reviews.head(2)

Unnamed: 0,Business_id,Name,Address,City,State,Postal_code,Review_count,Restaurant_ratings,RestaurantsPriceRange2,JoinedText,Review_1,Review_2,Review_3,Review_4,Review_5
0,rDMptJYWtnMhpQu_rRXHng,McDonald's,719 E Thunderbird Rd,Phoenix,AZ,85022.0,10.0,1.0,1.0,The speed of delivery of my food order was ter...,The speed of delivery of my food order was ter...,I stopped by for a double quarter pounder with...,I was told tonight at 8:30 pm that they were n...,Cashier was disgusting and unsanitary. Picking...,Don't waste your money! Terrible service. The...
1,qB15WElGAlI_eGWjn0kT2w,Taco Bell,15240 N. 32nd Street,Phoenix,AZ,85032.0,18.0,2.0,1.0,"Very polite, decent customer service. The rice...","Very polite, decent customer service. The rice...",Wow. You guys should really work on ur service...,Slowest drive thru I have ever gone through. T...,Terrible service. Always a long wait. If you g...,Literally the slowest drive thru I've ever exp...


### Preprocessing and cleaning data for getting top 5 bigrams

In [29]:
#Define the stop words list
stop =  set(stopwords.words("english"))
    
#This method is performing following steps:
'''
1. Removing any html tokens
2. removing any special charatcters
3. removing any numeric character
4.removing extra spaces
5. filtering the stop words
6. bringing all the words to lower casing
'''
def cleanText(reviewText):
    htmlremoved = re.sub(re.compile('<.*?>'), '', str(reviewText)) 
    filtered_review_token =  [word for word in htmlremoved.split() if word.lower() not in stop]
    joined = " ".join(filtered_review_token)
    lettersonly =  re.sub("[^a-zA-Z]", " ", joined)
    spaceremoved = re.sub(' +',' ',str(lettersonly)) #re.sub('  ', ' ', str(lettersonly))
    lower = spaceremoved.lower()
    return lower

In [30]:
#This was fast..did not require the Dask implementation
lr_business_reviews["cleantext"] = lr_business_reviews.JoinedText.apply(cleanText)

In [31]:
lr_business_reviews.head(2)

Unnamed: 0,Business_id,Name,Address,City,State,Postal_code,Review_count,Restaurant_ratings,RestaurantsPriceRange2,JoinedText,Review_1,Review_2,Review_3,Review_4,Review_5,cleantext
0,rDMptJYWtnMhpQu_rRXHng,McDonald's,719 E Thunderbird Rd,Phoenix,AZ,85022.0,10.0,1.0,1.0,The speed of delivery of my food order was ter...,The speed of delivery of my food order was ter...,I stopped by for a double quarter pounder with...,I was told tonight at 8:30 pm that they were n...,Cashier was disgusting and unsanitary. Picking...,Don't waste your money! Terrible service. The...,speed delivery food order terrible took minute...
1,qB15WElGAlI_eGWjn0kT2w,Taco Bell,15240 N. 32nd Street,Phoenix,AZ,85032.0,18.0,2.0,1.0,"Very polite, decent customer service. The rice...","Very polite, decent customer service. The rice...",Wow. You guys should really work on ur service...,Slowest drive thru I have ever gone through. T...,Terrible service. Always a long wait. If you g...,Literally the slowest drive thru I've ever exp...,polite decent customer service rice burrito ha...


## Creating top 5 Bigrams


In [32]:
#This function is created to create Bigrams
#This is taking 2 consecutive words as they appear in the document and create a bigram

def calculateBigrams(tokens):
    bigrams = [(tokens[i],tokens[i+1]) for i in range(0,len(tokens)-1)]
    return bigrams
    


In [33]:
bigramCols = ['Business_id',"Top5Bigrams_1","Top5Bigrams_2","Top5Bigrams_3","Top5Bigrams_4","Top5Bigrams_5"]
bigramDf = pd.DataFrame( columns= bigramCols)

In [34]:
#This function is performing following steps:

#1. taking the cleantext and splitting it to tokenise. 
#2. get the bigrams using the function
#3. get the polarity of each bigram using the textblob library
#4. if the bigram has negative polarity, then saving the bigram ina dataframe along with the polarity score
#5. Removing dplicate rows from the bigram datafram
#6. Sorting the dataframe based on polarity score
#7. Taking the top 5 negative polarity bigrams
#8. Creating a text out of negatively polar bigram in a UI presentable format

def gettop5Bigrams(cleantext):
    tokens = cleantext.split(" ")
    bigram = calculateBigrams(tokens)
    negative_bigrams = []
    neg=0
    #Need to create a dataframe with 2 columns, bigram name and polarity 
    df = pd.DataFrame( columns=["Bigram","polarity"])
    for x,y in bigram:
        #print(x +" "+y)
        text=x +" "+y
        analysis = TextBlob(text)
        polarity = analysis.sentiment.polarity
        #print(x +" "+y + str(polarity))
        if polarity < 0:
            df.loc[len(df)] = [text, polarity]
            #print(text)
    df = df.drop_duplicates()
    df_sorted = df.sort_values('polarity')
    #df_sorted.head(50)
    head = df_sorted.head(5).Bigram
    #print(df_sorted.head(50))
    top5 = ''
    i=0
    for t in head:
        i = i+1
        top5 = top5 + "|"+ t
    return top5
   
    

In [35]:
gettop5Bigrams("Sun Devil Dining on Lemon Street has probably been the worst dining experience I have ever had. Day after day, they run out of pretty much everything half way through the day, including probably the most popular item, grilled chicken. They don't even care to have the food prepared by 11am, which is the time they tell everyone the main food options will be available by. Along with many other students, I am sick and tired of shitty food and horrible service by the workers. Except for Ashley, she has always been happy and smiling. Everything about this dining hall has been awful and if there were a way for me to take my money back, and put that towards something else, preferably food that is edible, I would greatly appreciate it. Hopefully this will reach the eyes of someone in charge so they can change these ways and start making students happy. But until then, Keep up the shitty service.")

'|the worst|worst dining|and horrible|horrible service|been awful'

In [36]:
#Calculating the top 5 bigrams using distributed task(Dask) library for better performace
def getNegativeDaskTop5(df):
    ddata1 = dd.from_pandas(df, npartitions=30)
    txt = ddata1.map_partitions(lambda df: df.apply((lambda row: gettop5Bigrams(row["cleantext"])), axis=1)).compute(get=get)
    return txt


In [37]:
lr_business_reviews["Top5Bigrams"] = getNegativeDaskTop5(lr_business_reviews)

In [38]:
lr_business_reviews.shape


(1001, 17)

In [39]:
lr_business_reviews.columns

Index(['Business_id', 'Name', 'Address', 'City', 'State', 'Postal_code',
       'Review_count', 'Restaurant_ratings', 'RestaurantsPriceRange2',
       'JoinedText', 'Review_1', 'Review_2', 'Review_3', 'Review_4',
       'Review_5', 'cleantext', 'Top5Bigrams'],
      dtype='object')

In [40]:
lr_business_reviews_copy  = lr_business_reviews.copy(deep=True)

In [41]:
lr_business_reviews.count()

Business_id               1001
Name                      1001
Address                   1001
City                      1001
State                     1001
Postal_code               1001
Review_count              1001
Restaurant_ratings        1001
RestaurantsPriceRange2    1001
JoinedText                1001
Review_1                  1001
Review_2                   981
Review_3                   938
Review_4                   871
Review_5                   821
cleantext                 1001
Top5Bigrams               1001
dtype: int64

In [42]:
lr_business_reviews.to_csv("lr_business_reviews_top5.csv", index=False, sep="\t")

In [43]:
##Following lined of code will create a dataframe which will have 5 bigrams columns for each business id. 
#these bigrams are the bigrams sorted in negative order

for index,row in lr_business_reviews.iterrows():
    business_id = row["Business_id"]
    
    bigrams = row["Top5Bigrams"]
    lst = bigrams.split("|")
    for item in lst:
        if item== "":
            lst.remove(item)
                   
                   
    df2 = pd.DataFrame([[ business_id,"","","","",""]], columns=bigramCols)
    bigramDf = bigramDf.append(df2)
    
    length = len(lst)
    i=0
    for bigram in lst:
        #print(review)
        if i<length:
            colname = "Top5Bigrams_"+str(i+1)
            #df[df['Business_id']==business_id][colname] = review
            bigramDf.loc[bigramDf['Business_id'] == business_id, colname] = bigram
        i=i+1
        
    

In [44]:
bigramDf.head()

Unnamed: 0,Business_id,Top5Bigrams_1,Top5Bigrams_2,Top5Bigrams_3,Top5Bigrams_4,Top5Bigrams_5
0,rDMptJYWtnMhpQu_rRXHng,order terrible,horrible gave,wage horrible,worst mcdonald,world worst
0,qB15WElGAlI_eGWjn0kT2w,terrible experiences,bell terrible,me terrible,terrible service,one awful
0,1Nq7GxjvEDgAJxBeOjR_9Q,worst part,place disgusting,working terrible,terrible went,horrible people
0,mZK8IBkMFzOX2UmA7_BylA,worst ihop,make horrible,truly worst,horrible burnt,worst experience
0,mI5UYpuYxjiumMLgANoa9A,terrible everything,location worst,terrible business,onions terrible,terrible every


In [45]:
bigramDf.shape

(1001, 6)

In [46]:
#Merge bigrams data with the business data to add 5 new columns for 5 bigrams
lr_business_reviews_bigrams = pd.merge(lr_business_reviews , bigramDf, on='Business_id')

In [47]:
lr_business_reviews_bigrams.shape

(1001, 22)

In [48]:
lr_business_reviews_bigrams.head(2)

Unnamed: 0,Business_id,Name,Address,City,State,Postal_code,Review_count,Restaurant_ratings,RestaurantsPriceRange2,JoinedText,...,Review_3,Review_4,Review_5,cleantext,Top5Bigrams,Top5Bigrams_1,Top5Bigrams_2,Top5Bigrams_3,Top5Bigrams_4,Top5Bigrams_5
0,rDMptJYWtnMhpQu_rRXHng,McDonald's,719 E Thunderbird Rd,Phoenix,AZ,85022.0,10.0,1.0,1.0,The speed of delivery of my food order was ter...,...,I was told tonight at 8:30 pm that they were n...,Cashier was disgusting and unsanitary. Picking...,Don't waste your money! Terrible service. The...,speed delivery food order terrible took minute...,|order terrible|horrible gave|wage horrible|wo...,order terrible,horrible gave,wage horrible,worst mcdonald,world worst
1,qB15WElGAlI_eGWjn0kT2w,Taco Bell,15240 N. 32nd Street,Phoenix,AZ,85032.0,18.0,2.0,1.0,"Very polite, decent customer service. The rice...",...,Slowest drive thru I have ever gone through. T...,Terrible service. Always a long wait. If you g...,Literally the slowest drive thru I've ever exp...,polite decent customer service rice burrito ha...,|terrible experiences|bell terrible|me terribl...,terrible experiences,bell terrible,me terrible,terrible service,one awful


In [49]:
#Saving the data to a csv for UI presentation
columns = ['Business_id','Name','Address','City','State','Postal_code','Review_count','Restaurant_ratings','RestaurantsPriceRange2',"Review_1","Review_2","Review_3","Review_4","Review_5","Top5Bigrams_1","Top5Bigrams_2","Top5Bigrams_3","Top5Bigrams_4","Top5Bigrams_5"]
low_rated_business_reviews_csv =lr_business_reviews_bigrams[columns]


In [50]:
lr_business_reviews_bigrams.to_csv("low_rated_business_reviews.csv",encoding='utf-8', index=False)
    

In [51]:
 pd.reset_option('display.max_colwidth')