<h3>Ratings of Tags</h3>
<h7>We would be calculating ratings of the tags using reviews provided by travelers.</h7>

In [48]:
#importing libraries
import pandas as pd
import numpy as np
import csv
import datetime
import random
import math  
import matplotlib.pyplot as plt
import scipy.stats as st

<h4>Initial Dataset Used</h4>
<h7>This is data for reviews of various hotels along with sentiment scores of each</h7>

In [49]:
data = pd.read_csv("TestData.csv")
data.head()

Unnamed: 0,jarvis_tags.reviewid,jarvis_tags.hotel_id,jarvis_tags.brandtype,Unnamed: 3,jarvis_tags.snippet_text,jarvis_tags.text_type,jarvis_tags.sentiment_class,jarvis_tags.sentiment_score
0,99145909,Hotel 1,HomeAway,,"This is what hotels should be like. Clean, lar...",text,positive,0.9534
1,97597588,Hotel 1,HomeAway,,"The room was beautiful and spacious, but the f...",text,negative,0.862
2,96239840,Hotel 1,HomeAway,,I have been staying here off and on most weeks...,text,positive,0.9516
3,96025643,Hotel 1,HomeAway,,"Dear M, I just wanted to write a note of thank...",text,positive,0.9932
4,95527737,Hotel 1,HomeAway,,Stayed here for two nights in Feb 2011. Great ...,text,positive,0.9543


<h5>Random Date Generator</h5>
<h7>A function to generate random dates from 1999-01-01 to 2020-02-01</h7>

In [50]:
#random date generator
def random_date_generator():
    start_date = datetime.date(1999, 1, 1)
    end_date = datetime.date(2020, 2, 1)

    time_between_dates = end_date - start_date
    days_between_dates = time_between_dates.days
    random_number_of_days = random.randrange(days_between_dates)
    random_date = start_date + datetime.timedelta(days=random_number_of_days)
    return random_date

<h5>Desmet Dictionary</h5>
<h7>Currently we use the desmet dictionary in Expedia which conatains all the tags under various categories like activities,amenities,food,etc.</h7>
<h7>Downloaded the csv file and created a dictionary of tags.</h7>

In [51]:
#creating desmet dictionary from desmet.csv used in Expedia
reader = csv.DictReader(open('desmet.csv'))
desmet_dict = {}
for row in reader:
    key = row.pop('loc_value')
    if key in desmet_dict:
        # implement your duplicate row handling here
        pass
    desmet_dict[key] = row

In [52]:
#first five key value pairs of desmet dictionary
list(desmet_dict.items())[:5]

[('Activity',
  OrderedDict([('category', 'activities'),
               ('name', 'activity'),
               ('loc_type', 'default'),
               ('urn', 'ugc:activities:activity'),
               ('loc_lcid', '1033'),
               ('loc_lang', 'en')])),
 ('Activities',
  OrderedDict([('category', 'activities'),
               ('name', 'activity'),
               ('loc_type', 'plural'),
               ('urn', 'ugc:activities:activity'),
               ('loc_lcid', '1033'),
               ('loc_lang', 'en')])),
 ('Adventure',
  OrderedDict([('category', 'activities'),
               ('name', 'adventure'),
               ('loc_type', 'default'),
               ('urn', 'ugc:activities:adventure'),
               ('loc_lcid', '1033'),
               ('loc_lang', 'en')])),
 ('Adventures',
  OrderedDict([('category', 'activities'),
               ('name', 'adventure'),
               ('loc_type', 'plural'),
               ('urn', 'ugc:activities:adventure'),
               ('loc_lcid', 

<h5>Creating the final dataset</h5>
<h7>We had the sentiment score of every sentence. If a tag is present in a sentence, we allocate the same sentiment score to that tag. The sentiment score can range between 0 to 1, with 0 being highly negative, while 1 being highly positive. A tag with rating 0.9 would have better sentiment than a tag having sentiment as 0.5</h7>

In [53]:
#creating dataset for particular hotel
reviewid = []
hotel_id = []
brandtype = []
tag_type = []
sub_tag_type = []
tag_id = []
tag_name = []
snippet_text = []
text_type = []
sentiment_class = []
sentiment_score = []
string_match_score = []
update_dt = []

#Reading dataset and adding sentiment score to tag
with open('TestData.csv', encoding="utf8") as csvfile:
    csv_reader = csv.reader(csvfile, delimiter=',')
    for row in csv_reader:
        if row[1] == "jarvis_tags.hotel_id":
            continue
        row[7] = (float(row[7])+1)/2
        if row[1] == "Hotel 2":
            for tag in desmet_dict.keys():
                if tag in row[4]:
                    reviewid.append(row[0])
                    hotel_id.append(row[1])
                    brandtype.append(row[2])
                    tag_type.append("tag")
                    sub_tag_type.append(desmet_dict[tag]["category"])
                    tag_id.append(desmet_dict[tag]["name"])
                    tag_name.append(tag)
                    snippet_text.append(row[4])
                    text_type.append("text")
                    sentiment_class.append(row[6])
                    sentiment_score.append(row[7])
                    string_match_score.append(0)
                    update_dt.append(random_date_generator())
                    #print(tag)


data = {'reviewid':reviewid, 'hotel_id':hotel_id, 'brandtype':brandtype, 'tag_type':tag_type,'sub_tag_type':sub_tag_type,
        'tag_id':tag_id,'tag_name':tag_name,'snippet_text':snippet_text,'text_type':text_type,
        'sentiment_class':sentiment_class,'sentiment_score':sentiment_score,'string_match_score':string_match_score,'update_dt':update_dt}

finalDataset = pd.DataFrame(data)

In [54]:
finalDataset.head(5)

Unnamed: 0,reviewid,hotel_id,brandtype,tag_type,sub_tag_type,tag_id,tag_name,snippet_text,text_type,sentiment_class,sentiment_score,string_match_score,update_dt
0,122898565,Hotel 2,HomeAway,tag,placesOfInterest,hotel,Hotel,My husband and I were looking for a place to c...,text,negative,0.01665,0,2017-06-06
1,58596899,Hotel 2,HomeAway,tag,amenities,internet,Internet,I was looking for a place to rest my head afte...,text,positive,0.9863,0,2007-03-12
2,206096605,Hotel 2,HomeAway,tag,placesOfInterest,hotel,Hotel,Stayed here on Friday was clean quiet and the ...,text,positive,0.564,0,2007-02-15
3,101871687,Hotel 2,HomeAway,tag,attributes,location,Location,"This hotel is one of the newest in Barrie, so ...",text,negative,0.9417,0,2008-03-02
4,95715832,Hotel 2,HomeAway,tag,placesOfInterest,restaurant,Restaurant,Got a decent rate on a room with a fireplace a...,text,positive,0.99665,0,2004-09-09


<h4>Finding Tag rating</h4>
<h7>Currently the sentiment score of each tag is from 0-1. However for bayesian model to work, we need values greater then 1. So, scaling the values in range [0,1] to [0,5] and created a new variable called "tag_rating"</h7>

In [55]:
y  = (finalDataset["sentiment_score"]*10)/2
finalDataset['tag_rating'] = y
finalDataset.head(5)

Unnamed: 0,reviewid,hotel_id,brandtype,tag_type,sub_tag_type,tag_id,tag_name,snippet_text,text_type,sentiment_class,sentiment_score,string_match_score,update_dt,tag_rating
0,122898565,Hotel 2,HomeAway,tag,placesOfInterest,hotel,Hotel,My husband and I were looking for a place to c...,text,negative,0.01665,0,2017-06-06,0.08325
1,58596899,Hotel 2,HomeAway,tag,amenities,internet,Internet,I was looking for a place to rest my head afte...,text,positive,0.9863,0,2007-03-12,4.9315
2,206096605,Hotel 2,HomeAway,tag,placesOfInterest,hotel,Hotel,Stayed here on Friday was clean quiet and the ...,text,positive,0.564,0,2007-02-15,2.82
3,101871687,Hotel 2,HomeAway,tag,attributes,location,Location,"This hotel is one of the newest in Barrie, so ...",text,negative,0.9417,0,2008-03-02,4.7085
4,95715832,Hotel 2,HomeAway,tag,placesOfInterest,restaurant,Restaurant,Got a decent rate on a room with a fireplace a...,text,positive,0.99665,0,2004-09-09,4.98325


<h5>Creating Model 1</h5>
<h7>The approximate model for the Bayesian average includes the following - </h7>
<ul>
<li>Select some m that represents a prior for the average of stars.</li>
<li>Select a C that represents our confidence in the prior.</li>
</ul>
<img src ="BayesianAverageFormula.png" />
<p>This is as per the wikipedia link "https://en.wikipedia.org/wiki/Bayesian_average" and python sample code  "https://medium.com/district-data-labs/computing-a-bayesian-estimate-of-star-rating-means-651496a890ab"</p>

In [56]:
#Using first model as given in wikipedia

#assuming a constant
confidence = 50
#Setting prior mean rating 
prior = 3.5 
#calculating
x1 = ((confidence * prior + finalDataset.groupby(["tag_id"]).sum()) /
                (confidence + finalDataset.groupby(["tag_id"]).count()))


grid1   = pd.DataFrame({
                    'mean_rating':  finalDataset.groupby(["tag_id"]).mean()["tag_rating"],
                    'count': finalDataset.groupby(["tag_id"]).count()["tag_rating"],
                    'bayes_rating': x1["tag_rating"]
                 })

In [57]:
grid1

Unnamed: 0_level_0,mean_rating,count,bayes_rating
tag_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bathroom,4.6349,5,3.603173
beach,4.95275,1,3.528485
booking,3.684021,12,3.535617
breakfast,4.123597,36,3.761041
business,4.9895,1,3.529206
casino,4.749,5,3.613545
checkin,2.690875,2,3.46888
chocolate,4.8975,1,3.527402
coffee,4.871667,3,3.577642
elevator,3.496083,3,3.499778


<h4>Result</h4>
<p>We can see that the Bayesian averages are much better than mean of tag_ratings. Those tags which have large number of positive reviews are much far than 3.5 (prior mean) in positive side while those tags which have small number of positive reviews are not much far from 3.5 (prior mean)</p>
<p>Similarly, the case for negative reviews. </p>

<h3>Problem with this model</h3>
<p>As the second part of our task, we would need confidence score for every rating of a tag which would show how much confident we are about the rating. The purpose is that the client can take a decision for which top x tags would they like to display the ratings. This would range between 0 to 1 and would be relative to a hotel.</p>

<p>This cannot be done using the above formula as this does not provide a method for calcutaing confidence scores.</p>

<h3>Creating Model 2</h3>
<p>Following this link (https://www.evanmiller.org/ranking-items-with-star-ratings.html), applied this formula to calculate the confidence score of each tag so that we can sort them on basis of confidence_scores and not on the bayesian_scores.</p>
<p>The first part of the formula is used to calculate Bayesian_Average_score (without subtracting the Z alpha over two part).</p>

In [58]:
#Using Exact Formula
finalDataset["tag_rating"] = round(finalDataset["tag_rating"],1)
points_summation = 0.0
points_square_summation = 0.0
confidence = 10
for i in np.arange(0, 5, 0.1):
    points_summation = points_summation + i
    points_square_summation = points_square_summation + i*i
tags = finalDataset["tag_id"].unique()
x2 = []
variance = []
for tag in tags: 
    reviews = 0
    reviews_square = 0
    num_review = 0
    for i in np.arange(0, 5, 0.1):
        for ind in finalDataset.index:
            if(finalDataset["tag_id"][ind]==tag and finalDataset["tag_rating"][ind]==i):
                reviews = reviews + i
                reviews_square = reviews_square + i*i
                num_review += 1
    temp = (reviews + points_summation)/(50 + num_review)
    x2.append(temp)
    temp2 = (reviews_square + points_square_summation)/(50 + num_review)
    temp_variance = math.sqrt((temp2 - temp*temp)/(50 + num_review+1))
    variance.append(temp_variance)

x2_df = pd.DataFrame(x2) 
variance_df = pd.DataFrame(variance) 
confidence_score = x2_df - st.norm.ppf(.95)*variance_df
grid2   = pd.DataFrame({
                    'mean_rating':  finalDataset.groupby(["tag_id"]).mean()["tag_rating"],
                    'count': finalDataset.groupby(["tag_id"]).count()["tag_rating"],
                    'bayes_rating': x2,
                    'confidence_score': confidence_score.values.tolist()
                 })

In [59]:
grid2

Unnamed: 0_level_0,mean_rating,count,bayes_rating,confidence_score
tag_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
bathroom,4.64,5,2.860937,[2.534534679052758]
beach,5.0,1,2.579245,[2.2437766031778144]
booking,3.7,12,2.611111,[2.278021727776261]
breakfast,4.122222,36,2.648214,[2.3186774419580227]
business,5.0,1,2.45,[2.1176201604957123]
casino,4.76,5,2.750877,[2.4112062723143843]
checkin,2.7,2,2.811268,[2.4941493894154854]
chocolate,4.9,1,2.540385,[2.204729906276039]
coffee,4.866667,3,2.564151,[2.2329436948135646]
elevator,3.533333,3,2.524074,[2.190594693094195]


<h4>Result</h4>
<p>We can see that the bayesian_score in this case also gave very good result. Also, we can sort the tags on the basis of confidence_score of each tags. <b>However, we don't have any control on the initial average rating for each tag (like it was 3.5 in Model 1). So it will depend entirely on data.</b></p>

<h4>Work Remaining</h4>
<ul>
<li>We need to apply the decay factor. This can be easily done by multiplting an exponential decay function to the tag rating. h = Differnece in update date and current date. tag_rating = e^(-h)*tag_rating + bias.</li>
</ul>