In [1]:
import pandas as pd
import numpy as py
import re

In [2]:
neg_word_sen_score=pd.read_csv("neg_word_sen_score.csv",converters={"score": float})
pos_word_sen_score=pd.read_csv("pos_word_sen_score.csv",converters={"score": float})

In [3]:
from nltk.classify.scikitlearn import SklearnClassifier
from nltk.classify import NaiveBayesClassifier
from nltk.metrics import precision, recall, f_measure
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.tokenize import ToktokTokenizer

In [4]:
toktok = ToktokTokenizer()
stop_words_nltk = set(stopwords.words('english'))
def word_process(text):
    clean_words = re.sub("[^a-zA-Z]"," ", text)
    clean_words = clean_words.lower()
    words = word_tokenize(clean_words)
    words = [toktok.tokenize(k) for k in sent_tokenize(clean_words)]
    result = []
    if not words:
        pass
    else:
        for w in words[0]:
            if w not in stop_words_nltk:
                result.append(w)
    return result

In [5]:
set_of_pos_word = set(pos_word_sen_score.word) - set(['no', 'negative', 'positive'])
set_of_neg_word = set(neg_word_sen_score.word) - set(['no', 'negative', 'positive'])
def pos_sentiment_score(review):
    pos = 0
    clean = word_process(review)
    for w in clean:
        if w in set_of_pos_word:
            pos += pos_word_sen_score[pos_word_sen_score['word'] == w].score.iloc[0]
    return pos
def neg_sentiment_score(review):
    neg = 0
    clean = word_process(review)
    for w in clean:
        if w in set_of_neg_word:
            neg += neg_word_sen_score[neg_word_sen_score['word'] == w].score.iloc[0]
    return neg

- Application 1--Predict **strength of sentiment** abouht a given review text is. Two functions are provided: pos_sentiment_score(), neg_sentiment_score(). Input: string of review text; Output: a positive sentiment score by calling pos_sentiment_score() and a negative sentiment score by calling neg_sentiment_score(). 

 Example 1: My room was dirty and I was afraid to walk barefoot on the floor which looked as if it was not cleaned in weeks White furniture which looked nice in pictures was dirty too and the door looked like it was attacked by an angry dog My shower drain was clogged and the staff did not respond to my request to clean it On a day with heavy rainfall a pretty common occurrence in Amsterdam the roof in my room was leaking luckily not on the bed you could also see signs of earlier water damage I also saw insects running on the floor Overall the second floor of the property looked dirty and badly kept On top of all of this a repairman who came to fix something in a room next door at midnight was very noisy as were many of the guests I understand the challenges of running a hotel in an old building but this negligence is inconsistent with prices demanded by the hotel On the last night after I complained about water damage the night shift manager offered to move me to a different room but that offer came pretty late around midnight when I was already in bed and ready to sleep 

Type of example: **string** (copied from the above example)

In [6]:
test_text1 = 'My room was dirty and I was afraid to walk barefoot on the floor which looked as if it was not cleaned in weeks White furniture which looked nice in pictures was dirty too and the door looked like it was attacked by an angry dog My shower drain was clogged and the staff did not respond to my request to clean it On a day with heavy rainfall a pretty common occurrence in Amsterdam the roof in my room was leaking luckily not on the bed you could also see signs of earlier water damage I also saw insects running on the floor Overall the second floor of the property looked dirty and badly kept On top of all of this a repairman who came to fix something in a room next door at midnight was very noisy as were many of the guests I understand the challenges of running a hotel in an old building but this negligence is inconsistent with prices demanded by the hotel On the last night after I complained about water damage the night shift manager offered to move me to a different room but that offer came pretty late around midnight when I was already in bed and ready to sleep'

In [7]:
print("The positive sentiment score of the given text is %.2f" %pos_sentiment_score(test_text1), 
      "The negative sentiment score of the given text is %.2f" %neg_sentiment_score(test_text1))

('The positive sentiment score of the given text is 31.50', 'The negative sentiment score of the given text is 466.00')


From the above value, we could tell that the given test review text is classified to a **negative review** since the neg_sentiment_score is much larger than pos_sentiment_score. The result matches the catergorization from dataset (copied the given review text under column "Negative_Review"). From the two scores, we could all tell how negative the given review is and how positive the given review is. 

 Example 2: Great location in nice surroundings the bar and restaurant are nice and have a lovely outdoor area The building also has quite some character 

In [8]:
test_text2 = 'Great location in nice surroundings the bar and restaurant are nice and have a lovely outdoor area The building also has quite some character'

In [9]:
print("The positive sentiment score of the given text is %.2f" %pos_sentiment_score(test_text2), 
      "The negative sentiment score of the given text is %.2f" %neg_sentiment_score(test_text2))

('The positive sentiment score of the given text is 63.60', 'The negative sentiment score of the given text is 0.00')


From the above value, we could tell that the given test review text is classified to a **100% positive review** since the neg_sentiment_score is 0. The result matches the catergorization from dataset (copied the given review text under column "Positive_Review"). From the two scores, we could all tell how negative the given review is and how positive the given review is. 

Also, from the two returned values of a given review text from pos_sentiment_score() and neg_sentiment_score(), if one value if significantly large than the other, or one of them is 0, we could categorize the given text to positive or negtaive reviews.

- Application 2: Let's predict the **Reviewer_Score** based on 1 positive review and 1 negative review from the same hotel. Input: two different strings of review texts; Output: a predicted reviewer score. 

From the above two text, we could tell that example 1 is categorized as negative review and example 2 is categorzied as positive reivew. Assume these two reviews are given to the same hotel. Then we could predict the **Review Score** based on the sentiment scores of the two text. 

In [10]:
model_para = pd.read_pickle('reviewer_score_model_para') ###read model parameters
model_para = model_para['para'].values.tolist()
model_para

[0.011836679330565333, -0.016977716121541978, 8.24393084976765]

In [11]:
def Reviewer_score_cal(a, b):
    if a == b:
        return('Error: Given two texts are the same. Please give two different texts')
    pos_score_a = pos_sentiment_score(a)
    neg_score_a = neg_sentiment_score(a)
    pos_score_b = pos_sentiment_score(b)
    neg_score_b = neg_sentiment_score(b)
    if pos_score_a > neg_score_a and neg_score_b > pos_score_b:
        pos = pos_score_a
        neg = neg_score_b
        result = model_para[0] * pos + model_para[1] * neg + model_para[2]
        return result
    elif neg_score_a > pos_score_a and pos_score_b > neg_score_b:
        pos = pos_score_b
        neg = neg_score_a
        result = model_para[0] * pos + model_para[1] * neg + model_para[2]
        return result
    elif pos_score_a > neg_score_a and pos_score_a > neg_score_a:
        return('Error: the given two reviews are both positive reviews')
    elif pos_score_a < neg_score_a and pos_score_a < neg_score_a:
        return('Error: the given two reviews are both negative reviews')

In [12]:
Reviewer_score_cal(test_text1, test_text2)

1.0851279425530453

In [13]:
print('The predicted Reviewer Score of the two given texts is %.2f' %Reviewer_score_cal(test_text1, test_text2))

The predicted Reviewer Score of the two given texts is 1.09


- Application 3: Return a **ranking list of hotels** from interested aspect based on aspect analysis. Optional aspects from positive reviews are **staff**, **location**, **room**, **breakfast**, and **bed**. Optional aspects from negative reviews are **room**, **breakfast**, **staff**, **bed** and **bathroom**. 

Based on analysis from "data_process.ipynb", top 5 aspects in positive reviews are: "staff", "location", "room", "breakfast", "bed". Top 5 aspects in negative reviews are: "room", "breakfast", "staff", "bed", "bathroom". Each list is saved to .csv file fro user to read directly. 
'loc_pos_list.csv' is a ranked list of hotels has "location" in positive reviews.
'staff_pos_list.csv' is a ranked list of hotels has "staff" in positive reviews.
'rm_pos_list.csv' is a ranked list of hotels has "room" in positive reviews.
'bk_pos_list.csv' is a ranked list of hotels has "location" in positive reviews.
'bed_pos_list.csv' is a ranked list of hotels has "bed" in positive reviews.
'rm_neg_list.csv' is a ranked list of hotels has "room" in negative reviews.
'bk_neg_list.csv' is a ranked list of hotels has "breakfast" in negative reviews.
'staff_neg_list.csv' is a ranked list of hotels has "staff" in negative reviews.
'bed_neg_list.csv' is a ranked list of hotels has "bed" in negative reviews.
'bath_neg_list.csv' is a ranked list of hotels has "bathroom" in negative reviews.

Assume a user is interested in positive reviews containing **location** of hotels, then return the top 20 hotels. 

In [14]:
import pandas as pd
loc_pos = pd.read_csv('loc_pos_list.csv')
loc_pos['Hotel_name'][:20]

0                                    11 Cadogan Gardens
1                                              1K Hotel
2                    25hours Hotel beim MuseumsQuartier
3                                                    41
4                    45 Park Lane Dorchester Collection
5                                            88 Studios
6                                     9Hotel Republique
7                                     A La Villa Madame
8          ABaC Restaurant Hotel Barcelona GL Monumento
9     AC Hotel Barcelona Forum a Marriott Lifestyle ...
10    AC Hotel Diagonal L Illa a Marriott Lifestyle ...
11             AC Hotel Irla a Marriott Lifestyle Hotel
12           AC Hotel Milano a Marriott Lifestyle Hotel
13             AC Hotel Paris Porte Maillot by Marriott
14            AC Hotel Sants a Marriott Lifestyle Hotel
15    AC Hotel Victoria Suites a Marriott Lifestyle ...
16                                ADI Doria Grand Hotel
17                            ADI Hotel Polizian

Visualize the result by google map. 

In [15]:
import folium
df = pd.read_pickle('Filling_nans')
loc_pos_hotel = loc_pos['Hotel_name'][:20]
loc_pos_data = df.loc[df['Hotel_Name'].isin(loc_pos_hotel)][["Hotel_Name","Hotel_Address",
                                                             'lat','lng']].drop_duplicates()
loc_pos_map = folium.Map(location = [52, 17], zoom_start = 1)
loc_pos_data.apply(lambda row:folium.Marker(location=[row["lat"], row["lng"]])
                                             .add_to(loc_pos_map), axis=1)

loc_pos_map

In [16]:
loc_pos_data

Unnamed: 0,Hotel_Name,Hotel_Address,lat,lng
33703,11 Cadogan Gardens,11 Cadogan Gardens Sloane Square Kensington an...,51.493616,-0.159235
43148,1K Hotel,13 Boulevard Du Temple 3rd arr 75003 Paris France,48.863932,2.365874
185602,41,41 Buckingham Palace Road Westminster Borough ...,51.498147,-0.143649
191715,A La Villa Madame,44 Rue Madame 6th arr 75006 Paris France,48.848861,2.331526
191786,45 Park Lane Dorchester Collection,45 Park Lane Westminster Borough London W1K 1P...,51.506371,-0.151536
221503,AC Hotel Paris Porte Maillot by Marriott,6 rue Gustave Charpentier 17th arr 75017 Paris...,48.882005,2.281854
234572,9Hotel Republique,7 9 Rue Pierre Chausson 10th arr 75010 Paris F...,48.870842,2.360586
256880,88 Studios,88 Holland Road Kensington and Chelsea London ...,51.499279,-0.209073
277378,AC Hotel Diagonal L Illa a Marriott Lifestyle ...,Avenida Diagonal 555 Les Corts 08029 Barcelona...,41.389961,2.135684
280541,ABaC Restaurant Hotel Barcelona GL Monumento,Avenida Tibidabo 1 Sarri St Gervasi 08022 Barc...,41.410694,2.136294
