In [7]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import nltk
%matplotlib inline

## Explore the data

In [8]:
#import dataset
data = pd.read_csv("./hotel_reviews_raw_data.csv") 

In [9]:
#look at the last five reviews
data.head() 

Unnamed: 0,deceptive,hotel,polarity,source,text
0,truthful,conrad,positive,TripAdvisor,We stayed for a one night getaway with family ...
1,truthful,hyatt,positive,TripAdvisor,Triple A rate with upgrade to view room was le...
2,truthful,hyatt,positive,TripAdvisor,This comes a little late as I'm finally catchi...
3,truthful,omni,positive,TripAdvisor,The Omni Chicago really delivers on all fronts...
4,truthful,hyatt,positive,TripAdvisor,I asked for a high floor away from the elevato...


In [10]:
#list all the column headers:
for i in data.columns:
    print(i)

deceptive
hotel
polarity
source
text


### Information from dataset uploader on Kaggle:
"The csv file contains 17 fields. The description of each field is as below:

Hotel_Address: Address of hotel.

Review_Date: Date when reviewer posted the corresponding review.

Average_Score: Average Score of the hotel, calculated based on the latest comment in the last year.

Hotel_Name: Name of Hotel

Reviewer_Nationality: Nationality of Reviewer

Negative_Review: Negative Review the reviewer gave to the hotel. If the reviewer does not give the negative review, then it should be: 'No Negative'

Review_Total_Negative_Word_Counts: Total number of words in the negative review.

Positive_Review: Positive Review the reviewer gave to the hotel. If the reviewer does not give the positive review, then it should be: 'No Positive'

Review_Total_Positive_Word_Counts: Total number of words in the positive review.

Reviewer_Score: Score the reviewer has given to the hotel, based on his/her experience

Total_Number_of_Reviews_Reviewer_Has_Given: Number of Reviews the reviewers has given in the past.

Total_Number_of_Reviews: Total number of valid reviews the hotel has.

Tags: Tags reviewer gave the hotel.

days_since_review: Duration between the review date and scrape date.

Additional_Number_of_Scoring: There are also some guests who just made a scoring on the service rather than a review. This number indicates how many valid scores without review in there.

lat: Latitude of the hotel

lng: longtitude of the hotel

In order to keep the text data clean, I removed unicode and punctuation in the text data and transform text into lower case. No other preprocessing was performed."

In [11]:
 #total number of reviews
len(data)

1600

In [None]:
data_plot = data[["source","text"]].drop_duplicates()
data_plot_avg = data_plot.plot.hist()
plt.show()

As a quick exploration of data, we plot the number of hotels versus their average rating:

In [13]:
max_rating = data.max()
max_rating

deceptive                                             truthful
hotel                                                  talbott
polarity                                              positive
source                                                     Web
text         where do I start. The GM was a nice guy for ha...
dtype: object

In [14]:
min_rating = data.min() 
min_rating

deceptive                                            deceptive
hotel                                                  affinia
polarity                                              negative
source                                                   MTurk
text          I stayed at the Sheraton Chicago Hotel and To...
dtype: object

# Natural Language Processing - Tokenize the reviews and build a bag-of-words model
The first goal is to do sentiment analysis on the positive and negative reviews. To do this, we need to first tokenize the words using nltk, remove the stopwords, and build a bag-of-words model.

In [15]:
nltk.download("punkt")

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [22]:
data = pd.read_csv("./hotel_reviews_raw_data.csv") 

In [29]:

pos_reviews = data.polarity
neg_reviews = data.polarity
print(type(pos_reviews))

<class 'pandas.core.series.Series'>


In [30]:
#word_tokenize only works for text file, not whole series
pos_reviews_words = nltk.word_tokenize(pos_reviews[1]) 
#len(pos_reviews[0])
#print(pos_reviews[1])
#tokenize and print the second review (the first was to
print(pos_reviews_words) 

['positive']


In [31]:
print(type(pos_reviews[:5]))

<class 'pandas.core.series.Series'>


In [32]:
len(pos_reviews)

1600

pos_reviews_wordslist = []  

*Dec 15 doesn't work for bag-of-words because can't distinguish between neighbouring reviews

for i in range(5):

for i in range(515738): 

get error if put len+1 here, needed to switch from pos_reviews[1] to .iloc[1]

pos_reviews_wordslist.extend(nltk.word_tokenize(pos_reviews.iloc[i])) 

tokenize text in each positive review

 return pos_reviews_wordslist

In [34]:
pos_reviews_wordslist = []  
#for i in range(5):
 #get error if put len+1 here, needed to switch from pos_reviews[1] to .iloc[1]
for i in range(515):
   #tokenize te
    pos_reviews_wordslist.append(nltk.word_tokenize(pos_reviews.iloc[i])) 

In [35]:
print(pos_reviews_wordslist[:5])

[['positive'], ['positive'], ['positive'], ['positive'], ['positive']]


In [36]:
len(pos_reviews_wordslist)

515

In [37]:
type(pos_reviews_wordslist)

list

 Now we have tokenized all the positive and negative reviews with punctuation already removed in the raw data, we will remove the stop words and build a bag-of-words model with the filtered words

In [None]:
import nltk
nltk.download()
nltk.download("text")

In [None]:
 #all the reviews in this dataset are in text
len(nltk.corpus.stopwords.words("text"))

In [None]:
nltk.corpus.stopwords.words("text")[:10]

In [None]:
useless_words = nltk.corpus.stopwords.words("text")
type(useless_words)


# useless_words

In [None]:
def build_bag_of_words_filtered(words):
    return {
        #word:1 for word in words
        word:1 for word in words \
        if not word in useless_words} 

In [None]:
assert len(build_bag_of_words_filtered(["We stayed", "stayed"]))==0, "We stayed for a one night getaway with family on a thursday.

 We can build the negative and positive features separately using the build-bag-of-words function The format of the positive features should be:

[
    ( { "here":1, "some":1, "words":1 }, "pos" ),
    
    ( { "another":1, "tweet":1}, "pos" )
]

It is a list of tuples, the first element is a dictionary of the words with 1 if that word appears, the second the "pos" or "neg" string.

In [None]:
positive_features = None
positive_features = [
    (build_bag_of_words_filtered(review),'pos') \
    for review in pos_reviews_wordslist 
]

In [None]:
positive_features[-1]

In [None]:
type(positive_features)

In [None]:
negative_features[-2:]

# Train a classifier for sentiment analysis


We will use the Naive Bayes classifier as explained in lecture; train it on 80 percent of the data, and test on the remaining 20 percent

In [None]:
from nltk.classify import NaiveBayesClassifier

In [None]:
#Using 80% of the data for training, the rest for validation:
split = int(len(positive_features) * 0.8)
split

In [None]:
classifier = NaiveBayesClassifier.train(positive_features[:split]+negative_features[:split])

### check the accuracy on the training and test sets, turning accuracy into percentage:

In [None]:
training_accuracy = None #check accuracy of training set
training_accuracy = nltk.classify.util.accuracy(classifier, positive_features[:split] + negative_features[:split])*100
training_accuracy

The training accuracy is around 93.5 percent, which is quite good, as expected since the classifier has seen the data (I was actually expecting it to be a bit higher)

In [None]:
test_accuracy = None #check accuracy of test set
test_accuracy = nltk.classify.util.accuracy(classifier, positive_features[split:] + negative_features[split:])*100
test_accuracy

 The test accuracy is over 92.5 percent, which is really good and almost as high as the training accuracy. It is also significantly higher for the estimated human prediction accuracy of 80%.
 
This shows the Naive Bayes Classifier is a good method to use for this analysis since it performs well for this type of dataset.

The accuracy for the test is also very high compared to the movie review dataset from lecture. We can now print the most informative features below to understand why. The most informative features are the words that mostly identify a positive or a negative review, or the words that had the greatest effect on the prediction accuracy.

In [None]:
classifier.show_most_informative_features()

As we can see, the words "negative" and "positive" appeared in lots of reviews and are quite informative. But due to the nature of the dataset (worded like a questionaire), lots of reviews actually say "no positive" or "no negative" which can be tricky since they would represent the opposite sentiment, which is why the ratio is not 100 percent (number of reviews versus 1).

Since 9 out of 10 most informative features indicate high accuracy for a positive prediction, I decided to look at more of these features:

In [None]:
classifier.show_most_informative_features(50)

From this list of most informative features, it's interesting to note that quite a few of the informative words from positive reviews refer to the hotel staff (Friendly, Helpful, Efficient) and location (Convenient, Conveniently, Convenience), while the most informative words for negative reviews seem to refer mostly to the facilities (Thin, Charged, Unusable, Lack, unreliable, damaged, Loud, Noisy, Smelly, Missing, loudly)

### Relationship between reviewer nationality and rating
For the second research question, I am interested in finding out the relationship between reviewer nationality and their ratings. Since all the hotels in this dataset are located in Europe, do European travellers tend to give a higher or lower rating? Which country gives the highest and lowest ratings on average? As part of the analysis, I will try to use different types of visualizations to present findings.

 We also want to explore whether there is a relationship between the number of reviews a reviewer has given in this dataset and the review score they give to a particular hotel (a larger number of reviews indicates the reviewer is an experienced traveller and often stays at this type of hotels). One hypothesis is a more experienced traveller might have higher standards and give a lower rating.

### try using seaborn# for scatter plot, both axes need to be numbers! reviewer experie

In [None]:
#sns.jointplot(x="SepalLengthCm", y="SepalWidthCm", data=iris, size=5)
sns.jointplot(x="deceptive", y="hotel")

 <seaborn.axisgrid.JointGrid at 0x1c3919ba8>

In [None]:
sns.axes_style() #check plot style 

 From SciPy documentation, equation displayed shows Pearson correlation: 

The Pearson correlation coefficient measures the linear relationship between two datasets. Strictly speaking, Pearsonâ€™s correlation requires that each dataset be normally distributed. Like other correlation coefficients, this one varies between -1 and +1 with 0 implying no correlation. Correlations of -1 or +1 imply an exact linear relationship. Positive correlations imply that as x increases, so does y. Negative correlations imply that as x increases, y decreases.


Almost 0 --> no correlation between reviewer score and number of  reviews given

From the plot above, we do not see a clear relationship between reviewer score for a particular hotel and the total number of reviews the reviewer has given. There are even a few outliers where the number of reviews is high, and the reviewer score is also above 8.5. It seems most reviewers have given a low number of reviews in this dataset, I want to adjust the scale of the plot axes to see this better.

In [None]:
sns_plot=sns.jointplot(x="deceptive", y="hotel", data = data, size=15) 


In [None]:
sns_plot.savefig('sns_scatter.png') #saved sns scatter plot in JupyterNotebooks folder


#### change plot to presentation style, from seaborn documentation:


In [None]:
sns.set() #reset default parameters
sns.set_context("text")
sns_plot=sns.jointplot(x="deceptive", y="hotel", d

In [None]:
sns_plot.savefig('sns_scatter_talk.png') #save larger plot for presentation

#### Try plotting this without seaborn library:

In [None]:
def plot_scatter(df, x, y):
    ### BEGIN SOLUTION
    fig, axis = plt.subplots()
    # Grid lines, Xticks, Xlabel, Ylabel
    
    axis.yaxis.grid(True)
    axis.set_title('deceptive',fontsize=10)
    axis.set_xlabel(x,fontsize=10)
    axis.set_ylabel(y,fontsize=10)


    X = df[x]
    Y = df[y]

    #axis.semilogx(X)
    axis.scatter(X, Y)
    plt.show()

plot_scatter(data, 'deceptive', 'hotel')

In [None]:
#scatter-bubble plot for nations :
bbplot = data.plot.scatter(x='deceptive', y='hotel', s=data['text'])

In [None]:
max(data.text.unique())


(When revisiting this project in May 2019) Realized there is no correlation because the column Reviewer_Score contains scores for all the hotels from all reviewers, plus the ones that only left a score without reviewing. To take care of that discrepancy, would probably need to filter out the scores posted by those who did not leave a review...(but focusing on another project at the moment, and the main goal of this project -- using Naive Bayes Classifier to do sentiment analysis on hotel reviews dataset -- was successfully achieved