# Prediction Modelling for Hotel Ratings from Reviews, Reviewer Nationality and Hotel Location



## Business Problem

#### Goals
- Determine a relationship between reviews and the rating a reviewer gives.
- Reviewer Nationality can affect ratings

#### Other Impacts on Review Ratings
- Stay Duration
- Reviewer Nationality
- Location of the Hotel
- Number of Reviews
- Tags associated with the Trip

#### Why?
- This could allow the development of review apps and websites that could allow pre-filled rating based on a review.
- Allow improvements of Hotels based on reviews


## Risks and Limitations
- Assumption that Reviews are Honest. 
- Hotel is treated as constant.
- Low Score is solely due factors discussed in written reivew.
- Each review is treated as seperate.
- External Factors are ignored

### Data Access
Data was obtained from [Kaggle](https://www.kaggle.com/jiashenliu/515k-hotel-reviews-data-in-europe)

## The Review Data

#### Acknowledgements
The data was scraped from Booking.com by [Jason Liu](https://www.kaggle.com/jiashenliu)

#### Data Context
- 515,000 customer reviews
- 1493 luxury hotels within Europe

#### Data Content
- **Hotel_Address**: Address of hotel.
- **Review_Date:** Date when reviewer posted the corresponding review.
- **Average_Score:** Average Score of the hotel, calculated based on the latest comment in the last year.
- **Hotel_Name:** Name of Hotel
- **Reviewer_Nationality:** Nationality of Reviewer
- **Negative_Review:** Negative Review the reviewer gave to the hotel. If the reviewer does not give the negative review, then it should be: 'No Negative'
- **Review_Total_Negative_Word_Counts:** Total number of words in the negative review.
- **Positive_Review:** Positive Review the reviewer gave to the hotel. If the reviewer does not give the negative review, then it should be: 'No Positive'

- **Review_Total_Positive_Word_Counts:** Total number of words in the positive review.
- **Reviewer_Score:** Score the reviewer has given to the hotel, based on his/her experience
- **Total_Number_of_Reviews_Reviewer_Has_Given:** Number of Reviews the reviewers has given in the past.
- **Total_Number_of_Reviews:** Total number of valid reviews the hotel has.
- **Tags:** Tags reviewer gave the hotel.
- **days_since_review:** Duration between the review date and scrape date.
- **Additional_Number_of_Scoring:** There are also some guests who just made a scoring on the service rather than a review. This number indicates how many valid scores without review in there.
- **lat:** Latitude of the hotel
- **lng:** longtitude of the hotel

In [5]:
# Script Name: EDA of Hotel Reviews Data
# Author: Rahul Kumar
# Date: 2-Jan-20
# Description: The purpose is to clean up data in preperation for Model running

import pandas as pd
import numpy as np
from math import sqrt
import seaborn as sns
import scipy as sp
from textblob import TextBlob, Word
from nltk.stem.snowball import SnowballStemmer

#this supposedly extracts country from a text line
import pycountry

import matplotlib.pyplot as plt
%matplotlib inline

# This actually sets the pandas display to show all rows and columns 
# when you are showing a dataframe, without skipping the center
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
hotels = pd.read_csv('../Hotel_Reviews.csv')

### Preview of the Data
- Note the Review Score is the label to be predicted

In [7]:
hotels.head(4)

Unnamed: 0,Hotel_Address,Additional_Number_of_Scoring,Review_Date,Average_Score,Hotel_Name,Reviewer_Nationality,Negative_Review,Review_Total_Negative_Word_Counts,Total_Number_of_Reviews,Positive_Review,Review_Total_Positive_Word_Counts,Total_Number_of_Reviews_Reviewer_Has_Given,Reviewer_Score,Tags,days_since_review,lat,lng
0,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,194,8/3/2017,7.7,Hotel Arena,Russia,I am so angry that i made this post available...,397,1403,Only the park outside of the hotel was beauti...,11,7,2.9,"[' Leisure trip ', ' Couple ', ' Duplex Double...",0 days,52.360576,4.915968
1,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,194,8/3/2017,7.7,Hotel Arena,Ireland,No Negative,0,1403,No real complaints the hotel was great great ...,105,7,7.5,"[' Leisure trip ', ' Couple ', ' Duplex Double...",0 days,52.360576,4.915968
2,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,194,7/31/2017,7.7,Hotel Arena,Australia,Rooms are nice but for elderly a bit difficul...,42,1403,Location was good and staff were ok It is cut...,21,9,7.1,"[' Leisure trip ', ' Family with young childre...",3 days,52.360576,4.915968
3,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,194,7/31/2017,7.7,Hotel Arena,United Kingdom,My room was dirty and I was afraid to walk ba...,210,1403,Great location in nice surroundings the bar a...,26,1,3.8,"[' Leisure trip ', ' Solo traveler ', ' Duplex...",3 days,52.360576,4.915968


## Explatory Data Analysis
The Data was in a workable state and not much primary cleaning had to be done.

- Basic Text Cleaning
- Dropped Null Values
- Basic Text and Dataframe Manipulation
- Some relationships were observed

In [8]:
hotels.dtypes

Hotel_Address                                  object
Additional_Number_of_Scoring                    int64
Review_Date                                    object
Average_Score                                 float64
Hotel_Name                                     object
Reviewer_Nationality                           object
Negative_Review                                object
Review_Total_Negative_Word_Counts               int64
Total_Number_of_Reviews                         int64
Positive_Review                                object
Review_Total_Positive_Word_Counts               int64
Total_Number_of_Reviews_Reviewer_Has_Given      int64
Reviewer_Score                                float64
Tags                                           object
days_since_review                              object
lat                                           float64
lng                                           float64
dtype: object

In [9]:
hotels.isnull().sum().to_frame()

Unnamed: 0,0
Hotel_Address,0
Additional_Number_of_Scoring,0
Review_Date,0
Average_Score,0
Hotel_Name,0
Reviewer_Nationality,0
Negative_Review,0
Review_Total_Negative_Word_Counts,0
Total_Number_of_Reviews,0
Positive_Review,0


In [10]:
hotels.describe()

Unnamed: 0,Additional_Number_of_Scoring,Average_Score,Review_Total_Negative_Word_Counts,Total_Number_of_Reviews,Review_Total_Positive_Word_Counts,Total_Number_of_Reviews_Reviewer_Has_Given,Reviewer_Score,lat,lng
count,515738.0,515738.0,515738.0,515738.0,515738.0,515738.0,515738.0,512470.0,512470.0
mean,498.081836,8.397487,18.53945,2743.743944,17.776458,7.166001,8.395077,49.442439,2.823803
std,500.538467,0.548048,29.690831,2317.464868,21.804185,11.040228,1.637856,3.466325,4.579425
min,1.0,5.2,0.0,43.0,0.0,1.0,2.5,41.328376,-0.369758
25%,169.0,8.1,2.0,1161.0,5.0,1.0,7.5,48.214662,-0.143372
50%,341.0,8.4,9.0,2134.0,11.0,3.0,8.8,51.499981,0.010607
75%,660.0,8.8,23.0,3613.0,22.0,8.0,9.6,51.516288,4.834443
max,2682.0,9.8,408.0,16670.0,395.0,355.0,10.0,52.400181,16.429233


In [11]:
hotels.groupby('Hotel_Name').mean()

Unnamed: 0_level_0,Additional_Number_of_Scoring,Average_Score,Review_Total_Negative_Word_Counts,Total_Number_of_Reviews,Review_Total_Positive_Word_Counts,Total_Number_of_Reviews_Reviewer_Has_Given,Reviewer_Score,lat,lng
Hotel_Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
11 Cadogan Gardens,101.0,8.7,15.528302,393.0,19.974843,7.226415,8.845283,51.493616,-0.159235
1K Hotel,69.0,7.7,24.932432,663.0,15.601351,9.141892,7.861486,48.863932,2.365874
25hours Hotel beim MuseumsQuartier,391.0,8.8,16.161103,4324.0,21.911466,8.722787,8.983309,48.206474,16.35463
41,66.0,9.6,8.883495,244.0,25.300971,6.009709,9.71165,51.498147,-0.143649
45 Park Lane Dorchester Collection,27.0,9.4,6.75,68.0,11.535714,7.214286,9.603571,51.506371,-0.151536
88 Studios,197.0,8.4,23.936819,955.0,21.464052,7.427015,8.489107,51.499279,-0.209073
9Hotel Republique,100.0,8.8,16.95082,857.0,19.338798,8.63388,8.743716,48.870842,2.360586
A La Villa Madame,24.0,8.8,8.463415,185.0,19.634146,5.829268,8.853659,48.848861,2.331526
ABaC Restaurant Hotel Barcelona GL Monumento,10.0,8.8,35.225806,111.0,18.677419,10.258065,8.464516,41.410694,2.136294
AC Hotel Barcelona Forum a Marriott Lifestyle Hotel,160.0,8.1,21.626298,1560.0,14.986159,7.6609,8.001384,41.410131,2.218805
