# Capstone 3: Yelp's Cleanest Hotels

When users create reviews on Yelp there are multiple factors that go into their ranking. People priortize different factors. Whether the hotel is close to restaurants/shopping, does it have breakfast, does it have a restaurant, how far it is from the airport.

In this project we will sort through the user reviews and use natural language processing to find the highest ranked hotels that prioritize cleanliness in their reviews.


## Import and clean the data

In [1]:
import pandas as pd
import numpy as np

In [50]:
import re
import nltk
from nltk import ne_chunk
from nltk.tokenize import word_tokenize, regexp_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

from collections import Counter

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import metrics
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
reviews = pd.read_csv('tripadvisor_hotel_reviews.csv')
reviews.head()

Unnamed: 0,Review,Rating
0,nice hotel expensive parking got good deal sta...,4
1,ok nothing special charge diamond member hilto...,2
2,nice rooms not 4* experience hotel monaco seat...,3
3,"unique, great stay, wonderful time hotel monac...",5
4,"great stay great stay, went seahawk game aweso...",5


In [3]:
reviews.isna().sum()

Review    0
Rating    0
dtype: int64

To be able to describe these ratings more aptly we will add in 3 definitions. Excellent ratings (5), satisfactory ratings (4's and 3's), and unsatisfactory ratings (2's and 1's).

In [4]:
#Create Columns for Excellent, Satisfactory, and Unsatisfactory ratings

reviews['Excellent'] = np.where(reviews['Rating'] == 5, 1, 0)
reviews['Satisfactory'] = np.where((reviews['Rating'] == 4) | (reviews['Rating'] == 3) , 1, 0)
reviews['Unsatisfactory'] = np.where((reviews['Rating'] == 2) | (reviews['Rating'] == 1) , 1, 0)

In [5]:
# create one column for feature engineering
conditions = [ (reviews['Rating'] <= 2),
            (reviews['Rating'] <= 4), 
            (reviews['Rating'] ==5)
    ]

# create a list of the values we want to assign for each condition
values = ['unsat', 'satis', 'excel']

# create a new column and use np.select to assign values to it using our lists as arguments
reviews['Label'] = np.select(conditions, values)


In [6]:
reviews.head()

Unnamed: 0,Review,Rating,Excellent,Satisfactory,Unsatisfactory,Label
0,nice hotel expensive parking got good deal sta...,4,0,1,0,satis
1,ok nothing special charge diamond member hilto...,2,0,0,1,unsat
2,nice rooms not 4* experience hotel monaco seat...,3,0,1,0,satis
3,"unique, great stay, wonderful time hotel monac...",5,1,0,0,excel
4,"great stay great stay, went seahawk game aweso...",5,1,0,0,excel


In [7]:
excellentReviews=reviews.Excellent.sum()
satisfactoryReviews= reviews.Satisfactory.sum()
unsatisfactoryReviews = reviews.Unsatisfactory.sum()
allReviews = len(reviews)

#Double check assignments were done properly with simple T/F test
excellentReviews + satisfactoryReviews + unsatisfactoryReviews == allReviews

True

In [8]:
print('The percentage of excellent reviews is', "{0:.0%}".format(excellentReviews/allReviews) )
print('The percentage of satisfactory reviews is', "{0:.0%}".format(satisfactoryReviews/allReviews) )
print('The percentage of unsatisfactory reviews is', "{0:.0%}".format(unsatisfactoryReviews/allReviews) )

The percentage of excellent reviews is 44%
The percentage of satisfactory reviews is 40%
The percentage of unsatisfactory reviews is 16%


## Clean the text

In [20]:
%%time
#tokenize and lower case words

tokenReview = [word_tokenize(review.lower()) for review in reviews['Review']]

CPU times: user 11.8 s, sys: 73.5 ms, total: 11.9 s
Wall time: 11.9 s


In [21]:
%%time

noStops = [t for t in tokenReview if t not in stopwords.words('english')]

CPU times: user 2.34 s, sys: 673 ms, total: 3.02 s
Wall time: 3.02 s


In [22]:
%%time

tags = [nltk.pos_tag(token) for token in noStops]

CPU times: user 2min 6s, sys: 1.1 s, total: 2min 7s
Wall time: 2min 7s


In [23]:
%%time

chunk = [ne_chunk(tag) for tag in tags]

CPU times: user 9min 53s, sys: 1.28 s, total: 9min 54s
Wall time: 9min 57s


In [39]:
%%time

chunkarray = np.array(chunk)

CPU times: user 1.41 s, sys: 21.9 ms, total: 1.43 s
Wall time: 1.44 s




In [40]:
#TRYING TO DO ML WITH CHUNK
# Create a series to store the labels: y
y = reviews.Label

# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(chunkarray, y, test_size = 0.33, random_state = 53)

In [51]:
# Create a Multinomial Naive Bayes classifier: nb_classifier
nb_classifier = MultinomialNB()

# Fit the classifier to the training data
nb_classifier.fit(X_train, y_train)

# Create the predicted tags: pred
pred = nb_classifier.predict(X_test)

# Calculate the accuracy score: score
score = metrics.accuracy_score(y_test, pred)
print(score)

#print confusion matrix
confusion_matrix = pd.crosstab(y_test, pred, rownames=['Actual'], colnames=['Predicted'])
print (confusion_matrix)

ValueError: could not convert string to float: 'excellent little hotel husband stayed acte v hotel 3 nights happy accommodations, room clean comfortable quiet facing away rue monge, paul ana claudia desk helpful friendly, location great close notre dame metro stations rue mouffetard lots restaurants cafes, definitely stay,  '

## Beginning Natural Language Processing

## Beginning Modeling


In [48]:
#COUNT VECTORIZER

# Print the head of df
print(reviews.head())

# Create a series to store the labels: y
y = reviews.Label

# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(reviews['Review'], y, test_size = 0.33, random_state = 53)

# Initialize a CountVectorizer object: count_vectorizer
count_vectorizer = CountVectorizer(stop_words = 'english')

# Transform the training data using only the 'text' column values: count_train 
count_train = count_vectorizer.fit_transform(X_train)

# Transform the test data using only the 'text' column values: count_test 
count_test = count_vectorizer.transform(X_test)

# Print the first 10 features of the count_vectorizer
print(count_vectorizer.get_feature_names()[:10])

                                              Review  Rating  Excellent  \
0  nice hotel expensive parking got good deal sta...       4          0   
1  ok nothing special charge diamond member hilto...       2          0   
2  nice rooms not 4* experience hotel monaco seat...       3          0   
3  unique, great stay, wonderful time hotel monac...       5          1   
4  great stay great stay, went seahawk game aweso...       5          1   

   Satisfactory  Unsatisfactory  Label  
0             1               0  satis  
1             0               1  unsat  
2             1               0  satis  
3             0               0  excel  
4             0               0  excel  
['00', '000', '0001', '000__çî_', '000rp', '000rupiah', '000sf', '000us', '000year', '00a']


In [53]:
# TDIF MODEL

# Initialize a TfidfVectorizer object: tfidf_vectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words = 'english', max_df = 0.7)

# Transform the training data: tfidf_train 
tfidf_train = tfidf_vectorizer.fit_transform(X_train)

# Transform the test data: tfidf_test 
tfidf_test = tfidf_vectorizer.transform(X_test)

# Print the first 10 features
print(tfidf_vectorizer.get_feature_names()[:10])

# Print the first 5 vectors of the tfidf training data
print(tfidf_train.A[:5])

['00', '000', '0001', '000__çî_', '000rp', '000rupiah', '000sf', '000us', '000year', '00a']
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


In [54]:
# Create the CountVectorizer DataFrame: count_df
count_df = pd.DataFrame(count_train.A, columns=count_vectorizer.get_feature_names())

# Create the TfidfVectorizer DataFrame: tfidf_df
tfidf_df = pd.DataFrame(tfidf_train.A, columns=tfidf_vectorizer.get_feature_names())

# Print the head of count_df
print(count_df.head())

# Print the head of tfidf_df
print(tfidf_df.head())

# Calculate the difference in columns: difference
difference = set(count_df.columns) - set(tfidf_df.columns)
print(difference)

# Check whether the DataFrames are equal
print(count_df.equals(tfidf_df))


   00  000  0001  000__çî_  000rp  000rupiah  000sf  000us  000year  00a  ...  \
0   0    0     0         0      0          0      0      0        0    0  ...   
1   0    0     0         0      0          0      0      0        0    0  ...   
2   0    0     0         0      0          0      0      0        0    0  ...   
3   0    0     0         0      0          0      0      0        0    0  ...   
4   0    0     0         0      0          0      0      0        0    0  ...   

   ù_  ùn  ùtico  ûan  ü_e  üescribed  üifficult  üâjili  üè  üè__  
0   0   0      0    0    0          0          0       0   0     0  
1   0   0      0    0    0          0          0       0   0     0  
2   0   0      0    0    0          0          0       0   0     0  
3   0   0      0    0    0          0          0       0   0     0  
4   0   0      0    0    0          0          0       0   0     0  

[5 rows x 42401 columns]
    00  000  0001  000__çî_  000rp  000rupiah  000sf  000us  000year  00a

In [55]:
# Import the necessary modules
from sklearn import metrics
from sklearn.naive_bayes import MultinomialNB

# Instantiate a Multinomial Naive Bayes classifier: nb_classifier
nb_classifier = MultinomialNB()

# Fit the classifier to the training data
nb_classifier.fit(count_train, y_train)

# Create the predicted tags: pred
pred = nb_classifier.predict(count_test)

# Calculate the accuracy score: score
score = metrics.accuracy_score(y_test, pred)
print(score)

#print confusion matrix
confusion_matrix = pd.crosstab(y_test, pred, rownames=['Actual'], colnames=['Predicted'])
print (confusion_matrix)

0.7056040218837794
Predicted  excel  satis  unsat
Actual                        
excel       2391    611     17
satis        857   1691    139
unsat         30    337    690


In [56]:
# Create a Multinomial Naive Bayes classifier: nb_classifier
nb_classifier = MultinomialNB()

# Fit the classifier to the training data
nb_classifier.fit(tfidf_train, y_train)

# Create the predicted tags: pred
pred = nb_classifier.predict(tfidf_test)

# Calculate the accuracy score: score
score = metrics.accuracy_score(y_test, pred)
print(score)

#print confusion matrix
confusion_matrix = pd.crosstab(y_test, pred, rownames=['Actual'], colnames=['Predicted'])
print (confusion_matrix)

0.6256099364187491
Predicted  excel  satis  unsat
Actual                        
excel       2468    551      0
satis        939   1748      0
unsat         57    985     15
