# Happy Hotel Code Challenge

In this project, we are given hotel reviews as a free text form. The data is semi-labeled, as we know whether a review is a good/positive or bad/negative. This is a chain hotel with many branches and each branch wanted to know their good/bad attributes and would focus on improving. Since there is over 20 thousand reviews and each hotel may be reviewed differntly on different categories. Knowing the worst attributes of hotel branches is critical since the general management will create working groups and each branch will send representatives to work groups based on the area they need to improve. 

I will try to categorize the user reviews into multiple groups. After the categorization, I will evaluate the overall response for individual hotels to see which categories there could be improvements.

### Load Packages

In [131]:
from nltk.corpus import stopwords 
from nltk.stem.wordnet import WordNetLemmatizer
import string
import pandas as pd
import gensim
from gensim import corpora
import numpy as np
import nltk
nltk.download('stopwords')
import warnings
warnings.filterwarnings('ignore')


  and should_run_async(code)
[nltk_data] Downloading package stopwords to /Users/apple/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Load The Reviews 

In [171]:
good=pd.read_csv('good.csv')
bad=pd.read_csv('bad.csv')

In [172]:
good.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26521 entries, 0 to 26520
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   User_ID      26521 non-null  object
 1   Description  26521 non-null  object
 2   Is_Response  26521 non-null  object
 3   hotel_ID     26521 non-null  int64 
dtypes: int64(1), object(3)
memory usage: 828.9+ KB


In [173]:
# Checking for replicate reviews by looking a specific user_id's 
# in case there are problems with the database
print('Are entries unique?')
good.User_ID.nunique() == good.shape[0]

Are entries unique?


True

In [174]:
# Checking for replicate reviews by looking a specific user_id's 
# in case there are problems with the database
bad.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12411 entries, 0 to 12410
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   User_ID      12411 non-null  object
 1   Description  12411 non-null  object
 2   Is_Response  12411 non-null  object
 3   hotel_ID     12411 non-null  int64 
dtypes: int64(1), object(3)
memory usage: 388.0+ KB


In [175]:
# Checking for replicate reviews by looking a specific user_id's 
# in case there are problems with the database
print('Is entries unique?')
bad.User_ID.nunique() == bad.shape[0]

Is entries unique?


True

Looking at some example reviews to have an understanding of the dataset, type of reviews and grammar/typo error levels. 

In [176]:
good.Description[0]

'Stayed here with husband and sons on the way to an Alaska Cruise. We all loved the hotel, great experience. Ask for a room on the North tower, facing north west for the best views. We had a high floor, with a stunning view of the needle, the city, and even the cruise ships! We ordered room service for dinner so we could enjoy the perfect views. Room service dinners were delicious, too! You are in a perfect spot to walk everywhere, so enjoy the city. Almost forgot- Heavenly beds were heavenly, too!'

In [177]:
bad.Description[0]

"The room was kind of clean but had a VERY strong smell of dogs. Generally below average but ok for a overnight stay if you're not too fussy. Would consider staying again if the price was right. Breakfast was free and just about better than nothing."

## What is LDA: Latent Dirichlet Allocation? 

LDA is a topic model that generates topics based on word frequency from a set of documents. It is based on the assumption that there are limited/few topics in the documents. 

Each document may consists of multiple topics and each topic is governed by frequent use of subset of words.

LDA is able to return the documents that belong to a topic in corpus and the words that belong to a topic. It is based on probabilistic graphical modeling. It takes input a bag of words and produces 2 smaller matrix: - document to topic matrix, - word to topic matrix. When these are multiplied together, they produce the bag of words with the lowest error. 

I will use an LDA analysis to categorize the reviews into different groups. To be able to work with text, words need to be vectorized. However the text needs to be removed from punctuation, words need to be lemmatized (roots of the words are needed) and for simplicity everything is turned into lower case. For that, "clean" function is written which does all these. 

In [178]:
finish = set(stopwords.words('english'))
punc = set(string.punctuation) 
lemma = WordNetLemmatizer()
def clean(review):
    #remove stop words
    stop_free = " ".join([i for i in review.lower().split() if i not in finish])
    # remove punctuation
    punc_free = ''.join(ch for ch in stop_free if ch not in punc)
    # lemmatize
    normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
    return normalized

In [179]:
# Apply clean function to both good and bad reviews
good_cleaned = [clean(review).split() for review in good.Description]  
bad_cleaned = [clean(review).split() for review in bad.Description]     

In [180]:
# Combine all the reviews for processing, since both good and bad reviews are expected to share same categories. 
all_reviews= good_cleaned + bad_cleaned

In [181]:
#review_batch1= all_reviews[0:int(8*len(all_reviews)/10)]

In [182]:
# Creating the term dictionary of our corpus, where every unique term is assigned an index.
# Dictionary has the list of unique words
dictionary = corpora.Dictionary(all_reviews)

# Converting list of documents (corpus) into Term Matrix using dictionary prepared above.
# Every row is a document, and every column is a word
term_matrix = [dictionary.doc2bow(review) for review in all_reviews]

In [144]:
# Creating the object for LDA model using gensim library
Lda = gensim.models.ldamodel.LdaModel

In [167]:
# Running and Trainign LDA model on the document term matrix.
ldamodel = Lda(term_matrix, num_topics=6, id2word = dictionary, passes=10)

In [168]:
# Save model to disk because it takes time to rerun the algortihm.
from gensim.test.utils import datapath
ldamodel.save('6topic10pass.gensim')

In [145]:
ldamodel = Lda.load('6topic10pass.gensim')

In [186]:
#ldamodel.print_topics()

## Topic Visualization

Once the LDA model run, the words are sepereated into topics, and each topic is represented by the most relevant terms (words) in that topic. The model only groups the words together, interpretation should be done by the person who analysing the data. 

To visualize which groups have which words, 

In [146]:
import pyLDAvis.gensim_models

In [147]:
#lda = gensim.models.ldamodel.LdaModel.load('7topic15pass.gensim')
#import pyLDAvis.gensim
lda_display = pyLDAvis.gensim_models.prepare(ldamodel, term_matrix, dictionary, sort_topics=False)
pyLDAvis.display(lda_display)

## Categories 

The model cannot tell you how many topics are in the document, so a good aproach is to have an educated guess about the possible number of topics. Based on my search at hotel review sites, usually 5-8 areas were given to review. So I started with a 10 topic and reduced the number to 5 as in the case of 10, there are many overlaps between the topics. Above you can see the distributions, based on those I am suggesting categories: 

6- Food

5- Room Quality / Amenities

4- Parking / Transportation

3- Staff /  Quality of Service

2- Location/Transportation

1- Special Events / Services

In [148]:
categories = {0:"Special Event Services", 1:'Location', 2:'Service Quality' , 3:'Parking/Transportation', 4:'Room Quality', 5: 'Food' }
categories = pd.DataFrame.from_dict(categories, orient='index', columns=['review_category']).reset_index().rename(columns={'index':'topics'})

### Analysis of Review Topics per Hotel

We need to apply the LDA model to the reviews and assign the categories to see which hotels have good and bad reviews in which category. 

### Assign topics to good reviews

In [149]:
term_matrix = [dictionary.doc2bow(review) for review in good_cleaned]

In [150]:
good_topics = ldamodel[term_matrix] # get topic probability distribution for a document

<gensim.interfaces.TransformedCorpus at 0x1a653e1a20>

In [152]:
topics=[]
for i in range(len(good_cleaned)):
    topics.append(sorted(np.array(good_topics[i]), key = lambda x: x[1],reverse=True)[0][0])

In [153]:
good['topics']= topics
good = good.merge(categories, on='topics')

In [154]:
good.query('topics==5').head()

Unnamed: 0,User_ID,Description,Is_Response,hotel_ID,topics,review_category
25844,id10370,"Comfortable rooms,a little glitzy for my taste...",happy,3,5.0,Food
25845,id10497,We had to stay here for almost - weeks while m...,happy,3,5.0,Food
25846,id10558,"Lovely indoor pool area, looks like a lodge. S...",happy,7,5.0,Food
25847,id10563,I went on a --night business trip to Chicago a...,happy,4,5.0,Food
25848,id10645,The service was prompt and excellent. The room...,happy,1,5.0,Food


In [155]:
print(good.review_category[25843])
print((good.Description[25843]))

Service Quality
We arrived late at night and walked in to a check-in area that had been completely flooded. There were fans running everywhere and water damage on the ceilings and walls. The computers didn't work at the front desk and they had to work from a computer in the back office. I only mention this because with these conditions,it would have been understandable if the two women behind the front desk had been unfriendly, but it was just the opposite. They were so nice and checked us in as quickly as possible.
The room was nice and clean. Breakfast was okay. Plenty of choices, but we got to breakfast kind of late . The biscuits were hard and the fruit seemed old. I'm sure it would have been better if we had arrived to breakfast earlier. 
I'd definitely stay here again.


### Assign topics to bad reviews

In [156]:
term_matrix_2 = [dictionary.doc2bow(review) for review in bad_cleaned]
bad_topics = ldamodel[term_matrix_2]  # get topic probability distribution for a document

In [157]:
topics=[]
for i in range(len(bad_cleaned)):
    topics.append(sorted(np.array(bad_topics[i]), key = lambda x: x[1],reverse=True)[0][0])

In [158]:
bad['topics']= topics
bad = bad.merge(categories, on='topics')

In [159]:
bad.head()

Unnamed: 0,User_ID,Description,Is_Response,hotel_ID,topics,review_category
0,id10326,The room was kind of clean but had a VERY stro...,not happy,3,4.0,Room Quality
1,id10327,I stayed at the Crown Plaza April -- - April -...,not happy,9,4.0,Room Quality
2,id10328,I booked this hotel through Hotwire at the low...,not happy,3,4.0,Room Quality
3,id10332,My husband and I have stayed in this hotel a f...,not happy,7,4.0,Room Quality
4,id10355,The public areas are nice to look at. The staf...,not happy,1,4.0,Room Quality


## Group reviews by hotel ID

In [160]:
grouped_good=good.groupby(['hotel_ID','review_category']).count().drop(['User_ID','Description'],axis=1)[['Is_Response']].rename(columns={'Is_Response':'review_count'})



In [161]:
grouped_good.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,review_count
hotel_ID,review_category,Unnamed: 2_level_1
1,Food,53
1,Location,515
1,Parking/Transportation,501
1,Room Quality,321
1,Service Quality,162
1,Special Event Services,627
2,Food,32
2,Location,245
2,Parking/Transportation,252
2,Room Quality,147


In [162]:
grouped_bad=bad.groupby(['hotel_ID','review_category']).count().drop(['User_ID','Description'],axis=1)[['Is_Response']].rename(columns={'Is_Response':'review_count'})
grouped_bad.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,review_count
hotel_ID,review_category,Unnamed: 2_level_1
1,Food,20
1,Location,200
1,Parking/Transportation,127
1,Room Quality,607
1,Service Quality,747
1,Special Event Services,49
2,Food,11
2,Location,94
2,Parking/Transportation,96
2,Room Quality,344


In [163]:
all_reviews = grouped_good.merge(grouped_bad, left_index=True, right_index=True, suffixes=('_good','_bad'))

In [164]:
all_reviews['bad_to_good_ratio'] = all_reviews['review_count_bad'] / all_reviews['review_count_good']

In [165]:
all_reviews.query('hotel_ID==10')

Unnamed: 0_level_0,Unnamed: 1_level_0,review_count_good,review_count_bad,bad_to_good_ratio
hotel_ID,review_category,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
10,Food,30,4,0.133333
10,Location,193,78,0.404145
10,Parking/Transportation,166,60,0.361446
10,Room Quality,129,245,1.899225
10,Service Quality,65,322,4.953846
10,Special Event Services,194,25,0.128866


In [166]:
all_reviews.query('bad_to_good_ratio>2').head()

Unnamed: 0_level_0,Unnamed: 1_level_0,review_count_good,review_count_bad,bad_to_good_ratio
hotel_ID,review_category,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,Service Quality,162,747,4.611111
2,Room Quality,147,344,2.340136
2,Service Quality,72,445,6.180556
3,Service Quality,252,730,2.896825
4,Service Quality,342,939,2.745614


# Ideas:  How to Improve

1- Try to get a better labeled data, or manual review for testing purposes

2- Instead of whole review, split into sentences and attribute multiple topics per review

3- Run a sentiment analysis on the reviews first to attribute good/bad score to more granular scores

4- Hyper parameter tuning: # of topics, alpha: Document-Topic Density and beta: Word-Topic Density

5- Speed: Use another gensim model that enables multicore processing