# BUSI/COMP 488 Major Class Project: A Competitive Analysis of Reviews for Graduate Hotels
### by Anuttam Perumal, Hugh Williamson, Matthew Jaynes, Pranav Vinaayak, Emil Foldager Jensen, and Wali Khan

We start off by importing the data from Drive. We chose to use both both TripAdvisor and Google reviews in order to get as broad of a sample as possible. Furthermore, the two review sources have similarly formatted dataframes which mains cleaning relatively straightforward,

In [None]:
# import the necessary packages.
!pip install BERTopic
import numpy as np
import pandas as pd
import random

# connect your google drive.
from google.colab import drive
drive.mount('/content/drive')

# navigate to 'Final Project' directory.
%cd /content/drive/MyDrive/BUSI488 - Final Project

# list files in directory.
!ls

In [None]:
# navigate to the folder containing pre-scraped reviews from google and tpadvisor.
%cd /content/drive/MyDrive/BUSI488 - Final Project/Reviews
!ls

In [None]:
# read in all raw json files by hotel and source.
# the final dataframes for each set of reviews will have a different number of columns by source,
# e.g. google dataframes will have 13 columns and tpadvisor dataframes will have 19 columns.
# the dataframe shown below contains the 'raw', unprocessed contents of a json review file for a particular hotel and source. 

ace_all_google = pd.read_json('ace_google.json', lines=True)
ace_all_tpadvisor = pd.read_json('ace_all_tpadvisor.json', lines=True)

citizenm_all_google = pd.read_json('citizenm_all_google.json', lines=True)
citizenm_all_tpadvisor = pd.read_json('citizenm_all_tpadvisor.json', lines=True)

GraduateHotels_all_google = pd.read_json('GraduateHotels_google.json', lines=True)
GraduateHotels_all_tpadvisor = pd.read_json('GraduateHotels_tpadvisor.json', lines=True)

hoxton_all_google = pd.read_json('hoxton_google.json', lines=True)
hoxton_all_tpadvisor = pd.read_json('hoxton_all_tpadvisor.json', lines=True)

standard_all_google = pd.read_json('standard_google.json', lines=True)
standard_all_tpadvisor = pd.read_json('standard_all_tpadvisor.json', lines=True)

study_all_google = pd.read_json('study_google.json', lines=True)
study_all_tpadvisor = pd.read_json('study_tpadvisor.json', lines=True) 

# e.g. the raw dataframe for all Graduate Hotel reviews from google.

GraduateHotels_all_google

Now that we've read in our JSON in the most raw/rudimentary form possible, we can start to take a look at cleaning. First, we define a helper function that covnerts the 'raw' dataframes into dataframes that we can subset and use for our analysis below. 

In [None]:
# helper function to clean up messy json files of scraped reviews.
# works on both google and tpadvisor jsons.
# ignores the first column of the direct json df because it's just an unnecessary index.
# outputs a neat, simplified dataframe.

def dirty_parse(dirty_df):
  col_data = []
  col_names = list(dirty_df.columns[1:])
  for column in dirty_df.columns[1:]:
    col_data.append(dirty_df[column])
  individual_cols = pd.DataFrame()
  for i in range(len(col_data)):
    col_values = col_data[i].values
    new = col_values[0]
    new_df_column = pd.DataFrame.from_dict(new, orient='index')
    individual_cols = pd.concat([individual_cols, new_df_column], axis = 1)
  individual_cols.columns = col_names
  return individual_cols


In [None]:
# run the dirty_parse() helper function on the original json dataframes.

ace_google = dirty_parse(ace_all_google)
ace_tp = dirty_parse(ace_all_tpadvisor)

citizen_google = dirty_parse(citizenm_all_google)
citizen_tp = dirty_parse(citizenm_all_tpadvisor)

GraduateHotels_google = dirty_parse(GraduateHotels_all_google)
GraduateHotels_tp = dirty_parse(GraduateHotels_all_tpadvisor)

hoxton_google = dirty_parse(hoxton_all_google)
hoxton_tp = dirty_parse(hoxton_all_tpadvisor)

standard_google = dirty_parse(standard_all_google)
standard_tp = dirty_parse(standard_all_tpadvisor)

study_google = dirty_parse(study_all_google)
study_tp = dirty_parse(study_all_tpadvisor)

In [None]:
# some google review dataframes still contain an unnecessary id column, which we will remove.

ace_google = ace_google.iloc[: , 1:]
GraduateHotels_google = GraduateHotels_google.iloc[: , 1:]
hoxton_google = hoxton_google.iloc[: , 1:]
standard_google = standard_google.iloc[: , 1:] 
study_google = study_google.iloc[: , 1:]

In [None]:
# we'll look at the Graduate Hotel google and tpadvisor dataframes for our exploration purposes.

GraduateHotels_google.head(100)

In [None]:

GraduateHotels_tp.head(3)

In [None]:
# note that the total length of this dataframe is 272,441 entries.

GraduateHotels_google.isna().sum()

In [None]:
# note that the total length of this DataFrame is 272,441 entries.

GraduateHotels_tp.isna().sum()

Next, we rename columns in the TripAdvisor review dataframes to match the ones to Google more closely. We also drop any reviews that contains a null field (i.e. missing) for the text portion of the review.

In [None]:
# we will remove any reviews not containing any text. 

GraduateHotels_google = GraduateHotels_google[GraduateHotels_google['text'].notna()]
GraduateHotels_tp = GraduateHotels_tp[GraduateHotels_tp['review text'].notna()]
ace_google = ace_google[ace_google['text'].notna()]
ace_tp = ace_tp[ace_tp['review text'].notna()]
citizen_google = citizen_google[citizen_google['text'].notna()]
citizen_tp = citizen_tp[citizen_tp['review text'].notna()]
hoxton_google = hoxton_google[hoxton_google['text'].notna()]
hoxton_tp = hoxton_tp[hoxton_tp['review text'].notna()]
standard_google = standard_google[standard_google['text'].notna()]
standard_tp = standard_tp[standard_tp['review text'].notna()]
study_google = study_google[study_google['text'].notna()]
study_tp = study_tp[study_tp['review text'].notna()]

# next, we rename some tpadvisor columns to match the format of the respective columns in the google DataFrame. 

GraduateHotels_tp = GraduateHotels_tp.rename(columns={"review text": "text", "hotel name": "hotel"})
ace_tp = ace_tp.rename(columns={"review text": "text", "hotel name": "hotel"})
citizen_tp = citizen_tp.rename(columns={"review text": "text", "hotel name": "hotel"})
hoxton_tp = hoxton_tp.rename(columns={"review text": "text", "hotel name": "hotel"})
standard_tp = standard_tp.rename(columns={"review text": "text", "hotel name": "hotel"})
study_tp = study_tp.rename(columns={"review text": "text", "hotel name": "hotel"})


In [None]:
#append to combine Google and TP

GraduateHotels_tp_text = GraduateHotels_tp[['hotel', 'text']]
GraduateHotels_google_text = GraduateHotels_google[['hotel', 'text']]
GraduateHotels_alltext = GraduateHotels_tp_text.append(GraduateHotels_google_text)

ace_tp_text = ace_tp[['hotel', 'text']]
ace_google_text = ace_google[['hotel', 'text']]
ace_alltext = ace_tp_text.append(ace_google_text)

citizen_tp_text = citizen_tp[['hotel', 'text']]
citizen_google_text = citizen_google[['hotel', 'text']]
citizen_alltext = citizen_tp_text.append(citizen_google_text)

hoxton_tp_text = hoxton_tp[['hotel', 'text']]
hoxton_google_text = hoxton_google[['hotel', 'text']]
hoxton_alltext = hoxton_tp_text.append(hoxton_google_text)

standard_tp_text = standard_tp[['hotel', 'text']]
standard_google_text = standard_google[['hotel', 'text']]
standard_alltext = standard_tp_text.append(standard_google_text)

study_tp_text = study_tp[['hotel', 'text']]
study_google_text = study_google[['hotel', 'text']]
study_alltext = study_tp_text.append(study_google_text)

# study_alltext.head(10)

Unnamed: 0,hotel,text
0,The Study at Yale,I really enjoyed our stay at The Study at Yale! It was very clean and the staff was extremely friendly. I just wish the rates would not be so high. I loved the environment and the décor; it felt like being in a library. Great place to stay as it is close to Yale University.
1,The Study at Yale,"Comfy bed, incredibly beautiful view, perfect service, awesome location, comfy bed, nicely equipped room with large windows, nice desk and chair, would come back without hesitation. The view is the best part of it!"
2,The Study at Yale,Always wonderful. I get my best nights' sleep. Thanks to all of you who make our stays so enjoyable. We will be spending many more days with you as our granddaughter continues her amazing education in New Haven.
3,The Study at Yale,"We are regular guests at the Study and the staff were helpful and our stay was pleasant except for our room assignment. Room 702 is significantly smaller than the standard queen sized room, has a significantly smaller shower, and has less furniture than is shown in the images on your website. I ..."
4,The Study at Yale,"Stylish (elegant simplicity), great location for visiting Yale again . this was the perfect base from which to revisit old familiar places on campus. Wonderful lounge/ lobby. We will definitely return!"
5,The Study at Yale,"Excellent, welcoming service. Love the fresh fruit that's always available in the lobby, as well as the complementary coffee and tea every morning. . Extremely convenient location (and free parking in the hotel's own garage). Very comfortable bed/rooms."
6,The Study at Yale,I really love this hotel it was such a wonderful experience to be there and I loved the views of the campus I loved that they had free pencils on the desk that was sharpened I loved the little reading books along each part of the hotel I deeply recommend this as a retreat especially for people w...
7,The Study at Yale,"Very friendly staff, excellent service. Great location. Dog friendly, good breakfast. Great place to bring friends for a cocktail before dinner. Fun to watch the at Patrick’s day parade from the bar!"
8,The Study at Yale,"Extremely well-located if you are visiting Yale. Comfortable, spacious rooms with great light and air. I believe The Study is the only hotel in New Haven where you can actually open the windows and breath the fresh air!"
9,The Study at Yale,"The hotel has a privileged location amid all points of interest. Very nice staff, always ready to cooperate. Parking on site is a must. The rooms are comfortable and well furnished. A fridge would have been an asset."


## Classification with BERT

As a starting point, we wanted to be able to classify whether reviews belonged to the 'business' class, or the 'student' class. This allows us to dive deeper into our analysis in later portions of this notebook, where we split the hotel reviews up by franchise to analyze sentiment and topic discovery. 

When creating these steps and workflow, we took inspiration from the Class 26 lecture notes provided, along with the accomodating pickle file used as fine-tuning for the classifier. 

In [None]:
# install and import the necessary packages.

!pip install transformers
!pip install pickle5
from tqdm import tqdm
import pickle5 as pickle
import numpy as np
import pandas as pd
import torch
import transformers as ppb
import warnings
warnings.filterwarnings('ignore')
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
from torch.utils.data import TensorDataset, DataLoader





In [None]:
# change the directory back to the main folder in order to access the pre-trained, fine-tuned model.

%cd /content/drive/MyDrive/BUSI488 - Final Project
output_dir = 'model_1'

# assign model name.

model_name = 'distilbert-base-cased'

# load tokenizer.

tokenizer = AutoTokenizer.from_pretrained(model_name)

# we have two classes in our dataset (business and school).

numclasses = 2

# instantiate the model.

model = AutoModelForSequenceClassification.from_pretrained(
		output_dir,
		output_hidden_states=False,
		output_attentions=False,
		num_labels=numclasses
		)

/content/drive/MyDrive/BUSI488 - Final Project


loading configuration file https://huggingface.co/distilbert-base-cased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/ebe1ea24d11aa664488b8de5b21e33989008ca78f207d4e30ec6350b693f073f.302bfd1b5e031cc1b17796e0b6e5b242ba2045d31d00f97589e12b458ebff27a
Model config DistilBertConfig {
  "_name_or_path": "distilbert-base-cased",
  "activation": "gelu",
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.18.0",
  "vocab_size": 28996
}

loading file https://huggingface.co/distilbert-base-cased/resolve/main/vocab.txt from cache at /root/.cache/huggingface/transformers/ba377304984dc63e3ede0e23a938bbbf04d5c3835b66d5bb48343aecca188429.4

In [None]:
# set the maximum padding length value to 75 tokens.
# this saves RAM during testing and training.

max_length = 75
padding_type = 'max_length'

The following helper function applying tokenizaion and padding to each review found in the dataframe before passing it on to our classifier model. This is so that all of the matrices (when speaking in terms of matrices of tokens) are of the same dimension which reduces the compute power needed to run the model on Colab. 

In [None]:
# function to tokenize the sentences and return tokens and padding.
# taken from class 26 lecture notes.

def tokenize_sentences(sent):
  input_ids = []
  attention_mask = []
  token_ids = []

  for i in tqdm(range(len(sent))):
    sentence = sent[i]
    encoded = tokenizer.encode_plus(text=sentence,
                                    add_special_tokens=True,
                                    padding=padding_type,
                                    max_length=max_length,
                                    truncation=True,
                                    return_token_type_ids=True,
                                    return_tensors='pt')

    input_ids.append(encoded['input_ids'])
    attention_mask.append(encoded['attention_mask'])

  input_ids = torch.cat(input_ids, dim=0, out=None)
  attention_mask = torch.cat(attention_mask, dim=0, out=None)

  return input_ids, attention_mask

In [None]:
# prepare new sentences.

test_input_ids, test_attention_mask = tokenize_sentences(GraduateHotels_alltext['text'])

# converts the data into formats usable by tensorflow.

test_array = TensorDataset(test_input_ids, test_attention_mask)
test_loader = DataLoader(test_array, batch_size=8)

100%|██████████| 30123/30123 [00:30<00:00, 981.39it/s] 


In [None]:
# define the training arguments.

training_args = TrainingArguments(
		output_dir=' ',
		logging_dir=' ',
		)

# load and set the Trainer class from huggingface.

trainer = Trainer(
		model=model,
		args=training_args,
		train_dataset=test_loader,
		eval_dataset=test_loader
		)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [None]:
##do not run

# model set to evaluation mode.

model.eval()

# CRUCIAL: use the GPU if generating csv files for the first time.

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# store the predictions in lists.

test_pred_labels = []
test_pred_scores = []
with torch.no_grad():
    for input_ids, attn_mask in test_loader:
        input_ids = input_ids.to(device)
        attn_mask = attn_mask.to(device)
        outputs = model(input_ids, attn_mask)
        outputs = outputs['logits']
        test_pred_labels.extend(torch.argmax(outputs, 1).cpu().detach().numpy().tolist())
        test_pred_scores.extend(torch.max(torch.softmax(outputs, 1), 1)[0].cpu().detach().numpy().tolist())

In [None]:
# store the predictions for sentences and their labels in a new dataframe.

df = pd.DataFrame({'sentences': GraduateHotels_alltext['text'], 
                   'predicted_labels': test_pred_labels,
                   'predicted_scores': test_pred_scores
                   })

In [None]:
# let's see what our fine-tuned model classified as positive samples.

pd.set_option('max_colwidth', 300)
display(df[df.predicted_labels==1].sort_values(by=['predicted_scores']))

Unnamed: 0,sentences,predicted_labels,predicted_scores
3901,We stayed here on our visit to UNC and absolutely loved it. I was very pleasantly surprised by the amenities that this hotel offered. The business center was very convo enemy and the concierge was kind enough to set up a shuttle to the university for a class visit. The beds were very comfortable...,1,0.500187
10431,We stayed here for a tournament. We asked for a microwave and Matt brought it to us very quickly and set it up for us. He was also outstanding and very kind even though it was very late Saturday night. Thank you very much.,1,0.500300
6482,"We were in town for our Daughter to interview for the Masters Program at USC. We were there for 4 days, only got a room at the Inn for the last night. Had to book elsewhere the other days.",1,0.500328
10680,I was coming to Knoxville to attend UTK's orientation and booked at this hotel as the location was so convenient. Unfortunately the booking process and billing was not. When I booked the UTK's rate my booking dates were changed without me noticing until I had confirmed the booking. I immediately...,1,0.501571
13296,"A classic hotel in downtown Providence. The suites are so comfortable, the staff is great. A great location - walking distance to many fine restaurants. I certify that this review is my genuine opinion of this hotel, and that I have no personal or business affiliation with this establishme...",1,0.503707
...,...,...,...
8168,Went for a conference. The hotel staff did a really good job.,1,0.956970
12987,Great venue for a professional conference. Staff worked hard to have a smooth schedule.,1,0.957196
13151,Great venue for a professional conference. Staff worked hard to have a smooth schedule.,1,0.957196
10140,"Had a conference there, very snazzy.",1,0.958613


In [None]:
# let's see what our fine-tuned model classified as negative samples.

pd.set_option('max_colwidth', 300)
display(df[df.predicted_labels==0].sort_values(by=['predicted_scores']))

Unnamed: 0,sentences,predicted_labels,predicted_scores
2392,"I'm not a fan of being so close to campus, but I enjoyed my stay at the hotel. It was convenient for my work meeting; the room was cute and adequately clean. The bed was VERY comfortable. The room smelled musty upon check-in but with open windows and the fan going, it aired out nicely. The room ...",0,0.501131
9710,"I just called the hotel today to inquire about a reservation I made in February for August 2020. I was planning on staying there during the Madison Mini-Marathon as I had the previous year but due to the pandemic the race was cancelled. I knew when I booked they had a ""no refund"" policy and I wa...",0,0.501319
11705,Internet failed quite often.,0,0.502611
4562,"Unfortunately I am still dealing with this issue two months after the fact. I made a reservation three months ago for a one night stay in late June through Hotels.com. Well before the cancellation cut off for Hotels.com, I canceled my reservation at The Graduate through Hotels.com cause I found ...",0,0.502797
5236,"I had some meetings to attend at this hotel, and did not know much about Cincinnati, but this is in a revitalizing area of the City, in the middle of the University area. And near hospitals. The rooms were very clean and well appointed. Breakfast was delightful, with beautiful fresh fruits yo...",0,0.502869
...,...,...,...
8224,"Walking distance to shops, restaurants, town square. Spacious room, good linens, great service. Just avoid the 64/94 restaurant--go hungry, don't eat there, or walk a block to one of the good restaurants. The lobby bar is GOOD. It is separate from restaurant.",0,0.960260
8185,"Wished there was a hot tub, also the staff was really friendly and helpful",0,0.960263
12321,"The roof bar has a cool atmosphere, good drinks, decent food",0,0.960540
8730,"This hotel is very clean and has great staff. It was a bit pricey for what you get. 2 sodas and an apple juice from room service was $12 . The beds were ok but hated how the frame stuck out on the bottom. Hit my shin on it twice, my daughter three times. And if you do not like really firm pil...",0,0.960691


All of the predictions our classifier made are then outputted to a .csv file to make our immediate analyses of the results faster.

In [None]:
# output the predictions to csv files.
# this allows for us to read in the predicted csv's as dataframes, which is quicker than instantiatin and running the model every time.

FILE_NAME_TO_SAVE = 'GraduateHotels.csv' # Change the file name to whatever you like
df.to_csv(FILE_NAME_TO_SAVE, header=True, index=False)

## Graduate Hotels Topic Discovery and Sentiment Analysis

Now that we have classified our reviews as belonging to two different classes, it's time to start taking a look at topic discovery and sentiment analysis. Similar to the previous methodology, we use our finely tuned BERT model in order to drive these analyses. Note that the order in which BERTopic is installed is crucial to the code working, due to dependency issues within BERTopic.

In [None]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
import numpy as np
# !pip install bertopic
from bertopic import BERTopic


TypeError: ignored

We use a lambda function to remove stopwords and clean the reviews a little further before passing it on to our topic discovery model. 

In [None]:
# import libraries.

import numpy as np
from bertopic import BERTopic 
import nltk
from nltk.corpus import stopwords

# instantiate the model. 

model3 = BERTopic(verbose = True, nr_topics = "auto",calculate_probabilities=True)

# navigate to the directory where where we can output csv files.

%cd /content/drive/MyDrive/BUSI488 - Final Project/CSVs
gradcsv = pd.read_csv("GraduateHotels.csv")
gradcsv_college = gradcsv[gradcsv["predicted_labels"] == 0]
gradcsv_college_first5k = gradcsv_college.sample(n=15000)

nltk.download('stopwords')
stop_words = stopwords.words('english') + ['the','very','we','told','said','get']
gradcsv_college_first5k['sentences'] = gradcsv_college_first5k['sentences'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))

# pass in our reviews as a list.

docs4 = gradcsv_college_first5k['sentences'].tolist()

# fit model to data to predict topics.
topics, probabilities = model3.fit_transform(docs4)

TypeError: ignored

Now that the model is run, we can start to take a look at the different topics our model identified for each individual franchise.

In [None]:
# seeing how many reviews correlate to each topic.
model3.get_topic_freq()

In [None]:
# we can also find the words that comprise a topic, and the likelihood of those words being foun in reviews classified as being of that topic.

model3.get_topic(64)

NameError: ignored

In [None]:
# interactively explore the discovered topics.
model3.visualize_topics()

In [None]:
# generates a bar chart of the most relevant words per topic. 
model3.visualize_barchart()

In [None]:
# creates an interactive heatmap.
model3.visualize_heatmap()

In [None]:
# most relevant is with highest score (row 1, and then column, which corresponds to topic number).

pd.DataFrame(model3.find_topics("wifi"))

In [None]:
# save model (we name the file after the hotel franchise name).

model3.save("Graduation test model")

# load model (now we name the model accordingly).

graduate_model = BERTopic.load("Graduation test model")

# check loaded model.

graduate_model.visualize_barchart()

In [None]:
model3.visualize_distribution(probabilities[0])

In [None]:
model3.visualize_term_rank()

In [None]:
##assignment of topic number
graduatetopics = pd.DataFrame(gradcsv_college_first5k)
graduatetopics['Topic'] = topics
graduatetopics.rename(columns = {0: "text"}, inplace = True)

graduatetopics.head(25)

Next, we can start to take a look at the sentiment of each of these topics accross the different franchises. This allows us to get a cleaner picture of how customers feel about the different aspects of hotels, as well as how strongly they feel about it. 

In [None]:
 ## vader sentiment analysis installation.

!pip install vaderSentiment

Collecting vaderSentiment
  Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl (125 kB)
[?25l[K     |██▋                             | 10 kB 22.0 MB/s eta 0:00:01[K     |█████▏                          | 20 kB 21.7 MB/s eta 0:00:01[K     |███████▉                        | 30 kB 10.5 MB/s eta 0:00:01[K     |██████████▍                     | 40 kB 8.5 MB/s eta 0:00:01[K     |█████████████                   | 51 kB 4.6 MB/s eta 0:00:01[K     |███████████████▋                | 61 kB 5.5 MB/s eta 0:00:01[K     |██████████████████▏             | 71 kB 5.5 MB/s eta 0:00:01[K     |████████████████████▉           | 81 kB 5.5 MB/s eta 0:00:01[K     |███████████████████████▍        | 92 kB 6.0 MB/s eta 0:00:01[K     |██████████████████████████      | 102 kB 5.3 MB/s eta 0:00:01[K     |████████████████████████████▋   | 112 kB 5.3 MB/s eta 0:00:01[K     |███████████████████████████████▏| 122 kB 5.3 MB/s eta 0:00:01[K     |████████████████████████████████| 125 kB 5.3 MB

In [None]:
# define a function that returns the polarity score of a sentence.

def sentiment_analyzer_scores(sentence):
    score = analyser.polarity_scores(sentence)
    print("{:-<55} {}".format(sentence, str(score)), "\n")


# calculating compound sentiment scores for all reviews:

# import the sentiment module.

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

# import numpy.

import numpy as np

# instantiate the sentiment analyzer.

analyser = SentimentIntensityAnalyzer()

# get the compound sentiment score for each review.

graduatetopics['sent_score'] = np.nan 
for index, row in graduatetopics.iterrows():  
    graduatetopics.loc[index, 'sent_score'] = analyser.polarity_scores(row['sentences'])['compound']

# view results.

pd.set_option('display.max_colwidth', None)

final_graduate = graduatetopics.drop('predicted_labels', axis = 1)
final_graduate = final_graduate.drop('predicted_scores', axis = 1)
final_graduate = final_graduate[final_graduate["Topic"] >= 0]
top1 = final_graduate.sort_values(by=['sent_score'], ascending = True)
top1.head(20)



NameError: ignored

In [None]:
# graph data.

x = final_graduate['sent_score'] < -0.5
x.value_counts()


grouped = final_graduate.groupby("Topic")
grouped.mean().sort_values(by=['sent_score'], ascending = True)


Now that we've completed the topic discovery and sentiment analysis for one hotel, the process repeats itself for all six of the hotels we were given reviews for.

## Ace Hotels Topic Discovery and Sentiment Analysis

In [None]:

model10 = BERTopic(verbose = True, nr_topics = "auto",calculate_probabilities=True)


%cd /content/drive/MyDrive/BUSI488 - Final Project/CSVs
acecsv = pd.read_csv("AceHotels.csv")
acecsv_college = acecsv[acecsv["predicted_labels"] == 0]
acesamp = acecsv_college.sample(n=11867)


nltk.download('stopwords')


stop_words = stopwords.words('english') + ['the','very','we','told','said','get']


acesamp['sentences'] = acesamp['sentences'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))


docs10 = acesamp['sentences'].tolist()
 

topics, probabilities = model10.fit_transform(docs10)

NameError: ignored

In [None]:
docs10

In [None]:

model10.get_topic_freq()

In [None]:
model10.get_topic_info()

In [None]:
model10.get_topic(4)

In [None]:
model10.visualize_topics()

In [None]:
model10.visualize_barchart()

In [None]:
model10.visualize_heatmap()

In [None]:
pd.DataFrame(model10.find_topics("wifi"))

In [None]:
model10.save("Ace Hotel model")

ace_model = BERTopic.load("Ace Hotel model")

ace_model.visualize_barchart()

In [None]:
model10.visualize_distribution(probabilities[0])

In [None]:
model10.visualize_term_rank()

In [None]:
acetopics = pd.DataFrame(acesamp)
acetopics['Topic'] = topics
acetopics.rename(columns = {0: "text"}, inplace = True)

acetopics.head(25)



In [None]:
def sentiment_analyzer_scores(sentence):
    score = analyser.polarity_scores(sentence)
    print("{:-<55} {}".format(sentence, str(score)), "\n")



from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

import numpy as np

analyser = SentimentIntensityAnalyzer()

acetopics['sent_score'] = np.nan 
for index, row in acetopics.iterrows(): 
    acetopics.loc[index, 'sent_score'] = analyser.polarity_scores(row['sentences'])['compound']

pd.set_option('display.max_colwidth', None)

final_ace = acetopics.drop('predicted_labels', axis = 1)
final_ace = final_ace.drop('predicted_scores', axis = 1)
top2 = final_ace.sort_values(by=['sent_score'], ascending = False)
top2.head(20)

In [None]:
x = final_ace['sent_score'].mean()
x


grouped = final_ace.groupby("Topic")
grouped.mean().sort_values(by=['sent_score'], ascending = True)

## Hoxton Hotels Topic Discovery and Sentiment Analysis


In [None]:
model11 = BERTopic(verbose = True, nr_topics = "auto",calculate_probabilities=True)

%cd /content/drive/MyDrive/BUSI488 - Final Project/CSVs
hoxcsv = pd.read_csv("HoxtonHotels.csv")
hoxcsv_college = hoxcsv[hoxcsv["predicted_labels"] == 0]
hoxsamp = hoxcsv_college.sample(n=15000)


nltk.download('stopwords')


stop_words = stopwords.words('english') + ['the','very','we','told','said','get']


hoxsamp['sentences'] = hoxsamp['sentences'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))


docs11 = hoxsamp['sentences'].tolist()
 
topics, probabilities = model11.fit_transform(docs11)

In [None]:
docs11

In [None]:
model11.get_topic_freq()

In [None]:
model11.get_topic_info()

In [None]:
model11.get_topic(4)

In [None]:
model11.visualize_topics()

In [None]:
model11.visualize_barchart()

In [None]:
model11.visualize_heatmap()

In [None]:
pd.DataFrame(model11.find_topics("wifi"))

In [None]:
model11.save("Hoxton Hotel model")

hox_model = BERTopic.load("Hoxton Hotel model")

hox_model.visualize_barchart()

In [None]:
model11.visualize_distribution(probabilities[0])

In [None]:
model11.visualize_term_rank()

In [None]:
hoxtopics = pd.DataFrame(hoxsamp)
hoxtopics['Topic'] = topics
hoxtopics.rename(columns = {0: "text"}, inplace = True)

hoxtopics.head(25)

In [None]:
def sentiment_analyzer_scores(sentence):
    score = analyser.polarity_scores(sentence)
    print("{:-<55} {}".format(sentence, str(score)), "\n")


from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

import numpy as np

analyser = SentimentIntensityAnalyzer()

hoxtopics['sent_score'] = np.nan 
for index, row in hoxtopics.iterrows():  
    hoxtopics.loc[index, 'sent_score'] = analyser.polarity_scores(row['sentences'])['compound']

pd.set_option('display.max_colwidth', None)

final_hoxton = hoxtopics.drop('predicted_labels', axis = 1)
final_hoxton = final_hoxton.drop('predicted_scores', axis = 1)
top3 = final_hoxton.sort_values(by=['sent_score'], ascending = False)
top3.head(20)

## Study Hotels Topic Discovery and Sentiment Analysis

In [None]:
model12 = BERTopic(verbose = True, nr_topics = "auto",calculate_probabilities=True)

%cd /content/drive/MyDrive/BUSI488 - Final Project/CSVs
stucsv = pd.read_csv("StudyHotels.csv")
stucsv_college = stucsv[stucsv["predicted_labels"] == 0]
stusamp = stucsv_college.sample(n=1650)

nltk.download('stopwords')

stop_words = stopwords.words('english') + ['the','very','we','told','said','get']

stusamp['sentences'] = stusamp['sentences'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))

docs12 = stusamp['sentences'].tolist()

topics, probabilities = model12.fit_transform(docs12)

In [None]:
docs12

In [None]:
model12.get_topic_freq()

In [None]:
model12.get_topic_info()

In [None]:
model12.get_topic(4)

In [None]:
model12.visualize_topics()

In [None]:
model12.visualize_barchart()

In [None]:
model12.visualize_heatmap()

In [None]:
pd.DataFrame(model12.find_topics("wifi"))

In [None]:
model12.save("Study Hotel model")

stu_model = BERTopic.load("Study Hotel model")

stu_model.visualize_barchart()

In [None]:
model12.visualize_distribution(probabilities[0])

In [None]:
model12.visualize_term_rank()

In [None]:
stutopics = pd.DataFrame(stusamp)
stutopics['Topic'] = topics
stutopics.rename(columns = {0: "text"}, inplace = True)

stutopics.head(25)



In [None]:
def sentiment_analyzer_scores(sentence):
    score = analyser.polarity_scores(sentence)
    print("{:-<55} {}".format(sentence, str(score)), "\n")


from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

import numpy as np

analyser = SentimentIntensityAnalyzer()

stutopics['sent_score'] = np.nan 
for index, row in stutopics.iterrows():  
    stutopics.loc[index, 'sent_score'] = analyser.polarity_scores(row['sentences'])['compound']

pd.set_option('display.max_colwidth', None)

final_study = stutopics.drop('predicted_labels', axis = 1)
final_study = final_study.drop('predicted_scores', axis = 1)
top4 = final_study.sort_values(by=['sent_score'], ascending = False)
top4.head(20)

## Standard Hotels Topic Discovery and Sentiment Analysis


In [None]:
model13 = BERTopic(verbose = True, nr_topics = "auto",calculate_probabilities=True)

%cd /content/drive/MyDrive/BUSI488 - Final Project/CSVs
stacsv = pd.read_csv("StandardHotels.csv")
stacsv_college = stacsv[stacsv["predicted_labels"] == 0]
stasamp = stacsv_college.sample(n=12856)


nltk.download('stopwords')


stop_words = stopwords.words('english') + ['the','very','we','told','said','get']


stasamp['sentences'] = stasamp['sentences'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))

docs13 = stasamp['sentences'].tolist()
 
topics, probabilities = model13.fit_transform(docs13)

NameError: ignored

In [None]:
docs13

In [None]:
model13.get_topic_freq()

In [None]:
model13.get_topic_info()

In [None]:
model13.get_topic(4)

In [None]:
model13.visualize_topics()

In [None]:
model13.visualize_barchart()

In [None]:
model13.visualize_heatmap()

In [None]:
pd.DataFrame(model13.find_topics("wifi"))


In [None]:
model13.save("Standard Hotel model")

sta_model = BERTopic.load("Standard Hotel model")

sta_model.visualize_barchart()

In [None]:
model13.visualize_distribution(probabilities[0])

In [None]:
model13.visualize_term_rank()

In [None]:
statopics = pd.DataFrame(stasamp)
statopics['Topic'] = topics
statopics.rename(columns = {0: "text"}, inplace = True)

statopics.head(25)

In [None]:
def sentiment_analyzer_scores(sentence):
    score = analyser.polarity_scores(sentence)
    print("{:-<55} {}".format(sentence, str(score)), "\n")

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

import numpy as np

analyser = SentimentIntensityAnalyzer()

statopics['sent_score'] = np.nan 
for index, row in statopics.iterrows(): 
    statopics.loc[index, 'sent_score'] = analyser.polarity_scores(row['sentences'])['compound']

pd.set_option('display.max_colwidth', None)

final_standard = statopics.drop('predicted_labels', axis = 1)
final_standard = final_standard.drop('predicted_scores', axis = 1)
top5 = final_standard.sort_values(by=['sent_score'], ascending = False)
top5.head(20)

NameError: ignored

## Altair

We will use the sentiment analyzer code above in order to develop visualizations of the positive and negative sentiment across different hotel chains.

In [None]:
# 3. Define a function that returns the polarity score of a sentence
def sentiment_analyzer_scores(sentence):
    score = analyser.polarity_scores(sentence)
    print("{:-<55} {}".format(sentence, str(score)), "\n")


# 1. Import the sentiment module (in case you haven't already done so)
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

# 2. Import numpy (in case you have not already done so)
import numpy as np

# 3. Instantiate the sentiment analyzer (in case you haven't already done so)
analyser = SentimentIntensityAnalyzer()

# 4. Now get the compound sentiment score for each review
GraduateHotels_alltext['sent_score'] = np.nan 
for index, row in GraduateHotels_alltext.iterrows():  
    GraduateHotels_alltext.loc[index, 'sent_score'] = analyser.polarity_scores(row['text'])['compound']

# 5. Let's take a look!
pd.set_option('display.max_colwidth', None)



In [None]:
grad_appendable = GraduateHotels_alltext

In [None]:
def sentiment_analyzer_scores(sentence):
    score = analyser.polarity_scores(sentence)
    print("{:-<55} {}".format(sentence, str(score)), "\n")

# 1. Import the sentiment module (in case you haven't already done so)
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

# 2. Import numpy (in case you have not already done so)
import numpy as np

# 3. Instantiate the sentiment analyzer (in case you haven't already done so)
analyser = SentimentIntensityAnalyzer()

# 4. Now get the compound sentiment score for each review
ace_alltext['sent_score'] = np.nan 
for index, row in ace_alltext.iterrows():  
    ace_alltext.loc[index, 'sent_score'] = analyser.polarity_scores(row['text'])['compound']

# 5. Let's take a look!
pd.set_option('display.max_colwidth', None)

In [None]:
ace_appendable = ace_alltext

In [None]:
def sentiment_analyzer_scores(sentence):
    score = analyser.polarity_scores(sentence)
    print("{:-<55} {}".format(sentence, str(score)), "\n")

# 1. Import the sentiment module (in case you haven't already done so)
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

# 2. Import numpy (in case you have not already done so)
import numpy as np

# 3. Instantiate the sentiment analyzer (in case you haven't already done so)
analyser = SentimentIntensityAnalyzer()

# 4. Now get the compound sentiment score for each review
standard_alltext['sent_score'] = np.nan 
for index, row in standard_alltext.iterrows():  
    standard_alltext.loc[index, 'sent_score'] = analyser.polarity_scores(row['text'])['compound']

# 5. Let's take a look!
pd.set_option('display.max_colwidth', None)

In [None]:
standard_appendable = standard_alltext

In [None]:
def sentiment_analyzer_scores(sentence):
    score = analyser.polarity_scores(sentence)
    print("{:-<55} {}".format(sentence, str(score)), "\n")

# 1. Import the sentiment module (in case you haven't already done so)
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

# 2. Import numpy (in case you have not already done so)
import numpy as np

# 3. Instantiate the sentiment analyzer (in case you haven't already done so)
analyser = SentimentIntensityAnalyzer()

# 4. Now get the compound sentiment score for each review
hoxton_alltext['sent_score'] = np.nan 
for index, row in hoxton_alltext.iterrows(): 
    hoxton_alltext.loc[index, 'sent_score'] = analyser.polarity_scores(row['text'])['compound']

# 5. Let's take a look!
pd.set_option('display.max_colwidth', None)

In [None]:
hoxton_appendable = hoxton_alltext

We append dataframes together by hotel chain and not by source. In other words, both Google and TripAdvisor review text is located in the same DataFrame.

In [None]:
comb1 = grad_appendable.append(ace_appendable)
comb2 = comb1.append(hoxton_appendable)
combined = comb2.append(standard_appendable)

grads = combined[combined['hotel'].str.contains('Graduate')]

ace = combined[combined['hotel'] == ]
#combined = combined['hotel'].mask(grads, 'Graduate', inplace = True)
combined.replace(to_replace=r'^Graduate Ann Arbor$', value = 'Graduate', regex=True)
# grads["sent_score"].mean()
combined.head(10)
#grads.tail(20)

Unnamed: 0,hotel,text,sent_score
0,Graduate Ann Arbor,"Hotel is VERY conveniently located but lacks the college vibe/decor/colors and space that other Graduates have. They don't have a TV in the common area and the bar doesn't open until 4:00, so that's inconvenient when you want to watch a football or basketball game that starts before 4 pm ;-(. And rooms are very small.",-0.3304
1,Graduate Ann Arbor,"Love the style and atmosphere. Everything feels so cozy and the decor is unique and quaint. Also, the furnishings and service are high quality. I REALLY enjoyed my stay! Honestly this is the best hotel I have stayed at in a very long time. Doesn't feel like a chain if you know what I mean.",-0.7003
2,Graduate Ann Arbor,"The Staff was amazing, the Lobby and all rooms were cozy, comfortable, and inviting. The little lobby cafe had everything we needed! Plus the location to campus was perfect as we could walk everywhere. The only way our stay could have been better is if we could of stayed a little longer! Thank you!",0.9017
3,Graduate Ann Arbor,Excellent service and accommodations. Staff are over the top nice and engaging. Short walking distance to about everything. Wonderful coffee bar. Great weekend getaway. Look forward to our next visit.,-0.4325
4,Graduate Ann Arbor,"Top notch Graduate Hotel. You feel like your in a community of its own inside, the staff is friendly and engaging. The lobby concept is brilliant and studious. I enjoyed meeting folks that were U Mich Alumni in town for a campus visit with their kid(s). Would highly recommend for anyone looking to stay near campus for a multitude of reasons (i.e. Gameday, Campus…)",0.8687
5,Graduate Ann Arbor,We did not have a good stay as we explained to the front desk on checkout. They placed us in room 302 which was sub standard for a four star hotel. It had an attached room it. We could hear everything from next door. The bathroom grout was disgusting and the door handles didn't work on the bathroom door. The door also had weird splattering on the back. The room should never have been rented out. If I had known this was the room quality I would not have stayed there and would have booked a different hotel. The front desk could not move us since the hotel was sold out.,0.7707
6,Graduate Ann Arbor,Quirky hotel. Rooms were very nice and clean. Check in and out was easy. The valet service was very helpful as my sister couldn't walk very far. the The restaurant was nice and the bartender was very friendly. Love the decor.,0.8225
7,Graduate Ann Arbor,Always the best go-to when in Ann Arbor. visiting our UOM daughter. The Service is excellent. We love where the hotel is situated and the feeling / look of the property. Reserve in advance on parents weekend and game days!,-0.6656
8,Graduate Ann Arbor,I recently stayed at the Graduate Ann Arbor and everything was fabulous! I was staying there to be close to UofM hospital for medical reasons and as such required a little more accommodation than an average guest. The front desk team were always friendly and helpful in ensuring deliveries made it up to my room and assisted me with getting to/from my room to the lobby for taxis to my medical appointments. The guest service/concierge Ethan made sure that my every request was promptly attended to. Can’t thank all the hotel staff for making my stay so comfortable!,-0.6125
9,Graduate Ann Arbor,"The service from reservation to check out was simply amazing. The attention to detail is impressive. I received a call and email directly from the front desk to ensure I completed an item. Also, room and environment comfortable enjoyable.",0.5423


In [None]:
combined = comb2.append(standard_appendable)
combined

Unnamed: 0,hotel,text,sent_score
0,Graduate Ann Arbor,"Hotel is VERY conveniently located but lacks the college vibe/decor/colors and space that other Graduates have. They don't have a TV in the common area and the bar doesn't open until 4:00, so that's inconvenient when you want to watch a football or basketball game that starts before 4 pm ;-(. And rooms are very small.",-0.3304
1,Graduate Ann Arbor,"Love the style and atmosphere. Everything feels so cozy and the decor is unique and quaint. Also, the furnishings and service are high quality. I REALLY enjoyed my stay! Honestly this is the best hotel I have stayed at in a very long time. Doesn't feel like a chain if you know what I mean.",-0.7003
2,Graduate Ann Arbor,"The Staff was amazing, the Lobby and all rooms were cozy, comfortable, and inviting. The little lobby cafe had everything we needed! Plus the location to campus was perfect as we could walk everywhere. The only way our stay could have been better is if we could of stayed a little longer! Thank you!",0.9017
3,Graduate Ann Arbor,Excellent service and accommodations. Staff are over the top nice and engaging. Short walking distance to about everything. Wonderful coffee bar. Great weekend getaway. Look forward to our next visit.,-0.4325
4,Graduate Ann Arbor,"Top notch Graduate Hotel. You feel like your in a community of its own inside, the staff is friendly and engaging. The lobby concept is brilliant and studious. I enjoyed meeting folks that were U Mich Alumni in town for a campus visit with their kid(s). Would highly recommend for anyone looking to stay near campus for a multitude of reasons (i.e. Gameday, Campus…)",0.8687
...,...,...,...
5888,"The Standard, High Line",(Translated by Google) Good attentions (Original) Buenas atenciones,0.6369
5889,"The Standard, High Line",(Translated by Google) delete del (Original) suppr suppr,0.3182
5890,"The Standard, High Line",(Translated by Google) Good night view (Original) 야경이 멋짐,0.6369
5891,"The Standard, High Line",(Translated by Google) N/a (Original) N/a,0.3182
