# Insights from Hotel Reviews

**Goal**

This notebook aims to explore the Trip Advisor hotel reviews dataset, to show a combination of techniques that can reveal what makes a guest enjoy or dislike their hotel stay.  

Some of the techniques that will be applied are:
* **Topic Modeling** - This is the most common approach to reveal what makes a hotel good or bad, but the problem is that topics are often hard to interpret.  Even when they can be interpreted, they often overlap so much that they are not useful.  This notebook will explore ways to make the topics easier to understand.
* **Sentiment Analysis** - Sentiment can overlap with rating, but it may not.  This notebook will explore how sentiment and rating relate, and how sentiment can be used to enhance topic modeling.
* **Rating prediction** - There is no value in predicting rating.  The reviewers do that for us!  The true usefulness of a rating prediction model is in exploring which features steer it to a rating prediction.  That is, which features make the hotel rating good or bad?  This notebook will explore how deep learning can be used to create a solid prediction model, and how the black box can be peeled open to reveal the features that matter.

**Note About the Data**

The reviews themselves have been cleaned already.  The text does not read cleanly, as words have been removed.  I would have preferred the full, raw text, but the pre-cleaning will speed things up a bit.

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
!pip install glimpy

In [None]:
import os
import pandas as pd
import spacy
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pyLDAvis
import pyLDAvis.sklearn
import tensorflow as tf
import itertools
import plotly.express as px
from collections import Counter
from tqdm import tqdm
from textblob import TextBlob
from sklearn.preprocessing import normalize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation, PCA
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, confusion_matrix, accuracy_score, f1_score, log_loss
from glimpy import GLM, Poisson
from transformers import DistilBertTokenizer, DistilBertConfig, TFDistilBertModel

In [None]:
# hardcoded values used throughout the script
NBR_EPOCHS = 2
TEST_SET_FRAC = 0.2
SEED = 14
np.random.seed(SEED)
tf.random.set_seed(SEED)

## Exploratory Data Analysis

Since this is text data, I want to check for nulls, empty strings, new line characters, and tabs that could cause trouble parsing the text.



In [None]:
d = pd.read_csv('/kaggle/input/trip-advisor-hotel-reviews/tripadvisor_hotel_reviews.csv')
d.columns = [c.lower() for c in d.columns]

# OPTIONAL - reduce dataset size in case the kernel runs out of memory
d = d.loc[:3000]

print(len(d))
print(d.columns.to_list())
d.head()

In [None]:
# check for nulls
d.isna().sum()

In [None]:
# check for empty string reviews
len(d[d.review == ''])

In [None]:
# check for new lines and tabs
d.review.str.contains('[\n\t]', regex=True).sum()

### Sentiment Analysis

As part of the data exploration process, it would be interesting to see how sentiment plays into review ratings.  I want to do 2 levels of analysis:

* Sentiment at the sence level
* Sentiment at the document (review) level

For the document level, I will average the sentiments of the sentences in each document.  To do this, I will use Spacy and TextBlob.

In [None]:
# load Spacy language model
splg = spacy.load('en_core_web_lg')

In [None]:
# create a list of reviews to iterate over
reviews = d.review.to_list()

# create a list of tuples containing the sentences and sentiments
reviews_info = []
for r in tqdm(reviews):
    text = splg(r)
    sents = [str(s) for s in text.sents]
    sent_sentiments = [TextBlob(str(s)).sentiment[0] for s in text.sents]  # polarity is index 0
    reviews_info.append((sents, sent_sentiments))

# inspect the first review: the first tuple element should be a list of sentences
print("Nbr Sents:", len(reviews_info[0][0]))
reviews_info[0]

In [None]:
# create doc to sentence, and doc to sentence sentiment maps
# also create flat lists containing all sentences and sentiments
doc_sents = dict.fromkeys([i for i in range(len(reviews_info))])
doc_sent_sentiments = dict.fromkeys([i for i in range(len(reviews_info))])
all_sents = []
all_sent_sentiments = []

for doc_id, doc_info in enumerate(reviews_info):
    doc_sents[doc_id] = doc_info[0]  # sentences are index 0
    doc_sent_sentiments[doc_id] = doc_info[1]  # sentiments are index 1
    all_sents += doc_info[0]
    all_sent_sentiments += doc_info[1]

# create a doc to sentiment map by averaging the sentiments of a doc's sentences
doc_sentiments = {k: np.mean(v) for k, v in doc_sent_sentiments.items()}

# explore the document level sentiment distribution by rating
d['review_sentiment'] = d.index.map(doc_sentiments)
sns.boxplot(x=d.rating, y=d.review_sentiment)
plt.ylabel('Sentiment Distribution (Averaged Sentence Polarity per Review)')
plt.title('Sentiment Distribution by Rating')
plt.show()

Document sentiment tends to rise with the rating, which is to be expected.  Document sentiment could be a good feature to use when predicting rating.  There are times, however, when there are positive sentiments but low ratings, and times when there are negative sentiments but high ratings.  But from the appearance of these box plots, one could almost draw a line at doc sentiment = 0.1 to separate low (1-3) and high (3-5) ratings.  

Next I want to view examples of negative sentiments with high ratings, and examples of positive sentiments with low ratings.  I will look for high sentiment sentences that might influence the document sentiment too much.

In [None]:
# view examples of negative sentiment but high rating
d[((d.rating==5) & (d.review_sentiment<0.1))].head()

In [None]:
# render the first review for inspection
spacy.displacy.render(
    splg(d[((d.rating==5) & (d.review_sentiment<0.1))]['review'].to_list()[0]),
    style='ent', 
    jupyter=True
)

In [None]:
# view examples of positive sentiment but low rating
d[((d.rating==1) & (d.review_sentiment>0.1))].head()

In [None]:
spacy.displacy.render(
    splg(d[((d.rating==1) & (d.review_sentiment>0.1))]['review'].to_list()[0]),
    style='ent', 
    jupyter=True
)

It looks like some of the high ratings with negative sentiment are complaining about things that are irrelevant to the hotel, like the flights to the city.  Others complaing about 1 aspect of the hotel, like parking, but it did not impact the overall review.  It looks like a combination of sentiment and topic might be insightful.

It looks like some of the low ratings with high sentiment are actually more neutral sounding.  Since 0 is neutral sentiment, it may be worth changing the threshold for what defines positive.  But it is interesting to see that even neutral sentiment reviews can have very low ratings.  Perhaps high ratings are given when people are pleasantly surprised?  Again, it looks like a combination of topic and sentiment would be interesting.

Before exploring topics, I want to look at the sentence length distribution and rating distribution.  If there are unusually short or long sentences, they might be useless to analyze.  The short ones won't contain enough information, and the long ones will contain so much that it becomes too noisy (conflicting sentiments, embeddings that smooth towards 0, etc.) To do this, I need to assign each sentence an ID that is unique across all sentences, and then map each document to the global IDs of the sentences they contain.

In [None]:
# assign each sentence a unique, global ID
sent_global_id = {i: v for i, v in enumerate(all_sents)}

# map each document to the global IDs of the sents within it
doc_global_id = dict.fromkeys([n for n in range(len(doc_sents))])
doc_max_global_id = 0
for k, v in doc_sents.items():
    nbr_sents = len(v)
    new_max_global_id = doc_max_global_id + nbr_sents
    doc_global_id[k] = [n for n in range(doc_max_global_id, new_max_global_id)]
    doc_max_global_id = new_max_global_id

doc_global_id[0]

In [None]:
# check sentence length distribution
sns.distplot([len(s.split()) for s in all_sents], kde=False, bins=60)
plt.xlabel('Sentence Length (Nbr Words)')
plt.title('Sentence Length Distribution')
plt.show()

In [None]:
# rating distribution
sns.distplot(d.rating, kde=False)
plt.title('Rating Distribution')
plt.show()

The sentence length distribution shows that there are a handful of long sentences, and some with under 10 words.  

The rating distribution shows that there are more high ratings than low, overall.  I could group some of the ratings together as categories, such as 1-2 as negative, 4-5 as high, and 3 as neutral.  But from a business perspective, ratings are vital to landing new customers, so perhaps anything < 4 should be labeled as negative.  

Are there rare words that might be able to be removed?  Words that only appear once might not be useful for modeling.

In [None]:
# find rare words
minimum_word_count = 2
word_freqs = Counter(' '.join(all_sents).split())
sorted_word_freqs = sorted(word_freqs.items(), key=lambda i: i[1])
[i for i in sorted_word_freqs if i[1] < minimum_word_count][:100]

In [None]:
# for curiosity's sake, inspect most frequent words (hopefully these are not stop words)
word_freqs.most_common(5)

## Topic Modeling (LDA vs Manually Created Topics)

Topic modeling may help identify features of a hotel that could be improved, or that are a major draw for customers.  When combined with ratings, these topics could help the business identify areas of improvement.

Without writing a single line of code, I would bet that the topics that will surface might be something like:

['hotel', 'room', 'stay', 'great', 'good']

['restaurant', 'food']

['staff', 'friendly']


You don't need data science to figure out what people want in a hotel.  What is truly valuable is to find the non-obvious things, or the things a hotel might not realize are a problem.  Sometimes you might get lucky and find something, but it can require sifting through many topics to find nuggets of gold.  Who has time for that?  After taking the standard approach to topic modeling (LDA), I will look at manually creating topics.  **Spoiler: in this case, the manual approach is much better.**

In [None]:
# Limit number of words/features to use in LDA to 1000
nbr_features = 1000

# LDA uses raw term frequencies
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=nbr_features, stop_words='english')
term_freqs = tf_vectorizer.fit_transform(all_sents)
tf_feature_names = tf_vectorizer.get_feature_names()

nbr_topics = 20
lda = LatentDirichletAllocation(n_components=nbr_topics, max_iter=5, learning_method='online', learning_offset=50.,random_state=SEED).fit(term_freqs)
lda_topics = lda.transform(term_freqs)

# Plot pretty LDA output
lda_vis_data = pyLDAvis.sklearn.prepare(lda, term_freqs, tf_vectorizer)
lda_vis_data_html = pyLDAvis.prepared_data_to_html(lda_vis_data)
pyLDAvis.display(lda_vis_data)

This shows the problem with topic modeling: the topics are often hard to interpret, and they have significant overlap in their term components.  There may be some topics that pop out, like one that talks about the beds, or another that deals with beach hotels, but what pratical use are these?  You don't need data science to figure out that a comfortable bed and beachside hotel are things that customers want.  Unfortunately, it seems like topic modeling has not revealed anything useful.

### Topic Exploration

If the topics were easy to distinguish, then a more detailed exploration of them would be insightful.  In this case, they are not, but I'll do this analysis anyway.

In [None]:
# Inspect the most frequent n words for the top 10 topics
top_n = 10
total_rows = nbr_topics*top_n
topic = []
topic_top_ten = []
topic_top_ten_scores = []
for tid, t in enumerate(lda.components_):
    topic.append([tid+1]*top_n)
    topic_top_ten.append([tf_feature_names[i] for i in t.argsort()[:-top_n -1:-1]])
    topic_top_ten_scores.append(t[t.argsort()[:-top_n -1:-1]])
top_words = np.concatenate([np.array(topic).reshape(total_rows,1), np.array(topic_top_ten).reshape(total_rows,1), np.array(topic_top_ten_scores).reshape(total_rows,1)], axis=1)
topwordsdf = pd.DataFrame(top_words, columns=['topic', 'word', 'score'])
topwordsdf['topic'] = topwordsdf['topic'].astype('int64')
topwordsdf['score'] = topwordsdf['score'].astype('float64')

# The higher this threshold, the fewer results will be returned when data is filtered
lda_score_threshold = 0.2


# Set up bar chart of top words by topic
def create_bar_top_words(topwords_df, topic):
    topwords_df[topwords_df['topic']==topic].sort_values(by=['score'], ascending=True).plot.barh(x='word', y='score', title='Top 10 Words for Topic '+str(topic), colormap='Paired', legend=False)
    plt.xlabel('score')
    plt.tight_layout()
    plt.show()
    return

create_bar_top_words(topwordsdf, topic=1)

# Set up heatmap of document to topic probability score
def create_doc_topic_heatmap(topics_input):
    heatmap_x_labels = ["Topic %d" % i for i in range(1, topics_input.shape[1]+1)]
    heatmap_y_labels = ["%d" % i for i in range(1, topics_input.shape[0]+1)]
    sns.heatmap(topics_input, cmap='Greys', xticklabels=heatmap_x_labels, yticklabels=heatmap_y_labels)
    plt.title('Probability that Records are Related to Topics')
    plt.tight_layout()
    plt.show()
    return

create_doc_topic_heatmap(lda_topics[:24,:])

While the topics are not easy to interpret, might they be useful when combined with sentiment?  To find out, I will map each topic to a sentiment by averaging the sentence sentiments of the sentences.  Sentences will be assigned to topics based on their highest topic probability scores.  

In [None]:
# get sentiment by LDA topic

# map each document to its most probable topic (assign 1 topic per document using the max probability score from LDA)
doc_topic_mapping = {doc_id: topic for doc_id, topic in enumerate(list(np.argmax(lda_topics, axis=1)))}

# reverse the mapping to get a list of documents for each topic
topic_doc_mapping = {}
for k, v in doc_topic_mapping.items():
    topic_doc_mapping.setdefault(v, set()).add(k)
topic_doc_mapping

# now get the topic to sentiment mapping, by averaging the sentiments of the sentences contained in the documents 
#   that were mapped to each topic
topic_sentiment_mapping = {}
for k, v in topic_doc_mapping.items():
    topic_sentiment_mapping[k] = []
    for doc_id in v:
        topic_sentiment_mapping[k].append(all_sent_sentiments[doc_id])
topic_sentiment_mapping = {k: np.mean(v) for k, v in topic_sentiment_mapping.items()}
topic_sentiment_mapping

### Fit Rating to Topic

By modeling review rating on topic, I can determine if there is any relationship between them, and if there are any topics that influence rating, a regression will show their coefficient weighting.

In [None]:
# create a dataframe of review and lda_topics
dnew = pd.DataFrame(lda_topics)


# create map of sentence ID (global ID) to document
global_id_to_doc = dict.fromkeys([i for i in range(lda_topics.shape[0])])
for sent_id in global_id_to_doc.keys():
    for doc_id, sent_id_list in doc_global_id.items():
        if sent_id in sent_id_list:
            global_id_to_doc[sent_id] = doc_id
    
dnew['doc_id'] = dnew.index.map(global_id_to_doc)

# average probability that a document relates to each topic
dnew = dnew.groupby('doc_id').mean().reset_index()
dnew['rating'] = dnew.doc_id.map(dict(zip(d.index, d.rating)))
dnew.head()

In [None]:
# is there any relationship between the prob of a topic appearing and the rating?

glm = GLM(fit_intercept=True, family=Poisson())
glm.fit(X=dnew.drop(['doc_id', 'rating'], axis=1), y=dnew.rating)
pred_rating = glm.predict(dnew.drop(['doc_id', 'rating'], axis=1))
print(glm.summary())
print("RMSE:", np.sqrt(mean_squared_error(dnew.rating, pred_rating)))

sns.scatterplot(pred_rating, dnew.rating)
plt.title("Predicted vs Actual Rating")
plt.xlabel("Predicted Rating")
plt.show()

The predicted vs actual rating shows that there is a lot of overlap between the range of the predicted ratings.  This lends support to the idea that binning them and doing classification might be more useful than trying to predict the precise rating.

This regression shows that the topic probabilities alone are not enough to reliably predict actual review rating.  There are some interesting findings.  For example, topic 14 (x15) has a large negative coefficient, suggesting that its presence indicates a low review.  Going back up to the LDA visualization, you can see that this topic contains words like "disappointed" and "issues". However, it also contains the words "enjoyed" and "friend".  This further complicates the interpretation of topics.  For this data, a better approach is to manually create topics, or basically just find sentences containing words you want to explore.  It's simple and dumb, but it works better than fancy data science in this case.

### Manually Specified Topics

Although LDA's topics were not very helpful, we could manually search for a word or string to create a topic.  For instance, if we extract all the sentences containing the word "bed", then average their sentiments, we can get an idea for what customers think of the hotel bed.  This would be much easier to interpret than topics modeled from LDA.

In [None]:
# find sentiment of all sentences containing a word or string - this can serve as a proxy to topic sentiment
sents_containing_string = [si for si, s in enumerate(all_sents) if 'bed' in s]
topic_sentiment = np.mean([all_sent_sentiments[s] for s in sents_containing_string])
topic_sentiment

So sentences containing the word "bed" were generally positive.  Note that these review cover many hotels.  If it were possible to focus on a single hotel, then this approach would be extremely useful for the business.

## Predicting Review Rating

There is not much business use for predicting hotel ratings.  After all, the customer will provide a rating with the review, so there is no need to predict it.  But it still can be useful for finding the features that most contributed towards the rating.  By embedding the text and predicting the rating, I can explore the embeddings to see which features triggered the activations for very positive and very negative reviews.  This will be the best way to find hidden issues that the hotel may have.


### Fitting DistilBERT and a CNN Classifier

I want to combine the text data with the review sentiment score for classification.  So the model will need to take multiple inputs, since there is no need to embed the sentiment, but the text needs to be embedded.

The model will take the Bert tokenized text and masks as input to a transformer that will produce the embedding.  The embedding will be fed to a bi-directional LSTM layer.  The bi-directional aspect will allow the context of the entire review to assist with the interpretation of the embedding.  The output from this layer is down-sampled in the time dimension using max pooling. This will be the text processing piece.

Another piece of the model will look at review sentiment.  Each review has an average sentiment, and this will be concatenated with the downsampled output of the bi-directional LSTM.  The concatenated vectors will be fed to another dense layer, before ending up in the final layer with softmax activation. 

In [None]:
# bin the ratings:
#  1-2 = bad
#  3   = neutral
#  4-5 = good
dnew = d.copy()
dnew['rating'] = dnew['rating'].map({1: 0, 2: 0, 3: 1, 4: 2, 5: 2})
dnew.head()

In [None]:
# one-hot encode the target and create a train/test split

dnew_onehot = dnew.drop(['rating'], axis=1).join(pd.get_dummies(dnew.rating)).copy()
dnew_onehot.rename(columns={0: 'bad', 1: 'neutral', 2: 'good'}, inplace=True)

# uncomment to troubleshoot
#dnew = dnew[dnew['rating']==0].head(1)

x_train, x_test, y_train, y_test = train_test_split(
    dnew_onehot.drop(['bad', 'neutral', 'good'], axis=1).values, dnew_onehot[['bad', 'neutral', 'good']].values, 
    test_size=TEST_SET_FRAC, random_state=SEED
)

In [None]:
# free up memory
del d, dnew, dnew_onehot, all_sents, all_sent_sentiments, doc_topic_mapping, topic_doc_mapping, topic_sentiment_mapping, topwordsdf, lda_topics, word_freqs, sorted_word_freqs, reviews, reviews_info, doc_sents, doc_sent_sentiments

In [None]:
distil_bert = 'distilbert-base-uncased'

# tokenize the text to prepare it for modeling, using Bert's tokenization method

tokenizer = DistilBertTokenizer.from_pretrained(
    distil_bert, 
    do_lower_case=True, 
    add_special_tokens=True, 
    max_length=128, 
    pad_to_max_length=True
)

def tokenize(sentences, tokenizer):
    input_ids, input_masks, input_segments = [],[],[]
    for sentence in tqdm(sentences):
        inputs = tokenizer.encode_plus(
            sentence, 
            add_special_tokens=True,   
            max_length=128, 
            pad_to_max_length=True,
            return_attention_mask=True, 
            return_token_type_ids=True
        )

        input_ids.append(inputs['input_ids'])
        input_masks.append(inputs['attention_mask'])
        input_segments.append(inputs['token_type_ids'])

    return np.asarray(input_ids, dtype='int32'), np.asarray(input_masks, dtype='int32'), np.asarray(input_segments, dtype='int32')


# only tokenize the text input, leave the sentiment part of the array alone
train_tokens, train_masks, train_segments = tokenize(sentences=list(x_train[:,0]), tokenizer=tokenizer)
test_tokens, test_masks, test_segments = tokenize(sentences=list(x_test[:,0]), tokenizer=tokenizer)

In [None]:
# model with text and sentiment

sentiment_dims = 1 or x_train[:,1].shape[1]
nbr_classes = y_train.shape[1]

distil_bert = 'distilbert-base-uncased'

config = DistilBertConfig(dropout=0.2, attention_dropout=0.2)
config.output_hidden_states = False
transformer_model = TFDistilBertModel.from_pretrained(distil_bert, config=config)
   
input_layer_tokens = tf.keras.layers.Input(shape=(128,), name='input_token', dtype='int32')
input_layer_masks = tf.keras.layers.Input(shape=(128,), name='masked_token', dtype='int32')
input_layer_sentiment = tf.keras.layers.Input(shape=(sentiment_dims,), name='sentiment', dtype='float32')

embedding_layer = transformer_model(input_layer_tokens, attention_mask=input_layer_masks)[0]
x = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(50, return_sequences=True, dropout=0.1, recurrent_dropout=0.1))(embedding_layer)
x = tf.keras.layers.GlobalMaxPool1D()(x)
x = tf.keras.layers.Dense(50, activation='relu')(x)
x = tf.keras.Model(inputs=[input_layer_tokens, input_layer_masks], outputs=x)

y = tf.keras.layers.Dense(50, activation='relu')(input_layer_sentiment)
y = tf.keras.Model(inputs=input_layer_sentiment, outputs=y)

combined = tf.keras.layers.concatenate([x.output, y.output])
combined = tf.keras.layers.Dense(10, activation='relu')(combined)
#combined = tf.keras.layers.Dropout(0.2)(combined)
combined = tf.keras.layers.Dense(nbr_classes, activation='softmax')(combined)
model = tf.keras.Model(inputs=[x.input, y.input], outputs=combined)

# freeze the first 3 layers of the combined model (the input layers and Bert embeddings)
for layer in model.layers[:3]:
  layer.trainable = False

model.summary()

In [None]:
opt = tf.keras.optimizers.Adam(learning_rate=5e-3)
model.compile(
    optimizer=opt,
    loss='categorical_crossentropy',
    metrics=['accuracy']
)

# Train the model

model_inputs = [train_tokens, train_masks, x_train[:,1].astype('float32').reshape(-1,)]

train_history = model.fit(
    model_inputs,
    y_train,
    validation_split=TEST_SET_FRAC,
    batch_size=16,
    epochs=NBR_EPOCHS,
)    

In [None]:
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
  """
  This function prints and plots the confusion matrix.
  Normalization can be applied by setting `normalize=True`.
  """
  if normalize:
      cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
      print("Normalized confusion matrix")
  else:
      print('Confusion matrix, without normalization')

  print(cm)

  plt.imshow(cm, interpolation='nearest', cmap=cmap)
  plt.title(title)
  plt.colorbar()
  tick_marks = np.arange(len(classes))
  plt.xticks(tick_marks, classes, rotation=45)
  plt.yticks(tick_marks, classes)

  fmt = '.2f' if normalize else 'd'
  thresh = cm.max() / 2.
  for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
      plt.text(j, i, format(cm[i, j], fmt),
               horizontalalignment="center",
               color="white" if cm[i, j] > thresh else "black")

  plt.tight_layout()
  plt.ylabel('True label')
  plt.xlabel('Predicted label')
  plt.show()


# make predictions
p_test = model.predict([test_tokens, test_masks, x_test[:,1].astype('float32')]).argmax(axis=1)

# convert y labels back to useable form
y_test_clean = []
for p in y_test:
    predicted_class = 1  # use neutral as baseline
    for ci, c in enumerate(p):
        if c == 1:
            predicted_class = ci
    y_test_clean.append(predicted_class)
y_test_clean = np.array(y_test_clean)

# evaluate model
cm = confusion_matrix(y_test_clean, p_test)
plot_confusion_matrix(cm, classes=['bad', 'neutral', 'good'])
print(
    "Accuracy:", accuracy_score(y_true=y_test_clean, y_pred=p_test), "\n",
    "F1 Score:", f1_score(y_true=y_test_clean, y_pred=p_test, average='weighted'), "\n",
)

In [None]:
# Show some misclassified examples: predicted bad rating but it was actually good
misclassified_idx = np.where(((p_test == 0) & (y_test_clean == 2)))[0]
i = np.random.choice(misclassified_idx)
x_test[i,0]

In [None]:
# Show some misclassified examples: predicted good rating but it was actually bad
misclassified_idx = np.where(((p_test == 2) & (y_test_clean == 0)))[0]
i = np.random.choice(misclassified_idx)
x_test[i, 0]

The misclassified examples make sense.  The are tricky to classify because they say some good things about the hotel, despite the overall experience being bad.  It's like the reviewer is giving criticism to a friend and wants to soften the blow.  This confuses the model.

### Embedding Visualization

By visualizing the embeddings, it will be easier to determine which features contribute to a rating.  TensorBoard would be perfect for this, but there seems to be a bug with the projector in TF2 (https://github.com/tensorflow/tensorboard/issues/2471).  So instead, I will create the 3D scatterplot manually, using PCA.

In [None]:
def get_embeddings(t_text="hello world", t_inputs=None):
    """
    Tokenizes provided text and returns BERT embeddings.  
    BERT was never trained - its weights were frozen, so the initial pretrained weights 
    can be used to create the embeddings and they will match what came out of the model.
    
    Hugging Face's BERT returns a tuple.  The first item contains the embeddings.  The 
    second item contains the transformer's hidden states.  The final hidden state should 
    equal the embeddings: t_model(t_inputs)[0] == t_model(t_inputs)[1][-1]
    
    :params:
        t_text: A string to be embedded.  This argument is ignored if t_inputs is provided.
        t_inputs: A Numpy array of int32 type that contains the tokenized input for BERT.
    """
    t_tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
    t_config = DistilBertConfig(dropout=0.2, attention_dropout=0.2)
    # be sure to return hidden states
    t_config.output_hidden_states = True
    t_model = TFDistilBertModel.from_pretrained('distilbert-base-uncased', config=t_config)
    if t_inputs is not None:
        return t_model(t_inputs)[0]
    else:
        t_inputs = tokenize(t_text, tokenizer=t_tokenizer)[0]  # ignore the masks and segments, only return tokens
        return t_model(t_inputs)[0]


embeds = get_embeddings(t_inputs=test_tokens)

# BERT output is a tuple of (batch_size, sequence_length, 768)
# convert the embeds to a numpy array, and average them over the sequence axis (axis 1)
# this will produce 1 vector per review
review_embeds = np.mean(embeds.numpy(), axis=1).squeeze()

# apply PCA to reduce the dimensionality to 3
pca = PCA(n_components=3)
components = pca.fit_transform(review_embeds)
total_var = pca.explained_variance_ratio_.sum() * 100

# convert class labels to readable categories
class_labels = []
for c in y_test_clean.tolist():
    if c == 0:
        class_labels.append("Bad (Rating 1-2)")
    elif c == 1:
        class_labels.append("Neutral (Rating 3)")
    else:
        class_labels.append("Good (Rating 4-5)")

# convert reviews to readable format when they are hovered over
review_texts = x_test[:,0].tolist()
split_after_n_words = 12
review_texts_formatted = []
for r in review_texts:
    words = r.split()
    total_words = len(words)
    nbr_segments = np.ceil(total_words/split_after_n_words)
    # insert line break after every nth word
    words_new = [
        x for y in (words[i:i+split_after_n_words] + ['<br>'] * (i < len(words) - 2) 
                    for i in range(0, len(words), split_after_n_words)) for x in y
    ]
    r_new = ' '.join(words_new)
    review_texts_formatted.append(r_new)

# plot with Plotly
fig = px.scatter_3d(
    components, x=0, y=1, z=2, color=class_labels, 
    title=f'First 3 Principal Components by Class, with Total Explained Variance: {total_var:.2f}%',
    labels={'0': 'PC 1', '1': 'PC 2', '2': 'PC 3'},
    hover_name=review_texts_formatted
)
fig.update_traces(marker=dict(size=3))
fig.show()

This plot gives insight into possible areas for a hotel to improve, as well as the most liked features of a hotel.  The color shows the rating.  Reviews that are closer together are more similar.  

Right away, a region of bad reviews becomes visible, where they all comment on the rudeness of the hotel staff.  There are some neutral reviews in that region, but it is clear that rude staff will not lead to a good review.  Another region shows good reviews that all reflect the quality of the room.  Things like "elegant fixtures" and "large clean room" are mentioned.  

There are regions that contain mixed reviews.  One of these concerns the noise level in the hotel.  One reviewer gave a bad review, complaining about the noise.  Another reviewer gave a good review, but said that it was a nice hotel if you don't mind the noise.  This is interesting, because a hotel could look at reviews like these to determine what kind of people would likely enjoy their stay more. 

These are the kinds of insights that one might hope to gain from data science.  Many of them are still obvious (who wouldn't give rude staff a bad rating), but it is less about the rating and more about discovery.  Perhaps a hotel does not realize its staff are rude, and there are too many reviews to sift through to discover it.  Maybe a guest had a bedbug problem in one of the rooms, and the hotel needs to get on top of it, before it spreads.  Or maybe several good reviews mention the restaurant, and the hotel could benefit from marketing that.