# <span style="color:steelblue">Sentiment analysis explained </span>

The goal is to train a multilayer perceptron on the reviews text to perform a sentiment analysis task.
In particular we want to be able to extract a positivity/negativity score expressed in the range from 0 to 1.

0. Extremly negative
1. Extremly positive

To do this, we will take advantage of the Google's Word2Vec skip-gram algorithm.
Thanks to this simple model we'll pre-calculate word embeddings for the entire vocabulary used in the reviews, mapping in a vectorial space all words that appears in the same context.

The pre-trained embeddings will let the Multilayer Perceptron to be trained with very small amount of data.

Just 20000 reviews out of 515000 will be used for training.

### <span style="color:steelblue">__Importing needed dependencies__ </span>
- The multiplayer perceptron will be built with __tensorflow keras__ libraries.
- Useful function from __sklearn__ will be used to evaluate the performances.
- Google's Word2Vec skipgram model will be extracted from __gensim libraries__.
- __Numpy__ will be used to create the embeddings matrix of weigths out of the skipgram model.

In [None]:
import pandas as pd
import numpy as np
import tensorflow.contrib.keras as keras
import matplotlib.pyplot as plt
import os
import sys
import pickle
import itertools
import gensim
from sklearn.model_selection import train_test_split
from numpy import zeros
from sklearn.metrics import classification_report, confusion_matrix
from tensorflow.keras.preprocessing.text import text_to_word_sequence
from gensim.models import Word2Vec
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Embedding
from tensorflow.keras.layers import Conv1D
from tensorflow.keras.layers import GlobalMaxPooling1D

### <span style="color:steelblue">__Importing hotel reviews dataset__ </span>

Not all types are recognized because of missing values, but we will take care of that later because the goal is train the Multilayer Perceptron just on text.

In [None]:
# Importing dataset
reviews_df = pd.read_csv('../input/Hotel_Reviews.csv')
print(reviews_df.dtypes)

### <span style="color:steelblue">__Clean text__ </span>

In [None]:
def clean(text):
    '''
    '''
    text = text.lower()
    text = text.replace("ain't", "am not")
    text = text.replace("aren't", "are not")
    text = text.replace("can't", "cannot")
    text = text.replace("can't've", "cannot have")
    text = text.replace("'cause", "because")
    text = text.replace("could've", "could have")
    text = text.replace("couldn't", "could not")
    text = text.replace("couldn't've", "could not have")
    text = text.replace("should've", "should have")
    text = text.replace("should't", "should not")
    text = text.replace("should't've", "should not have")
    text = text.replace("would've", "would have")
    text = text.replace("would't", "would not")
    text = text.replace("would't've", "would not have")
    text = text.replace("didn't", "did not")
    text = text.replace("doesn't", "does not")
    text = text.replace("don't", "do not")
    text = text.replace("hadn't", "had not")
    text = text.replace("hadn't've", "had not have")
    text = text.replace("hasn't", "has not")
    text = text.replace("haven't", "have not")
    text = text.replace("haven't", "have not")
    text = text.replace("haven't", "have not")
    text = text.replace("haven't", "have not")
    text = text.replace("he'd", "he would")
    text = text.replace("haven't", "have not")
    text = text.replace("he'd've", "he would have")
    text = text.replace("'s", "")
    text = text.replace("'t", "")
    text = text.replace("'ve", "")
    text = text.replace(".", " . ")
    text = text.replace("!", " ! ")
    text = text.replace("?", " ? ")
    text = text.replace(";", " ; ")
    text = text.replace(":", " : ")
    text = text.replace(",", " , ")
    text = text.replace("´", "")
    text = text.replace("‘", "")
    text = text.replace("’", "")
    text = text.replace("“", "")
    text = text.replace("”", "")
    text = text.replace("\'", "")
    text = text.replace("\"", "")
    text = text.replace("-", "")
    text = text.replace("–", "")
    text = text.replace("—", "")
    text = text.replace("[", "")
    text = text.replace("]","")
    text = text.replace("{","")
    text = text.replace("}", "")
    text = text.replace("/", "")
    text = text.replace("|", "")
    text = text.replace("(", "")
    text = text.replace(")", "")
    text = text.replace("$", "")
    text = text.replace("+", "")
    text = text.replace("*", "")
    text = text.replace("%", "")
    text = text.replace("#", "")
    text = text.replace("\n", " \n ")
    text = text.replace("\n", "")
    text = text.replace("_", " _ ")
    text = text.replace("_", "")
    text = ''.join([i for i in text if not i.isdigit()])

    return text

positive_reviews = reviews_df['Positive_Review'].values
negative_reviews = reviews_df['Negative_Review'].values

cleaned_positive_reviews = [clean(r) for r in positive_reviews] 
cleaned_negative_reviews = [clean(r) for r in negative_reviews] 

reviews_df['Positive_Review'] = cleaned_positive_reviews
reviews_df['Negative_Review'] = cleaned_negative_reviews

### <span style="color:steelblue">__Extract truth value__</span>

To train the multilayer perceptron model we need to extract reviews text and assign them a truth value.
 - *"The hotel was a disaster"* : 0
 - *"We had a lovely holiday"* : 1

Negative and positive reviews are already divided in the given dataset, so we just need to read data and create new columns with truth values.

Two dataset will be created:
   1. sentiment_task_reviews, containing all reviews with truth
   2. reviews_text containing all reviews (to train embeddings)
   

In [None]:
# Shuffling data
reviews_df = reviews_df.sample(frac=1).reset_index(drop=True)

# Extracting all text
positive_reviews = reviews_df['Positive_Review'].values
negative_reviews = reviews_df['Negative_Review'].values
reviews_text = []

for p,n in zip(positive_reviews, negative_reviews) : 
    if p in ['na', 'nothing', 'none', 'n a', 'no', 'no positive', 'no negative'] : 
        reviews_text.append(n)
    elif n in ['na', 'nothing', 'none', 'n a', 'no', 'no positive', 'no negative'] : 
        reviews_text.append(p)
    else : 
        reviews_text.append(n)
        reviews_text.append(p)

# Preprocessing training data
training_df = reviews_df.loc[:1000]
positive_reviews_filtered = training_df['Positive_Review'].values
negative_reviews_filtered = training_df['Negative_Review'].values
training_reviews = []
labels = []

for idx,(p,n) in enumerate(zip(positive_reviews_filtered, negative_reviews_filtered)) : 
    if p in ['na', 'nothing', 'none', 'n a', 'no', 'no positive', 'no negative'] : 
        training_reviews.append(n)
        labels.append(0)
    elif n in ['na', 'nothing', 'none', 'n a', 'no', 'no positive', 'no negative'] :
        training_reviews.append(p)
        labels.append(1)
    else :
        training_reviews.append(n)
        labels.append(0)
        training_reviews.append(p)
        labels.append(1)

# Creating datasets
dict1 ={
    'reviews' : training_reviews,
    'labels' : labels
}
sentiment_df = pd.DataFrame.from_dict(dict1)


dict2 ={
    'reviews_text' : reviews_text
}
reviews_text_df = pd.DataFrame.from_dict(dict2)


Importing preprocessed data in Pandas DataFrames.

### <span style="color:steelblue">__Train embeddings__ </span>

The __Word2Vec skip-gram__ model is a simple Neural Network that performs a *fake task* of predict the nearest words given another one that appears in a sentence. The goal of this fake task is to train the network so that we can then extract the weights matrix in which each row will be the vectorial representation of a word.

We will train the network 128 neurons in the hidden layer, so the words will have 128 dimensional representation.

In [None]:
text_reviews = [str(r) for r in reviews_text_df['reviews_text'].values]

sentences = []

for review in text_reviews:
    words = text_to_word_sequence(review)
    sentences.append(words)

embeddings_model = Word2Vec(sentences, min_count=1, sg=1, size=128)
words = list(embeddings_model.wv.vocab)
print('{} WORDS '.format(len(words)))
print('Printing first 100:')
print(words[:100])

### <span style="color:steelblue">Create a vocabulary</span>

A __Keras Tokenizer__ will extract the vocabulary of all words that appear in the 515000 reviews.

In [None]:
# Querying SQLlite database to extract needed words embeddings
tokenizer = keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(text_reviews)
vocabSize = len(tokenizer.word_index) + 1

### <span style="color:steelblue">__Extract Embeddings matrix__ </span>

Now that the word2vec model is trained we can combine its vectorial representation of words to create and embeddings matrix:
- vocabularySize __x__ embeddingDim

These representation will be used to intitliaze the weights of the embeddings layer in the multilayer perceptron.

In [None]:
# Recreating embeddings index based on Tokenizer vocabulary
word2vec_vocabulary = embeddings_model.wv.vocab
embeddingIndex = dict()
counter = 0
for word, i in tokenizer.word_index.items():
    if word in word2vec_vocabulary :
        embeddingIndex[word] = embeddings_model[word]
    else:
        counter += 1

print("{} words without pre-trained embedding!".format(counter))
    
# Prepare embeddings matrix
embeddingMatrix = zeros((vocabSize, 128))
for word, i in tokenizer.word_index.items():
    embeddingVector = embeddingIndex.get(word)
    if embeddingVector is not None:
        embeddingMatrix[i] = embeddingVector

### <span style="color:steelblue">__Preprocess training text__ </span>

Thanks to the tokenizer we can map each word in a review to a one hot representation.<br/>
Then we want to pad the reviews with zeros to make them all of the same size, which will be __40 words__ because the average length is ~33.<br/>
In the end we just divide in training and test sets. 

In [None]:
reviews = [ str(r) for r in sentiment_df['reviews'].values]
labels = sentiment_df['labels'].values

oneHotReviews = tokenizer.texts_to_sequences(reviews)
encodedReviews = keras.preprocessing.sequence.pad_sequences(oneHotReviews, maxlen=40, padding='post')

X_train, X_test, y_train, y_test = train_test_split(encodedReviews, labels, test_size=0.33, random_state=42)

### <span style="color:steelblue">__Train the model__ </span>

5 epochs of training will be enough to get an accuracy over 90%

In [None]:
# define neural network
CNN = keras.models.Sequential()
CNN.add(keras.layers.Embedding(vocabSize, 128,weights=[embeddingMatrix], input_length=40, trainable=True))
CNN.add(Conv1D(128, 2, activation='relu'))
CNN.add(GlobalMaxPooling1D())
CNN.add(Flatten())
CNN.add(Dense(1, activation='sigmoid'))
CNN.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
CNN.fit(X_train, y_train, epochs=5, verbose=1)

### <span style="color:steelblue">__Test the model__ </span>

In [None]:
loss, accuracy = CNN.evaluate(X_test, y_test, verbose=1)
print('Test Loss: {}'.format(loss))
print('Test Accuracy: {}'.format(accuracy))

### <span style="color:steelblue">__Confusion matrix__ </span>

We want a better visualization of classification results, that are more important than accuracy in this particular task.

In [None]:
predictions = CNN.predict_classes(X_test)

cm = confusion_matrix(y_test, predictions, labels=[0,1])
title = 'Confusion matrix'
cmap = plt.cm.Blues
classes=["negative","positive"]
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)

fmt = '.2f'
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
    plt.text(j, i, format(cm[i, j], fmt),
             horizontalalignment="center",
             color="white" if cm[i, j] > thresh else "black")

plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.tight_layout()
plt.show()

### <span style="color:steelblue">__Classification report__ </span>

We want a better visualization of classification results, that are more important than accuracy in this particular task.

In [None]:
report = classification_report(y_test, predictions, target_names=['0','1'])
print(report)

### <span style="color:steelblue">__Extracting sentiment score__ </span>

Thanks to the Multilayer Perceptron we've built in the previous Notebook, we can extract a positivity/negativity scores from text reviews.<br/>
The goal is to use these values as Numerical features in subsequent analisis.

Calculate positive scores

In [None]:
positive_reviews = [str(r) for r in reviews_df['Positive_Review'].values]

for idx, review in enumerate(positive_reviews):
    words = text_to_word_sequence(review)     
    if(len(words) > 40): 
        words = words[:40]
        positive_reviews[idx] = ' '.join(words)

oneHotPositiveReviews = tokenizer.texts_to_sequences(positive_reviews)
encodedPositiveReviews = keras.preprocessing.sequence.pad_sequences(oneHotPositiveReviews, maxlen=40, padding='post')


positivity_predictions = CNN.predict_proba(encodedPositiveReviews)

plt.hist(positivity_predictions, bins=30)
plt.xlabel('Positivity scores');
plt.show() 

Calculate negative scores

In [None]:
negative_reviews = [str(r) for r in reviews_df['Negative_Review'].values]

for idx, review in enumerate(negative_reviews):
    words = text_to_word_sequence(review)      
    if(len(words) > 40): 
        words = words[:40]
        negative_reviews[idx] = ' '.join(words)

oneHotNegativeReviews = tokenizer.texts_to_sequences(negative_reviews)
encodedNegativeReviews = keras.preprocessing.sequence.pad_sequences(oneHotNegativeReviews, maxlen=40, padding='post')


negativity_predictions = CNN.predict_proba(encodedNegativeReviews)
    
# print(negativity_predictions)
plt.hist(negativity_predictions, bins=30)
plt.xlabel('Negativity scores');
plt.show() 


<span style="color:MediumAquaMarine">*__Observations__*</span><br/>

The first thing we must notice is that a small amount of data are misclassified, in fact some negative texts appear to have a positive score and viceversa. Since we know a priori if a review is positive or not, we can simply put these values in the middle.

Another problem we have to manage is related to possibility of a guest to give two reviews, one positive and one negative. In particular is someone leaves just one of the two we need to consider the worst case.<br/>
Let's make and example:

- ("The worst hotel", "No Positive") --> (0.345, ?) 

In this case the value of No Positive has to be zero, because if in subsequent analisis we'll consider the average score of the two reviews, this is the only way to describe a complete non-positive experience. The same concept is valid for "No Negative" reviews.

In [None]:
negative_reviews = reviews_df['Negative_Review'].values
positive_reviews = reviews_df['Positive_Review'].values

missing = ['na', 'nothing', 'none', 'n a', 'no', 'no positive', 'no negative']

for idx, (text,score) in enumerate(zip(positive_reviews, positivity_predictions)):
    if text in missing : positivity_predictions[idx] = 0.0
    elif score < 0.5 : positivity_predictions[idx] = 0.501

for idx, (text,score) in enumerate(zip(negative_reviews, negativity_predictions)):
    if text in missing : negativity_predictions[idx] = 1.0
    elif score > 0.5 : negativity_predictions[idx] = 0.499

# Printing final distributions
fig, axes = plt.subplots(nrows=1, ncols=2, constrained_layout=True)
axes[0].hist(negativity_predictions, color='red', bins=30)
axes[1].hist(positivity_predictions, color='green', bins=30)
plt.show() 

<span style="color:MediumAquaMarine">*__Observations__*</span><br/>

This final visualization show us that while the big part of positive scores seems to tend to 1, negative once seem to be more distributed in the [0, 0.5] range. This could mean that a big part of negative reviews are not so negative in the end. Maybe some guests who leave a positive review, leave also an "advice" as negative one.<br/>

More of that the number of people who leaves only a positive review is way bigger than the number of who decide to leave just a negative one.

To conclude we can add these two numerical features to our dataset.<br/>
Since every reviewer leaves a negative review and positive one, a __final average sentiment score__ is extracted.

This value will be used for further analysis instead of the actual reviewer score.



In [None]:
reviews_df['Negative_Review_Score'] = negativity_predictions
reviews_df['Positive_Review_Score'] = positivity_predictions
reviews_df['Sentiment_Review_Score'] = (positivity_predictions+negativity_predictions)/2

positive_reviews = reviews_df['Positive_Review']
negative_reviews = reviews_df['Negative_Review']
sentiment_scores = reviews_df['Sentiment_Review_Score']
reviewer_score = reviews_df['Reviewer_Score']

print('Positive review: {}'.format(reviews_df['Positive_Review'][100]))
print('Negative review: {}'.format(reviews_df['Negative_Review'][100]))
print('Sentiment score: {}'.format(reviews_df['Sentiment_Review_Score'][100]))
print('Reviewer score: {}'.format(reviews_df['Reviewer_Score'][100]))

### <span style="color:steelblue">__Extracting sentiment classes__ </span>

We're going to divide reviews in four classes:
1. __best__ : final_score >= 0.7 
2. __good__ : final_score < 0.7 AND final_score >= 0.5 
3. __bad__ :  final_score < 0.5 AND final_score >= 0.3 
4. __worst__ : final_score < 0.3

In [None]:
target = []
final_scores = reviews_df['Sentiment_Review_Score'].values

for f in final_scores : 
    if f >= 0.7 : target.append(4)
    elif f < 0.7 and f >= 0.5 : target.append(3)
    elif f < 0.5 and f >= 0.3 : target.append(2)
    else: target.append(1)

reviews_df['Sentiment_Review_Class'] = target

Let's visualize the number of reviews for each class:

In [None]:
reviews_best = reviews_df[reviews_df['Sentiment_Review_Class'] == 4]['Positive_Review'].values
print('Number of best reviews:  {}'.format(len(reviews_best)))
      
reviews_good = reviews_df[reviews_df['Sentiment_Review_Class'] == 3]['Positive_Review'].values
print('Number of good reviews:  {}'.format(len(reviews_good)))
      
reviews_bad = reviews_df[reviews_df['Sentiment_Review_Class'] == 2]['Negative_Review'].values
print('Number of bad reviews:   {}'.format(len(reviews_bad)))
      
reviews_worst = reviews_df[reviews_df['Sentiment_Review_Class'] == 1]['Negative_Review'].values
print('Number of worst reviews: {}'.format(len(reviews_worst)))

fig, ax = plt.subplots()
x = ['Worst','Bad', 'Good', 'Best']
y = [len(reviews_worst),len(reviews_bad),len(reviews_good),len(reviews_best)]
vert_bars = ax.bar(x, y, color='steelblue', align='center')
plt.show()

### <span style="color:steelblue">__Printing best and worst hotels according to sentiment analysis__ </span>

In [None]:
worst_hotels = reviews_df.groupby('Hotel_Name')['Sentiment_Review_Score'].mean().sort_values(ascending=False).head(10)
worst_hotels.plot(kind="bar",color="DarkGreen")
_=plt.xlabel('Best Hotels according to Reviews')
_=plt.ylabel('Average Review Score')
plt.show()

reviews_df.groupby('Hotel_Name')['Sentiment_Review_Score'].mean().sort_values(ascending=False).head(10)

In [None]:
worst_hotels = reviews_df.groupby('Hotel_Name')['Sentiment_Review_Score'].mean().sort_values(ascending=True).head(10)
worst_hotels.plot(kind="bar",color="DarkRed")
_=plt.xlabel('Worst Hotels according to Reviews')
_=plt.ylabel('Average Review Score')
plt.show()

reviews_df.groupby('Hotel_Name')['Sentiment_Review_Score'].mean().sort_values(ascending=True).head(10)