### Sentiment Analysis
* Sentiment analysis involves determining the sentiment of text.
* In this lab, you will use a hotel review data set that includes reviews and a rating 
 * There are other features that you can ignore, unless you want to use them to improve results
* Your goal is to train a model that can predict the number of stars based on the text
* This is the last programming assignment. We will use similar cleaning and discovery techniques as other assignments
 * ... except we need to add the fun of stop words, stemming / lemmatizing and similar exciting topics.
* Dont forget to save this as a copy in your Google Colab environment



* **Student Name:** TU HOANG

### Get the data
* Either download the data and store it in your drive or use the Kaggle API to obtain the data from
 * https://www.kaggle.com/datasets/datafiniti/hotel-reviews

In [1]:
import pandas as pd
import seaborn as sns
import numpy as np
import tensorflow as tf
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from tensorflow.keras import layers
from tensorflow.keras.models import Sequential
tf.random.set_seed(1)

### Explore and Clean the Data

In [2]:
# Read and set up the data
df1 = pd.read_csv("Datafiniti_Hotel_Reviews.csv")
df2 = pd.read_csv("Datafiniti_Hotel_Reviews_Jun19.csv").drop('reviews.dateAdded', axis=1)
df_review = pd.concat([df1, df2], axis=0)
df_review = df_review[['reviews.text','reviews.rating']]

In [3]:
# Drop any empty rows
df_review.dropna(inplace=True)

Using only two columns which are "review.text" and "review.rating" as string input and labels

In [4]:
df_review.head()

Unnamed: 0,reviews.text,reviews.rating
0,Our experience at Rancho Valencia was absolute...,5.0
1,Amazing place. Everyone was extremely warm and...,5.0
2,We booked a 3 night stay at Rancho Valencia to...,5.0
3,Currently in bed writing this for the past hr ...,2.0
4,I live in Md and the Aloft is my Home away fro...,5.0


In [5]:
# Cast the reviews.rating column to Integer
df_review = df_review.astype({'reviews.rating': 'int32'})

In [6]:
df_review["rating"] = df_review["reviews.rating"].apply(lambda x: 2 if 4 < x <= 5 else 1 if 2 < x <= 4 else 0)

In [7]:
df_review

Unnamed: 0,reviews.text,reviews.rating,rating
0,Our experience at Rancho Valencia was absolute...,5,2
1,Amazing place. Everyone was extremely warm and...,5,2
2,We booked a 3 night stay at Rancho Valencia to...,5,2
3,Currently in bed writing this for the past hr ...,2,0
4,I live in Md and the Aloft is my Home away fro...,5,2
...,...,...,...
9995,My friends and I took a trip to Hampton for th...,4,1
9996,"from check in to departure, staff is friendly,...",5,2
9997,This Hampton is located on a quiet street acro...,5,2
9998,Awesome wings (my favorite was garlic parmesan...,5,2


- The dataframe that we used to feed the model contains two columns which are text and rating for all the reviews.


- Remove all the rows which are missing.


- Grouping the rating into 3 categories: [1,2) is '0', [2,4) is '1', and [4,5] is '2'.

### Train the Model
* Train the model using 90% of the data
* You may choose whichever model technique you choose

In [9]:
# Split the data into test and train data respectively with 10% for test (labels)
train_sentences, val_sentences, train_labels, val_labels = train_test_split(df_review["reviews.text"].to_numpy(), df_review["rating"].to_numpy(), test_size=0.1, random_state=42)

In [21]:
# Setup text vectorization
text_vectorizer = layers.TextVectorization(max_tokens=20000, output_sequence_length=90)
# Fit the text vectorizer to the training text
text_vectorizer.adapt(train_sentences)

In [22]:
# Creating an Embedding using an Embedding Layer
embedding = layers.Embedding(input_dim=20001, output_dim=512, input_length=90, mask_zero=True)

In [23]:
# Create the LSTM model
LSTM_model = Sequential()
LSTM_model.add(layers.Input(shape=(1,), dtype="string"))
LSTM_model.add(text_vectorizer)
LSTM_model.add(embedding)
LSTM_model.add(layers.LSTM(256))
LSTM_model.add(layers.Dense(64, activation='relu'))
LSTM_model.add(layers.Dense(3, activation='softmax'))

In [24]:
LSTM_model.summary()

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 text_vectorization_1 (TextV  (None, 90)               0         
 ectorization)                                                   
                                                                 
 embedding_1 (Embedding)     (None, 90, 512)           10240512  
                                                                 
 lstm_4 (LSTM)               (None, 256)               787456    
                                                                 
 dense_4 (Dense)             (None, 64)                16448     
                                                                 
 dense_5 (Dense)             (None, 3)                 195       
                                                                 
Total params: 11,044,611
Trainable params: 11,044,611
Non-trainable params: 0
__________________________________________

In [43]:
LSTM_model.compile(optimizer=tf.keras.optimizers.Adam(),
                   loss=tf.keras.losses.SparseCategoricalCrossentropy(),
                   metrics=["sparse_categorical_accuracy"])

In [44]:
history = LSTM_model.fit(train_sentences, train_labels, epochs=6, validation_data=(val_sentences, val_labels))

Epoch 1/6
Epoch 2/6
Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6


### Test the Model 
* Test the model using the remaining 10% of the data
* The testing results will depend on the model you use
 * If the rating is evaluated as a number, you need to look at values such as mean square error
 * If you are using categories, then you can use accuracy, but you may want to collapse the categories from 1 to 5 to 3 categories such as bad, neutral, and good.

In [39]:
from sklearn.metrics import accuracy_score

In [40]:
# Getting predicting labels
y_preds = tf.squeeze(tf.round(LSTM_model.predict(val_sentences)))
y_preds = np.argmax(y_preds, 1)



In [41]:
# Calculating the accuracy with the validation data
accuracy_score(val_labels, y_preds)

0.6305

### Provide an explanation of your model and results

* The data is splitting into 90% train and 10% test.
* The model consists of a vector tonkenizer layer, an embedding layer, a LSTM layer, a Dense layer, and an output layer.
* The maximum vocabulary is 20000 and the maximum tokens is 90.
* Output labels are 0, 1, 2 which represent 0-2 rating, 2-4 rating and 4-5 rating respectively from the original ratings criteria.
* Optimizer function is Adam, and loss function is sparse categorical cross-entropy.
* Metric for accuracy is sparse categorical accuracy.
  
* The result is not as good as I expected - the final accuracy is around 63%.
* The current problem of model is that the better it fits the train data, the higher the loss of validation data.
* It highly suggests over-fitting for this model.

### Discuss techniques you could use to improve your model if you had more time

* To reduce over-fitting, we could implement early stopping and drop out additionally.

* Preprocessing could use more help such as handling stopwords, lemming.
  
* The model itself could be improved upon by using different embedding method such as Word2Vec, transfer learning.
* The layers can be extended further with stacked RNN layers.
* Different techniques also possible such as GRU, bidirectional.