# Assignment 5 - Di Riccio Tommaso - 18/05/2024

## BidirectionalLSTM class
To address the task of predicting ratings from review text, I opted for a network architecture based on LSTM units. LSTMs are particularly well-suited for sequential data processing, allowing them to better capture long-term dependencies present in long reviews.

As the first computational layer in the chosen architecture, I used a bidirectional layer featuring LSTM units. This layer processes the input sequence in both forward and backward directions, enabling the model to capture dependencies from both past and future contexts, thereby enhancing its understanding of the review text.

The second layer of the model is a standard LSTM layer used to further process the information gathered from the bidirectional layer. I chose a plain LSTM layer instead of a bidirectional one to have a deeper network without introducing excessive computational complexity.

For the output layer, softmax is used, treating the problem as a classification task with 9 classes (i.e., ratings from 1 to 9).

This architecture is encapsulated in the <code>BidirectionalLSTM</code> class. This class allows instantiation of the Keras model by passing the pre-trained embeddings matrix, the number of units to use, and the dropout parameter to control regularization. Additionally, the class wraps the most useful methods of a Keras model.

In [None]:
import keras

class BidirectionalLSTM:
    def __init__(self, num_words, embedding_dim, embedding_matrix, lstm_units=10, dropout=0.0, recurrent_dropout=0.0):
            """
            Constructor for the Bidirectional LSTM class.

            :param num_words: The size of the vocabulary.
            :param embedding_dim: The dimensionality of the embedding vectors.
            :param embedding_matrix: The pretrained embedding matrix.
            :param lstm_units: The number of LSTM units.
            :param dropout: The dropout rate.
            :param recurrent_dropout: The recurrent dropout rate.
            """
            # Define the model architecture
            bidirLSTM = keras.models.Sequential()
            # Add an input layer
            bidirLSTM.add(keras.layers.Input(shape=(None,), dtype="int32"))
            # Add an embedding layer to convert input sequences to dense vectors
            bidirLSTM.add(
                keras.layers.Embedding(
                    input_dim=num_words,
                    output_dim=embedding_dim,
                    weights=[embedding_matrix],
                    input_length=embedding_dim,
                    trainable=False,
                    mask_zero=True,
                )
            )

            # Add a Bidirectional LSTM layer
            bidirLSTM.add(keras.layers.Bidirectional(keras.layers.LSTM(units=lstm_units, return_sequences=True, dropout=dropout, recurrent_dropout=recurrent_dropout)))

            # Add a LSTM layer
            bidirLSTM.add(keras.layers.LSTM(units=lstm_units, return_sequences=False, dropout=dropout, recurrent_dropout=recurrent_dropout))

            # Add a dense output layer
            bidirLSTM.add(keras.layers.Dense(units=9, activation="softmax"))

            # Optimizer
            optimizer = keras.optimizers.Adam(learning_rate=0.001)

            # Compile the model
            bidirLSTM.compile(
                loss="categorical_crossentropy", optimizer=optimizer,
            )

            self._bidirLSTM = bidirLSTM


    def fit(self, inputs, targets, **kwargs):
        """
        Fit the model to the given inputs and targets.

        :param inputs: the input data.
        :param targets: the target data.
        :param kwargs: additional arguments to pass to the fit method.
        """
        self._bidirLSTM.fit(inputs, targets, **kwargs)

    def summary(self):
        """
        Print a summary of the model.
        """
        self._bidirLSTM.summary()

    def evaluate(self, inputs, targets):
        """
        Evaluate the model.

        :param inputs: the data to evaluate the model on
        :param targets: the target values
        :return: the score of the model
        """
        return self._bidirLSTM.evaluate(inputs, targets)

    def predict(self, data):
        """
        Make prediction over data.

        :param data: the data to predict
        :return: the predicted values
        """
        return self._bidirLSTM.predict(data)

## Utils methods

In the following code are present all the utils methods used to preprocess the text. In particular:
- **<code>Lowercase</code>**: this method converts all text to lowercase, ensuring uniformity in the text data.

- **<code>Remove Stopwords</code>**: this method removes common stopwords from the text using NLTK's stopwords.

- **<code>Remove Non-Alphanumeric Characters</code>**: this method removes non-alphanumeric characters from the text, retaining only letters, numbers, and spaces.

- **<code>Remove Contractions</code>**: this method expands contractions in the text, replacing them with their full forms. For example, "don't" becomes "do not".

- **<code>Tokenize</code>**: this method tokenizes the text data, converting words into integers, and pads sequences to ensure uniform length for input into neural networks. The vocabulary created is returned. It utilizes TensorFlow's Tokenizer and pad_sequences functions.


In [None]:
!pip install contractions



In [None]:
import nltk
from nltk.corpus import stopwords
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import pandas as pd
import contractions

nltk.download('stopwords')

def lowercase(text):
    return text.lower()

def remove_stopwords(text):
    stop_words = set(stopwords.words('english'))
    words = text.split()
    filtered_words = [word for word in words if (word not in stop_words)]
    return ' '.join(filtered_words)

def remove_non_alphanumeric(text):
    return ''.join([char if char.isalnum() or char.isspace() else ' ' for char in text ])

def remove_contractions(text):
    words = text.split()
    filtered_words = [contractions.fix(word)for word in words]
    return ' '.join(filtered_words)


def tokenize(sequences_train, sequences_test, vocab_size, max_sequence_length):
    # Tokenization
    vocab_size = vocab_size
    tokenizer = Tokenizer(num_words = vocab_size, oov_token=None)
    concat = pd.concat([sequences_train, sequences_test])
    tokenizer.fit_on_texts(concat)
    sequences_as_integers_train = tokenizer.texts_to_sequences(sequences_train)
    sequences_as_integers_test = tokenizer.texts_to_sequences(sequences_test)

    # Padding
    padding_type = 'post'
    truncation_type = 'post'
    x_train = pad_sequences(sequences_as_integers_train, maxlen=max_sequence_length, padding=padding_type, truncating=truncation_type)
    x_test = pad_sequences(sequences_as_integers_test, maxlen=max_sequence_length, padding=padding_type, truncating=truncation_type)

    return x_train, x_test, tokenizer.word_index

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Preprocessing and Test

In [None]:
# DATASET LOAD
import pandas as pd

!gdown 1Y32ppP0cTXEiXejPaIglZaTbUuOf3xyS

dataset = pd.read_csv("/content/Airline_Reviews.csv")

dataset = dataset[['Review', 'Overall_Rating']]

Downloading...
From: https://drive.google.com/uc?id=1Y32ppP0cTXEiXejPaIglZaTbUuOf3xyS
To: /content/Airline_Reviews.csv
  0% 0.00/20.5M [00:00<?, ?B/s] 69% 14.2M/20.5M [00:00<00:00, 140MB/s]100% 20.5M/20.5M [00:00<00:00, 168MB/s]


The first step in the data preprocessing pipeline is to remove all the entries in the dataset that have missing or non-numeric values as rating.

Then, the remaining entries undergo further preprocessing steps using the utility methods <code>lowercase</code>, <code>remove_contractions</code> and <code>remove_non_alphanumeric</code>.

After running some tests, I decided to keep the stopwords as they provide important context necessary to maintain an accurate representation of the original text. For instance, stopwords like "not" can completely reverse the polarity of a sentence.

Also stemming was used but then it was removed in favor of using pretrained embeddings.

In [None]:
# DATA PREPROCESSING

# Discard entry with non-numeric rating
mask = dataset["Overall_Rating"].apply(lambda x: x.isnumeric())
dataset = dataset[mask]

# Lowercasing
dataset["Review"] = dataset["Review"].apply(lowercase)

# Remove contractions
dataset["Review"] = dataset["Review"].apply(remove_contractions)

# Remove non-alphanumeric characters
dataset["Review"] = dataset["Review"].apply(remove_non_alphanumeric)

After preprocessing, the target value are one-hot encoded and the dataset is split into train and test set (90% - 10%).

In [None]:
# DATA SPLIT
from sklearn.model_selection import train_test_split

x = dataset["Review"]

# Vote one-hot encoding
y = pd.get_dummies(dataset["Overall_Rating"])

# Split dataset in train and test set
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.1, stratify=y)

At this point, all the reviews are tokenized and converted into sequences of integers, with each integer representing the index of the corresponding words in the constructed vocabulary. I chose a vocabulary size of 10,000 because this has been shown to strike a good balance between capturing sufficient word diversity and ignoring very rare words that are not be useful for the task.

The sequences are then padded or truncated to have a fixed length of 200 elements. This choice strikes a good balance between representing the majority of reviews in their entirety and avoiding excessive padding that could lead to increased memory usage.

In [None]:
# TOKENIZATION AND VOCABULARY CREATION
x_train, x_test, word_index = tokenize(x_train, x_test, vocab_size=10000, max_sequence_length=200)

Once the vocabulary has been built, it is possible to create the matrix with the pretrained embeddings. The decision to use pretrained embeddings is to greatly reduce the number of trainable parameters and improve the model performance through transfer learning. In particular, GloVe embeddings of size 200 are used.

It is possible to see that pretrained embeddings exist for 78.34% of the tokens found in the preprocessed review text. While 21.66% of tokens lack of pretrained embeddings, this isn't a significant concern. In fact, given the utilization of a vocabulary size of 10,000, it's highly likely that these tokens without a pretrained embedding will be excluded from the input sequences. After evaluation, I observed that they mainly consist of misspelled words, non-English terms (most Italian and Arabic) and acronyms. Therefore their absence is not a major issue for the model's performance.

In [None]:
# PRE-TRAINED EMBEDDINGS
import numpy as np

!gdown 1Qf6gBW6omI8Fwvew6H1wfvce9qXXuPd-

embeddings_index = {}
f = open("/content/glove.6B.200d.txt")
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

# EMBEDDING MATRIX
num_words = len(word_index) + 1
embedding_dim = 200

found = 0
embedding_matrix = np.zeros((num_words, embedding_dim))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector
        found += 1

print("Percentage of words found in pre-trained embeddings: ", np.around(found/num_words*100,2), "%")

Downloading...
From (original): https://drive.google.com/uc?id=1Qf6gBW6omI8Fwvew6H1wfvce9qXXuPd-
From (redirected): https://drive.google.com/uc?id=1Qf6gBW6omI8Fwvew6H1wfvce9qXXuPd-&confirm=t&uuid=1dedfa84-2e46-4f99-8d73-3463402cb9a5
To: /content/glove.6B.200d.txt
100% 693M/693M [00:05<00:00, 126MB/s]
Percentage of words found in pre-trained embeddings:  78.34 %


One of the disadvantages of bidirectional LSTM is the time needed for training. Given the limited hardware at my disposal (a laptop without a dedicated GPU), I initially attempted networks with a limited number of LSTM units (i.e., the dimension of the inner cells in LSTM). After some trials, I realized that this approach wouldn't be feasible, so I switched to using Google Colab to leverage the provided GPU. The improvement in training performance was significant, allowing me to test networks with many more units.

Unfortunately, it still wasn't possible to conduct a grid search due to the limitation that Colab terminates the session if the user does not interact with the page for a few minutes. Therefore, I started manually testing different configurations varying:
- the BidirectionalLSTM class parameters
- vocabulary size
- pretrained embeddings dimension
- maximum sequence length
- learning rate
- the parameters of the fit function
- the optimizer used (Adam and SGD)
- the preprocessing pipeline

In the end, one of the best-performing networks (although some were very close) is the one summarized in the following.

In [None]:
# MODEL TRAINING
bidirLSTM = BidirectionalLSTM(num_words, embedding_dim, embedding_matrix, lstm_units=75, dropout=0.5, recurrent_dropout=0.0)
bidirLSTM.summary()
bidirLSTM.fit(x_train, y_train, batch_size=64, epochs=15, validation_split=0.0, verbose=1)

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, None, 200)         6647800   
                                                                 
 bidirectional (Bidirection  (None, None, 150)         165600    
 al)                                                             
                                                                 
 lstm_1 (LSTM)               (None, 75)                67800     
                                                                 
 dense (Dense)               (None, 9)                 684       
                                                                 
Total params: 6881884 (26.25 MB)
Trainable params: 234084 (914.39 KB)
Non-trainable params: 6647800 (25.36 MB)
_________________________________________________________________
Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15


This is the result of the final retraining using both training and validation data. Early stopping wasn't used in the final retraining (due to the lack of the validation set) but during model selection 15 epochs allowed to get a good training loss without overfitting thanks to the high value of dropout.

It is important to notice that the recurrent dropout (i.e. the dropout between time steps) must be set to zero to allow GPU parallelization. Despite this, regular dropout (i.e. between the layers) was enough to avoid overfit and a better choice than weight regularizers (at least in the tested configurations).

In [None]:
# COMPUTE THE MEAN DIFFERENCE BETWEEN THE PREDICTED RATING AND THE ACTUAL RATING
import numpy as np

# Get the predicted votes from the model
predicted_prob = bidirLSTM.predict(x_test)

# Convert the predicted votes to the actual vote labels
predicted_rating = np.argmax(predicted_prob, axis=1) + 1

actual_rating = np.argmax(y_test.values, axis=1) + 1

# Compute the difference between the predicted labels and the actual labels
difference = np.abs(predicted_rating - actual_rating)

# show vote distribution
pred_rating = np.zeros(9, dtype=int)
for i in predicted_rating:
    pred_rating[i-1] += 1

act_rating = np.zeros(9, dtype=int)
for i in actual_rating:
    act_rating[i-1] += 1

print("Distribution of predicted ratings:", pred_rating)
print("Distribution of actual ratings: ", act_rating)

# Compute the mean difference and accuracy
mean_difference = np.around(np.mean(difference),2)
accuracy = np.around(np.sum(difference == 0) / len(difference), 2)

print("Mean difference between predicted and actual ratings: ", mean_difference)
print("Ratings predicted exactly: ", accuracy, "%")

Distribution of predicted ratings: [1533   58   31   31   52   82  111  187  148]
Distribution of actual ratings:  [1159  230  136   86   83   67  119  176  177]
Mean difference between predicted and actual ratings:  1.34
Ratings predicted exactly:  0.56 %


The final model on the test set is able to exactly predict 56% of ratings. The mean distance between the predicted rating and the actual rating is 1.34.

From the distribution of predicted ratings, it can be noticed that the model is slightly biased towards rating 1. This is probably the consequence of having many more entries in the dataset for this rating.

In [None]:
# MAX DIFFERENCE BETWEEN PREDICTED AND ACTUAL RATING
max_difference_index = np.argmax(difference)

print("Actual vote: " + str(actual_rating[max_difference_index]))
print("Predicted vote: " + str(predicted_rating[max_difference_index]))

print("Review text: ")
# Print the review with the maximum difference
for i in x_test[max_difference_index]:
  if i == 0:
    break
  print(list(word_index.keys())[list(word_index.values()).index(i)], end=' ')

Actual vote: 1
Predicted vote: 9
Review text: 
my wife and i recently may 6 2023 flew klm from split through amsterdam to atlanta to return home my wife had a broken and a full leg cast so she could not fly economy the flights attendants on both flights kl kl worked to accommodate both her situation to be as comfortable as possible as well as to allow me to be fairly close to assist i wish to commend all the crew for the assistance in this difficult situation 

In this cell I printed the review with the maximum distance between the predicted rating and the actual rating. We can clearly see that there is an error in the review. The text is highly positive, but the actual vote is strongly negative. The same observation can be made for different reviews.

## Considerations
In this dataset and task, there are multiple challenges, such as:
- Subjectivity of reviews, especially with ratings in a range from 1 to 9.
- Errors in the dataset, like the one previously shown.
- Reviews in other languages (like Italian and Arabic), which are difficult to detect and remove during preprocessing.

Given these challenges, a more complex and specifically engineered preprocessing of the dataset text would likely provide a higher performance boost than using different architectures. Various approaches have been attempted, such as considering the problem as a regression task, using a CNN after the LSTM (proposed by
<a href="https://www.sciencedirect.com/science/article/pii/S187705091830601X">Xiaobin Zhang et al.</a>) or adjusting the number of LSTM layers. However, none of these approaches effectively improved performance on the addressed task.