# Deep Learning - Máster in Artificial Intelligence (UDC)
## Amazon Reviews Sentiment Classification using RNNs - 
Date: 20/03/2025

Authors:

Paula Biderman Mato

Celia Hermoso Soto

# Introduction
In this part of the practice, the main objective is to build and evaluate different Recurrent Neural Networks (RNNs) for text classification. The dataset used consists of Amazon customer reviews, labeled with positive or negative sentiment. The goal is to train deep learning models that can accurately predict whether a review is positive (4 or 5 stars) or negative (1 or 2 stars).

As we have seen throughout this subject, RNNs are a powerful type of deep learning architecture especially suited for sequential data such as text. They are able to capture temporal dependencies by maintaining a hidden state that is passed through time. In this practice, we use three RNN-based architectures: a simple RNN, an LSTM, and a bidirectional LSTM. These models are trained on a preprocessed version of the Amazon reviews dataset using Keras.

Unlike CNNs in image classification, text classification problems require models that understand the sequential nature of language. RNNs, and in particular LSTMs, are designed to capture such dependencies, which makes them an appropriate choice for this task. In this notebook, we compare the performance of each architecture in terms of classification accuracy.

To reuse functions and code outside a Jupyter environment, the notebook can be converted into a Python script using the nbconvert tool. This is particularly useful when we want to organize reusable components, such as data preprocessing or model definitions, into importable modules. The conversion is done with a single command: *!jupyter nbconvert --to script generateAmazonDataset.ipynb*
This command transforms the notebook into a .py file by extracting all code cells and converting markdown cells into Python comments. The resulting script can then be edited, imported into other projects, or executed directly as a standalone Python program.

In [None]:
!jupyter nbconvert --to script generateAmazonDataset.ipynb

In [None]:
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from generateAmazonDataset import readData, transformData
import matplotlib.pyplot as plt

# Load dataset
We use the script provided by the instructors to load the dataset: generateAmazonDataset.ipynb. It contains two functions: readData and transformData. The readData function returns the training and test sets in raw text format, along with their binary sentiment labels. __label__1 corresponds to negative reviews (1 or 2 stars) and __label__2 to positive reviews (4 or 5 stars). Neutral reviews (3 stars) have been excluded.


In [None]:
train_texts, train_labels, test_texts, test_labels = readData()

In [1]:
# We print the number of samples in each set to confirm the data has been correctly loaded.
print(f"Train samples: {len(train_texts)}")
print(f"Test samples: {len(test_texts)}")

NameError: name 'train_texts' is not defined

# Transform data using TextVectorization
In order to use text as input for a neural network, we must convert it into numeric format. We use the Keras layer TextVectorization, which converts raw text into sequences of integers, where each integer represents a word in a vocabulary limited to 'max_features'.
* max_features: the number of most frequent words to keep in the vocabulary.
* seq_length: the maximum number of words per input review. Shorter reviews are padded with zeros.
* embedding_dim: the size of the dense vector that each token will be mapped to in the Embedding layer.


In [None]:
max_features = 10000
seq_length = 300
embedding_dim = 64

X_train, y_train, X_test, y_test, vectorizer = transformData(
    train_texts, train_labels, test_texts, test_labels,
    max_features=max_features,
    output_sequence_length=seq_length
)

After transformation, each input review is a sequence of integers of fixed length. These inputs are now ready to be used in neural networks.

# Build Model 1: SimpleRNN
The first model uses a SimpleRNN layer. This is the most basic type of recurrent layer. It maintains a hidden state that gets updated sequentially, allowing the model to capture dependencies in the text. However, it tends to struggle with long-term dependencies, which will be improved in later models using LSTM.
Architecture:
* Embedding layer: transforms integer word indices into dense vectors of fixed size (embedding_dim)
* SimpleRNN: processes the sequence one word at a time and outputs a hidden state
* Dense: final classification layer with sigmoid activation (since this is a binary classification task)


In [None]:
model_rnn = keras.Sequential([
    layers.Embedding(max_features, embedding_dim, input_length=seq_length),
    layers.SimpleRNN(64),
    layers.Dense(1, activation='sigmoid')
])

model_rnn.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
history_rnn = model_rnn.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=5, batch_size=64)

# Build Model 2: LSTM + Dropout
This model uses an LSTM layer, which is a more advanced type of RNN capable of learning long-term dependencies. We add a Dropout layer after LSTM to reduce overfitting and improve generalization.
Architecture:
* Embedding layer
* LSTM: more robust to vanishing gradients, ideal for longer sequences
* Dropout: randomly disables 50% of neurons during training
* Dense: output layer for binary classification

In [None]:
model_lstm = keras.Sequential([
    layers.Embedding(max_features, embedding_dim, input_length=seq_length),
    layers.LSTM(64, return_sequences=False),
    layers.Dropout(0.5),
    layers.Dense(1, activation='sigmoid')
])

model_lstm.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
history_lstm = model_lstm.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=5, batch_size=64)

# Build Model 3: Bidirectional LSTM
In this model, we wrap the LSTM layer in a Bidirectional wrapper. This allows the model to read the input sequence both forward and backward, capturing context from both directions, which is useful for understanding sentence structure.
Architecture:
* Embedding layer
* Bidirectional LSTM: processes the sequence in both directions
* Dense: final output layer

In [None]:
model_bilstm = keras.Sequential([
    layers.Embedding(max_features, embedding_dim, input_length=seq_length),
    layers.Bidirectional(layers.LSTM(64)),
    layers.Dense(1, activation='sigmoid')
])

model_bilstm.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
history_bilstm = model_bilstm.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=5, batch_size=64)

# Visualization of training results
We plot the accuracy of each model over the training epochs to visually compare their performance. This helps to detect underfitting, overfitting, and general learning trends.

In [None]:
def plot_history(history, title):
    plt.plot(history.history['accuracy'], label='train')
    plt.plot(history.history['val_accuracy'], label='val')
    plt.title(title)
    plt.ylabel('Accuracy')
    plt.xlabel('Epoch')
    plt.legend()
    plt.grid(True)
    plt.show()

plot_history(history_rnn, 'SimpleRNN Accuracy')
plot_history(history_lstm, 'LSTM Accuracy')
plot_history(history_bilstm, 'Bidirectional LSTM Accuracy')

# Final evaluation and conclusions
In this section we show the final results on the test set. The main metric used is accuracy. We compare the performance of each architecture to assess which one generalizes better.

In [2]:
print("Final Evaluation:")
print("SimpleRNN:", model_rnn.evaluate(X_test, y_test, verbose=0))
print("LSTM:", model_lstm.evaluate(X_test, y_test, verbose=0))
print("Bidirectional LSTM:", model_bilstm.evaluate(X_test, y_test, verbose=0))

Final Evaluation:


NameError: name 'model_rnn' is not defined