<a href="https://colab.research.google.com/github/tinayiluo0322/Yelp-Polarity-Sentiment-Analysis-Using-Deep-Learning-Models-With-Different-Embedding-Methods/blob/main/Yelp_Polarity_Sentiment_Analysis_LR_RNN_LSTM_Learnable_Word2Vec_Embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Yelp Polarity Sentiment Analysis Using Deep Learning Models With Different Embedding Methods

#### Luopeiwen Yi

### 1. Introduction
Sentiment classification is a fundamental task in natural language processing (NLP), with applications in customer feedback analysis, brand reputation monitoring, and automated content moderation. In this study, we implemented and compared multiple deep learning and traditional machine learning models for binary sentiment classification on the Yelp Polarity dataset. The objective was to evaluate how different architectures and embedding strategies impact classification performance. The models tested include Logistic Regression, Vanilla RNN, and LSTM, each paired with two embedding strategies: learnable embeddings (LE) and pretrained Word2Vec embeddings (W2V). Performance was assessed using accuracy, precision, recall, F1-score, and loss.

### 2. Background

#### 2.1 Model Selection
We selected three different model architectures with increasing complexity:
- **Logistic Regression (Baseline):** Serves as a fundamental benchmark. It transforms text into a numerical representation by averaging word embeddings and applies logistic regression for classification.
- **Vanilla RNN:** Processes input text as a sequence of word embeddings and captures temporal dependencies through recurrent connections. However, RNNs struggle with long-term dependencies due to vanishing gradients.
- **LSTM (Long Short-Term Memory):** An advanced form of RNN designed to handle long-range dependencies more effectively through its gating mechanisms. It is expected to outperform Vanilla RNN due to its ability to retain information across longer text sequences.

#### 2.2 Word Embedding Strategies
- **Learnable Embeddings (LE):** The model initializes an embedding layer with random weights and updates them during training. This allows the embeddings to be optimized for the specific dataset, potentially improving classification performance.
- **Pretrained Word2Vec Embeddings (W2V):** The embeddings are trained separately on the dataset before being used in the model. This strategy helps in leveraging semantic relationships between words, reducing the risk of overfitting and improving generalization.

### 3. Experiment Setup

#### 3.1 Dataset Preparation
- The **Yelp Polarity dataset** from Hugging Face was used.
- The text was tokenized and converted into sequences of word indices.
- Padding was applied to ensure uniform input length.
- The dataset was split into training, validation, and test sets.

#### 3.2 Model Implementation
Each model was implemented with two embedding versions:
- **Logistic Regression (LogReg_LE, LogReg_W2V)**
- **Vanilla RNN (RNN_LE, RNN_W2V)**
- **LSTM (LSTM_LE, LSTM_W2V)**

All models were trained using cross-entropy loss and the Adam optimizer. Performance was evaluated using accuracy, precision, recall, and F1-score.

### 4. Results and Analysis

#### 4.1 Performance Metrics
| Model        | Loss    | Accuracy | Precision | Recall  | F1-score |
|-------------|--------|----------|-----------|---------|----------|
| LSTM_W2V    | 0.1229 | 0.9521   | 0.9465    | 0.9584  | 0.9524   |
| LSTM_LE     | 0.2591 | 0.9393   | 0.9280    | 0.9526  | 0.9401   |
| LogReg_LE   | 0.2110 | 0.9312   | 0.9297    | 0.9331  | 0.9314   |
| LogReg_W2V  | 0.2816 | 0.8946   | 0.8987    | 0.8893  | 0.8940   |
| RNN_W2V     | 0.6258 | 0.6487   | 0.5992    | 0.8979  | 0.7188   |
| RNN_LE      | 0.6845 | 0.5244   | 0.5266    | 0.4822  | 0.5034   |

#### 4.2 Detailed Analysis
- **LSTM_W2V achieved the highest performance** with an accuracy of **95.21%** and an F1-score of **0.9524**. This suggests that the combination of LSTM's ability to retain long-term dependencies and Word2Vec’s pretrained semantic knowledge enhances sentiment classification significantly.
- **LSTM_LE performed slightly worse** but still achieved high accuracy (93.93%) and an F1-score of 0.9401. The difference indicates that pretrained Word2Vec embeddings provided a useful semantic advantage over randomly initialized embeddings.
- **Logistic Regression models performed well**, with **LogReg_LE** outperforming **LogReg_W2V** slightly. This suggests that for simpler models, allowing embeddings to be trained on the dataset may be more beneficial than using pretrained embeddings.
- **RNN models performed significantly worse**, especially **RNN_LE**, which had the lowest accuracy (52.44%). This confirms that simple RNNs struggle with longer sequences, and their lack of sophisticated gating mechanisms leads to ineffective learning.
- **RNN_W2V performed better than RNN_LE**, with a significant boost in recall (89.79%), but its lower precision indicates that it made more false positive classifications.

### 5. Conclusion

#### 5.1 Model Comparisons
- **LSTM is the best-performing model overall**, demonstrating its strength in capturing long-range dependencies and handling sequential data effectively.
- **Logistic Regression provides a strong baseline** with relatively high accuracy, making it an efficient choice when computational resources are limited.
- **Vanilla RNN is the weakest model**, confirming that simple RNNs struggle with long text sequences and suffer from vanishing gradient problems.

#### 5.2 Embedding Strategy Comparisons
- **Word2Vec embeddings improved performance for complex models like LSTM and RNN** due to their ability to retain semantic relationships.
- **Learnable embeddings worked better for Logistic Regression**, suggesting that training embeddings from scratch is beneficial for simpler models without sequential dependencies. However, this approach comes at the cost of significantly increased training time.

#### 5.3 Final Thoughts
- **Best Choice:** LSTM with Word2Vec (LSTM_W2V) due to its superior accuracy, F1-score, and overall robustness.
- **Efficient Alternative:** Logistic Regression with learnable embeddings (LogReg_LE).
- **Least Recommended:** Vanilla RNN, as it struggled significantly, especially without pretrained embeddings.

This study highlights the importance of both model selection and embedding strategy in NLP tasks. Future work could explore Transformer-based models like BERT to further enhance performance.


## Dependencies Setup

In [17]:
!pip install datasets



In [18]:
import torch
from torch.utils.data import Dataset, DataLoader
from datasets import load_dataset
from sklearn.model_selection import train_test_split
import numpy as np
import random
import os
from gensim.models import Word2Vec
import torch.nn.functional as F
from transformers import BertModel, BertTokenizer
import torch.optim as optim
from torch.utils.data import DataLoader
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report
import time
import torch.nn as nn
import pandas as pd

In [19]:
# Check if using GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Using device: cuda


In [20]:
# Set random seeds for reproducibility
seed = 42
random.seed(seed)  # Python's random module
np.random.seed(seed)  # NumPy's random module
torch.manual_seed(seed)  # PyTorch's random seed for CPU
torch.cuda.manual_seed(seed)  # PyTorch's random seed for the current GPU
torch.cuda.manual_seed_all(seed)  # PyTorch's random seed for all GPUs (if using multi-GPU)

# Ensure deterministic behavior on GPU (optional, may slow down training)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

# Optional: Set environment variables for further reproducibility
os.environ['PYTHONHASHSEED'] = str(seed)

## Data Preprocessing

### Learnable Embeddings Dataset Preprocessing



In [21]:
class SentimentDataset(Dataset):
    """Custom PyTorch Dataset class for sentiment analysis."""
    def __init__(self, sequences, labels):
        self.sequences = torch.tensor(sequences, dtype=torch.long)
        self.labels = torch.tensor(labels, dtype=torch.float32)

    def __len__(self):
        return len(self.sequences)

    def __getitem__(self, idx):
        return self.sequences[idx], self.labels[idx]

In [22]:
class LearnableEmbeddingDataset:
    """Dataset preprocessing for learnable embeddings (randomly initialized nn.Embedding)."""

    def __init__(self, max_seq_length=None, batch_size=32):
        self.dataset = load_dataset("yelp_polarity")  # Load dataset
        self.tokenized_train_corpus, self.tokenized_test_corpus = self.tokenize_text()
        self.word2idx = self.build_vocab()  # Build vocabulary (train set only)
        self.max_seq_length = max_seq_length or self.compute_max_seq_length()
        self.train_sequences, self.test_sequences = self.text_to_indices()
        self.split_data()  # Train-validation split
        self.batch_size = batch_size
        # Create full train dataset (train + validation)
        self.full_train_data = self.train_data + self.val_data
        self.full_train_labels = self.train_labels + self.val_labels

        # Print dataset statistics
        print(f" Vocabulary size: {len(self.word2idx)}")
        print(f" Max sequence length: {self.max_seq_length}")
        print(f" Training samples: {len(self.train_data)}")
        print(f" Validation samples: {len(self.val_data)}")
        print(f" Full Train Samples: {len(self.full_train_data)}")
        print(f" Test samples: {len(self.test_data)}")

    def tokenize_text(self):
        """Tokenize text by lowercasing and splitting."""
        train_texts = self.dataset["train"]["text"]
        test_texts = self.dataset["test"]["text"]
        return ([text.lower().split() for text in train_texts],
                [text.lower().split() for text in test_texts])

    def build_vocab(self):
        """Build word-to-index mapping from training data only."""
        word2idx = {"<PAD>": 0, "<UNK>": 1}  # Reserved tokens
        idx = 2
        for sentence in self.tokenized_train_corpus:
            for word in sentence:
                if word not in word2idx:
                    word2idx[word] = idx
                    idx += 1
        return word2idx

    def compute_max_seq_length(self):
        """Determine an optimal max sequence length based on training data."""
        train_lengths = [len(sentence) for sentence in self.tokenized_train_corpus]
        avg_length = np.mean(train_lengths)
        max_length = max(train_lengths)
        computed_max_length = int(avg_length * 2)

        print(f" Average sentence length: {avg_length:.2f} words")
        print(f" Longest sentence length: {max_length} words")
        print(f" Computed max sequence length: {computed_max_length}")

        return computed_max_length

    def text_to_indices(self):
        """Convert tokenized text to sequences of word indices."""
        def encode(sentence):
            indices = [self.word2idx.get(word, 1) for word in sentence]  # 1 = <UNK>
            return indices[:self.max_seq_length] + [0] * max(0, self.max_seq_length - len(indices))

        train_sequences = [encode(sentence) for sentence in self.tokenized_train_corpus]
        test_sequences = [encode(sentence) for sentence in self.tokenized_test_corpus]
        return train_sequences, test_sequences

    def split_data(self):
        """Split data into train, validation, and test sets."""
        train_labels = self.dataset["train"]["label"]
        test_labels = self.dataset["test"]["label"]

        self.train_data, self.val_data, self.train_labels, self.val_labels = train_test_split(
            self.train_sequences, train_labels, test_size=0.1, random_state=42
        )

        self.test_data, self.test_labels = self.test_sequences, test_labels

    def get_dataloaders(self):
        """Create PyTorch DataLoaders for training, validation, and testing."""
        train_dataset = SentimentDataset(self.train_data, self.train_labels)
        val_dataset = SentimentDataset(self.val_data, self.val_labels)
        test_dataset = SentimentDataset(self.test_data, self.test_labels)
        full_train_dataset = SentimentDataset(self.full_train_data, self.full_train_labels)

        train_dataloader = DataLoader(train_dataset, batch_size=self.batch_size, shuffle=True)
        val_dataloader = DataLoader(val_dataset, batch_size=self.batch_size, shuffle=False)
        test_dataloader = DataLoader(test_dataset, batch_size=self.batch_size, shuffle=False)
        full_train_dataloader = DataLoader(full_train_dataset, batch_size=self.batch_size, shuffle=True)

        return train_dataloader, val_dataloader, test_dataloader, full_train_dataloader

### Word2Vec Embeddings Dataset Preprocessing

In [23]:
class Word2VecEmbeddingDataset:
    """Dataset preprocessing for pretrained Word2Vec embeddings."""

    def __init__(self, embedding_dim=100, max_seq_length=None, batch_size=32):
        self.dataset = load_dataset("yelp_polarity")  # Load dataset
        self.tokenized_train_corpus, self.tokenized_test_corpus = self.tokenize_text()
        self.word2vec_model = self.train_word2vec(embedding_dim)  # Train Word2Vec
        self.word2idx = self.rebuild_word2idx()  # Rebuild vocabulary to match Word2Vec order
        self.max_seq_length = max_seq_length or self.compute_max_seq_length()
        self.train_sequences, self.test_sequences = self.text_to_indices()
        self.split_data()  # Train-validation split
        self.batch_size = batch_size
        self.embedding_matrix = self.build_embedding_matrix()  # Convert Word2Vec embeddings to PyTorch tensor
        # Create full train dataset (train + validation)
        self.full_train_data = self.train_data + self.val_data
        self.full_train_labels = self.train_labels + self.val_labels

        # Print dataset statistics
        print(f" Vocabulary size: {len(self.word2idx)}")
        print(f" Max sequence length: {self.max_seq_length}")
        print(f" Training samples: {len(self.train_data)}")
        print(f" Validation samples: {len(self.val_data)}")
        print(f" Full Train Samples: {len(self.full_train_data)}")
        print(f" Test samples: {len(self.test_data)}")
        print(f" Embedding matrix shape: {self.embedding_matrix.shape}")

    def tokenize_text(self):
        """Tokenize text by lowercasing and splitting."""
        train_texts = self.dataset["train"]["text"]
        test_texts = self.dataset["test"]["text"]
        return ([text.lower().split() for text in train_texts],
                [text.lower().split() for text in test_texts])

    def train_word2vec(self, embedding_dim):
        """Train Word2Vec on training data only."""
        return Word2Vec(sentences=self.tokenized_train_corpus, vector_size=embedding_dim, window=5, min_count=1, workers=4)

    def rebuild_word2idx(self):
        """Rebuild word2idx to match Word2Vec order."""
        word2idx = {"<PAD>": 0, "<UNK>": 1}
        for idx, word in enumerate(self.word2vec_model.wv.index_to_key, start=2):
            word2idx[word] = idx
        return word2idx

    def compute_max_seq_length(self):
        """Determine an optimal max sequence length based on training data."""
        train_lengths = [len(sentence) for sentence in self.tokenized_train_corpus]
        avg_length = np.mean(train_lengths)
        max_length = max(train_lengths)
        computed_max_length = int(avg_length * 2)

        print(f" Average sentence length: {avg_length:.2f} words")
        print(f" Longest sentence length: {max_length} words")
        print(f" Computed max sequence length: {computed_max_length}")

        return computed_max_length

    def text_to_indices(self):
        """Convert tokenized text to sequences of word indices."""
        def encode(sentence):
            indices = [self.word2idx.get(word, 1) for word in sentence]  # 1 = <UNK>
            return indices[:self.max_seq_length] + [0] * max(0, self.max_seq_length - len(indices))

        train_sequences = [encode(sentence) for sentence in self.tokenized_train_corpus]
        test_sequences = [encode(sentence) for sentence in self.tokenized_test_corpus]
        return train_sequences, test_sequences

    def build_embedding_matrix(self):
        """Return the PyTorch tensor of the Word2Vec embeddings, including <PAD> and <UNK>."""
        word_embeddings = torch.FloatTensor(self.word2vec_model.wv.vectors)

        # Manually add <PAD> and <UNK> embeddings
        pad_embedding = torch.zeros((1, self.word2vec_model.vector_size))  # Zero vector for padding
        unk_embedding = torch.mean(word_embeddings, dim=0, keepdim=True)  # Average embedding for <UNK>

        # Concatenate <PAD> and <UNK> embeddings at the beginning
        word_embeddings = torch.cat([pad_embedding, unk_embedding, word_embeddings], dim=0)

        return word_embeddings

    def get_embedding_matrix(self):
        """Return the PyTorch tensor of the Word2Vec embeddings, including <PAD> and <UNK>."""
        return self.embedding_matrix

    def split_data(self):
        """Split data into train, validation, and test sets."""
        train_labels = self.dataset["train"]["label"]
        test_labels = self.dataset["test"]["label"]

        self.train_data, self.val_data, self.train_labels, self.val_labels = train_test_split(
            self.train_sequences, train_labels, test_size=0.1, random_state=42
        )

        self.test_data, self.test_labels = self.test_sequences, test_labels

    def get_dataloaders(self):
        """Create PyTorch DataLoaders for training, validation, and testing."""
        train_dataset = SentimentDataset(self.train_data, self.train_labels)
        val_dataset = SentimentDataset(self.val_data, self.val_labels)
        test_dataset = SentimentDataset(self.test_data, self.test_labels)
        full_train_dataset = SentimentDataset(self.full_train_data, self.full_train_labels)

        train_dataloader = DataLoader(train_dataset, batch_size=self.batch_size, shuffle=True)
        val_dataloader = DataLoader(val_dataset, batch_size=self.batch_size, shuffle=False)
        test_dataloader = DataLoader(test_dataset, batch_size=self.batch_size, shuffle=False)
        full_train_dataloader = DataLoader(full_train_dataset, batch_size=self.batch_size, shuffle=True)

        return train_dataloader, val_dataloader, test_dataloader, full_train_dataloader

In [24]:
# Function to test dataset module and verify Word2Vec embeddings
def test_dataset_module(dataset_class, embedding_type, batch_size=32, embedding_dim=None):
    """Tests the given dataset module by printing dataset statistics and a sample batch."""
    print(f" Testing {embedding_type} Dataset with batch size {batch_size}...")

    # Instantiate dataset module (handle cases where embedding_dim is needed)
    dataset = dataset_class(embedding_dim=embedding_dim, max_seq_length=None, batch_size=batch_size) if embedding_dim else dataset_class(max_seq_length=None, batch_size=batch_size)

    # Print dataset statistics
    print(f"   {embedding_type} Dataset Stats:")
    print(f"   - Vocabulary size: {len(dataset.word2idx)}")
    print(f"   - Max sequence length: {dataset.max_seq_length}")
    print(f"   - Training samples: {len(dataset.train_data)}")
    print(f"   - Validation samples: {len(dataset.val_data)}")
    full_train_count = len(dataset.full_train_data) if hasattr(dataset, "full_train_data") else "N/A"
    print(f"   - Full Train Samples: {full_train_count}")
    print(f"   - Test samples: {len(dataset.test_data)}")

    # Sentence length stats (from compute_max_seq_length)
    avg_length = np.mean([len(sentence) for sentence in dataset.tokenized_train_corpus])
    max_length = max(len(sentence) for sentence in dataset.tokenized_train_corpus)
    print(f"   - Average sentence length: {avg_length:.2f} words")
    print(f"   - Longest sentence length: {max_length} words")
    print(f"   - Computed max sequence length: {dataset.compute_max_seq_length()}")
    print("-" * 50)

    # Get DataLoaders
    train_dataloader, val_dataloader, test_dataloader, full_train_dataloader = dataset.get_dataloaders()

    # Retrieve a batch
    train_batch = next(iter(train_dataloader))
    print(f" {embedding_type} Batch Shape: {train_batch[0].shape}, Labels Shape: {train_batch[1].shape}")

    # If testing Word2Vec embeddings, retrieve and verify the embedding matrix
    if isinstance(dataset, Word2VecEmbeddingDataset):
        print(" Testing Word2Vec Embedding Matrix...")
        embedding_matrix = dataset.get_embedding_matrix()

        # Print embedding matrix shape
        print(f" Embedding Matrix Shape: {embedding_matrix.shape}")  # Should match (vocab_size, embedding_dim)

        # Check special token embeddings
        pad_embedding = embedding_matrix[0]  # First row should be all zeros
        unk_embedding = embedding_matrix[1]  # Second row should be the average of all embeddings

        print(f" <PAD> Embedding (should be all zeros): {pad_embedding[:5]} ...")  # Print first few values
        print(f" <UNK> Embedding (should be avg of embeddings): {unk_embedding[:5]} ...")  # Print first few values

    print("=" * 50)


# Run tests for Learnable and Word2Vec Embeddings
test_dataset_module(LearnableEmbeddingDataset, "Learnable Embeddings", batch_size=32)
test_dataset_module(Word2VecEmbeddingDataset, "Word2Vec Embeddings", batch_size=32, embedding_dim=100)

 Testing Learnable Embeddings Dataset with batch size 32...
 Average sentence length: 133.03 words
 Longest sentence length: 1052 words
 Computed max sequence length: 266
 Vocabulary size: 1288540
 Max sequence length: 266
 Training samples: 504000
 Validation samples: 56000
 Full Train Samples: 560000
 Test samples: 38000
   Learnable Embeddings Dataset Stats:
   - Vocabulary size: 1288540
   - Max sequence length: 266
   - Training samples: 504000
   - Validation samples: 56000
   - Full Train Samples: 560000
   - Test samples: 38000
   - Average sentence length: 133.03 words
   - Longest sentence length: 1052 words
 Average sentence length: 133.03 words
 Longest sentence length: 1052 words
 Computed max sequence length: 266
   - Computed max sequence length: 266
--------------------------------------------------
 Learnable Embeddings Batch Shape: torch.Size([32, 266]), Labels Shape: torch.Size([32])
 Testing Word2Vec Embeddings Dataset with batch size 32...
 Average sentence length:

In [25]:
# Instantiate dataset objects
le_dataset = LearnableEmbeddingDataset()
w2v_dataset = Word2VecEmbeddingDataset()

# Get the DataLoaders
train_dataloader_LE, val_dataloader_LE, test_dataloader_LE, full_train_dataloader_LE = le_dataset.get_dataloaders()
train_dataloader_Word2Vec, val_dataloader_Word2Vec, test_dataloader_Word2Vec, full_train_dataloader_Word2Vec= w2v_dataset.get_dataloaders()

# Get the Word2Vec embedding matrix (for models using pretrained embeddings)
word2vec_embeddings = w2v_dataset.get_embedding_matrix()

 Average sentence length: 133.03 words
 Longest sentence length: 1052 words
 Computed max sequence length: 266
 Vocabulary size: 1288540
 Max sequence length: 266
 Training samples: 504000
 Validation samples: 56000
 Full Train Samples: 560000
 Test samples: 38000
 Average sentence length: 133.03 words
 Longest sentence length: 1052 words
 Computed max sequence length: 266
 Vocabulary size: 1288540
 Max sequence length: 266
 Training samples: 504000
 Validation samples: 56000
 Full Train Samples: 560000
 Test samples: 38000
 Embedding matrix shape: torch.Size([1288540, 100])


In [26]:
# Sanity Check to ensure no data leakage
def count_oov_words(dataset):
    """
    Count the number of out-of-vocabulary (OOV) words in the test set.

    Args:
        dataset: The dataset object (LearnableEmbeddingDataset or Word2VecEmbeddingDataset).

    Returns:
        oov_count (int): Total number of OOV words.
        oov_percentage (float): Percentage of OOV words in test set.
    """
    oov_count = 0
    total_words = 0

    for sentence in dataset.tokenized_test_corpus:
        for word in sentence:
            total_words += 1
            if word not in dataset.word2idx:  # Word not in vocabulary
                oov_count += 1

    oov_percentage = (oov_count / total_words) * 100
    print(f"\n OOV Words in Test Set: {oov_count}/{total_words} ({oov_percentage:.2f}%)")
    return oov_count, oov_percentage

In [27]:
# Check OOV rate for Learnable Embeddings
print("\n OOV Words for Learnable Embeddings:")
oov_LE, oov_LE_percentage = count_oov_words(le_dataset)

# Check OOV rate for Word2Vec Embeddings
print("\n OOV Words for Word2Vec Embeddings:")
oov_W2V, oov_W2V_percentage = count_oov_words(w2v_dataset)


 OOV Words for Learnable Embeddings:

 OOV Words in Test Set: 63627/5037228 (1.26%)

 OOV Words for Word2Vec Embeddings:

 OOV Words in Test Set: 63627/5037228 (1.26%)


## Model Implementation



### Logistic Regression

In [28]:
class LogisticRegressionModel(nn.Module):
    """Baseline Logistic Regression model using averaged word embeddings."""

    def __init__(self, vocab_size, embedding_dim, pretrained_embeddings=None, train_embeddings=True):
        super(LogisticRegressionModel, self).__init__()

        if pretrained_embeddings is not None:
            self.embeddings = nn.Embedding.from_pretrained(pretrained_embeddings, freeze=not train_embeddings)
        else:
            self.embeddings = nn.Embedding(vocab_size, embedding_dim)

        self.fc = nn.Linear(embedding_dim, 1)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        embedded = self.embeddings(x)  # Shape: (batch_size, seq_len, embedding_dim)
        avg_embedding = embedded.mean(dim=1)  # Average over sequence length
        output = self.fc(avg_embedding)  # Fully connected layer
        return self.sigmoid(output).squeeze()

### RNN

In [29]:
class RNNModel(nn.Module):
    """RNN-based sentiment classification model."""

    def __init__(self, vocab_size, embedding_dim, hidden_size, pretrained_embeddings=None, train_embeddings=True):
        super(RNNModel, self).__init__()

        if pretrained_embeddings is not None:
            self.embeddings = nn.Embedding.from_pretrained(pretrained_embeddings, freeze=not train_embeddings)
        else:
            self.embeddings = nn.Embedding(vocab_size, embedding_dim)

        self.rnn = nn.RNN(embedding_dim, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, 1)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        embedded = self.embeddings(x)
        rnn_out, _ = self.rnn(embedded)
        last_hidden_state = rnn_out[:, -1, :]  # Take last hidden state
        output = self.fc(last_hidden_state)
        return self.sigmoid(output).squeeze()

### LSTM

In [30]:
class LSTMModel(nn.Module):
    """LSTM-based sentiment classification model."""

    def __init__(self, vocab_size, embedding_dim, hidden_size, pretrained_embeddings=None, train_embeddings=True):
        super(LSTMModel, self).__init__()

        if pretrained_embeddings is not None:
            self.embeddings = nn.Embedding.from_pretrained(pretrained_embeddings, freeze=not train_embeddings)
        else:
            self.embeddings = nn.Embedding(vocab_size, embedding_dim)

        self.lstm = nn.LSTM(embedding_dim, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, 1)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        embedded = self.embeddings(x)
        lstm_out, (hidden, _) = self.lstm(embedded)
        last_hidden_state = hidden[-1]  # Take last hidden state
        output = self.fc(last_hidden_state)
        return self.sigmoid(output).squeeze()

## Training & Evaluation Pipeline

In [31]:
class Trainer:
    """Reusable training and evaluation class for all models."""

    def __init__(self, model, train_dataloader, val_dataloader, test_dataloader,
                 full_train_dataloader=None, learning_rate=0.001, num_epochs=10,
                 device="cuda" if torch.cuda.is_available() else "cpu"):
        self.model = model.to(device)
        self.train_dataloader = train_dataloader
        self.val_dataloader = val_dataloader
        self.test_dataloader = test_dataloader
        self.full_train_dataloader = full_train_dataloader  # Store full dataset loader for retraining
        self.criterion = nn.BCELoss()  # Binary classification loss
        self.optimizer = optim.Adam(self.model.parameters(), lr=learning_rate)
        self.num_epochs = num_epochs
        self.device = device

    def train_model(self, use_full_data=False):
        """Train the model with either train data or full (train + validation) data."""
        dataloader = self.full_train_dataloader if use_full_data else self.train_dataloader
        data_type = "Full Train (Train+Val)" if use_full_data else "Train Only"

        print(f"\n Training {self.model.__class__.__name__} on {data_type} for {self.num_epochs} epochs...\n")
        start_time = time.time()

        for epoch in range(self.num_epochs):
            self.model.train()  # Set model to training mode
            total_loss = 0
            all_preds, all_labels = [], []

            for inputs, labels in dataloader:
                inputs, labels = inputs.to(self.device), labels.to(self.device)

                self.optimizer.zero_grad()  # Reset gradients
                outputs = self.model(inputs).squeeze()  # Ensure correct shape

                loss = self.criterion(outputs, labels)
                loss.backward()  # Compute gradients
                self.optimizer.step()  # Update weights

                total_loss += loss.item()

                # Convert outputs to binary predictions
                predictions = (outputs >= 0.5).float()
                all_preds.extend(predictions.cpu().numpy())
                all_labels.extend(labels.cpu().numpy())

            # Compute Training Accuracy
            train_accuracy = accuracy_score(all_labels, all_preds)
            avg_train_loss = total_loss / len(dataloader)

            # Evaluate on Validation Set (only when not using full data)
            if not use_full_data:
                val_metrics = self.evaluate_model(self.val_dataloader, mode="Validation")
                print(f"Epoch {epoch+1}/{self.num_epochs} | "
                      f"Train Loss: {avg_train_loss:.4f} | Train Acc: {train_accuracy:.4f} | "
                      f"Val Loss: {val_metrics['loss']:.4f} | Val Acc: {val_metrics['accuracy']:.4f} | "
                      f"Val F1: {val_metrics['f1-score']:.4f} | "
                      f"Val Precision: {val_metrics['precision']:.4f} | "
                      f"Val Recall: {val_metrics['recall']:.4f}")
            else:
                print(f"Epoch {epoch+1}/{self.num_epochs} | Train Loss: {avg_train_loss:.4f} | Train Acc: {train_accuracy:.4f}")

        total_time = time.time() - start_time
        print(f"\n Training completed in {total_time:.2f} seconds.")

    def evaluate_model(self, dataloader, mode="Test"):
        """Evaluate the model on a given dataset (validation or test)."""
        self.model.eval()  # Set model to evaluation mode
        all_preds, all_labels = [], []
        total_loss = 0

        with torch.no_grad():  # No gradient computation in evaluation
            for inputs, labels in dataloader:
                inputs, labels = inputs.to(self.device), labels.to(self.device)
                outputs = self.model(inputs).squeeze()

                loss = self.criterion(outputs, labels)
                total_loss += loss.item()

                predictions = (outputs >= 0.5).float()  # Convert logits to binary predictions

                all_preds.extend(predictions.cpu().numpy())
                all_labels.extend(labels.cpu().numpy())

        acc = accuracy_score(all_labels, all_preds)
        precision, recall, f1, _ = precision_recall_fscore_support(all_labels, all_preds, average="binary", zero_division=0)
        avg_loss = total_loss / len(dataloader)

        # Print Classification Report
        print(f"\n {mode} Classification Report:\n", classification_report(all_labels, all_preds, digits=4))

        return {"loss": avg_loss, "accuracy": acc, "precision": precision, "recall": recall, "f1-score": f1}

    def retrain_and_test(self):
        """Retrain model using both train & validation sets, then test on test set."""
        if self.full_train_dataloader is None:
            print("Full train dataloader is missing! Ensure it's passed during initialization.")
            return

        print(f"\n Retraining {self.model.__class__.__name__} using Train + Validation Data...\n")
        self.train_model(use_full_data=True)  # Train the model again with full dataset

        print(f"\n Final Testing on Test Set...")
        test_metrics = self.evaluate_model(self.test_dataloader, mode="Test")

        print(f"Final Test Results: Accuracy: {test_metrics['accuracy']:.4f}, Loss: {test_metrics['loss']:.4f} | "
              f"Precision: {test_metrics['precision']:.4f} | "
              f"Recall: {test_metrics['recall']:.4f} | "
              f"F1-score: {test_metrics['f1-score']:.4f}")

## Experiment

In [32]:
# Define model hyperparameters
embedding_dim = 100  # Same as Word2Vec embedding size
hidden_size = 128  # For RNN/LSTM
num_epochs = 5
learning_rate = 0.001

# Get vocabulary sizes
vocab_size_LE = len(le_dataset.word2idx)  # Learnable embeddings vocabulary size
vocab_size_W2V = len(w2v_dataset.word2idx)  # Word2Vec embeddings vocabulary size

# Define all models with both embedding strategies
models = {
    "LogReg_LE": LogisticRegressionModel(vocab_size_LE, embedding_dim),
    "LogReg_W2V": LogisticRegressionModel(vocab_size_W2V, embedding_dim, pretrained_embeddings=word2vec_embeddings, train_embeddings=False),

    "RNN_LE": RNNModel(vocab_size_LE, embedding_dim, hidden_size),
    "RNN_W2V": RNNModel(vocab_size_W2V, embedding_dim, hidden_size, pretrained_embeddings=word2vec_embeddings, train_embeddings=False),

    "LSTM_LE": LSTMModel(vocab_size_LE, embedding_dim, hidden_size),
    "LSTM_W2V": LSTMModel(vocab_size_W2V, embedding_dim, hidden_size, pretrained_embeddings=word2vec_embeddings, train_embeddings=False)
}

# Define dataloaders for both strategies
dataloaders = {
    "LE": (train_dataloader_LE, val_dataloader_LE, test_dataloader_LE, full_train_dataloader_LE),
    "W2V": (train_dataloader_Word2Vec, val_dataloader_Word2Vec, test_dataloader_Word2Vec, full_train_dataloader_Word2Vec)
}

# Train & Evaluate all models
results = {}

# Train, evaluate, and retrain each model
for model_name, model in models.items():

    # Add big seperation line
    print("-" * 100)

    print(f"\n Training {model_name}...\n")

    # Determine which dataloaders to use
    strategy = "LE" if "LE" in model_name else "W2V"
    train_dl, val_dl, test_dl, full_train_dl = dataloaders[strategy]

    # Initialize trainer
    trainer = Trainer(model, train_dl, val_dl, test_dl, full_train_dataloader=full_train_dl,
                      learning_rate=learning_rate, num_epochs=num_epochs)

    # Train and evaluate model
    trainer.train_model()

    # Add small seperation line
    print("-" * 50)

    # Retrain on full train data and test
    trainer.retrain_and_test()

    # Add seperation line
    print("-" * 50)

    # Store results
    test_metrics = trainer.evaluate_model(trainer.test_dataloader, mode="Test")
    results[model_name] = test_metrics

----------------------------------------------------------------------------------------------------

 Training LogReg_LE...


 Training LogisticRegressionModel on Train Only for 5 epochs...


 Validation Classification Report:
               precision    recall  f1-score   support

         0.0     0.9358    0.9104    0.9229     28035
         1.0     0.9126    0.9374    0.9248     27965

    accuracy                         0.9239     56000
   macro avg     0.9242    0.9239    0.9239     56000
weighted avg     0.9242    0.9239    0.9239     56000

Epoch 1/5 | Train Loss: 0.2869 | Train Acc: 0.8880 | Val Loss: 0.2093 | Val Acc: 0.9239 | Val F1: 0.9248 | Val Precision: 0.9126 | Val Recall: 0.9374

 Validation Classification Report:
               precision    recall  f1-score   support

         0.0     0.9415    0.9149    0.9280     28035
         1.0     0.9171    0.9430    0.9299     27965

    accuracy                         0.9290     56000
   macro avg     0.9293    0.9290    0.

## Comparison and Analysis

In [33]:
# Convert results dictionary to DataFrame
results_df = pd.DataFrame.from_dict(results, orient="index")
results_df = results_df.sort_values(by="accuracy", ascending=False)  # Sort by accuracy

# Print the results table
print("\n Final Model Performance Comparison:\n")
print(results_df)


 Final Model Performance Comparison:

                loss  accuracy  precision    recall  f1-score
LSTM_W2V    0.122912  0.952132   0.946515  0.958421  0.952431
LSTM_LE     0.259095  0.939316   0.927963  0.952579  0.940110
LogReg_LE   0.211040  0.931237   0.929676  0.933053  0.931361
LogReg_W2V  0.281639  0.894553   0.898729  0.889316  0.893998
RNN_W2V     0.625798  0.648684   0.599220  0.897947  0.718782
RNN_LE      0.684483  0.524395   0.526646  0.482158  0.503421
