# Comparative Analysis: Sentiment Analysis Using
BERT, LSTM, GRU, and RNN

## Objective
Perform sentiment analysis on the given dataset using multiple deep learning models: BERT, LSTM, GRU, and RNN, and conduct a comparative analysis based on their performance metrics.
Dataset
## Implementation Plan
1. Data Preprocessing
    ● Load the Dataset: Import the CSV file and extract the relevant columns (polarity,
    text).
    ● Clean the Text: Remove URLs, special characters, numbers, and extra spaces. Convert
    text to lowercase.
    ● Tokenization: Split the text into individual tokens using a tokenizer suitable for each
    model (e.g., Word2Vec for RNN-based models, BERT tokenizer for BERT).
    ● Class Mapping: Map sentiment labels:
        ○ 0 → Negative
        ○ 2 → Neutral
        ○ 4 → Positive
    ● Train-Test Split: Divide the data into training, validation, and test sets.
2. Feature Engineering
    ● BERT Tokenization:
        ○ Use a pre-trained BERT tokenizer to convert the text into input IDs, attention
        masks, and token type IDs.
    ● Embedding for RNN-based Models:
        ○ Generate word embeddings using GloVe or Word2Vec.
        ○ Pad sequences to a fixed length for uniformity.
3. Model Implementation
    ● BERT:
        ○ Use a pre-trained BERT model from the Hugging Face library.
        ○ Add a classification head (e.g., a dense layer with softmax activation) to fine-tune
        the model.
    ● LSTM:
        ○ Use a sequential model with embedding, LSTM layers, and a dense output layer.
    ● GRU:
        ○ Similar to LSTM but replace LSTM layers with GRU layers for comparison.
    ● RNN:
        ○ Use simple RNN layers instead of LSTM/GRU for baseline comparison.
4. Evaluation Metrics
    ● Accuracy: Overall percentage of correct predictions.
    ● Precision, Recall, F1-Score: Evaluate per class (negative, neutral, positive).
    ● Confusion Matrix: Show performance across all classes.
    ● ROC-AUC Score: Measure the ability of the model to distinguish between classes.
5. Comparative Analysis
    ● Compare the models on:
        ○ Performance metrics (accuracy, precision, recall, F1-score).
        ○ Computational requirements (training time, memory usage).
        ○ Complexity of implementation.
    ● Generate visualizations:
        ○ Bar chart comparing F1-scores for all models.
        ○ Line plot showing training/validation loss and accuracy over epochs.
        ○ Confusion matrix heatmaps for each model.
6. Expected Outcome
    ● BERT: Likely to outperform RNN-based models due to its pre-trained contextual
    embeddings and transformer architecture.
    ● LSTM/GRU: Expected to perform better than simple RNN due to their ability to handle
    long-term dependencies and avoid vanishing gradient problems.
    ● RNN: May provide a baseline but is likely to underperform compared to other models.
7. Deliverables
    ● Code implementation for each model in Python (using libraries like TensorFlow, PyTorch,
    Hugging Face).
    ● Comparative analysis report with:
    ○ Metric tables
    ○ Charts and graphs
    ○ Insights on model performance.
    ● Recommendations on the best model for deployment based on trade-offs between
    performance and resource usage.

## Inspect the Dataset

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np
import re
# Set seed
np.random.seed(42)

test_df = pd.read_csv('sentiment-analysis-dataset/testdata.manual.2009.06.14.csv', encoding='latin1')
train_df = pd.read_csv('sentiment-analysis-dataset/training.1600000.processed.noemoticon.csv', encoding='latin1')
# Strip extra white spaces from train_df column names (tweet data fields)
train_df.columns = train_df.columns.str.strip()
print("Shape: ", train_df.shape)
print("Columns: ", train_df.columns)
print("Tweet polarity distribution (%):\n", train_df['polarity of tweet'].value_counts(normalize=True) * 100)
print()
train_df.head(2)

In [None]:
print(f"Test set columns: {test_df.columns}")
# Test_df is not in the expected, formated. Using train_df as the entire dataset.


## Clean the Text: Remove URLs, special characters, numbers, and extra spaces. Convert text to lowercase.

In [None]:
## Clean the Text: Remove URLs, special characters, numbers, and extra spaces. Convert text to lowercase.
def clean_tweet(text: str):
    # Remove URLs
    text = re.sub(r"http\S+|www\S+|https\S+", '', text, flags=re.MULTILINE)
    # Remove usernames (@user)
    text = re.sub(r'@\w+', '', text)
    # Remove special characters and numbers (except whitespace)
    text = re.sub(r"[^a-zA-Z\s]", "", text)
    # Remove extra spaces
    text = re.sub(r'\s+', ' ', text)
    # Convert to lowercase and strip leading/trailing whitespace
    return text.lower().strip()

# Apply cleaning function to the tweets
clean_tweets = train_df['text of the tweet'].apply(clean_tweet)
clean_df = train_df.copy()
# Save the cleaned df
clean_df['text of the tweet'] = clean_tweets
clean_df.to_pickle('clean_df.pkl')
clean_df.head()


# Splitting into Train, Test, and Val
    - 10% test set
    - 72% training set
    - 18% validation set



In [None]:
# OVERSAMPLING class 4:
    # Tweet polarity distribution (%):
    #  polarity of tweet
    # 0    76.293855
    # 4    23.706145
df = clean_df.copy()

# Define features and target
X = df#['text of the tweet']
y = df['polarity of tweet']

# BALANCED SPLIT: Using OVERSAMPLING to preserve more data
# This approach keeps ALL majority class samples and oversamples the minority class
# Separate data by class
class_0_mask = y == 0
class_4_mask = y == 4

X_class_0 = X[class_0_mask]
y_class_0 = y[class_0_mask]
X_class_4 = X[class_4_mask]
y_class_4 = y[class_4_mask]

print(f"Original - Class 0 samples: {len(X_class_0)}")
print(f"Original - Class 4 samples: {len(X_class_4)}")

# Use ALL samples from class 0 (majority class)
# Oversample class 4 (minority class) to match class 0 size
max_class_size = len(X_class_0)  # Use majority class size as target
print(f"\nTarget size per class: {max_class_size} samples")
print(f"Class 4 needs {max_class_size - len(X_class_4)} additional samples (oversampling)\n")


# For class 0: Use all samples (no sampling needed - preserves all data!)
X_class_0_balanced = X_class_0.copy()
y_class_0_balanced = y_class_0.copy()

# For class 4: Oversample to match class 0 size
# Calculate how many additional samples we need
additional_samples_needed = max_class_size - len(X_class_4)

# Randomly sample WITH replacement from class 4 to create additional samples
oversample_indices = np.random.choice(len(X_class_4), additional_samples_needed, replace=True)

# Get the oversampled data
X_class_4_oversampled = X_class_4.iloc[oversample_indices]
y_class_4_oversampled = y_class_4.iloc[oversample_indices]

# Combine original class 4 samples with oversampled ones
X_class_4_balanced = pd.concat([X_class_4, X_class_4_oversampled], ignore_index=True)
y_class_4_balanced = pd.concat([y_class_4, y_class_4_oversampled], ignore_index=True)

print(f"After oversampling:")
print(f"  Class 0: {len(X_class_0_balanced)} samples (all original samples preserved)")
print(f"  Class 4: {len(X_class_4_balanced)} samples ({len(X_class_4)} original + {additional_samples_needed} oversampled)")
print(f"  Total: {len(X_class_0_balanced) + len(X_class_4_balanced)} samples\n")

# Combine balanced classes
X_balanced = pd.concat([X_class_0_balanced, X_class_4_balanced], ignore_index=True)
y_balanced = pd.concat([y_class_0_balanced, y_class_4_balanced], ignore_index=True)

# Shuffle the combined data
shuffle_indices = np.random.permutation(len(X_balanced))
X_balanced = X_balanced.iloc[shuffle_indices].reset_index(drop=True)
y_balanced = y_balanced.iloc[shuffle_indices].reset_index(drop=True)

# Now split the balanced data: First split to (train+val) and test (90% train+val, 20% test)
X_trainval, X_test, y_trainval, y_test = train_test_split(
    X_balanced, y_balanced, test_size=0.1, random_state=42, stratify=y_balanced
)

# Now split train+val into train and val (80% train, 20% val of the remaining)
X_train, X_val, y_train, y_val = train_test_split(
    X_trainval, y_trainval, test_size=0.2, random_state=42, stratify=y_trainval
)

print(f"Train set shape: {X_train.shape}")
print(f"Val set shape:   {X_val.shape}")
print(f"Test set shape:  {X_test.shape}\n")

def print_class_balance(y, set_name=""):
    value_counts = y.value_counts()
    percentage = value_counts / value_counts.sum() * 100
    balance_df = pd.DataFrame({'count': value_counts, 'percentage': percentage.round(2)})
    print(f"{set_name} class balance:\n{balance_df}\n")

print_class_balance(y_train, "Train")
print_class_balance(y_val, "Val")
print_class_balance(y_test, "Test")

# Save the datasets in .pkl files 
# train
train = X_train.copy()
train['polarity of tweet'] = y_train.values  # keep original column
train['label'] = y_train.values         
# val    
val = X_val.copy()
val['polarity of tweet'] = y_val.values
val['label'] = y_val.values
# test 
test = X_test.copy()
test['polarity of tweet'] = y_test.values
test['label'] = y_test.values

train.to_pickle("train.pkl")
val.to_pickle("val.pkl")
test.to_pickle("test.pkl")

In [None]:
X_train

# Read pickle files

In [1]:
import pickle

# Load the pkl files
with open("clean_df.pkl", "rb") as f:
    clean_df_loaded = pickle.load(f)
with open("train.pkl", "rb") as f:
    train_loaded = pickle.load(f)
with open("val.pkl", "rb") as f:
    val_loaded = pickle.load(f)
with open("test.pkl", "rb") as f:
    test_loaded = pickle.load(f)

# Print their shapes
print(f"clean_df shape: {clean_df_loaded.shape}")
print(f"train shape: {train_loaded.shape}")
print(f"val shape: {val_loaded.shape}")
print(f"test shape: {test_loaded.shape}")

# Verify that train, val, and test have the same columns
print(f"train columns: {train_loaded.columns.tolist()}")
print(f"val columns:   {val_loaded.columns.tolist()}")
print(f"test columns:  {test_loaded.columns.tolist()}")

assert set(train_loaded.columns) == set(val_loaded.columns) == set(test_loaded.columns), \
    "train, val, test files do not have the same columns!"
print("✅ train, val, and test have the same columns.")

clean_df shape: (1048572, 6)
train shape: (1151993, 7)
val shape: (287999, 7)
test shape: (160000, 7)
train columns: ['polarity of tweet', 'id of the tweet', 'date of the tweet', 'query', 'user', 'text of the tweet', 'label']
val columns:   ['polarity of tweet', 'id of the tweet', 'date of the tweet', 'query', 'user', 'text of the tweet', 'label']
test columns:  ['polarity of tweet', 'id of the tweet', 'date of the tweet', 'query', 'user', 'text of the tweet', 'label']
✅ train, val, and test have the same columns.
