**Assignment Topic:** RNN, LSTM

**Date:** 23 March 2025  

**Name:** Ajmal

**Roll Number:** cs22b2046  


# Lab - 08        
## Sentiment Classification using RNN, LSTM  
**Date:** 17-03-2025  

### Objective
Train a RNN based sentiment analysis model for classification of movie reviews.
Explore and learn about the different preprocessing steps in the Natural Language Processing (NLP) domain.
Apply suitable preprocessing steps for this sentiment analysis assignment.
Build and train a RNN model using basic layers from the framework.
Test model on the test set using suitable evaluation metrics.

### Task 1: Train a RNN-based Model
- Build and train a RNN model using basic layers from the framework.
- Test model on the test set using suitable evaluation metrics.

### Task 2: Train a LSTM-based Model
- Build and train a LSTM model using basic layers from the framework.
- Test model on the test set using suitable evaluation metrics.

### Comparison
Compare between the two approaches and highlight the improvements.

## Dataset: Stanford Sentiment Treebank 2  
**Original dataset link:** [SST2 Dataset](https://huggingface.co/datasets/stanfordnlp/sst2)  
**Dataset Zip Link:** [Download Here](https://drive.google.com/file/d/1TytoIgt7KI9Ep9bo8bs_X0HSSnBJX0oi/)  

### Data Fields
- **idx**: Monotonically increasing index ID.
- **sentence**: Complete sentence expressing an opinion about a film.
- **label**: Sentiment of the opinion, either "negative" (0) or "positive" (1).

### Data Split
- Split the provided training dataset (67,349 rows) into:
  - **5,000 rows** for testing
  - **Remaining** for training
- Use the separately provided validation dataset (872 rows) for validation.


# Part 1 RNN

In [None]:
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, SimpleRNN, Dense, Dropout
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Load dataset from Google Drive


In [None]:
# sst2_train.parquet  sst2_valid.parquet
train_path = '/content/drive/MyDrive/sst2/sst2_train.parquet'
val_path = '/content/drive/MyDrive/sst2/sst2_valid.parquet'

In [None]:
train_df = pd.read_parquet(train_path)
val_df = pd.read_parquet(val_path)

## Split training data into training and testing


In [None]:
train_data, test_data = train_test_split(train_df, test_size=5000, random_state=42)

In [None]:
# Extract sentences and labels
X_train, y_train = train_data["sentence"], train_data["label"]
X_test, y_test = test_data["sentence"], test_data["label"]
X_val, y_val = val_df["sentence"], val_df["label"]

## pre1

Test Accuracy: 0.7112

|               | Precision | Recall | F1-Score | Support |
|--------------|-----------|--------|----------|---------|
| **0**       | 0.81      | 0.43   | 0.56     | 2167    |
| **1**       | 0.68      | 0.92   | 0.78     | 2833    |
| **Accuracy** |           |        | 0.71     | 5000    |
| **Macro Avg** | 0.75     | 0.68   | 0.67     | 5000    |
| **Weighted Avg** | 0.74  | 0.71   | 0.69     | 5000    |


In [None]:

# Tokenization and Padding
max_vocab = 20000  # Limit vocabulary size
max_length = 100  # Max length of a sentence

tokenizer = Tokenizer(num_words=max_vocab, oov_token="<OOV>")
tokenizer.fit_on_texts(X_train)

X_train_seq = pad_sequences(tokenizer.texts_to_sequences(X_train), maxlen=max_length, padding='post')
X_test_seq = pad_sequences(tokenizer.texts_to_sequences(X_test), maxlen=max_length, padding='post')
X_val_seq = pad_sequences(tokenizer.texts_to_sequences(X_val), maxlen=max_length, padding='post')


## pre2
Test Accuracy: 0.8946

|               | Precision | Recall | F1-Score | Support |
|--------------|-----------|--------|----------|---------|
| **0**       | 0.89      | 0.87   | 0.88     | 2167    |
| **1**       | 0.90      | 0.91   | 0.91     | 2833    |
| **Accuracy** |           |        | 0.8946   | 5000    |
| **Macro Avg** | 0.89     | 0.89   | 0.89     | 5000    |
| **Weighted Avg** | 0.89  | 0.89   | 0.89     | 5000    |





In [None]:
import re
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

# Text Cleaning Function
def clean_text(text):
    text = text.lower()  # Convert to lowercase
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)  # Remove punctuation and special characters
    stop_words = set(stopwords.words('english'))
    text = ' '.join([word for word in text.split() if word not in stop_words])  # Remove stopwords
    return text

# Apply text cleaning
X_train = X_train.apply(clean_text)
X_test = X_test.apply(clean_text)
X_val = X_val.apply(clean_text)

# Tokenization and Padding
max_vocab = 10000  # Limit vocabulary size
max_length = 100  # Max length of a sentence

tokenizer = Tokenizer(num_words=max_vocab, oov_token="<OOV>")
tokenizer.fit_on_texts(X_train)

X_train_seq = pad_sequences(tokenizer.texts_to_sequences(X_train), maxlen=max_length, padding='post')
X_test_seq = pad_sequences(tokenizer.texts_to_sequences(X_test), maxlen=max_length, padding='post')
X_val_seq = pad_sequences(tokenizer.texts_to_sequences(X_val), maxlen=max_length, padding='post')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Build RNN Model

In [None]:
from tensorflow.keras.layers import LeakyReLU
from tensorflow.keras.optimizers import Adam


def create_rnn_model():
    model = Sequential([
        Embedding(input_dim=max_vocab, output_dim=128, input_length=max_length),
        SimpleRNN(64, return_sequences=False),
        Dropout(0.3),
        Dense(64),
        LeakyReLU(alpha=0.1),
        Dense(1, activation='sigmoid')
    ])
    model.compile(loss='binary_crossentropy', optimizer=Adam(learning_rate=0.00005), metrics=['accuracy'])
    return model

model = create_rnn_model()

## Train

In [None]:
model.fit(X_train_seq, y_train, validation_data=(X_val_seq, y_val), epochs=5, batch_size=32)

Epoch 1/5
[1m1949/1949[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m23s[0m 10ms/step - accuracy: 0.6644 - loss: 0.5988 - val_accuracy: 0.7569 - val_loss: 0.5403
Epoch 2/5
[1m1949/1949[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 9ms/step - accuracy: 0.8838 - loss: 0.3063 - val_accuracy: 0.7867 - val_loss: 0.5313
Epoch 3/5
[1m1949/1949[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m18s[0m 9ms/step - accuracy: 0.9099 - loss: 0.2408 - val_accuracy: 0.7936 - val_loss: 0.5375
Epoch 4/5
[1m1949/1949[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 9ms/step - accuracy: 0.9196 - loss: 0.2156 - val_accuracy: 0.7982 - val_loss: 0.5234
Epoch 5/5
[1m1949/1949[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 9ms/step - accuracy: 0.9242 - loss: 0.2004 - val_accuracy: 0.7901 - val_loss: 0.5565


<keras.src.callbacks.history.History at 0x7b68cadf7c90>

## Evaluate

In [None]:
y_pred = (model.predict(X_test_seq) > 0.5).astype("int32")
print("Test Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))


[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 6ms/step
Test Accuracy: 0.8946
              precision    recall  f1-score   support

           0       0.89      0.87      0.88      2167
           1       0.90      0.91      0.91      2833

    accuracy                           0.89      5000
   macro avg       0.89      0.89      0.89      5000
weighted avg       0.89      0.89      0.89      5000



# Part 2 LSTM

In [None]:
import pandas as pd
import numpy as np
import re
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout, Bidirectional
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
# Load dataset from Google Drive
from google.colab import drive
drive.mount('/content/drive')

# File paths
train_path = '/content/drive/MyDrive/sst2/sst2_train.parquet'
val_path = '/content/drive/MyDrive/sst2/sst2_valid.parquet'

# Read datasets
train_df = pd.read_parquet(train_path)
val_df = pd.read_parquet(val_path)

# Split train data into training and testing sets
train_data, test_data = train_test_split(train_df, test_size=5000/len(train_df), random_state=42)


Mounted at /content/drive


In [None]:
# Text Preprocessing Function
def clean_text(text):
    text = text.lower()
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # Remove special characters and numbers
    text = ' '.join([word for word in text.split() if word not in stop_words])  # Remove stopwords
    return text

# Apply cleaning
train_data['sentence'] = train_data['sentence'].apply(clean_text)
val_df['sentence'] = val_df['sentence'].apply(clean_text)
test_data['sentence'] = test_data['sentence'].apply(clean_text)

# Extract sentences and labels
X_train, y_train = train_data['sentence'].values, train_data['label'].values
X_val, y_val = val_df['sentence'].values, val_df['label'].values
X_test, y_test = test_data['sentence'].values, test_data['label'].values

# Tokenization and padding
max_words = 20000  # voca
max_len = 200

tokenizer = Tokenizer(num_words=max_words, oov_token="<OOV>")
tokenizer.fit_on_texts(X_train)

X_train_seq = pad_sequences(tokenizer.texts_to_sequences(X_train), maxlen=max_len, padding='post')
X_val_seq = pad_sequences(tokenizer.texts_to_sequences(X_val), maxlen=max_len, padding='post')
X_test_seq = pad_sequences(tokenizer.texts_to_sequences(X_test), maxlen=max_len, padding='post')


In [None]:
# Build LSTM model
model = Sequential([
    Embedding(input_dim=max_words, output_dim=128, input_length=max_len, trainable=True),  # Trainable embedding layer
    Bidirectional(LSTM(64, return_sequences=True)),
    Bidirectional(LSTM(32)),
    Dense(16, activation='relu'),
    Dropout(0.5),
    #Dense(16, activation='relu'),
    #Dropout(0.5),
    Dense(1, activation='sigmoid')
])

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])


In [None]:
# Train the model
epochs = 5
batch_size = 128
history = model.fit(X_train_seq, y_train, validation_data=(X_val_seq, y_val), epochs=epochs, batch_size=batch_size)


Epoch 1/5
[1m488/488[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m26s[0m 41ms/step - accuracy: 0.7116 - loss: 0.5293 - val_accuracy: 0.8050 - val_loss: 0.4813
Epoch 2/5
[1m488/488[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 40ms/step - accuracy: 0.9191 - loss: 0.2367 - val_accuracy: 0.7821 - val_loss: 0.5300
Epoch 3/5
[1m488/488[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 40ms/step - accuracy: 0.9441 - loss: 0.1683 - val_accuracy: 0.7833 - val_loss: 0.5954
Epoch 4/5
[1m488/488[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 40ms/step - accuracy: 0.9543 - loss: 0.1296 - val_accuracy: 0.7626 - val_loss: 0.7874
Epoch 5/5
[1m488/488[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m19s[0m 40ms/step - accuracy: 0.9597 - loss: 0.1086 - val_accuracy: 0.7672 - val_loss: 1.1066


In [None]:
# Evaluate the model on test set
y_pred = (model.predict(X_test_seq) > 0.5).astype("int32")
accuracy = accuracy_score(y_test, y_pred)
print(f'Test Accuracy: {accuracy:.4f}')

[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 17ms/step
Test Accuracy: 0.9190


# Part 3 Compare between the two approaches and highlight the improvements

### Comparison between RNN and LSTM Approaches

#### 1. **Model Architecture**
- **RNN Model (Part 1)**:
  - The RNN model uses a simple **SimpleRNN** layer with 64 units, followed by a dense layer with 64 units and a LeakyReLU activation function. The final layer is a single neuron with a sigmoid activation function for binary classification.
  - The model is relatively simple

- **LSTM Model (Part 2)**:
  - The LSTM model uses a **Bidirectional LSTM** layer with 64 units, followed by another Bidirectional LSTM layer with 32 units. This is followed by a dense layer with 16 units and a ReLU activation function, and finally a single neuron with a sigmoid activation function for binary classification.
  - The LSTM model is more complex and is designed to capture long-term dependencies in the data, which is particularly useful for sequential data like text.

#### 2. **Preprocessing**
- **RNN Model (Part 1)**:
  - The preprocessing includes tokenization, padding, and text cleaning (lowercasing, removing special characters, and stopwords). The vocabulary size is limited to 10,000 words, and the maximum sequence length is 100.
  
- **LSTM Model (Part 2)**:
  - The preprocessing is similar to the RNN model, but the vocabulary size is increased to 20,000 words, and the maximum sequence length is extended to 200. This allows the LSTM model to handle longer sequences and a larger vocabulary, which can be beneficial for capturing more context.

#### 3. **Training**
- **RNN Model (Part 1)**:
  - The RNN model is trained for 5 epochs with a batch size of 32. The learning rate is set to 0.00005, which is relatively low, and the model uses the Adam optimizer.
  
- **LSTM Model (Part 2)**:
  - The LSTM model is also trained for 5 epochs but with a larger batch size of 128. The learning rate is not explicitly set, but the Adam optimizer is used by default. The larger batch size allows for faster training and better generalization.

#### 4. **Performance**
- **RNN Model (Part 1)**:
  - The RNN model achieves a **test accuracy of 89.46%**. The precision, recall, and F1-score are balanced, with slightly better performance on the positive class (label 1).
  
- **LSTM Model (Part 2)**:
  - The LSTM model achieves a **test accuracy of 91.90%**, which is an improvement over the RNN model. The LSTM model also shows better generalization, as indicated by the higher accuracy on the test set.

#### 5. **Improvements with LSTM**
- **Better Handling of Long-Term Dependencies**: LSTM models are designed to handle long-term dependencies in sequential data, which is crucial for tasks like sentiment analysis where the context of words matters.
- **Higher Accuracy**: The LSTM model achieves a higher accuracy (91.90%) compared to the RNN model (89.46%), indicating that it is better at capturing the nuances in the text data.
- **Larger Vocabulary and Sequence Length**: The LSTM model can handle a larger vocabulary (20,000 words) and longer sequences (200 tokens), which allows it to capture more context and improve performance.
- **Bidirectional LSTM**: The use of bidirectional LSTM layers allows the model to capture context from both past and future words, which is particularly useful for understanding the sentiment expressed in a sentence.

#### 6. **Conclusion**
- The LSTM model outperforms the RNN model in terms of accuracy and generalization. The improvements are primarily due to the LSTM's ability to handle long-term dependencies and its more complex architecture, which allows it to capture more context from the text data. The bidirectional LSTM layers further enhance the model's ability to understand the sentiment expressed in the text.

In summary, while the RNN model performs reasonably well, the LSTM model offers significant improvements in accuracy and generalization, making it a better choice for sentiment analysis tasks.