<a href="https://colab.research.google.com/github/SehanArandara/DL-Lab-05/blob/main/IT21164330Q3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

***Task 3 - Sentiment Analysis using LSTM ***

Sentiment analysis is a common natural language processing (NLP) task that involves determining the sentiment or emotional tone behind a body of text. It is widely used in fields such as marketing, customer service, and social media monitoring to gauge public opinion and understand customer feedback.

In this task, you will implement a Long Short-Term Memory (LSTM) network, a type of recurrent neural network (RNN) that is particularly well-suited for analyzing sequential data, such as text. Using the IMDB movie reviews dataset, you will build a model to classify reviews as either positive or negative. This exercise will help you understand how LSTMs can capture the context and sequence of words in a text, making them powerful tools for tasks like sentiment analysis.

By the end of this task, you should be able to implement a basic LSTM model, preprocess text data, and evaluate the model's performance using metrics such as accuracy and F1-score. This hands-on experience will give you a deeper understanding of how deep learning models can be applied to real-world NLP problems.

In [None]:
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Bidirectional, Dense, Dropout
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.metrics import accuracy_score, f1_score
from sklearn.model_selection import train_test_split
import re

The pd.read_csv() function is used to read the CSV file. We specify the engine='python' to handle complex parsing scenarios, such as files with irregular delimiters or quotes. The on_bad_lines='skip' parameter ensures that any problematic rows in the CSV file (e.g., rows with formatting issues) are skipped instead of causing the program to crash. This helps in handling large and potentially messy datasets. After loading the data, the df.dropna(inplace=True) line removes any rows that contain missing values. This is important to ensure that the data fed into the model is complete and does not cause errors during processing.

In [None]:
# 1. Load and Preprocess the Dataset
def load_data(file_path):
    # Load the dataset (e.g., IMDB movie reviews dataset)
    df = pd.read_csv(file_path, engine='python', on_bad_lines='skip')  # Using 'python' engine and skipping bad lines
    df.dropna(inplace=True)  # Drop any rows with missing values
    return df['review'], df['sentiment']  # Assuming 'review' and 'sentiment' columns


The clean_text function is designed to clean and preprocess text data by removing unwanted characters, numbers, and symbols, ensuring that the text is ready for tokenization and further processing.

re.sub(r"[^A-Za-z\s]", "", text) removes any characters that are not letters (A-Z, a-z) or spaces. This includes punctuation, numbers, and special symbols.

re.sub(r"\s+", " ", text) replaces multiple spaces with a single space.

.strip() removes any leading or trailing spaces from the text.

This cleaning process ensures that the text is standardized, making it easier for the model to learn patterns without being confused by irrelevant characters or inconsistent spacing.

In [None]:
# Clean the text
def clean_text(text):
    # Remove unwanted characters, numbers, and symbols
    text = re.sub(r"[^A-Za-z\s]", "", text)
    text = re.sub(r"\s+", " ", text).strip()
    return text

preprocess_text first cleans each review by removing unwanted characters using the clean_text function. Then, it initializes a Tokenizer to convert text into sequences of integers, where each integer represents a word. These sequences are padded to a uniform length (max_len) to ensure consistent input size for the model. Finally, it returns the padded sequences and the tokenizer for further use.

A Tokenizer in the context of text processing is a tool used to convert text data into a numerical format that machine learning models can understand.

In [None]:
# Tokenize and Pad Sequences
def preprocess_text(reviews, max_words=5000, max_len=200):
    reviews = [clean_text(review) for review in reviews]  # Clean the reviews
    tokenizer = Tokenizer(num_words=max_words)
    tokenizer.fit_on_texts(reviews)
    sequences = tokenizer.texts_to_sequences(reviews)
    padded_sequences = pad_sequences(sequences, maxlen=max_len)
    return padded_sequences, tokenizer

The encode_labels function converts 'positive' and 'negative' sentiment labels into 1s and 0s, respectively, for numerical processing. It then returns these labels as a NumPy array.

In [None]:
# Encode Sentiments
def encode_labels(sentiments):
    sentiments = sentiments.map({'positive': 1, 'negative': 0}).values
    return sentiments

In [None]:
# Load Data
file_path = 'IMDB Dataset.csv'  # <-- Provide the correct path to the dataset
reviews, sentiments = load_data(file_path)

In [None]:
# Preprocess Text Data
max_words = 5000  # Consider the top 5000 words
max_len = 200  # Pad or truncate reviews to 200 words
X, tokenizer = preprocess_text(reviews, max_words=max_words, max_len=max_len)


In [None]:
# Encode Sentiments (positive -> 1, negative -> 0)
y = encode_labels(sentiments)
# Split into Training and Testing Sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [None]:
# 2. Define and Train the Bidirectional LSTM Model
bidirectional_model = Sequential()
bidirectional_model.add(Embedding(input_dim=max_words, output_dim=128, input_length=max_len))  # Modify 'output_dim'
bidirectional_model.add(Bidirectional(LSTM(units=64, return_sequences=True)))  # Experiment with 'units'
bidirectional_model.add(Dropout(0.5))  # Add Dropout for regularization
bidirectional_model.add(Bidirectional(LSTM(units=64)))  # Experiment with 'units'
bidirectional_model.add(Dropout(0.5))  # Add Dropout for regularization
bidirectional_model.add(Dense(1, activation='sigmoid'))
bidirectional_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])




In [None]:
# Train the Bidirectional LSTM model
bidirectional_history = bidirectional_model.fit(X_train, y_train, epochs=10, batch_size=32, validation_data=(X_test, y_test), verbose=1)  # Adjust 'epochs' and 'batch_size'


Epoch 1/10
[1m1250/1250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m397s[0m 308ms/step - accuracy: 0.7224 - loss: 0.5246 - val_accuracy: 0.8707 - val_loss: 0.3151
Epoch 2/10
[1m1250/1250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m357s[0m 286ms/step - accuracy: 0.8644 - loss: 0.3347 - val_accuracy: 0.8664 - val_loss: 0.3216
Epoch 3/10
[1m1250/1250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m364s[0m 291ms/step - accuracy: 0.8875 - loss: 0.2857 - val_accuracy: 0.8877 - val_loss: 0.2710
Epoch 4/10
[1m1250/1250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m352s[0m 282ms/step - accuracy: 0.9141 - loss: 0.2264 - val_accuracy: 0.8940 - val_loss: 0.2658
Epoch 5/10
[1m1250/1250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m384s[0m 307ms/step - accuracy: 0.9349 - loss: 0.1808 - val_accuracy: 0.8838 - val_loss: 0.2898
Epoch 6/10
[1m1250/1250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m460s[0m 368ms/step - accuracy: 0.9463 - loss: 0.1538 - val_accuracy: 0.8918 - val_loss:

In [None]:
# 3. Define and Train the Unidirectional LSTM Model
unidirectional_model = Sequential()
unidirectional_model.add(Embedding(input_dim=max_words, output_dim=128, input_length=max_len))  # Modify 'output_dim'
unidirectional_model.add(LSTM(units=64, return_sequences=True))  # Experiment with 'units'
unidirectional_model.add(Dropout(0.5))  # Add Dropout for regularization
unidirectional_model.add(LSTM(units=64))  # Experiment with 'units'
unidirectional_model.add(Dropout(0.5))  # Add Dropout for regularization
unidirectional_model.add(Dense(1, activation='sigmoid'))
unidirectional_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])


In [None]:
# Train the Unidirectional LSTM model
unidirectional_history = unidirectional_model.fit(X_train, y_train, epochs=10, batch_size=32, validation_data=(X_test, y_test), verbose=1)  # Adjust 'epochs' and 'batch_size'


Epoch 1/10
[1m1250/1250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m204s[0m 159ms/step - accuracy: 0.7620 - loss: 0.4731 - val_accuracy: 0.8727 - val_loss: 0.3015
Epoch 2/10
[1m1250/1250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m196s[0m 157ms/step - accuracy: 0.8911 - loss: 0.2773 - val_accuracy: 0.8873 - val_loss: 0.2808
Epoch 3/10
[1m1250/1250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m202s[0m 161ms/step - accuracy: 0.9154 - loss: 0.2199 - val_accuracy: 0.8735 - val_loss: 0.3010
Epoch 4/10
[1m1250/1250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m200s[0m 160ms/step - accuracy: 0.9294 - loss: 0.1903 - val_accuracy: 0.8827 - val_loss: 0.3195
Epoch 5/10
[1m1250/1250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m216s[0m 173ms/step - accuracy: 0.9446 - loss: 0.1592 - val_accuracy: 0.8845 - val_loss: 0.2961
Epoch 6/10
[1m1250/1250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m249s[0m 158ms/step - accuracy: 0.9537 - loss: 0.1305 - val_accuracy: 0.8740 - val_loss:

In [None]:
# 4. Evaluate the Bidirectional LSTM Model
y_pred_bidirectional = (bidirectional_model.predict(X_test) > 0.5).astype("int32")


[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m39s[0m 107ms/step


In [None]:
# Calculate Accuracy and F1-Score for Bidirectional LSTM
accuracy_bidirectional = accuracy_score(y_test, y_pred_bidirectional)
f1_bidirectional = f1_score(y_test, y_pred_bidirectional)
print(f'Bidirectional LSTM - Accuracy: {accuracy_bidirectional:.4f}')
print(f'Bidirectional LSTM - F1-Score: {f1_bidirectional:.4f}')


Bidirectional LSTM - Accuracy: 0.5132
Bidirectional LSTM - F1-Score: 0.3564


In [None]:
# 5. Evaluate the Unidirectional LSTM Model
y_pred_unidirectional = (unidirectional_model.predict(X_test) > 0.5).astype("int32")


[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m24s[0m 73ms/step


In [None]:

# Calculate Accuracy and F1-Score for Unidirectional LSTM
accuracy_unidirectional = accuracy_score(y_test, y_pred_unidirectional)
f1_unidirectional = f1_score(y_test, y_pred_unidirectional)
print(f'Unidirectional LSTM - Accuracy: {accuracy_unidirectional:.4f}')
print(f'Unidirectional LSTM - F1-Score: {f1_unidirectional:.4f}')


Unidirectional LSTM - Accuracy: 0.8861
Unidirectional LSTM - F1-Score: 0.8874


### Model Performance Comparison

**Bidirectional LSTM:**
- **Accuracy:** 0.5132
- **F1-Score:** 0.3564

**Unidirectional LSTM:**
- **Accuracy:** 0.8861
- **F1-Score:** 0.8874

#### Analysis:
The Unidirectional LSTM model significantly outperformed the Bidirectional LSTM model, with much higher accuracy and F1-score. This suggests that for this particular sentiment analysis task on the IMDB dataset, processing the sequence in one direction is more effective than capturing context from both directions.

The lower performance of the Bidirectional LSTM might be due to overfitting or the complexity added by processing sequences in both directions. The Unidirectional LSTM, being simpler, likely benefited from better generalization.

This result indicates that for certain tasks, simpler models may outperform more complex ones, particularly if the additional complexity does not add meaningful value to the model's understanding of the data. Further experiments could involve tuning the Bidirectional LSTM to see if its performance can be improved, or focusing on enhancing the Unidirectional LSTM further.
