***Task 3 - Sentiment Analysis using LSTM ***

Sentiment analysis is a common natural language processing (NLP) task that involves determining the sentiment or emotional tone behind a body of text. It is widely used in fields such as marketing, customer service, and social media monitoring to gauge public opinion and understand customer feedback.

In this task, you will implement a Long Short-Term Memory (LSTM) network, a type of recurrent neural network (RNN) that is particularly well-suited for analyzing sequential data, such as text. Using the IMDB movie reviews dataset, you will build a model to classify reviews as either positive or negative. This exercise will help you understand how LSTMs can capture the context and sequence of words in a text, making them powerful tools for tasks like sentiment analysis.

By the end of this task, you should be able to implement a basic LSTM model, preprocess text data, and evaluate the model's performance using metrics such as accuracy and F1-score. This hands-on experience will give you a deeper understanding of how deep learning models can be applied to real-world NLP problems.

In [1]:
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Bidirectional, Dense, Dropout
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.metrics import accuracy_score, f1_score
from sklearn.model_selection import train_test_split
import re

The pd.read_csv() function is used to read the CSV file. We specify the engine='python' to handle complex parsing scenarios, such as files with irregular delimiters or quotes. The on_bad_lines='skip' parameter ensures that any problematic rows in the CSV file (e.g., rows with formatting issues) are skipped instead of causing the program to crash. This helps in handling large and potentially messy datasets. After loading the data, the df.dropna(inplace=True) line removes any rows that contain missing values. This is important to ensure that the data fed into the model is complete and does not cause errors during processing.

In [2]:
# 1. Load and Preprocess the Dataset
def load_data(file_path):
    # Load the dataset (e.g., IMDB movie reviews dataset)
    df = pd.read_csv(file_path, engine='python', on_bad_lines='skip')  # Using 'python' engine and skipping bad lines
    df.dropna(inplace=True)  # Drop any rows with missing values
    return df['review'], df['sentiment']  # Assuming 'review' and 'sentiment' columns


The clean_text function is designed to clean and preprocess text data by removing unwanted characters, numbers, and symbols, ensuring that the text is ready for tokenization and further processing.

re.sub(r"[^A-Za-z\s]", "", text) removes any characters that are not letters (A-Z, a-z) or spaces. This includes punctuation, numbers, and special symbols.

re.sub(r"\s+", " ", text) replaces multiple spaces with a single space.

.strip() removes any leading or trailing spaces from the text.

This cleaning process ensures that the text is standardized, making it easier for the model to learn patterns without being confused by irrelevant characters or inconsistent spacing.

In [3]:
# Clean the text
def clean_text(text):
    # Remove unwanted characters, numbers, and symbols
    text = re.sub(r"[^A-Za-z\s]", "", text)
    text = re.sub(r"\s+", " ", text).strip()
    return text

preprocess_text first cleans each review by removing unwanted characters using the clean_text function. Then, it initializes a Tokenizer to convert text into sequences of integers, where each integer represents a word. These sequences are padded to a uniform length (max_len) to ensure consistent input size for the model. Finally, it returns the padded sequences and the tokenizer for further use.

A Tokenizer in the context of text processing is a tool used to convert text data into a numerical format that machine learning models can understand.

In [4]:
# Tokenize and Pad Sequences
def preprocess_text(reviews, max_words=5000, max_len=200):
    reviews = [clean_text(review) for review in reviews]  # Clean the reviews
    tokenizer = Tokenizer(num_words=max_words)
    tokenizer.fit_on_texts(reviews)
    sequences = tokenizer.texts_to_sequences(reviews)
    padded_sequences = pad_sequences(sequences, maxlen=max_len)
    return padded_sequences, tokenizer

The encode_labels function converts 'positive' and 'negative' sentiment labels into 1s and 0s, respectively, for numerical processing. It then returns these labels as a NumPy array.

In [5]:
# Encode Sentiments
def encode_labels(sentiments):
    sentiments = sentiments.map({'positive': 1, 'negative': 0}).values
    return sentiments

In [6]:
# Load Data
file_path = 'IMDB Dataset.csv'  # <-- Provide the correct path to the dataset
reviews, sentiments = load_data(file_path)

In [7]:
# Preprocess Text Data
max_words = 5000  # Consider the top 5000 words
max_len = 200  # Pad or truncate reviews to 200 words
X, tokenizer = preprocess_text(reviews, max_words=max_words, max_len=max_len)


In [8]:
# Encode Sentiments (positive -> 1, negative -> 0)
y = encode_labels(sentiments)
# Split into Training and Testing Sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [9]:
# 2. Define and Train the Bidirectional LSTM Model
bidirectional_model = Sequential()
bidirectional_model.add(Embedding(input_dim=max_words, output_dim=128, input_length=max_len))  # Modify 'output_dim'
bidirectional_model.add(Bidirectional(LSTM(units=64, return_sequences=True)))  # Experiment with 'units'
bidirectional_model.add(Dropout(0.5))  # Add Dropout for regularization
bidirectional_model.add(Bidirectional(LSTM(units=64)))  # Experiment with 'units'
bidirectional_model.add(Dropout(0.5))  # Add Dropout for regularization
bidirectional_model.add(Dense(1, activation='sigmoid'))
bidirectional_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])




In [10]:
# Train the Bidirectional LSTM model
bidirectional_history = bidirectional_model.fit(X_train, y_train, epochs=10, batch_size=32, validation_data=(X_test, y_test), verbose=1)  # Adjust 'epochs' and 'batch_size'


Epoch 1/10
[1m1250/1250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m572s[0m 452ms/step - accuracy: 0.7575 - loss: 0.4819 - val_accuracy: 0.8489 - val_loss: 0.3426
Epoch 2/10
[1m1250/1250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m616s[0m 448ms/step - accuracy: 0.8859 - loss: 0.2839 - val_accuracy: 0.8859 - val_loss: 0.2809
Epoch 3/10
[1m1250/1250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m562s[0m 447ms/step - accuracy: 0.9146 - loss: 0.2236 - val_accuracy: 0.8798 - val_loss: 0.2902
Epoch 4/10
[1m1250/1250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m567s[0m 451ms/step - accuracy: 0.9357 - loss: 0.1783 - val_accuracy: 0.8872 - val_loss: 0.3083
Epoch 5/10
[1m1250/1250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m614s[0m 445ms/step - accuracy: 0.9467 - loss: 0.1489 - val_accuracy: 0.8826 - val_loss: 0.3150
Epoch 6/10
[1m1250/1250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m564s[0m 447ms/step - accuracy: 0.9552 - loss: 0.1276 - val_accuracy: 0.8839 - val_loss:

In [11]:
# 3. Define and Train the Unidirectional LSTM Model
unidirectional_model = Sequential()
unidirectional_model.add(Embedding(input_dim=max_words, output_dim=128, input_length=max_len))  # Modify 'output_dim'
unidirectional_model.add(LSTM(units=64, return_sequences=True))  # Experiment with 'units'
unidirectional_model.add(Dropout(0.5))  # Add Dropout for regularization
unidirectional_model.add(LSTM(units=64))  # Experiment with 'units'
unidirectional_model.add(Dropout(0.5))  # Add Dropout for regularization
unidirectional_model.add(Dense(1, activation='sigmoid'))
unidirectional_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])


In [12]:
# Train the Unidirectional LSTM model
unidirectional_history = unidirectional_model.fit(X_train, y_train, epochs=10, batch_size=32, validation_data=(X_test, y_test), verbose=1)  # Adjust 'epochs' and 'batch_size'


Epoch 1/10
[1m1250/1250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m273s[0m 215ms/step - accuracy: 0.7649 - loss: 0.4783 - val_accuracy: 0.8647 - val_loss: 0.3333
Epoch 2/10
[1m1250/1250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m318s[0m 212ms/step - accuracy: 0.8890 - loss: 0.2801 - val_accuracy: 0.8895 - val_loss: 0.2733
Epoch 3/10
[1m1250/1250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m322s[0m 212ms/step - accuracy: 0.9153 - loss: 0.2236 - val_accuracy: 0.8855 - val_loss: 0.3016
Epoch 4/10
[1m1250/1250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m266s[0m 213ms/step - accuracy: 0.9263 - loss: 0.1975 - val_accuracy: 0.8871 - val_loss: 0.2842
Epoch 5/10
[1m1250/1250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m266s[0m 213ms/step - accuracy: 0.9382 - loss: 0.1684 - val_accuracy: 0.8765 - val_loss: 0.3078
Epoch 6/10
[1m1250/1250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m265s[0m 212ms/step - accuracy: 0.9504 - loss: 0.1384 - val_accuracy: 0.8818 - val_loss:

In [None]:
# 4. Evaluate the Bidirectional LSTM Model
y_pred_bidirectional = (bidirectional_model.predict(X_test) > 0.5).astype("int32")


In [None]:
# Calculate Accuracy and F1-Score for Bidirectional LSTM
accuracy_bidirectional = accuracy_score(y_test, y_pred_bidirectional)
f1_bidirectional = f1_score(y_test, y_pred_bidirectional)
print(f'Bidirectional LSTM - Accuracy: {accuracy_bidirectional:.4f}')
print(f'Bidirectional LSTM - F1-Score: {f1_bidirectional:.4f}')


In [15]:
# 5. Evaluate the Unidirectional LSTM Model
y_pred_unidirectional = (unidirectional_model.predict(X_test) > 0.5).astype("int32")


[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 62ms/step


In [16]:

# Calculate Accuracy and F1-Score for Unidirectional LSTM
accuracy_unidirectional = accuracy_score(y_test, y_pred_unidirectional)
f1_unidirectional = f1_score(y_test, y_pred_unidirectional)
print(f'Unidirectional LSTM - Accuracy: {accuracy_unidirectional:.4f}')
print(f'Unidirectional LSTM - F1-Score: {f1_unidirectional:.4f}')


Unidirectional LSTM - Accuracy: 0.8840
Unidirectional LSTM - F1-Score: 0.8848
Unidirectional LSTM - Accuracy: 0.8840
Unidirectional LSTM - F1-Score: 0.8848


### Model Performance Comparison

**Bidirectional LSTM:**
- **Accuracy:** 0.5132
- **F1-Score:** 0.3564

**Unidirectional LSTM:**
- **Accuracy:** 0.8840
- **F1-Score:** 0.8848


1. Compare the performance of the bidirectional LSTM with a unidirectional LSTM using the same dataset.

  The unidirectional LSTM performs much better than the bidirectional LSTM on this dataset. It has higher accuracy (0.8840) and a higher F1-score (0.8848) compared to the bidirectional LSTM's accuracy (0.5132) and F1-score (0.3564).

2. Analyze the impact of each architecture on model accuracy and F1-score.

  The unidirectional LSTM has a big positive impact on model performance. It is much more accurate and gives better F1-scores than the bidirectional LSTM. This means the unidirectional LSTM is better at predicting correctly and has a balanced performance between precision and recall.