# üé¨ IMDB Sentiment Analysis using Bidirectional LSTM

[![Kaggle](https://img.shields.io/badge/Kaggle-Dataset-blue)](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews)
[![Python](https://img.shields.io/badge/Python-3.8+-green)](https://www.python.org/)
[![TensorFlow](https://img.shields.io/badge/TensorFlow-2.x-orange)](https://www.tensorflow.org/)

## Project Overview

This project implements a **Bidirectional LSTM (BiLSTM)** deep learning model for sentiment analysis on the IMDB movie reviews dataset. The model classifies reviews as either positive or negative, achieving **88% accuracy** on the test set.

### Key Features
- **Advanced NLP preprocessing** with negation word preservation
- **Bidirectional LSTM architecture** for contextual understanding
- **Early stopping** to prevent overfitting
- **Comprehensive evaluation** with classification reports and confusion matrix

### Dataset
- **Source**: IMDB Dataset of 50K Movie Reviews
- **Size**: 50,000 reviews (25,000 positive, 25,000 negative)
- **Split**: 70% training, 15% validation, 15% testing

---

## üì¶ 1. Import Libraries and Load Dataset

In [None]:
# Install dependencies if needed
# !pip install kagglehub[pandas-datasets]

import kagglehub
from kagglehub import KaggleDatasetAdapter
import pandas as pd
import numpy as np

# Load the IMDB dataset from Kaggle
print("Loading IMDB dataset from Kaggle...")
df = kagglehub.load_dataset(
    KaggleDatasetAdapter.PANDAS,
    "lakshmi25npathi/imdb-dataset-of-50k-movie-reviews",
    ""
)

print("\n‚úÖ Dataset loaded successfully!")
print(f"\nDataset shape: {df.shape}")
print("\nFirst 5 records:")
print(df.head())

print("\nDataset info:")
df.info()

## üßπ 2. Data Preprocessing & Tokenization

### Preprocessing Pipeline:
1. **Text Cleaning**: Lowercase, remove HTML tags, punctuation, extra spaces
2. **Word Tokenization**: Split text into individual words
3. **Stopword Removal**: Remove common words while **preserving negation words** (not, never, no, etc.)
4. **Numerical Tokenization**: Convert text to numerical sequences
5. **Padding**: Ensure uniform sequence length

In [None]:
import re
import nltk

# Download NLTK resources
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [None]:
# ==========================================
# Step 1: Define Stopwords (excluding negation)
# ==========================================

# Load English stopwords
stop_words = set(stopwords.words('english'))

# Negation words to preserve (critical for sentiment analysis)
negation_words = {
    "no", "not", "nor", "never", "none",
    "nobody", "nothing", "neither", "nowhere",
    "cannot", "cant"
}

# Remove negation words from stopwords
stop_words = stop_words - negation_words

print(f"‚úÖ Stopwords loaded: {len(stop_words)} words")
print(f"‚úÖ Negation words preserved: {negation_words}")

In [None]:
# ==========================================
# Step 2: Text Cleaning Function
# ==========================================

def clean_text(text):
    """
    Clean and preprocess text data.
    
    Args:
        text (str): Raw text input
    
    Returns:
        str: Cleaned text
    """
    text = text.lower()                      # Convert to lowercase
    text = re.sub(r"<.*?>", "", text)        # Remove HTML tags
    text = re.sub(r"[^a-zA-Z\s]", "", text)  # Remove punctuation and numbers
    text = re.sub(r"\s+", " ", text).strip() # Remove extra whitespace
    return text


# Apply text cleaning
print("Cleaning text data...")
df['clean_review'] = df['review'].apply(clean_text)
print("‚úÖ Text cleaning completed!")

In [None]:
# ==========================================
# Step 3: Word Tokenization
# ==========================================

print("Tokenizing words...")
df['tokens'] = df['clean_review'].apply(word_tokenize)
print("‚úÖ Word tokenization completed!")

In [None]:
# ==========================================
# Step 4: Remove Stopwords (preserve negation)
# ==========================================

print("Removing stopwords (preserving negation)...")
df['tokens'] = df['tokens'].apply(
    lambda tokens: [word for word in tokens if word not in stop_words]
)
print("‚úÖ Stopword removal completed!")

# Display sample
print("\nSample of preprocessed reviews:")
print(df[['review', 'tokens']].head(3))

In [None]:
# ==========================================
# Step 5: Numerical Tokenization & Padding
# ==========================================

# Join tokens back into strings
df['final_review'] = df['tokens'].apply(lambda x: " ".join(x))

# Configuration
VOCAB_SIZE = 10000    # Maximum vocabulary size
MAX_LENGTH = 200      # Maximum sequence length

# Initialize and fit tokenizer
tokenizer = Tokenizer(num_words=VOCAB_SIZE, oov_token="<OOV>")
tokenizer.fit_on_texts(df['final_review'])

# Convert texts to sequences
sequences = tokenizer.texts_to_sequences(df['final_review'])

# Pad sequences to uniform length
X = pad_sequences(
    sequences,
    maxlen=MAX_LENGTH,
    padding='post',
    truncating='post'
)

print(f"\n‚úÖ Numerical tokenization completed!")
print(f"Vocabulary size: {VOCAB_SIZE}")
print(f"Maximum sequence length: {MAX_LENGTH}")
print(f"\nData shape after preprocessing: {X.shape}")
print(f"Sample sequence: {X[0][:20]}...")

## üìä 3. Label Encoding & Train-Test Split

In [None]:
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

# Encode sentiment labels (positive=1, negative=0)
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(df['sentiment'])

print(f"Label encoding: {dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_)))}")
print(f"\nClass distribution:")
print(pd.Series(y).value_counts())

In [None]:
# Split data: 70% train, 15% validation, 15% test
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=0.5, random_state=42, stratify=y_temp
)

print("\n‚úÖ Data split completed!")
print(f"Training set: {X_train.shape[0]} samples")
print(f"Validation set: {X_val.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")

## üß† 4. Build Bidirectional LSTM Model

### Model Architecture:
1. **Embedding Layer**: Converts word indices to dense vectors (128 dimensions)
2. **Bidirectional LSTM**: Processes sequences in both forward and backward directions (64 units)
3. **Dropout Layer**: Prevents overfitting (50% dropout rate)
4. **Dense Output Layer**: Sigmoid activation for binary classification

### Why Bidirectional LSTM?
- Captures **context from both directions** in the sequence
- Better understanding of sentiment nuances
- Improved performance on long-term dependencies

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Bidirectional, LSTM, Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping

# Model configuration
EMBEDDING_DIM = 128
LSTM_UNITS = 64
DROPOUT_RATE = 0.5

# Build the model
model = Sequential([
    # Embedding layer: converts word indices to dense vectors
    Embedding(
        input_dim=VOCAB_SIZE,
        output_dim=EMBEDDING_DIM,
        input_length=MAX_LENGTH
    ),
    
    # Bidirectional LSTM: processes sequences in both directions
    Bidirectional(LSTM(LSTM_UNITS)),
    
    # Dropout: prevents overfitting
    Dropout(DROPOUT_RATE),
    
    # Output layer: binary classification
    Dense(1, activation='sigmoid')
], name='BiLSTM_Sentiment_Classifier')

# Compile the model
model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy']
)

# Display model architecture
print("\n" + "="*60)
print("MODEL ARCHITECTURE")
print("="*60)
model.summary()
print("="*60)

## üéØ 5. Train the Model

In [None]:
# Early stopping callback
early_stop = EarlyStopping(
    monitor='val_loss',
    patience=2,
    restore_best_weights=True,
    verbose=1
)

# Training configuration
EPOCHS = 10
BATCH_SIZE = 64

print("\n" + "="*60)
print("TRAINING MODEL")
print("="*60)
print(f"Epochs: {EPOCHS}")
print(f"Batch size: {BATCH_SIZE}")
print(f"Early stopping: Enabled (patience=2)")
print("="*60 + "\n")

# Train the model
history = model.fit(
    X_train, y_train,
    epochs=EPOCHS,
    batch_size=BATCH_SIZE,
    validation_data=(X_val, y_val),
    callbacks=[early_stop],
    verbose=1
)

print("\n‚úÖ Training completed!")

## üìà 6. Model Evaluation

Evaluate the model performance on the test set using:
- **Classification Report**: Precision, Recall, F1-Score
- **Confusion Matrix**: True Positives, False Positives, True Negatives, False Negatives

In [None]:
from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# Make predictions
print("Making predictions on test set...")
y_pred_prob = model.predict(X_test, verbose=0)
y_pred = (y_pred_prob > 0.5).astype(int).flatten()

print("\n" + "="*60)
print("CLASSIFICATION REPORT")
print("="*60)
print(classification_report(
    y_test, y_pred,
    target_names=['Negative', 'Positive'],
    digits=4
))

print("\n" + "="*60)
print("CONFUSION MATRIX")
print("="*60)
cm = confusion_matrix(y_test, y_pred)
print(cm)
print("\n[Row = Actual, Column = Predicted]")
print(f"True Negatives: {cm[0][0]}")
print(f"False Positives: {cm[0][1]}")
print(f"False Negatives: {cm[1][0]}")
print(f"True Positives: {cm[1][1]}")

# Calculate test accuracy
test_accuracy = (cm[0][0] + cm[1][1]) / cm.sum()
print(f"\nüéØ Test Accuracy: {test_accuracy:.4f} ({test_accuracy*100:.2f}%)")

## üé® 7. Visualize Results (Optional)

In [None]:
# Plot training history
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Accuracy plot
axes[0].plot(history.history['accuracy'], label='Training Accuracy', linewidth=2)
axes[0].plot(history.history['val_accuracy'], label='Validation Accuracy', linewidth=2)
axes[0].set_title('Model Accuracy', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Accuracy')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Loss plot
axes[1].plot(history.history['loss'], label='Training Loss', linewidth=2)
axes[1].plot(history.history['val_loss'], label='Validation Loss', linewidth=2)
axes[1].set_title('Model Loss', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Loss')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Plot confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(
    cm, annot=True, fmt='d', cmap='Blues',
    xticklabels=['Negative', 'Positive'],
    yticklabels=['Negative', 'Positive'],
    cbar_kws={'label': 'Count'}
)
plt.title('Confusion Matrix', fontsize=14, fontweight='bold')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.tight_layout()
plt.show()

## üîÆ 8. Test with Custom Reviews (Optional)

In [None]:
def predict_sentiment(review_text):
    """
    Predict sentiment of a custom review.
    
    Args:
        review_text (str): Movie review text
    
    Returns:
        str: Predicted sentiment (Positive/Negative) with confidence
    """
    # Preprocess
    cleaned = clean_text(review_text)
    tokens = word_tokenize(cleaned)
    tokens = [word for word in tokens if word not in stop_words]
    final = " ".join(tokens)
    
    # Tokenize and pad
    seq = tokenizer.texts_to_sequences([final])
    padded = pad_sequences(seq, maxlen=MAX_LENGTH, padding='post')
    
    # Predict
    prob = model.predict(padded, verbose=0)[0][0]
    sentiment = "Positive" if prob > 0.5 else "Negative"
    confidence = prob if prob > 0.5 else 1 - prob
    
    return f"{sentiment} ({confidence*100:.2f}% confidence)"


# Test with sample reviews
test_reviews = [
    "This movie was absolutely amazing! I loved every minute of it.",
    "Terrible film. Waste of time and money. Do not watch.",
    "Not bad, but could have been better. Average at best."
]

print("\n" + "="*60)
print("CUSTOM REVIEW PREDICTIONS")
print("="*60)
for i, review in enumerate(test_reviews, 1):
    prediction = predict_sentiment(review)
    print(f"\nReview {i}: \"{review}\"")
    print(f"Prediction: {prediction}")
print("="*60)

## üéØ Results Summary

### Model Performance
- **Test Accuracy**: ~88%
- **Precision**: 0.87-0.89 for both classes
- **Recall**: 0.87-0.89 for both classes
- **F1-Score**: 0.88 (balanced performance)

### Key Achievements
‚úÖ Successfully implemented BiLSTM architecture for sentiment analysis  
‚úÖ Preserved negation words to maintain sentiment context  
‚úÖ Applied early stopping to prevent overfitting  
‚úÖ Achieved balanced performance across both classes  

### Potential Improvements
- Experiment with GRU layers as an alternative to LSTM
- Try pre-trained word embeddings (GloVe, Word2Vec)
- Implement attention mechanisms
- Increase model depth (add more LSTM layers)
- Apply data augmentation techniques

---

## üìù Citation

**Dataset**: Maas et al. (2011) - IMDB Movie Reviews Dataset  
**Kaggle**: [lakshmi25npathi/imdb-dataset-of-50k-movie-reviews](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews)

---

**Author**: Mohmad Taha Jasem Alhmad  
**GitHub**: [Your GitHub Profile]  
**Date**: February 2026  