# IMDB movie review sentiment classification using LSTM

This is a binary classification problem with two labels, positive and negative. We will use LSTM in this example. However, simple RNN or its other variants (like GRU) can also be used.

## Goal:

We’ll build an LSTM-based model that classifies IMDB movie reviews as positive or negative.

## Concepts Covered:

- Tokenization and numericalization of text

- Padding and batching using PyTorch DataLoader

- Building and training an LSTM-based classifier

- Evaluating accuracy

- Saving and loading trained models

- Making predictions on new reviews

## Load and Explore IMDB Dataset

We'll use the Hugging Face Datasets library for convenient data loading. Run `pip install datasets` in your virtual environment to install the datasets library.

[Hugging Face imdb dataset.](https://huggingface.co/datasets/stanfordnlp/imdb)

In [1]:
from datasets import load_dataset

In [2]:
# Load the dataset
dataset = load_dataset("imdb")

# Check the splits
print(dataset)

# View a sample
print(dataset['train'][0])

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})
{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and

In [3]:
print(dataset['test'][0])

{'text': 'I love sci-fi and am willing to put up with a lot. Sci-fi movies/TV are usually underfunded, under-appreciated and misunderstood. I tried to like this, I really did, but it is to good TV sci-fi as Babylon 5 is to Star Trek (the original). Silly prosthetics, cheap cardboard sets, stilted dialogues, CG that doesn\'t match the background, and painfully one-dimensional characters cannot be overcome with a \'sci-fi\' setting. (I\'m sure there are those of you out there who think Babylon 5 is good sci-fi TV. It\'s not. It\'s clichéd and uninspiring.) While US viewers might like emotion and character development, sci-fi is a genre that does not take itself seriously (cf. Star Trek). It may treat important issues, yet not as a serious philosophy. It\'s really difficult to care about the characters here as they are not simply foolish, just missing a spark of life. Their actions and reactions are wooden and predictable, often painful to watch. The makers of Earth KNOW it\'s rubbish as 

## Split data and Inspect Samples

In [4]:
train_data = dataset['train']
test_data = dataset['test']

print(f"Training samples: {len(train_data)}")
print(f"Test samples: {len(test_data)}")

# Example text and label
print("Example review:", train_data[0]['text'])
print("Label (0=negative, 1=positive):", train_data[0]['label'])

Training samples: 25000
Test samples: 25000
Example review: I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Re

## Text pre-processing

In [5]:
import re
import torch
from torch.utils.data import DataLoader, Dataset

In [6]:
def clean_text(text):
    text = text.lower()                        # lowercasing the text
    text = re.sub(r"[^a-zA-Z\s]", "", text)    # remove punctuation and special characters
    text = re.sub(r"\s+", " ", text).strip()   # remove extra spaces
    return text

train_texts = [clean_text(x) for x in train_data['text']]
train_labels = train_data['label']

test_texts = [clean_text(x) for x in test_data['text']]
test_labels = test_data['label']

## Tokenization and Vocabulary building

We’ll use a simple tokenizer and build a vocabulary of the most frequent words.

In [7]:
from collections import Counter

def tokenize(text):
    return text.split()

# Build vocabulary
counter = Counter()
for text in train_texts:
    counter.update(tokenize(text))

vocab_size = 20000  # keep top 20k words
most_common = counter.most_common(vocab_size - 2)
word_to_idx = {w: i+2 for i, (w, _) in enumerate(most_common)}
word_to_idx["<PAD>"] = 0
word_to_idx["<UNK>"] = 1

idx_to_word = {i: w for w, i in word_to_idx.items()}

In [9]:
counter.most_common(10)  # Display 10 most common words

[('the', 334760),
 ('and', 162243),
 ('a', 161962),
 ('of', 145332),
 ('to', 135047),
 ('is', 106859),
 ('in', 93038),
 ('it', 77110),
 ('i', 75738),
 ('this', 75196)]

In [10]:
idx_to_word

{2: 'the',
 3: 'and',
 4: 'a',
 5: 'of',
 6: 'to',
 7: 'is',
 8: 'in',
 9: 'it',
 10: 'i',
 11: 'this',
 12: 'that',
 13: 'br',
 14: 'was',
 15: 'as',
 16: 'for',
 17: 'with',
 18: 'movie',
 19: 'but',
 20: 'film',
 21: 'on',
 22: 'not',
 23: 'you',
 24: 'are',
 25: 'his',
 26: 'have',
 27: 'he',
 28: 'be',
 29: 'one',
 30: 'its',
 31: 'at',
 32: 'all',
 33: 'by',
 34: 'an',
 35: 'they',
 36: 'from',
 37: 'who',
 38: 'so',
 39: 'like',
 40: 'her',
 41: 'just',
 42: 'or',
 43: 'about',
 44: 'has',
 45: 'if',
 46: 'out',
 47: 'some',
 48: 'there',
 49: 'what',
 50: 'good',
 51: 'when',
 52: 'more',
 53: 'very',
 54: 'even',
 55: 'she',
 56: 'my',
 57: 'no',
 58: 'up',
 59: 'would',
 60: 'which',
 61: 'only',
 62: 'time',
 63: 'really',
 64: 'story',
 65: 'their',
 66: 'were',
 67: 'had',
 68: 'see',
 69: 'can',
 70: 'me',
 71: 'than',
 72: 'we',
 73: 'much',
 74: 'well',
 75: 'been',
 76: 'get',
 77: 'will',
 78: 'into',
 79: 'also',
 80: 'because',
 81: 'other',
 82: 'do',
 83: 'people'

## Encode and Pad Sequence

We’ll convert tokens to integers and pad them to a fixed length for batching.

In [11]:
def encode(text):
    return [word_to_idx.get(word, 1) for word in tokenize(text)]

max_len = 200  # maximum sequence length, this is provided by user

def pad_sequence(seq):
    if len(seq) < max_len:
        seq = seq + [0] * (max_len - len(seq))
    else:
        seq = seq[:max_len]
    return seq

X_train = [pad_sequence(encode(t)) for t in train_texts]
X_test = [pad_sequence(encode(t)) for t in test_texts]

y_train = torch.tensor(train_labels)
y_test = torch.tensor(test_labels)

X_train = torch.tensor(X_train)
X_test = torch.tensor(X_test)

print(X_train.shape, y_train.shape)


torch.Size([25000, 200]) torch.Size([25000])


In [14]:
X_train[50]

tensor([   10,   209,    11,    20,   611,  2324,     8,  3816, 12972,    17,
           34,   311,   178,     5,  8576,     1,  6335, 12558,     3,  3239,
           12,     2,  6177,    64,     5,  2747,  1421,    59,  5290,   156,
           70,   456,   448,    34,  1290,   237,    33,  7095,  8576,     8,
            2,   476,   216,     2,   114,    14,  2468,     1,   145,  5803,
           70,     6,   320,   255,  3708,    48,    66,    47,    81,   919,
          744,     2,  1351,   112,    36,     1,     8,   812,  6246,   460,
           36,  6083,     1,    25,   674,    21,   120,     8,  5128,  1243,
           15,    74,     3,   360,  1576,    12,   495,     4,   146,   234,
            5, 16245,     6,     2,  2344,  3497,    73,    15,     2,  2484,
         8924,     5,     2,   393,   821,  1163,   219,    26,   108,     9,
           19,    10,   176,   328,   528,    12,   154,  1362,    44,    75,
         1020,     6,   359,     2,   146,    64,     5,    34, 

## Create a Dataset and DataLoader

In [18]:
class IMDBDataset(Dataset):
    def __init__(self, texts, labels):
        self.texts = texts
        self.labels = labels
    
    def __len__(self):
        return len(self.texts)
    
    def __getitem__(self, idx):
        return self.texts[idx], self.labels[idx]

train_dataset = IMDBDataset(X_train, y_train)
test_dataset = IMDBDataset(X_test, y_test)

train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=64)


## Build an LSTM model

The flow is following:

Input sequence -> Embedding -> LSTM -> Output

In [15]:
import torch.nn as nn

class SentimentLSTM(nn.Module):

    def __init__(self, vocab_size, embed_dim, hidden_dim, output_dim, num_layers=1, dropout=0.0):

        super(SentimentLSTM, self).__init__()
        
        # Creating an embedding layer
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)

        # Creating LSTM layer
        self.lstm = nn.LSTM(embed_dim, hidden_dim, num_layers=num_layers,
                            batch_first=True, dropout=dropout)
        
        # Optionally we can create RNN layer instead of LSTM
        # self.rnn = nn.RNN(embed_dim, hidden_dim, num_layers=num_layers,
        #                   batch_first=True, dropout=dropout)
        
        # Creating a fully connected layer
        self.fc = nn.Linear(hidden_dim, output_dim)
    
    def forward(self, x):
        embeds = self.embedding(x)
        # embedded dim: [sentence length, batch size, embedding dim]

        lstm_out, (hidden, cell) = self.lstm(embeds)
        # output dim: [sentence length, batch size, hidden dim]
        # hidden dim: [1, batch size, hidden dim]

        # rnn_out, h = self.rnn(embeds)   # if using RNN layer instead of LSTM

        h = hidden[-1]  # get the last layer's hidden state
        # h dim: [batch size, hidden dim]

        out = self.fc(h)
        return out

Suggested read: [What's the difference between "hidden" and "output" in PyTorch LSTM?](https://stackoverflow.com/questions/48302810/whats-the-difference-between-hidden-and-output-in-pytorch-lstm)

## Initialize Model, Loss, and Optimizer

In [16]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)

model = SentimentLSTM(vocab_size=vocab_size, embed_dim=128, hidden_dim=256,
                      output_dim=2, num_layers=2, dropout=0.5).to(device)

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

Using device: cuda


## Training the Model

In [21]:
def train_model(model, dataloader, optimizer, criterion, epochs=5):
    model.train()
    for epoch in range(epochs):
        total_loss = 0
        for texts, labels in dataloader:
            texts, labels = texts.to(device), labels.to(device)
            
            optimizer.zero_grad()
            outputs = model(texts)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            
            total_loss += loss.item()
        print(f"Epoch [{epoch+1}/{epochs}], Loss: {total_loss/len(dataloader):.4f}")

train_model(model, train_loader, optimizer, criterion, epochs=25)

Epoch [1/25], Loss: 0.4049
Epoch [2/25], Loss: 0.4036
Epoch [3/25], Loss: 0.3008
Epoch [4/25], Loss: 0.2111
Epoch [5/25], Loss: 0.1690
Epoch [6/25], Loss: 0.1257
Epoch [7/25], Loss: 0.0930
Epoch [8/25], Loss: 0.0696
Epoch [9/25], Loss: 0.0602
Epoch [10/25], Loss: 0.0474
Epoch [11/25], Loss: 0.0394
Epoch [12/25], Loss: 0.0366
Epoch [13/25], Loss: 0.0329
Epoch [14/25], Loss: 0.0310
Epoch [15/25], Loss: 0.0301
Epoch [16/25], Loss: 0.0254
Epoch [17/25], Loss: 0.0203
Epoch [18/25], Loss: 0.0181
Epoch [19/25], Loss: 0.0153
Epoch [20/25], Loss: 0.0161
Epoch [21/25], Loss: 0.0152
Epoch [22/25], Loss: 0.0125
Epoch [23/25], Loss: 0.0109
Epoch [24/25], Loss: 0.0087
Epoch [25/25], Loss: 0.0078


## Evaluate the model

In [22]:
from sklearn.metrics import accuracy_score

def evaluate(model, dataloader):
    model.eval()
    all_preds, all_labels = [], []
    with torch.no_grad():
        for texts, labels in dataloader:
            texts, labels = texts.to(device), labels.to(device)
            outputs = model(texts)
            preds = torch.argmax(outputs, dim=1)
            all_preds.extend(preds.cpu().numpy())
            all_labels.extend(labels.cpu().numpy())
    acc = accuracy_score(all_labels, all_preds)
    print(f"Test Accuracy: {acc*100:.2f}%")

evaluate(model, test_loader)

Test Accuracy: 82.12%


## Save and Load model

In [23]:
torch.save(model.state_dict(), "lstm_imdb.pth")
print("Model saved successfully!")

# To load later:
# model.load_state_dict(torch.load("lstm_imdb.pth"))
# model.eval()

Model saved successfully!


## Predict on custom reviews

In [24]:
def predict_sentiment(model, text):
    model.eval()
    text = clean_text(text)
    encoded = pad_sequence(encode(text))
    tensor = torch.tensor(encoded).unsqueeze(0).to(device)
    
    with torch.no_grad():
        output = model(tensor)

        # Get probabilities using softmax [negative_prob, positive_prob]
        probability = torch.softmax(output, dim=1).cpu().numpy()[0].tolist()
        pred = "Positive" if probability[1] > 0.5 else "Negative"

    output = {"sentiment": pred, 
              "positive-probability": probability[1],
              "negative-probability": probability[0]}

    return output

In [25]:
# Try with your own input
sample_review = "The movie was absolutely wonderful, with great acting!"
print(predict_sentiment(model, sample_review))

{'sentiment': 'Positive', 'positive-probability': 0.9998984336853027, 'negative-probability': 0.00010152783943340182}


In [26]:
sample_review = "I really hate this movie. The acting is really bad and sucks!"

print(predict_sentiment(model, sample_review))

{'sentiment': 'Negative', 'positive-probability': 0.003941753879189491, 'negative-probability': 0.9960582256317139}


In [27]:
sample_review = "The movie was okay, not the best I've seen but not the worst either."

print(predict_sentiment(model, sample_review))

{'sentiment': 'Negative', 'positive-probability': 0.002007979666814208, 'negative-probability': 0.9979920387268066}
