# Sentiment Analysis of IMDB Movie Reviews

This notebook demonstrates how to build a neural network for sentiment analysis using a recurrent neural network (RNN).

**Use Case:** Classify movie reviews from the IMDB dataset as either **Positive** or **Negative**.

**Model:** We will use a `Sequential` model with an `Embedding` layer and a `Bidirectional LSTM` layer to process the text sequences.

--- 
*Copyright (c) 2026 Shrikara Kaudambady. All rights reserved.*

## 1. Setup and Imports

Import the necessary libraries from TensorFlow, Keras, and NumPy.

In [None]:
# Copyright (c) 2026 Shrikara Kaudambady

import numpy as np
import tensorflow as tf
from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Bidirectional, LSTM, Dense
import matplotlib.pyplot as plt

## 2. Load and Prepare the Dataset

We'll load the IMDB movie reviews dataset directly from `tf.keras.datasets`. The dataset is already preprocessed: words are encoded as integers. We'll limit our vocabulary to the top 10,000 most frequent words.

In [None]:
VOCAB_SIZE = 10000
MAX_LEN = 256
BATCH_SIZE = 64

(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=VOCAB_SIZE)

print(f"Training entries: {len(train_data)}, labels: {len(train_labels)}")
print(f"Test entries: {len(test_data)}, labels: {len(test_labels)}")

### Pad Sequences

Since reviews can have different lengths, we need to standardize them. We'll pad (or truncate) each review to a fixed length of `MAX_LEN`.

In [None]:
train_data = pad_sequences(train_data, maxlen=MAX_LEN, padding='post', truncating='post')
test_data = pad_sequences(test_data, maxlen=MAX_LEN, padding='post', truncating='post')

print("Shape of a padded training sample:", train_data[0].shape)

## 3. Build the Neural Network

Our model consists of three main parts:
1.  **Embedding Layer**: This layer takes the integer-encoded vocabulary and looks up an embedding vector for each word-index. These vectors are learned as the model trains.
2.  **Bidirectional LSTM Layer**: This is a type of RNN that processes sequences, allowing information to persist. 'Bidirectional' means it processes the text from start-to-finish and from finish-to-start, capturing more context.
3.  **Dense Layer**: The final classification layer that outputs a single value, representing the sentiment prediction.

In [None]:
# Copyright (c) 2026 Shrikara Kaudambady
EMBEDDING_DIM = 16

model = Sequential([
    Embedding(VOCAB_SIZE, EMBEDDING_DIM, input_length=MAX_LEN),
    Bidirectional(LSTM(64)),
    Dense(64, activation='relu'),
    Dense(1, activation='sigmoid') # Sigmoid for binary classification (Positive/Negative)
])

model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

model.summary()

## 4. Train the Model

We'll now train the model for 10 epochs. We will also set aside a portion of the training data for validation to monitor for overfitting.

In [None]:
history = model.fit(train_data,
                    train_labels,
                    epochs=10,
                    batch_size=BATCH_SIZE,
                    validation_split=0.2)

## 5. Evaluate the Model

Let's check the model's performance on the unseen test data and visualize the training process.

In [None]:
results = model.evaluate(test_data, test_labels, verbose=2)

for name, value in zip(model.metrics_names, results):
  print("%s: %.3f" % (name, value))

### Plot Training History

In [None]:
def plot_graphs(history, metric):
    plt.plot(history.history[metric])
    plt.plot(history.history['val_'+metric])
    plt.xlabel("Epochs")
    plt.ylabel(metric)
    plt.legend([metric, 'val_'+metric])
    plt.show()

plot_graphs(history, 'accuracy')
plot_graphs(history, 'loss')

## 6. Make Predictions on New Sentences

Finally, let's create a small utility to preprocess new sentences and use our trained model to predict their sentiment.

In [None]:
# Copyright (c) 2026 Shrikara Kaudambady

# Get the word index from the IMDB dataset
word_index = imdb.get_word_index()

# Invert the word index to map integers back to words
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

# Create the encoder function
def encode_text(text):
    tokens = tf.keras.preprocessing.text.text_to_word_sequence(text)
    # +3 is for padding, start, and unknown tokens
    tokens = [word_index[word] if word in word_index else 2 for word in tokens]
    return pad_sequences([tokens], maxlen=MAX_LEN, padding='post', truncating='post')

def predict_sentiment(text):
    encoded_text = encode_text(text)
    prediction = model.predict(encoded_text)[0][0]
    sentiment = "Positive" if prediction >= 0.5 else "Negative"
    print(f"Text: '{text}'")
    print(f"Prediction Score: {prediction:.4f}")
    print(f"Sentiment: {sentiment}\n")

# Test with some new sentences
predict_sentiment("this movie was absolutely fantastic, the best I have ever seen!")
predict_sentiment("what a terrible and boring film, I would not recommend it to anyone.")
predict_sentiment("it was an okay movie, not great but not bad either.")