# Problem Statement:

Build a sentiment analysis model using Recurrent Neural Networks (RNNs) to classify movie reviews from the IMDB dataset into positive or negative sentiments.

# Dataset:
The dataset comprises 25,000 movie reviews from IMDB, labeled by sentiment (positive/negative). Reviews have been preprocessed, and each review is encoded as a sequence of word indices (integers). The words in the dataset are indexed by overall frequency in the dataset, allowing for quick filtering operations such as: "only consider the top 10,000 most common words, but eliminate the top 20 most common words".

# Tasks to be Performed:

### Data Preprocessing:

● Load the IMDB dataset, keeping only the top 10,000 most frequently occurring words.

● Pad the sequences so that they all have the same length.

### Model Building:

● Create a Sequential RNN model using TensorFlow and Keras

● The model should consist of an Embedding layer, a SimpleRNN layer, and a Dense output layer.

● Compile the model, specifying the appropriate optimizer, loss function, and metrics.

### Training:

● Train the model on the preprocessed movie reviews, using a batch size of 128 and validating on 20% of the training data.

● Run the training for 10 epochs.

### Evaluation:

● Evaluate the model on the test set and report the accuracy. 

### Expected Outcome:
● A trained RNN model that can classify movie reviews into positive or negative sentiments, with an accuracy metric provided at the end of the training process

# 1. Load Libraries

In [1]:
import os
import pandas as pd
import numpy as np

import tensorflow as tf
from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, SimpleRNN, Dense

import warnings
warnings.filterwarnings("ignore")

  if not hasattr(np, "object"):


# 2. Load Dataset

In [2]:
# Parameters
vocab_size = 10000
max_length = 150

# Load IMDB dataset
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=vocab_size)

In [3]:
print(f"Training samples: {len(x_train)}")
print(f"Test samples: {len(x_test)}")
print(f"First review length before padding: {len(x_train[0])}")

Training samples: 25000
Test samples: 25000
First review length before padding: 218


# 3. Pad Sequences

In [4]:
x_train = pad_sequences(x_train, maxlen=max_length, padding='post')
x_test = pad_sequences(x_test, maxlen=max_length, padding='post')
print(f"Shape after padding: {x_train.shape}")

Shape after padding: (25000, 150)


# 3. Build the RNN Model

In [5]:
model = Sequential([
    Embedding(vocab_size, 128, input_length=max_length),
    SimpleRNN(64),
    Dense(1, activation='sigmoid')
])

# 4. Compile the Model

In [6]:
model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy']
)

model.summary()

# 5. Training

In [7]:
history = model.fit(x_train, y_train, batch_size=128, epochs=10, validation_split=0.2)

Epoch 1/10
[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m19s[0m 108ms/step - accuracy: 0.5391 - loss: 0.6852 - val_accuracy: 0.5540 - val_loss: 0.6728
Epoch 2/10
[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 108ms/step - accuracy: 0.6977 - loss: 0.5686 - val_accuracy: 0.5786 - val_loss: 0.6686
Epoch 3/10
[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 105ms/step - accuracy: 0.8034 - loss: 0.3963 - val_accuracy: 0.5616 - val_loss: 0.7883
Epoch 4/10
[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 105ms/step - accuracy: 0.8847 - loss: 0.2393 - val_accuracy: 0.5738 - val_loss: 0.8968
Epoch 5/10
[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m16s[0m 103ms/step - accuracy: 0.9235 - loss: 0.1579 - val_accuracy: 0.5760 - val_loss: 0.9998
Epoch 6/10
[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m16s[0m 101ms/step - accuracy: 0.9279 - loss: 0.1501 - val_accuracy: 0.5878 - val_loss: 1.0298
Epoch 7/10

# 6. Evaluation

In [8]:
test_loss, test_accuracy = model.evaluate(x_test, y_test)
print(f"Test Accuracy: {test_accuracy * 100:.2f}%")

[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 10ms/step - accuracy: 0.6016 - loss: 1.1065
Test Accuracy: 60.16%
