NLP Challenge: IMDB Dataset of 50K Movie Reviews to perform Sentiment analysis
https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

Imports

In [3]:
!pip install tensorflow
import tensorflow as tf
from tensorflow.keras.callbacks import TensorBoard, EarlyStopping, LearningRateScheduler, ModelCheckpoint, ReduceLROnPlateau
import numpy as np
import pandas as pd
import io



Load data set

In [10]:
data_file = './imdb50kreviews/IMDB Dataset.csv'
df = pd.read_csv(data_file)
df

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


Preprocess data

In [13]:
print(df.columns)

Index(['review', 'sentiment'], dtype='object')


In [15]:
#create data set, classifying labels to 0 for negative and 1 for positive
sentences = df['review'].to_numpy()
labels = df['sentiment'].apply(lambda x: 0 if x == 'negative' else 1).to_numpy()
dataset = tf.data.Dataset.from_tensor_slices((sentences, labels))

#take a look
examples = list(dataset.take(5))

print(f"dataset contains {len(dataset)} examples\n")

print(f"Text of second example look like this: {examples[1][0].numpy().decode('utf-8')}\n")
print(f"Labels of first 5 examples look like this: {[x[1].numpy() for x in examples]}")

dataset contains 50000 examples

Text of second example look like this: A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />The actors are extremely well chosen- Michael Sheen not only "has got all the polari" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master's of comedy and his life. <br /><br />The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional 'dream' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat wit

Split to training and validation sets

In [51]:
TRAINING_SPLIT = 0.9

In [53]:
train_size = int(len(dataset) * TRAINING_SPLIT)

train_dataset = dataset.take(train_size)
validation_dataset = dataset.skip(train_size)

print(f"There are {len(train_dataset)} elements for training.\n")
print(f"There are {len(validation_dataset)} elements for validation.\n")

There are 45000 elements for training.

There are 5000 elements for validation.



Vectorization and padding

In [23]:
MAX_LENGTH = 120

In [55]:
#Create and adapt the vectorizer
vectorizer = tf.keras.layers.TextVectorization(
    standardize='lower_and_strip_punctuation',
    output_sequence_length=MAX_LENGTH,
    output_mode='int'
)

vectorizer.adapt(train_dataset.map(lambda x, y: x))

#Get the vocabulary size after adaptation
vocab_size = len(vectorizer.get_vocabulary())

print(f"Vocabulary contains {vocab_size} words\n")

Vocabulary contains 172364 words



In [57]:
#Apply vectorization and padding to train and validation datasets

def vectorize_and_pad(x, y):
    #Vectorize the input
    vectorized = vectorizer(x)
    #Pad the vectorized sequence
    padded = tf.pad(vectorized, [[0, MAX_LENGTH - tf.shape(vectorized)[0]]], constant_values=0)
    #Ensure the padded sequence has the correct shape
    padded = tf.ensure_shape(padded, [MAX_LENGTH])
    return padded, y

#Apply vectorization and padding
train_dataset_vectorized = train_dataset.map(vectorize_and_pad)
val_dataset_vectorized = validation_dataset.map(vectorize_and_pad)

In [59]:
#Test view 2 training sequences and their labels
for example in train_dataset_vectorized.take(2):
    print(example)
    print()

(<tf.Tensor: shape=(120,), dtype=int64, numpy=
array([   29,     5,     2,    78,  1948,    45,  1060,    12,   100,
         146,    41,   482,  3199,   397,   457,    27,  3220,    35,
          24,   204,    15,    11,     7,   600,    49,   591,    16,
        2112,    13,     2,    88,   148,    12,  3288,    70,    43,
        3199,    14,    30,  5687,     3, 14712,   135,     5,   593,
          61,   281,     8,   204,    36,     2,   680,   139,  1688,
          70,    11,     7,    22,     4,   119,    17,     2,  8756,
        5821,    40, 11585,    11,   119,  2413,    56,  5961,    16,
        5557,     6,  1465,   384,    40,   593,    30,     7,  3460,
           8,     2,   352,   342,     5,     2, 22149,    13,     9,
           7,   469,  3199,    15,    12,     7,     2, 11311,   344,
           6,     2, 15503,  6852,  2569,  1074, 65347,     9,  2626,
        1386,    21, 25866,   536,    34,  4883,  2469,     5,     2,
        1185,   114,    32], dtype=int64)>,

In [31]:
#Optimize and batch datasets for training
SHUFFLE_BUFFER_SIZE = 1000
PREFETCH_BUFFER_SIZE = tf.data.AUTOTUNE
BATCH_SIZE = 128

In [61]:
train_dataset_final = (train_dataset_vectorized
                       .cache()
                       .shuffle(SHUFFLE_BUFFER_SIZE)
                       .prefetch(PREFETCH_BUFFER_SIZE)
                       .batch(BATCH_SIZE)
                      )

val_dataset_final = (val_dataset_vectorized
                     .cache()
                     .prefetch(PREFETCH_BUFFER_SIZE)
                     .batch(BATCH_SIZE)
                    )

for batch in train_dataset_final.take(1):
    print("Input shape:", batch[0].shape)
    print("Label shape:", batch[1].shape)

Input shape: (128, 120)
Label shape: (128,)


Model LSTM

In [36]:
EMBEDDING_DIM = 16
LSTM_DIM = 32
DENSE_DIM = 6

In [63]:
model = tf.keras.Sequential([
    tf.keras.Input(shape=(MAX_LENGTH,)),
    tf.keras.layers.Embedding(input_dim=vocab_size, output_dim=EMBEDDING_DIM),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(LSTM_DIM, kernel_regularizer=tf.keras.regularizers.L2(0.01))),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(DENSE_DIM, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

#Print the model summary
model.summary()

In [42]:
NUM_EPOCHS = 20

In [65]:
#Train the model
history = model.fit(train_dataset_final,
                    epochs=NUM_EPOCHS,
                    validation_data=val_dataset_final,
                    callbacks=[ReduceLROnPlateau(monitor='val_loss',
                                                 factor=0.2, verbose=1,
                                                 patience=1, min_lr=0.00001)
                              ])

Epoch 1/20
[1m352/352[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m35s[0m 85ms/step - accuracy: 0.6354 - loss: 0.8503 - val_accuracy: 0.8434 - val_loss: 0.3792 - learning_rate: 0.0010
Epoch 2/20
[1m352/352[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 76ms/step - accuracy: 0.8762 - loss: 0.3119
Epoch 2: ReduceLROnPlateau reducing learning rate to 0.00020000000949949026.
[1m352/352[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m30s[0m 84ms/step - accuracy: 0.8762 - loss: 0.3118 - val_accuracy: 0.8510 - val_loss: 0.3805 - learning_rate: 0.0010
Epoch 3/20
[1m352/352[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 78ms/step - accuracy: 0.9391 - loss: 0.1759
Epoch 3: ReduceLROnPlateau reducing learning rate to 4.0000001899898055e-05.
[1m352/352[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m30s[0m 86ms/step - accuracy: 0.9391 - loss: 0.1758 - val_accuracy: 0.8580 - val_loss: 0.3799 - learning_rate: 2.0000e-04
Epoch 4/20
[1m352/352[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[

Test the Predictions

In [67]:
#Save the model
model.save('sentiment_model.keras')

def predict_sentiment(sentence):
    #Preprocess the sentence
    vectorized_sentence = vectorizer([sentence])
    #Make prediction
    prediction = model.predict(vectorized_sentence)
    #Interpret the result
    sentiment = "Positive" if prediction[0][0] > 0.5 else "Negative"
    confidence = prediction[0][0] if sentiment == "Positive" else 1-prediction[0][0]
    return sentiment, confidence

#Test the function
test_sentence = "This movie was absolutely fantastic! I loved every minute of it."
sentiment, confidence = predict_sentiment(test_sentence)
print(f"Sentence: {test_sentence}")
print(f"Predicted sentiment: {sentiment}")
print(f"Confidence: {confidence:.2f}")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 412ms/step
Sentence: This movie was absolutely fantastic! I loved every minute of it.
Predicted sentiment: Positive
Confidence: 0.89


In [69]:
#Test more sentence
test_sentences = [
    "I hated this film, it was a complete waste of time.",
    "The actors were marvelous, but the plot was a disaster.",
    "A masterpiece! One of the most memorable film I've ever seen.",
    "The movie was well-received by critics but I didn't find it very interesting.",
    "It was okay, I didn't think much about. I forgot about it pretty quickly."
]

for sentence in test_sentences:
    sentiment, confidence = predict_sentiment(sentence)
    print(f"Sentence: {sentence}")
    print(f"Predicted sentiment: {sentiment}")
    print(f"Confidence: {confidence:.2f}")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 58ms/step
Sentence: I hated this film, it was a complete waste of time.
Predicted sentiment: Negative
Confidence: 0.91
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 48ms/step
Sentence: The actors were marvelous, but the plot was a disaster.
Predicted sentiment: Positive
Confidence: 0.59
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 48ms/step
Sentence: A masterpiece! One of the most memorable film I've ever seen.
Predicted sentiment: Positive
Confidence: 0.94
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 46ms/step
Sentence: The movie was well-received by critics but I didn't find it very interesting.
Predicted sentiment: Positive
Confidence: 0.79
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 51ms/step
Sentence: It was okay, I didn't think much about. I forgot about it pretty quickly.
Predicted sentiment: Positive
Confidence: 0.60
