# **Sentiment Analysis of IMDB Movie Reviews**

</br>

**Dataset**
</br>

The IMDb Dataset of 50K Movie Reviews, is a popular dataset commonly used for sentiment analysis and natural language processing tasks. The dataset consists of 50,000 movie reviews, with 25,000 reviews labeled as positive and 25,000 as negative
</br>

Dataset Source: [Kaggle](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews?datasetId=134715&searchQuery=pytor)
</br>

**The Problem Statement**
</br>

Predict the number of positive and negative reviews based on sentiments by using deep learning techniques.

**To approach this problem, we've followed the below outline:**

- **Data preprocessing:** applied in the notebook called _"Data_preprocessing_notebook"_
</br>

- **Word embedding:** We've converted the preprocessed text into a numerical representation that can be understood by deep learning models, using word embeddings, such as Word2Vec or GloVe, to represent words as dense vectors in a continuous vector space.
</br>

- **Model selection:** Choose a suitable deep learning model architecture including recurrent neural networks (RNNs), long short-term memory (LSTM) networks, and convolutional neural networks (CNNs). 
</br>

- **Model training:** Split our dataset into training and validation sets.
</br>
- **Model evaluation**
</br>
- **Model refinement**
</br>

**(Initial) Attributes**:

* Review
* Sentiment
 

## All the imports

In [130]:
import gc
gc.collect()


# import to "ignore" warnings

import warnings
warnings.filterwarnings('ignore')

# imports for data manipulation

import pandas as pd
import numpy as np

# imports for data visualization

import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud # need local import


# import pytorch (framework for building deep learning models) || need local import

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader

# import keras (framework for building deep learning models) || need local import
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Bidirectional, GRU, Conv1D, GlobalMaxPooling1D, Dense, Dropout , LSTM
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import ReduceLROnPlateau, EarlyStopping


# imports from sklearn

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report, confusion_matrix

import gensim # need local import
from gensim.models import Word2Vec
import random
import nltk
from nltk import word_tokenize


## Load the csv file 

In [131]:
# read data

data = pd.read_csv('imdb_clean_dataset.csv')
data.head()

Unnamed: 0,review,sentiment
0,one review mention watch oz episod hook right ...,1
1,wonder littl product film techniqu unassum old...,1
2,thought wonder way spend time hot summer weeke...,1
3,basic famili littl boy jake think zombi closet...,0
4,petter mattei love time money visual stun film...,1


## Split Dataset

In [132]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data['review'], data['sentiment'], test_size=0.2, random_state=42)

print(f'Shape of train data: {X_train.shape}')
print(f'Shape of test data: {X_test.shape}')

Shape of train data: (39665,)
Shape of test data: (9917,)


In [133]:
# Tokenize the reviews
X_train_tokenized = [word_tokenize(review) for review in X_train]
X_test_tokenized = [word_tokenize(review) for review in X_test]

print('--------------X_train_tokenized: \n')
print(X_train_tokenized[:1])
print('\n')
print('--------------X_test_tokenized: \n')
print(X_test_tokenized[:1])

--------------X_train_tokenized: 

[['realli', 'like', 'movi', 'empor', 'new', 'groov', 'watch', 'like', 'come', 'home', 'see', 'wife', 'relat', 'llama', 'serious', 'movi', 'bad', 'like', 'club', 'dread', 'super', 'trooper', 'suppos', 'write', 'line', 'even', 'know', 'els', 'say', 'laugh', 'coupl', 'time', 'drink', 'movi', 'like', 'least', 'funni', 'drunk', 'mayb', 'llama', 'funni', 'regular', 'cartoon', 'peopl', 'either', 'way', 'stick', 'empor', 'new', 'groov', 'want', 'funni', 'cartoon', 'llama', 'theme', 'movi', 'line', 'line', 'right']]


--------------X_test_tokenized: 

[['soul', 'plane', 'horribl', 'attempt', 'comedi', 'appeal', 'peopl', 'thick', 'skull', 'bloodshot', 'eye', 'furri', 'pawn', 'plot', 'incoher', 'also', 'non', 'exist', 'act', 'mostli', 'sub', 'sub', 'par', 'gang', 'highli', 'moron', 'dread', 'charact', 'thrown', 'bad', 'measur', 'joke', 'often', 'spot', 'mile', 'ahead', 'almost', 'never', 'even', 'bit', 'amus', 'movi', 'lack', 'structur', 'full', 'racial', 'stere

## Word embedding using Word2Vec model

In [134]:
# To learn word embeddings specific to the training set, the Word2Vec model is trained on the X_train_tokenized data, 
# which consists of tokenized reviews from the training set. 
# This division ensures a realistic assessment and reduces information leaking from the testing set. 
# The test set is handled as new data, giving a precise evaluation of the model's performance on novel occurrences. 
# Word embeddings are created for the testing data using the trained model.

model = Word2Vec(sentences=X_train_tokenized, vector_size=100, window=5, min_count=1, workers=4)

In [135]:
# Get the vocabulary size
vocab_size = len(model.wv)
print(f"vocab_size: {vocab_size}")

# Get the average word vector size
avg_vector_size = model.vector_size
print(f"avg_vector_size: {avg_vector_size}")

# Get the total number of reviews in the training set
num_reviews = len(X_train_tokenized)
print(f"num_reviews: {num_reviews}")

# Get the maximum number of words in a review
max_review_length = max(len(review) for review in X_train_tokenized)
print(f"max_review_length: {max_review_length}")


vocab_size: 63780
avg_vector_size: 100
num_reviews: 39665
max_review_length: 1135


In [136]:
# Generate word embeddings for training data
X_train_word_embeddings = []
for review in X_train_tokenized:
    review_embedding = []
    for word in review:
        if word in model.wv:  # Check if the word has a word vector in the Word2Vec model's vocabulary
            word_embedding = model.wv[word]  # Retrieve the word vector for the word
            review_embedding.append(word_embedding)  # Add the word vector to the review_embedding list
    if review_embedding: #check if the review_embedding list is not empty.
        review_embedding_avg = sum(review_embedding) / len(review_embedding)  # Calculate the average embedding
        X_train_word_embeddings.append(review_embedding_avg)  # Append the average embedding to X_train_word_embeddings
    else:
        X_train_word_embeddings.append([])  # Append an empty list if no word vectors were found for the review

# Generate word embeddings for testing data
X_test_word_embeddings = []
for review in X_test_tokenized:
    review_embedding = []
    for word in review:
        if word in model.wv:  # Check if the word has a word vector in the Word2Vec model's vocabulary
            word_embedding = model.wv[word]  # Retrieve the word vector for the word
            review_embedding.append(word_embedding)  # Add the word vector to the review_embedding list
    if review_embedding: #check if the review_embedding list is not empty.
        review_embedding_avg = sum(review_embedding) / len(review_embedding)  # Calculate the average embedding
        X_test_word_embeddings.append(review_embedding_avg)  # Append the average embedding to X_test_word_embeddings
    else:
        X_test_word_embeddings.append([])  # Append an empty list if no word vectors were found for the review


In [137]:
# Remove empty embeddings (if any) from training data
num_removed_train = 0
X_train_word_embeddings_filtered = []
for embedding in X_train_word_embeddings:
    if len(embedding) > 0:
        X_train_word_embeddings_filtered.append(embedding)
    else:
        num_removed_train += 1

X_train_word_embeddings = X_train_word_embeddings_filtered

# Remove empty embeddings (if any) from testing data
num_removed_test = 0
X_test_word_embeddings_filtered = []
for embedding in X_test_word_embeddings:
    if len(embedding) > 0:
        X_test_word_embeddings_filtered.append(embedding)
    else:
        num_removed_test += 1

X_test_word_embeddings = X_test_word_embeddings_filtered

# Print the number of removed embeddings
print("Number of removed embeddings (training data):", num_removed_train)
print("Number of removed embeddings (testing data):", num_removed_test)
print('\n')



print('--------------X_train_word_embeddings: \n')
print(X_train_word_embeddings[:1])

Number of removed embeddings (training data): 0
Number of removed embeddings (testing data): 0


--------------X_train_word_embeddings: 

[array([ 0.11843939,  0.40737468, -0.31806594, -0.413338  , -0.32696098,
       -0.81214046,  0.42219087,  0.4818255 , -0.01088629, -0.4130508 ,
        0.94538206,  0.373542  ,  0.7279831 , -0.53524673, -0.4170448 ,
       -0.02434385,  0.3920842 , -0.5743434 , -0.15548031, -0.2712898 ,
        0.14474407, -0.51100916,  0.01840197, -0.1433615 , -0.7358261 ,
       -0.06844264,  0.01321736,  0.34234086, -0.06639291,  0.01833295,
        0.4464348 , -0.51792115,  0.46142694, -0.50904745,  0.35335052,
        1.0143116 ,  0.56808984, -0.07859233, -0.46526942, -0.51201195,
       -0.03492953,  0.5546405 ,  0.28440964,  0.0626305 ,  1.1908377 ,
        0.32133946,  0.3726599 ,  0.530576  , -0.09281716, -0.19503234,
        0.08419336,  0.4327711 , -0.407829  ,  0.10671559, -0.33138898,
       -0.38468325,  0.26527426,  0.9788681 , -0.55170894, -0.6697967

In [138]:
print('--------------X_test_word_embeddings: \n')
print(X_test_word_embeddings[:1])


--------------X_test_word_embeddings: 

[array([ 0.26774868,  0.45513487, -0.1283196 , -0.6166905 , -0.4742356 ,
       -0.45482776,  0.3290812 ,  0.32271448, -0.3311331 , -0.43133664,
        0.6026132 ,  0.36143342,  0.5340214 , -0.08719439, -0.35256717,
       -0.0625425 ,  0.287423  , -0.36039275, -0.36010265, -0.43359667,
        0.18733892, -0.13334942,  0.03829688, -0.32266244, -0.910652  ,
        0.2164441 , -0.10206286,  0.03575484, -0.02808089,  0.10100013,
        0.21810633, -0.36831638,  0.35649237, -0.78378   ,  0.19911186,
        0.6383603 ,  0.37079847, -0.10995406, -0.32926652, -0.30819955,
        0.19594997,  0.15089722, -0.08151383, -0.03778863,  0.885269  ,
        0.09953274,  0.26436773,  0.05248537, -0.04592742, -0.14945155,
        0.24132542,  0.21955483, -0.12329152,  0.16020009, -0.60332793,
       -0.42134184,  0.35439676,  0.78711534, -0.44186175, -0.6303511 ,
       -0.23340324, -0.5009944 ,  0.48402983,  0.15878691, -0.25607815,
        0.71669984,  0.

## Pad sequences to ensure equal length

In [139]:
# Convert word embeddings to numpy arrays
X_train_word_embeddings = np.array(X_train_word_embeddings)
X_test_word_embeddings = np.array(X_test_word_embeddings)

# Pad sequences to ensure equal length
max_sequence_length = max_review_length  # Use the maximum length of a review as the sequence length

X_train_padded = tf.keras.preprocessing.sequence.pad_sequences(
    X_train_word_embeddings,
    maxlen=max_sequence_length,
    dtype='float32',
    padding='post',
    truncating='post'
)

X_test_padded = tf.keras.preprocessing.sequence.pad_sequences(
    X_test_word_embeddings,
    maxlen=max_sequence_length,
    dtype='float32',
    padding='post',
    truncating='post'
)

# Reshape word embeddings
embedding_size = 100  # Set the desired embedding size
max_sequence_length = 100  # Set the desired sequence length

def reshape_embeddings(embeddings):
    reshaped_embeddings = np.zeros((len(embeddings), max_sequence_length, embedding_size))
    for i, embedding in enumerate(embeddings):
        # Determine the length of the embedding and truncate if necessary
        length = min(len(embedding), max_sequence_length)
        reshaped_embeddings[i, :length] = embedding[:length]
    return reshaped_embeddings

X_train_reshaped = reshape_embeddings(X_train_word_embeddings)
X_test_reshaped = reshape_embeddings(X_test_word_embeddings)



## Model Selection

### LSTM model

LSTM (Long Short-Term Memory) networks is a type of recurrent neural network (RNN) architecture commonly used for sentiment analysis.

In [140]:
# Define a list of model names and their corresponding test loss and test accuracy
model_names = ['Simple Model', 'Increased the number of LSTM layers', 
               'With dropout regularization', 'Bidirectional LSTMs', 
               'Ensemble methods']
test_loss = np.zeros(5)
test_accuracy = np.zeros(5)

# Create a DataFrame with the model names, test loss, and test accuracy
results_df = pd.DataFrame({'Model': model_names, 'Test_Loss': test_loss, 'Test_Accuracy': test_accuracy})

# Display the results table
print(results_df)

                                 Model  Test_Loss  Test_Accuracy
0                         Simple Model        0.0            0.0
1  Increased the number of LSTM layers        0.0            0.0
2          With dropout regularization        0.0            0.0
3                  Bidirectional LSTMs        0.0            0.0
4                     Ensemble methods        0.0            0.0


#### Simple Model

In [141]:
# Create the model
model = Sequential()
model.add(LSTM(100, input_shape=(max_sequence_length, embedding_size)))
model.add(Dense(1, activation='sigmoid'))

# Compile the model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Define the ReduceLROnPlateau and EarlyStopping callbacks
# the ReduceLROnPlateau callback is used to reduce the learning rate when the validation loss stops improving
# the EarlyStopping callback is used to stop the training process 
# when the validation accuracy does not improve within a certain number of epochs. 
reduce_lr = ReduceLROnPlateau(monitor='val_loss', patience=5, cooldown=0) 
early_stopping = EarlyStopping(monitor='val_accuracy', min_delta=1e-4, patience=5) 

# Train the model with the callbacks
model.fit(X_train_reshaped, y_train, validation_data=(X_test_reshaped, y_test),
          epochs=10, batch_size=64, callbacks=[reduce_lr, early_stopping])

# Evaluate the model
loss, accuracy = model.evaluate(X_test_reshaped, y_test)
print(f"Test loss: {loss}")
print(f"Test accuracy: {accuracy}")

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Test loss: 0.33980506658554077
Test accuracy: 0.8492487668991089


In [142]:
results_df['Test_Loss'][results_df['Model']=='Simple Model'] =  loss
results_df['Test_Accuracy'][results_df['Model']=='Simple Model'] = accuracy
results_df

Unnamed: 0,Model,Test_Loss,Test_Accuracy
0,Simple Model,0.339805,0.849249
1,Increased the number of LSTM layers,0.0,0.0
2,With dropout regularization,0.0,0.0
3,Bidirectional LSTMs,0.0,0.0
4,Ensemble methods,0.0,0.0


#### Increased the number of LSTM layers

In [143]:
# Create the model
model = Sequential()
model.add(LSTM(100, input_shape=(max_sequence_length, embedding_size), return_sequences=True))
model.add(LSTM(100))
model.add(Dense(1, activation='sigmoid'))

# Compile the model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Define the ReduceLROnPlateau and EarlyStopping callbacks
# the ReduceLROnPlateau callback is used to reduce the learning rate when the validation loss stops improving
# the EarlyStopping callback is used to stop the training process 
# when the validation accuracy does not improve within a certain number of epochs. 
reduce_lr = ReduceLROnPlateau(monitor='val_loss', patience=5, cooldown=0) 
early_stopping = EarlyStopping(monitor='val_accuracy', min_delta=1e-4, patience=5) 

# Train the model with the callbacks
model.fit(X_train_reshaped, y_train, validation_data=(X_test_reshaped, y_test),
          epochs=10, batch_size=64, callbacks=[reduce_lr, early_stopping])

# Evaluate the model
loss, accuracy = model.evaluate(X_test_reshaped, y_test)
print(f"Test loss: {loss}")
print(f"Test accuracy: {accuracy}")

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Test loss: 0.3335508108139038
Test accuracy: 0.8576182126998901


In [144]:
results_df['Test_Loss'][results_df['Model']=='Increased the number of LSTM layers'] = loss
results_df['Test_Accuracy'][results_df['Model']=='Increased the number of LSTM layers'] = accuracy
results_df

Unnamed: 0,Model,Test_Loss,Test_Accuracy
0,Simple Model,0.339805,0.849249
1,Increased the number of LSTM layers,0.333551,0.857618
2,With dropout regularization,0.0,0.0
3,Bidirectional LSTMs,0.0,0.0
4,Ensemble methods,0.0,0.0


#### With dropout regularization

In [145]:
# Create the model
model = Sequential()
model.add(LSTM(100, input_shape=(max_sequence_length, embedding_size)))
model.add(Dropout(0.2))
model.add(Dense(1, activation='sigmoid'))


# Compile the model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Define the ReduceLROnPlateau and EarlyStopping callbacks
# the ReduceLROnPlateau callback is used to reduce the learning rate when the validation loss stops improving
# the EarlyStopping callback is used to stop the training process 
# when the validation accuracy does not improve within a certain number of epochs. 
reduce_lr = ReduceLROnPlateau(monitor='val_loss', patience=3, cooldown=0) 
early_stopping = EarlyStopping(monitor='val_accuracy', min_delta=1e-4, patience=3) 

# Train the model with the callbacks
model.fit(X_train_reshaped, y_train, validation_data=(X_test_reshaped, y_test),
          epochs=10, batch_size=64, callbacks=[reduce_lr, early_stopping])

# Evaluate the model
loss, accuracy = model.evaluate(X_test_reshaped, y_test)
print(f"Test loss: {loss}")
print(f"Test accuracy: {accuracy}")

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Test loss: 0.3314211964607239
Test accuracy: 0.8581224083900452


In [146]:
results_df['Test_Loss'][results_df['Model']=='With dropout regularization'] = loss
results_df['Test_Accuracy'][results_df['Model']=='With dropout regularization'] = accuracy
results_df

Unnamed: 0,Model,Test_Loss,Test_Accuracy
0,Simple Model,0.339805,0.849249
1,Increased the number of LSTM layers,0.333551,0.857618
2,With dropout regularization,0.331421,0.858122
3,Bidirectional LSTMs,0.0,0.0
4,Ensemble methods,0.0,0.0


#### Bidirectional LSTMs

In [147]:
# Create the model
model = Sequential()
model.add(Bidirectional(LSTM(100, input_shape=(max_sequence_length, embedding_size), return_sequences=True)))
model.add(Bidirectional(LSTM(100)))
model.add(Dense(1, activation='sigmoid'))

# Compile the model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Define the ReduceLROnPlateau and EarlyStopping callbacks
# the ReduceLROnPlateau callback is used to reduce the learning rate when the validation loss stops improving
# the EarlyStopping callback is used to stop the training process 
# when the validation accuracy does not improve within a certain number of epochs. 
reduce_lr = ReduceLROnPlateau(monitor='val_loss', patience=3, cooldown=0) 
early_stopping = EarlyStopping(monitor='val_accuracy', min_delta=1e-4, patience=3) 

# Train the model with the callbacks
model.fit(X_train_reshaped, y_train, validation_data=(X_test_reshaped, y_test),
          epochs=10, batch_size=64, callbacks=[reduce_lr, early_stopping])

# Evaluate the model
loss, accuracy = model.evaluate(X_test_reshaped, y_test)
print(f"Test loss: {loss}")
print(f"Test accuracy: {accuracy}")

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Test loss: 0.33250999450683594
Test accuracy: 0.85792076587677


In [148]:
results_df['Test_Loss'][results_df['Model']=='Bidirectional LSTMs'] = loss
results_df['Test_Accuracy'][results_df['Model']=='Bidirectional LSTMs'] = accuracy
results_df

Unnamed: 0,Model,Test_Loss,Test_Accuracy
0,Simple Model,0.339805,0.849249
1,Increased the number of LSTM layers,0.333551,0.857618
2,With dropout regularization,0.331421,0.858122
3,Bidirectional LSTMs,0.33251,0.857921
4,Ensemble methods,0.0,0.0


#### Ensembles

In [149]:
# results_df['Test_Loss'][results_df['Model']=='Ensemble methods'] = loss
# results_df['Test_Accuracy'][results_df['Model']=='Ensemble methods'] = accuracy
# results_df

In [152]:
# save the results_df into new csv file
results_df[['Model', 'Test_Loss', 'Test_Accuracy']].to_csv('LSTM_model_results.csv', index=False, header=True)

##### CNN

In [150]:
# # Create a neural network model
# model = Sequential()

# # Add a fully connected layer with 64 units and ReLU activation
# model.add(Dense(64, activation='relu', input_shape=(avg_vector_size,)))

# # Add another fully connected layer with 32 units and ReLU activation
# model.add(Dense(32, activation='relu'))

# # Add a final output layer with 1 unit and sigmoid activation for binary classification
# model.add(Dense(1, activation='sigmoid'))

# # Compile the model
# model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# # Train the model
# model.fit(X_train_reshaped, y_train, validation_data=(X_test_reshaped, y_test), epochs=10, batch_size=64)

# # Evaluate the model
# loss, accuracy = model.evaluate(X_test, y_test)
# print(f'Test loss: {loss:.2f}')
# print(f'Test accuracy: {accuracy:.2f}')