# **Sentiment Analysis of IMDB Movie Reviews**

</br>

**Dataset**
</br>

The IMDb Dataset of 50K Movie Reviews, is a popular dataset commonly used for sentiment analysis and natural language processing tasks. The dataset consists of 50,000 movie reviews, with 25,000 reviews labeled as positive and 25,000 as negative
</br>

Dataset Source: [Kaggle](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews?datasetId=134715&searchQuery=pytor)
</br>

**The Problem Statement**
</br>

Predict the number of positive and negative reviews based on sentiments by using deep learning techniques.

**To approach this problem, we've followed the below outline:**

- **Data preprocessing:** applied in the notebook called _"Data_preprocessing_notebook"_
</br>

- **Word embedding:** We've converted the preprocessed text into a numerical representation that can be understood by deep learning models, using word embeddings, such as Word2Vec or GloVe, to represent words as dense vectors in a continuous vector space.
</br>

- **Model selection:** Choose a suitable deep learning model architecture including recurrent neural networks (RNNs), long short-term memory (LSTM) networks, and convolutional neural networks (CNNs). 
</br>

- **Model training:** Split our dataset into training and validation sets.
</br>
- **Model evaluation**
</br>
- **Model refinement**
</br>

**(Initial) Attributes**:

* Review
* Sentiment
 

## All the imports

In [1]:
# import to "ignore" warnings

import warnings
warnings.filterwarnings('ignore')

# imports for data manipulation

import pandas as pd
import numpy as np

# imports for data visualization

import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud # need local import


# import pytorch (framework for building deep learning models) || need local import

import torch 
from torch import nn
from torch.optim import Adam
from torch.utils.data import TensorDataset, DataLoader

# imports from sklearn

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report, confusion_matrix

import gensim # need local import
from gensim.models import Word2Vec
import random
import nltk

#### Load the csv file 

In [2]:
# read data

data = pd.read_csv('imdb_clean_dataset.csv')
data.head()

Unnamed: 0,review,sentiment
0,one review mention watch oz episod hook right ...,1
1,wonder littl product film techniqu unassum old...,1
2,thought wonder way spend time hot summer weeke...,1
3,basic famili littl boy jake think zombi closet...,0
4,petter mattei love time money visual stun film...,1


In [4]:
data.shape

(49582, 2)

### Word embedding using Word2Vec model

In [16]:
nltk.download('punkt') 
sentences = [nltk.word_tokenize(review) for review in data['review']]
model = Word2Vec(sentences, vector_size=25, window=5, min_count=1, workers=2)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\stavp\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [17]:
sentences[:1]

[['one',
  'review',
  'mention',
  'watch',
  'oz',
  'episod',
  'hook',
  'right',
  'exactli',
  'happen',
  'first',
  'thing',
  'struck',
  'oz',
  'brutal',
  'unflinch',
  'scene',
  'violenc',
  'set',
  'right',
  'word',
  'go',
  'trust',
  'show',
  'faint',
  'heart',
  'timid',
  'show',
  'pull',
  'punch',
  'regard',
  'drug',
  'sex',
  'violenc',
  'hardcor',
  'classic',
  'use',
  'word',
  'call',
  'oz',
  'nicknam',
  'given',
  'oswald',
  'maximum',
  'secur',
  'state',
  'penitentari',
  'focus',
  'mainli',
  'emerald',
  'citi',
  'experiment',
  'section',
  'prison',
  'cell',
  'glass',
  'front',
  'face',
  'inward',
  'privaci',
  'high',
  'agenda',
  'em',
  'citi',
  'home',
  'mani',
  'aryan',
  'muslim',
  'gangsta',
  'latino',
  'christian',
  'italian',
  'irish',
  'scuffl',
  'death',
  'stare',
  'dodgi',
  'deal',
  'shadi',
  'agreement',
  'never',
  'far',
  'away',
  'would',
  'say',
  'main',
  'appeal',
  'show',
  'due',
  'fac

#### Split into train/test sets

In [18]:
# Split  dataset into training and testing sets:
X = data['review']
y = data['sentiment']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f'Shape of train data: {X_train.shape}')
print(f'Shape of test data: {X_test.shape}')

Shape of train data: (39665,)
Shape of test data: (9917,)


In [19]:
X_train[:5]

7827     realli like movi empor new groov watch like co...
4811     decid watch movi note scariest movi ever expec...
35252    hard say go ladi shanghai film could without s...
3446     sci fi adventur best mean worst agre statement...
24377    around late anim bluth frustrat output compani...
Name: review, dtype: object

#### Tokenize the text data into sentences and words

In [20]:
X_train_sentences = [nltk.word_tokenize(text) for text in X_train]
X_test_sentences = [nltk.word_tokenize(text) for text in X_test]

In [21]:
X_train_sentences[:1]

[['realli',
  'like',
  'movi',
  'empor',
  'new',
  'groov',
  'watch',
  'like',
  'come',
  'home',
  'see',
  'wife',
  'relat',
  'llama',
  'serious',
  'movi',
  'bad',
  'like',
  'club',
  'dread',
  'super',
  'trooper',
  'suppos',
  'write',
  'line',
  'even',
  'know',
  'els',
  'say',
  'laugh',
  'coupl',
  'time',
  'drink',
  'movi',
  'like',
  'least',
  'funni',
  'drunk',
  'mayb',
  'llama',
  'funni',
  'regular',
  'cartoon',
  'peopl',
  'either',
  'way',
  'stick',
  'empor',
  'new',
  'groov',
  'want',
  'funni',
  'cartoon',
  'llama',
  'theme',
  'movi',
  'line',
  'line',
  'right']]

#### Convert each word in the sentences to its corresponding word vector representation

In [22]:
X_train_vectors = []
for sentence in X_train_sentences:
    sentence_vectors = []
    for word in sentence:
        if word in model.wv:
            sentence_vectors.append(model.wv[word])
    X_train_vectors.append(sentence_vectors)

X_test_vectors = []
for sentence in X_test_sentences:
    sentence_vectors = []
    for word in sentence:
        if word in model.wv:
            sentence_vectors.append(model.wv[word])
    X_test_vectors.append(sentence_vectors)

In [23]:
X_train_vectors[:1]

[[array([-0.8158005 , -0.79729337,  0.01080699, -0.95394844,  1.1936789 ,
          3.4601262 , -1.7047168 ,  0.3056661 ,  0.23701242, -0.3253516 ,
         -0.96359634,  1.233798  ,  5.477555  , -2.7805386 ,  1.908187  ,
          1.7135054 ,  1.2637771 ,  1.6957569 , -0.25309414, -0.01486314,
          2.5091515 ,  1.5817406 , -0.01520674,  1.9691343 , -0.4267806 ],
        dtype=float32),
  array([ 2.0072021 , -2.341946  , -0.8904394 , -1.3117406 , -0.19989231,
          3.262966  , -0.23999795, -0.27094847,  0.34196025, -1.4057124 ,
         -0.15793666,  1.7720054 ,  2.3195863 ,  0.811979  ,  2.392187  ,
          1.3614914 ,  1.5390148 , -0.6711238 , -1.1657475 ,  0.90940505,
          1.4119602 ,  0.77606237, -2.61128   ,  1.2451581 ,  0.25014243],
        dtype=float32),
  array([ 0.33962864, -1.062212  ,  0.75977075,  1.7319666 ,  2.2754123 ,
          2.654217  , -0.12741539,  0.40317035,  1.2745634 ,  0.34986964,
         -2.624419  ,  0.55494684,  5.008631  , -1.1508145 ,  

#### Convert the word vector sequences into fixed-length vectors

In [24]:
# Calculate the maximum sequence length across training and test sets
train_max_sequence_length = max(len(seq) for seq in X_train_vectors)
test_max_sequence_length = max(len(seq) for seq in X_test_vectors)
max_sequence_length = max(train_max_sequence_length, test_max_sequence_length)

print(train_max_sequence_length)
print(test_max_sequence_length)
print(max_sequence_length)

1135
1422
1422


In [25]:
# Define a function to pad sequences
def pad_sequences(sequences, max_length):
    padded_sequences = []
    for sequence in sequences:
        sequence_length = len(sequence)
        if sequence_length < max_length:
            # Pad the sequence with zero vectors
            padding_length = max_length - sequence_length
            padding = np.zeros((padding_length, model.vector_size))
            padded_sequence = np.concatenate((sequence, padding))
        else:
            # Truncate the sequence if it exceeds the maximum length
            padded_sequence = sequence[:max_length]
        padded_sequences.append(padded_sequence)
    return np.array(padded_sequences)

# Pad the training and test sequences
X_train_padded = pad_sequences(X_train_vectors, max_sequence_length)
X_test_padded = pad_sequences(X_test_vectors, max_sequence_length)


In [26]:
X_train_padded[:5]

array([[[-0.81580049, -0.79729337,  0.01080699, ..., -0.01520674,
          1.96913433, -0.42678061],
        [ 2.00720215, -2.34194589, -0.89043939, ..., -2.61127996,
          1.24515808,  0.25014243],
        [ 0.33962864, -1.06221199,  0.75977075, ...,  0.68648297,
          1.73341846,  0.01167157],
        ...,
        [ 0.        ,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ],
        [ 0.        ,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ],
        [ 0.        ,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ]],

       [[ 0.81318736, -0.54572284,  1.11665988, ...,  0.23371355,
         -1.78173149, -0.21972075],
        [ 1.00708973, -0.55764121, -1.64212048, ...,  0.78962874,
         -0.79215157, -2.07423043],
        [ 0.33962864, -1.06221199,  0.75977075, ...,  0.68648297,
          1.73341846,  0.01167157],
        ...,
        [ 0.        ,  0.        ,  0.        , ...,  