# Sentiment Analysis with IMDB movie reviews

In this notebook we will perform sentimend analysis for IMDB movie reviews. This dataset was downloaded from Kaggle, you can download it too in the following link: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews. There are 50k movie revies within the dataset, all of them are labeled. Those labels have values of: negative and positive.

### Importing needed libraries

In [35]:
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf
import nltk
nltk.download('stopwords')
import numpy as np

from tensorflow.keras.preprocessing.sequence import pad_sequences
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.callbacks import ModelCheckpoint

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\teddy\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Reading data & Pre-processing data

As you can see, there are 50k lines and two columns. The sentiment column is our output. This dataset needs some processing because there are special characters and HTML tags within the review, so we will be performing a quick cleaning process before continuing our implementation.

In [3]:
df = pd.read_csv("IMDB Dataset.csv")
df

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


In [6]:
# Separating data into independent and dependent variables
x_data = df['review']
y_data = df['sentiment']

# Cleaning X data
stop_words = set(stopwords.words('english'))    # Getting english stop words to concentrate in the important information
x_data = x_data.replace({'<.*?>': ''}, regex= True) # Removing All the HTML tags
x_data = x_data.replace({'[^A-Za-z]': ' '}, regex= True) # Removing everything that doesn't belong to the alphabet
x_data = x_data.apply(lambda review: [w for w in review.split() if w not in stop_words])    # Removing stop words from the reviews
x_data = x_data.apply(lambda review: " ".join([w.lower() for w in review]))   # Lowercasing all words in the review

# Transformating Y data
y_data = y_data.replace('positive', 1)  # 1 represents positive reviews and 0 represents negative reviews in newer label
y_data = y_data.replace('negative', 0)

# Showing transformed data
print('X Data:\n', x_data)
print('Y Data:\n', y_data)

X Data:
 0        one reviewers mentioned watching oz episode ho...
1        a wonderful little production the filming tech...
2        i thought wonderful way spend time hot summer ...
3        basically family little boy jake thinks zombie...
4        petter mattei love time money visually stunnin...
                               ...                        
49995    i thought movie right good job it creative ori...
49996    bad plot bad dialogue bad acting idiotic direc...
49997    i catholic taught parochial elementary schools...
49998    i going disagree previous comment side maltin ...
49999    no one expects star trek movies high art fans ...
Name: review, Length: 50000, dtype: object
Y Data:
 0        1
1        1
2        1
3        0
4        1
        ..
49995    1
49996    0
49997    0
49998    0
49999    0
Name: sentiment, Length: 50000, dtype: int64


### Train test split

We need to split the data into training and testing data, for this we will be using the method provided by scikitlearn. We will be using 30% for test and the remaining 70% for training.

In [30]:
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.3, random_state=123, stratify=y_data)
x_train.shape,y_train.shape

((35000,), (35000,))

### Getting the median length for padding

In [13]:
review_length = []
for review in x_train:
    review_length.append(len(review))

median_length = int(np.ceil(np.mean(review_length)))

### Padding and tokenization

In [31]:
# ENCODE REVIEW
token = Tokenizer(lower=False)    # no need lower, because already lowered the data
token.fit_on_texts(x_train)
x_train = token.texts_to_sequences(x_train)
x_test = token.texts_to_sequences(x_test)

max_length = median_length 

x_train = pad_sequences(x_train, maxlen=max_length, padding='post', truncating='post')
x_test = pad_sequences(x_test, maxlen=max_length, padding='post', truncating='post')

total_words = len(token.word_index) + 1   # add 1 because of 0 padding

print('Encoded X Train\n', x_train, '\n')
print('Encoded X Test\n', x_test, '\n')
print('Maximum review length: ', max_length)

x_train.shape

Encoded X Train
 [[  320     5  7607 ...     0     0     0]
 [ 2240  2736   202 ...     0     0     0]
 [    7    94  2200 ...     0     0     0]
 ...
 [ 1811 14915     2 ...     0     0     0]
 [    2    13   255 ...     0     0     0]
 [ 1306  6441   149 ...   162   539   330]] 

Encoded X Test
 [[  868  3516  1310 ...  2752   266 42604]
 [  106  1558  6561 ...     0     0     0]
 [  680 15392    78 ...     0     0     0]
 ...
 [  106     1  1720 ...   412  1261  4410]
 [    8  3054   830 ...  2188  2215   418]
 [ 2848   392    18 ...     0     0     0]] 

Maximum review length:  130


(35000, 130)

### Model Architecture

Our model will have a **embedding layer**, **LSTM layer** and a **Dense layer**. The embedded layer will have a size of **16** and LSTM wil have a size of **32**.

In [23]:
embedded_size = 16
lstm_size = 32
word_size = len(token.word_index) + 1       # +1 due to padding

model = Sequential()
model.add(Embedding(word_size, embedded_size, input_length = median_length))
model.add(LSTM(lstm_size))
model.add(Dense(1, activation= 'sigmoid'))
model.compile(optimizer= 'adam', loss= 'binary_crossentropy', metrics= ['accuracy'])

print(model.summary())

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 130, 16)           1399408   
                                                                 
 lstm_1 (LSTM)               (None, 32)                6272      
                                                                 
 dense_1 (Dense)             (None, 1)                 33        
                                                                 
Total params: 1,405,713
Trainable params: 1,405,713
Non-trainable params: 0
_________________________________________________________________
None


### Training model

We need to fit the model with out training data. For this training session we will be using mini batch, with a size of 128 and 5 epochs.

In [19]:
checkpoint = ModelCheckpoint(filepath= 'models/LSTM.h5', monitor= 'accuracy', save_best_only= True, verbose= 1)

In [32]:
model.fit(x_train, y_train, batch_size= 128, epochs= 5, callbacks=[checkpoint])

(35000, 130) (35000,)
Epoch 1/5
Epoch 1: accuracy improved from -inf to 0.67117, saving model to models\LSTM.h5
Epoch 2/5
Epoch 2: accuracy improved from 0.67117 to 0.90311, saving model to models\LSTM.h5
Epoch 3/5
Epoch 3: accuracy improved from 0.90311 to 0.95666, saving model to models\LSTM.h5
Epoch 4/5
Epoch 4: accuracy improved from 0.95666 to 0.97606, saving model to models\LSTM.h5
Epoch 5/5
Epoch 5: accuracy improved from 0.97606 to 0.98594, saving model to models\LSTM.h5


<keras.callbacks.History at 0x13e11ec7130>

### Testing

We need to evaluate the model's performance on unseen data and comparing it to the label

In [44]:
y_pred = (model.predict(x_test) > 0.5).astype('int32')

true = 0
for i, y in enumerate(y_test):
    if y == y_pred[i]:
        true+=1

accuracy = (true/len(y_pred)) * 100
print(f'Accuracy: {accuracy}')

Accuracy: 86.66
