## IMDB Reviews Sentiment Analysis with RNN

The goal is to train RNN model on IMDB Dataset of 50K Movie Reviews. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. It consists of a set of 25,000 highly polar movie reviews for training and 25,000 for testing.

In [1]:
import numpy as np
import pandas as pd
import io
from tqdm import tqdm
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer 
import os, re, csv, math, codecs
from sklearn import model_selection
from sklearn import metrics
import torch
import torch.nn as nn
import torch.multiprocessing as mp
from torch.optim.lr_scheduler import StepLR
import tensorflow as tf  # we use both tensorflow and pytorch (pytorch for main part) , tensorflow for tokenizer

torch.manual_seed(1337)

try:
    mp.set_start_method('spawn')
except RuntimeError:
    pass

df = pd.read_csv('./IMDB Dataset.csv', 
                 encoding='utf-8')

df.head(10)

2023-09-11 01:11:49.382141: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
5,"Probably my all-time favorite movie, a story o...",positive
6,I sure would like to see a resurrection of a u...,positive
7,"This show was an amazing, fresh & innovative i...",negative
8,Encouraged by the positive comments about this...,negative
9,If you like original gut wrenching laughter yo...,positive


Since there HTML marks and other peculiarities in the text. Let's clean them using Beautiful Soup and RegEx.

In [2]:
from bs4 import BeautifulSoup
import re
import warnings
warnings.filterwarnings('ignore')

#Getting rid of  html strips
def strip_html(text):
    soup = BeautifulSoup(text, "html.parser")
    return soup.get_text()

#Removing the square brackets
def remove_between_square_brackets(text):
    return re.sub('\[[^]]*\]', '', text)

#Removing the noisy text
def denoise_text(text):
    text = strip_html(text)
    text = remove_between_square_brackets(text)
    return text
#Apply function on review column
df['review']=df['review'].apply(denoise_text)
df.head(10)

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. The filming tec...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
5,"Probably my all-time favorite movie, a story o...",positive
6,I sure would like to see a resurrection of a u...,positive
7,"This show was an amazing, fresh & innovative i...",negative
8,Encouraged by the positive comments about this...,negative
9,If you like original gut wrenching laughter yo...,positive


Let's convert sentiment column to a binari label and add a 5-fold cross-validation group column

In [3]:
df.sentiment = df.sentiment.apply(lambda x: 1 if x=='positive' else 0)

df['kfold'] = -1
df = df.sample(frac=1).reset_index(drop=True)

y = df.sentiment.values

kf = model_selection.StratifiedKFold(n_splits=5)
for fold, (train_, valid_) in enumerate(kf.split(X=df, y=y)):
    df.loc[valid_, 'kfold'] = fold
    
df.head(10)

Unnamed: 0,review,sentiment,kfold
0,I watched this movie only because I didn't wan...,1,0
1,What an embarassment...This doesnt do justice ...,0,0
2,I can not believe the positive reaction to thi...,0,0
3,"Like many western Pennsylvania history buffs, ...",0,0
4,SLASHERS (2 outta 5 stars)Not really a very go...,0,0
5,"i think dirty dancing was a great movie, they ...",1,0
6,My Super Ex Girlfriend turned out to be a plea...,1,0
7,Billy Chung Siu Hung's (the bloody swordplay f...,0,0
8,What was always missing with the Matrix story ...,1,0
9,"What a truly moronic movie, all I can say is t...",0,0


# Word embeddings

First, I use Facebook (2016) FastText. It's loaded from: https://fasttext.cc/docs/en/english-vectors.html
It's better than Word2Vec since it accounts word parts into and  enables training of embeddings on smaller datasets and generalization to unknown words.

In [4]:
fasttext_embedding = {}
f = codecs.open('./wiki-news-300d-1M.vec', encoding='utf-8')
for line in tqdm(f):
    values = line.rstrip().rsplit(' ')
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    fasttext_embedding[word] = coefs
f.close()

0it [00:00, ?it/s]

999995it [01:20, 12443.47it/s]


Also, I use Standford (2014) GloVe 6B tokens, 400K vocab, uncased, 300d vectors. It is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.

In [5]:
glove = pd.read_csv('./glove/glove.6B.300d.txt', sep=" ", quoting=3, header=None, index_col=0)
glove_embedding = {key: val.values for key, val in glove.T.items()}

glove_embedding['hello'].shape

(300,)

Next, I create IMDBdataset class that takes embedding matrix and returns torch tensor output datatype

In [6]:
class IMDBDataset:
    def __init__(self, reviews, targets):
        self.reviews = reviews
        self.target = targets
    
    def __len__(self):
        return len(self.reviews)
    
    def __getitem__(self, index):
        review = torch.tensor(self.reviews[index,:], dtype = torch.long)
        target = torch.tensor(self.target[index], dtype = torch.float)
        
        return {'review': review,
                'target': target}

Let's build a bidirectional LSTM model class

In [7]:
class LSTM(nn.Module):
    def __init__(self, embedding_matrix):
        super(LSTM, self).__init__()
        
        num_words = embedding_matrix.shape[0]           # Number of words - num of rows
        embedding_dim = embedding_matrix.shape[1]       # Embedding Dimension - num of columns
        self.embedding = nn.Embedding(
                                      num_embeddings=num_words,
                                      embedding_dim=embedding_dim)
        
        self.embedding.weight = nn.Parameter(torch.tensor(embedding_matrix, dtype = torch.float32))
        self.embedding.weight.requires_grad = False     # not training gradient on embedding weight since I use pretrained embedding
        
        self.lstm = nn.LSTM(
                            embedding_dim, 
                            128,
                            bidirectional=True,
                            batch_first=True,
                             )                          # hidden_size is 128
        self.out = nn.Linear(512, 1)                    # hidden_size*2 + maxpooling **2  = 128*4  (bi-directional LSTM)
        
    def forward(self, x):
        x = self.embedding(x)
        hidden, _ = self.lstm(x)
        avg_pool= torch.mean(hidden, 1)
        max_pool, index_max_pool = torch.max(hidden, 1)     # mean and max pooling on lstm output
        out = torch.cat((avg_pool, max_pool), 1)            # bidirectional: 256*2 
        out = self.out(out)                                 # dim reduction 512 to 1
        return out

If there is a GPU available, let's use it

In [8]:
is_cuda = torch.cuda.is_available()

if is_cuda:
    device = torch.device("cuda")
    print("GPU is available")
else:
    device = torch.device("cpu")
    print("GPU not available, CPU used")

GPU not available, CPU used


Now, I construct the model training function.

In [9]:
def train(data_loader, model, optimizer, device):
    model.train()
    for data in data_loader:
        reviews = data['review']
        targets = data['target']
        
        reviews = reviews.to(device, dtype = torch.long)
        targets = targets.to(device, dtype = torch.float)
    
        optimizer.zero_grad()
        predictions = model(reviews)
        
        loss = nn.BCEWithLogitsLoss()(predictions, targets.view(-1,1))
        loss.backward()
        optimizer.step()

Next, the model evaluation function is constructed.

In [10]:
def evaluate(data_loader, model, device):
    
    final_predictions = []
    final_targets = []
    model.eval()
   
    with torch.no_grad():
        for data in data_loader:
            reviews = data['review']
            targets = data['target']
            reviews = reviews.to(device, dtype = torch.long)
            targets = targets.to(device, dtype=torch.float)
            
            predictions = model(reviews)
            
            predictions = predictions.cpu().numpy().tolist()
            targets = data['target'].cpu().numpy().tolist()
            
            final_predictions.extend(predictions)
            final_targets.extend(targets)
    return final_predictions, final_targets

Paramaters' configuration and saving embedding matrix into a numpy array

In [22]:
MAX_LEN = 128
TRAIN_BATCH_SIZE = 64
VALID_BATCH_SIZE = 32
EPOCHS = 8

def create_embedding_matrix(word_index, embedding_dict=None, d_model=300):
    
    embedding_matrix = np.zeros((len(word_index) + 1, d_model))
    
    for word, index in word_index.items():
        if word in embedding_dict:
            embedding_matrix[index] = embedding_dict[word]
    return embedding_matrix

Here, I tokenize with Keras Tokenizer

In [23]:
tokenizer = tf.keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(df.review.values.tolist())

Let's train the model with FastText embedding.

In [25]:
if __name__ == '__main__':
    
    embedding_matrix = create_embedding_matrix(tokenizer.word_index, embedding_dict=fasttext_embedding, d_model=300)

    for fold in range(2):
   
        train_df = df[df.kfold != fold].reset_index(drop=True)
        valid_df = df[df.kfold == fold].reset_index(drop=True)
    
        xtrain = tokenizer.texts_to_sequences(train_df.review.values)
        xtest = tokenizer.texts_to_sequences(valid_df.review.values)
    
        xtrain = tf.keras.preprocessing.sequence.pad_sequences(xtrain, maxlen=MAX_LEN)     
        xtest = tf.keras.preprocessing.sequence.pad_sequences(xtest, maxlen=MAX_LEN)
    
        train_dataset = IMDBDataset(reviews=xtrain, targets=train_df.sentiment.values)
        train_data_loader = torch.utils.data.DataLoader(train_dataset, batch_size = TRAIN_BATCH_SIZE, num_workers=0)
    
        valid_dataset = IMDBDataset(reviews=xtest, targets=valid_df.sentiment.values)
        valid_data_loader = torch.utils.data.DataLoader(valid_dataset, batch_size = VALID_BATCH_SIZE, num_workers=0)
    
        model_fasttext = LSTM(embedding_matrix)
        model_fasttext.to(device)
    
        optimizer = torch.optim.Adam(model_fasttext.parameters(), lr=1e-3)
        scheduler = StepLR(optimizer, step_size=1, gamma=0.5)
        print('training model')
   
        for epoch in range(EPOCHS):
            train(train_data_loader, model_fasttext, optimizer, device)
            outputs, targets = evaluate(valid_data_loader, model_fasttext, device)
            outputs = np.array(outputs) >= 0.5
            scheduler.step()
            accuracy = metrics.accuracy_score(targets, outputs)
            print(f'FOLD:{fold}, epoch: {epoch}, accuracy: {accuracy}')

training model
FOLD:0, epoch: 0, accuracy: 0.8315
FOLD:0, epoch: 1, accuracy: 0.8678
FOLD:0, epoch: 2, accuracy: 0.8658
FOLD:0, epoch: 3, accuracy: 0.8718
FOLD:0, epoch: 4, accuracy: 0.876
FOLD:0, epoch: 5, accuracy: 0.8743
FOLD:0, epoch: 6, accuracy: 0.8737
FOLD:0, epoch: 7, accuracy: 0.874
training model
FOLD:1, epoch: 0, accuracy: 0.837
FOLD:1, epoch: 1, accuracy: 0.8608
FOLD:1, epoch: 2, accuracy: 0.8559
FOLD:1, epoch: 3, accuracy: 0.8599
FOLD:1, epoch: 4, accuracy: 0.8675
FOLD:1, epoch: 5, accuracy: 0.8676
FOLD:1, epoch: 6, accuracy: 0.8671
FOLD:1, epoch: 7, accuracy: 0.8685


Next, I train the GloVe embedding model with the kernel size 300.

In [26]:
if __name__ == '__main__':
    
    embedding_matrix = create_embedding_matrix(tokenizer.word_index, embedding_dict=glove_embedding, d_model=300)

    for fold in range(2):
    
        train_df = df[df.kfold != fold].reset_index(drop=True)
        valid_df = df[df.kfold == fold].reset_index(drop=True)
    
        xtrain = tokenizer.texts_to_sequences(train_df.review.values)
        xtest = tokenizer.texts_to_sequences(valid_df.review.values)
    
        xtrain = tf.keras.preprocessing.sequence.pad_sequences(xtrain, maxlen=MAX_LEN)
        xtest = tf.keras.preprocessing.sequence.pad_sequences(xtest, maxlen=MAX_LEN)
    
        train_dataset = IMDBDataset(reviews=xtrain, targets=train_df.sentiment.values)
    
        train_data_loader = torch.utils.data.DataLoader(train_dataset, batch_size = TRAIN_BATCH_SIZE, num_workers=0)
        valid_dataset = IMDBDataset(reviews=xtest, targets=valid_df.sentiment.values)
        valid_data_loader = torch.utils.data.DataLoader(valid_dataset, batch_size = VALID_BATCH_SIZE, num_workers=0)
    
        model_glove = LSTM(embedding_matrix)
        model_glove.to(device)
    
        optimizer = torch.optim.Adam(model_glove.parameters(), lr=1e-3)
        scheduler = StepLR(optimizer, step_size=1, gamma=0.5)
        print('training model')
   
        for epoch in range(EPOCHS):
        
            train(train_data_loader, model_glove, optimizer, device)
            outputs, targets = evaluate(valid_data_loader, model_glove, device)
            outputs = np.array(outputs) >= 0.5
            scheduler.step()
            accuracy = metrics.accuracy_score(targets, outputs)
            print(f'FOLD:{fold}, epoch: {epoch}, accuracy: {accuracy}')

training model
FOLD:0, epoch: 0, accuracy: 0.8555
FOLD:0, epoch: 1, accuracy: 0.874
FOLD:0, epoch: 2, accuracy: 0.8755
FOLD:0, epoch: 3, accuracy: 0.8784
FOLD:0, epoch: 4, accuracy: 0.8841
FOLD:0, epoch: 5, accuracy: 0.884
FOLD:0, epoch: 6, accuracy: 0.8844
FOLD:0, epoch: 7, accuracy: 0.8839
training model
FOLD:1, epoch: 0, accuracy: 0.8517
FOLD:1, epoch: 1, accuracy: 0.8721
FOLD:1, epoch: 2, accuracy: 0.8731
FOLD:1, epoch: 3, accuracy: 0.8729
FOLD:1, epoch: 4, accuracy: 0.8816
FOLD:1, epoch: 5, accuracy: 0.8802
FOLD:1, epoch: 6, accuracy: 0.8802
FOLD:1, epoch: 7, accuracy: 0.8803


Let's try checking the best FastText model classification by feeding it with the sampl

In [27]:
def Interact_user_input(model):

    model.eval()
    
    sentence = 'If you like original gut wrenching laughter'
    while True:
        try:
            sentence = input('Review: ')
            if sentence in ['q','quit']: 
                break
            sentence = np.array([sentence])
            sentence_token = tokenizer.texts_to_sequences(sentence)
            sentence_token = tf.keras.preprocessing.sequence.pad_sequences(sentence_token, maxlen = MAX_LEN)
            sentence_train = torch.tensor(sentence_token, dtype = torch.long).to(device, dtype = torch.long)
            predict = model(sentence_train)
            if predict.item() > 0.5:
                print('------> Positive')
            else:
                print('------> Negative')
        except KeyError:
            print('please enter again')
    
Interact_user_input(model_glove)

------> Positive
------> Positive
------> Positive
------> Positive
------> Positive
------> Positive
------> Positive
------> Positive
