<a href="https://www.kaggle.com/code/annkuruvilla/movie-reviews-nb-cnn-bilstm-pytorch?scriptVersionId=128927046" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [None]:
import numpy as np 
import pandas as pd 

In [None]:
!ls /kaggle/usr/lib/

In [None]:
import model_utilities as util

This notebook helps in familiarising with the usage of some NLP models that could be used for text classification.  
Mainly Three Models are used here:  

**Gaussian Naive Bayes Model  
One dimesional convolutional Model  
Bi LSTM Model**

My understanding of these models are also described below

In [None]:
train_data_path='../input/movie-review-data-set-from-rotten-tomatoes/train.csv'
test_data_path='../input/movie-review-data-set-from-rotten-tomatoes/test.csv'
validation_data_path='../input/movie-review-data-set-from-rotten-tomatoes/validation.csv'

In [None]:
train_data=pd.read_csv(train_data_path)
test_data=pd.read_csv(test_data_path)
val_data=pd.read_csv(validation_data_path)

In [None]:
train_data.head()

In [None]:
train_data.label.value_counts().plot.barh()

In [None]:
import pprint
printer = pprint.PrettyPrinter(width =120,compact=False)
sample_text_list=list(train_data['text'])[0:10]
printer.pprint(sample_text_list)

*The data seems to be pre-processed already with  no class-imbalance. So minimal preprocessing is done later with more focus on the models.*

In [None]:
%%capture
#Installations
!pip install contractions

In [None]:
import string
import contractions
import spacy
import matplotlib.pyplot as plt

In [None]:
en = spacy.load('en_core_web_sm')
stopwords = en.Defaults.stop_words
punct_string=string.punctuation.replace('\'','')

In [None]:
punct_dict={}
for punct in punct_string:
    punct_dict[ord(punct[0])]=' ' 

def remove_punct(punct_dict,text):
    return text.translate(punct_dict)  
    
    
def preprocess(text,punct_dict=punct_dict):
    text=" ".join([contractions.fix(token.strip()) for token in text.split()])
    text=remove_punct(punct_dict,text).replace("  "," ")   
    text=" ".join([token for token in text.split() if (token not in stopwords)])
    return text.strip() 

In [None]:
train_data["text"]=train_data["text"].apply(preprocess)
test_data["text"]=test_data["text"].apply(preprocess)
val_data["text"]=val_data["text"].apply(preprocess)

In [None]:
get_len={lambda x: len(x.split())}
len_frame=pd.DataFrame()
len_frame['len']=train_data['text'].apply(get_len)

In [None]:
len_frame['len'].quantile(q=[0.1,0.9])

In [None]:
train_data=train_data[len_frame['len']>0]

## WordCloud

In [None]:
from wordcloud import WordCloud
from wordcloud import STOPWORDS

In [None]:
#Positive and negative review
stopwords=list(STOPWORDS)+["movie","film","make","one","makes","story","time","character"]

pos_index=train_data[train_data['label']==1]
neg_index=train_data[train_data['label']==0]

pos_reviews=list(pos_index['text'])
neg_reviews=list(neg_index['text'])

pos_wordcloud = WordCloud(width=800, height=500,stopwords=stopwords, background_color="white").generate(str(pos_reviews))
neg_wordcloud=WordCloud(width=800, height=500,stopwords=stopwords, background_color="black").generate(str(neg_reviews))

In [None]:
plt.figure(figsize=[5,5])
plt.title("Positive Reviews")
plt.imshow(pos_wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

In [None]:
plt.figure(figsize=[5,5])
plt.title("Negative Reviews")
plt.imshow(neg_wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

## Vocabulary,DataSet,DataLoader

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

In [None]:
seed=0
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)

In [None]:
if torch.cuda.is_available():
    device=torch.device("cuda")
else:
    device=torch.device("cpu")
    
print(device)

In [None]:
train_text_list=list(train_data['text'])
vocab_obj=util.Vocabulary(train_text_list)
vocab_obj.make_token_dicts()

# **Models**

# Gaussian Naive Bayes Model

The variant of Naive Bayes classifier where, the features of the data in each class , is assumed to have come from their respective gaussian distributions. So the likelihood factor for computing the final conditional probabilities is  calculated from these previously mentioned distributions. The tfidf vectoriser helps in converting the text numerical feature values before using the Gaussian Naive Bayes Classifier.

In [None]:
train_data.head()

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import GaussianNB

In [None]:
train_x=list(train_data['text'])
train_y=list(train_data['label'])

val_x=list(val_data['text'])
val_y=list(val_data['label'])

test_x=list(test_data['text'])
test_y=list(test_data['label'])

In [None]:
tfidf_vectoriser = TfidfVectorizer(max_features = len(vocab_obj)) 

train_vectors = tfidf_vectoriser.fit_transform(train_x).toarray()
train_y=np.array(train_y)

val_vectors = tfidf_vectoriser.transform(val_x).toarray()
val_y=np.array(val_y)

test_vectors = tfidf_vectoriser.transform(test_x).toarray()
test_y=np.array(test_y)

In [None]:
nb_model=GaussianNB()
nb_model.fit(train_vectors,train_y)

In [None]:
train_score=nb_model.score(train_vectors,train_y)
validation_score=nb_model.score(val_vectors,val_y)
test_score=nb_model.score(test_vectors,test_y)

In [None]:
print(f" Train Score: {train_score*100.0} , Validation Score : {validation_score*100.0} , Test Score: {test_score*100.0} ,  ")

# 1D Convolution Network

 Here "output_size" number of filters,  of the form (feature_size,kernel_size) is created for each layer of convolution. The convolutions majory happen in one direction ( ie.. sequence length) instead of the most common 2-D way. An filter's tensor shapes are shown below

In [None]:
#Sample
input_feature_size=128
output_size=32
kernel_size=7

conv_filter = nn.Conv1d(input_feature_size, output_size, kernel_size=kernel_size, padding="same")

for param in conv_filter.parameters():
    print(param.shape)  


In [None]:
class ConvolutionModel(nn.Module):
    def __init__(self,vocab_obj,embed_dim):
        super(ConvolutionModel,self).__init__()
        self.token_count=len(vocab_obj)
        self.embed_dim=embed_dim       
        self.embedding=nn.Embedding(len(vocab_obj),embed_dim)
        self.conv1=nn.Sequential(nn.Conv1d(embed_dim,512,kernel_size=4,padding='same'),nn.BatchNorm1d(512),nn.ReLU(),nn.Dropout1d(0.5))
        self.conv2=nn.Sequential(nn.Conv1d(512,256,kernel_size=8,padding='same'),nn.BatchNorm1d(256),nn.ReLU(),nn.Dropout1d(0.3))
        self.conv3=nn.Sequential(nn.Conv1d(256,64,kernel_size=8,padding='same'),nn.BatchNorm1d(64),nn.ReLU(),nn.Dropout1d(0.1))
        self.linear=nn.Sequential(nn.Linear(64,16),nn.ReLU(),nn.Linear(16,2))
        
        
    def forward(self,input):  #input shape: [batch_size,token_length]
        x=self.embedding(input)    #shape: [batch_size,token_length,embedding length]
        bs,tl,es=x.shape
        x=x.reshape(bs,es,tl)  #shape[batch_size, feature_length, token length]
        x=self.conv1(x)
        x=self.conv2(x)
        x=self.conv3(x)
        x,max_indices=x.max(dim=-1)
        x=self.linear(x)
        return x


In [None]:
def InitialliseModel(vocab_obj,embed_dim,device,init_func="xavier"):
    model=ConvolutionModel(vocab_obj,embed_dim).to(device)
    if init_func=="xavier":
        for param in model.parameters():
            if(len(param.shape)>=2):
                nn.init.xavier_uniform_(param, gain=nn.init.calculate_gain('relu'))                
    elif init_func=="kaiming":
        for param in model.parameters():
            if(len(param.shape)>=2):
                nn.init.kaiming_normal_(param, mode='fan_out', nonlinearity='relu')         
    return model

In [None]:
def print_model(model):
    for child in model.children():
        print(f"Module Name: {child}")
    return   

In [None]:
embed_dim=512  #300
init_func="kaiming"
model=InitialliseModel(vocab_obj,embed_dim,device,init_func)
print_model(model)

In [None]:
batch_size=32
epochs=10

cnn_train_dataset,cnn_train_loader=util.get_loader(train_data,vocab_obj,batch_size,max_len=100)
cnn_val_dataset,cnn_val_loader=util.get_loader(val_data,vocab_obj,batch_size,max_len=100)
cnn_test_dataset,cnn_test_loader=util.get_loader(test_data,vocab_obj,batch_size,max_len=100)

loss_function=nn.CrossEntropyLoss()
optimiser=torch.optim.Adam(model.parameters())

train_batches=len(cnn_train_dataset)//batch_size 
val_batches=len(cnn_val_dataset)//batch_size
test_batches=len(cnn_test_dataset)//batch_size

print(f"Train Batchcount: {train_batches} ,Test Batchcount: {test_batches} , Validation Batchcount: {val_batches} ")

In [None]:
util.training(model,cnn_train_loader,train_batches,cnn_val_loader,val_batches,loss_function,optimiser,epochs,device,if_clip=True)

In [None]:
util.testing(model,cnn_test_loader,test_batches,loss_function,device)

# Bi LSTM

The multilayer-lstm's outputs from both directions, forward and backward are used here (since we have a complete sentence)  and the only output at the  last word is used here for final result.

In [None]:
class BiLSTMModel(nn.Module):
    def __init__(self,input_size,hidden_size,num_layers,batch_first,dropout,bi_directional,proj_size,n_classes,vocab_obj,device):
        super(BiLSTMModel,self).__init__()
        
        self.num_layers=num_layers
        self.proj_size=proj_size
        self.hidden_size=hidden_size 
        self.device=device
        self.d=2 if bi_directional else 1        
        self.embedding=nn.Embedding(len(vocab_obj),input_size)
        self.bi_lstm=nn.LSTM(input_size=input_size,hidden_size=hidden_size,num_layers=num_layers,batch_first=batch_first,bidirectional=bi_directional,proj_size=proj_size,dropout=dropout)
        self.linear=nn.Linear(self.d*proj_size,n_classes)
        
        
    def forward(self,input):        
        batch_size,seq_length=input.shape
        x=self.embedding(input)        
        hidden_init = torch.zeros(self.d*self.num_layers,batch_size,self.proj_size).to(device)
        cell_init = torch.zeros(self.d*self.num_layers,batch_size,self.hidden_size).to(device)
        x, (hidden_final, cell_final) = self.bi_lstm(x, (hidden_init, cell_init))
        x=x[:,-1,:]
        output=self.linear(x)
        return output

In [None]:
rnn_params={
    "input_size":128,
    "hidden_size":256,
    "num_layers":2,    
    "batch_first":True,
    "dropout":0.1,
    "bi_directional":True,
    "proj_size":2,
    "n_classes":2,
    "vocab_obj":vocab_obj,
    "device":device
            }

In [None]:
model=BiLSTMModel(**rnn_params).to(device)
print_model(model)

In [None]:
batch_size=32
epochs=6

rnn_train_dataset,rnn_train_loader=util.get_loader(train_data,vocab_obj,batch_size,max_len=100)
rnn_val_dataset,rnn_val_loader=util.get_loader(val_data,vocab_obj,batch_size,max_len=100)
rnn_test_dataset,rnn_test_loader=util.get_loader(test_data,vocab_obj,batch_size,max_len=100)

loss_function=nn.CrossEntropyLoss()
optimiser=torch.optim.Adam(model.parameters())

train_batches=len(rnn_train_dataset)//batch_size 
val_batches=len(rnn_val_dataset)//batch_size
test_batches=len(rnn_test_dataset)//batch_size

print(f"Train Batchcount: {train_batches} ,Test Batchcount: {test_batches} , Validation Batchcount: {val_batches} ")

In [None]:
util.training(model,rnn_train_loader,train_batches,rnn_val_loader,val_batches,loss_function,optimiser,epochs,device,if_clip=True)

In [None]:
util.testing(model,rnn_test_loader,test_batches,loss_function,device)