# LSTM based Network Architecture for Binary Classification
(IMDB Movie Dataset)

![image.png](attachment:039fc0be-d353-4c3e-9475-f1e9b2234ae1.png)

### Various Steps Involved are:
1. Input Data
1. Tokenization
1. Sequence Batching and Data Loaders
1. Embedding Layer
1. LSTM Layer
1. Fully Connected (FC) layer
1. Sigmoid
1. Final Output

In [None]:
# List of extra libraries used in the project that requires installation

# 1. Beautiful Soup
!pip install bs4

In [None]:
import torch

import numpy as np 
import pandas as pd 
import re
from collections import Counter

from bs4 import BeautifulSoup
import matplotlib.pyplot as plt

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
df = pd.read_csv('/kaggle/input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv')

df.shape

In [None]:
df['review_len'] = df['review'].str.len()
df['review_num_words'] = df['review'].str.split().str.len()
df

![image.png](attachment:3967484c-39fd-4fe0-856f-66f3cb58d67c.png)

## Look at few review samples to understand what kind of data is present

In [None]:
df['review'][0]

In [None]:
df['review'][1]

In [None]:
df['review'][2]

## We will set up the pipeline for only 1000 reviews. 

Tip: Generally a good idea to work with smaller dataset as the code runs faster and it is easier to debug. However, some of the issues may only surface when you use the complete data so you have to mindfull.

In [None]:
#df = df.iloc[0:1000,:]
#df.shape

## We observe that the review text would require some cleaning before we use it further.
### Some of the cleaning steps required are :
1. Removing HTML tags
1. Converting into lower case
1. Removing special characters

In [None]:
def pre_process_text(text):
    
    # Remove HTML tags
    soup = BeautifulSoup(text, "html.parser")
    text = soup.get_text()
    
    # Convert to lower case
    text = text.lower()
    
    # Remove punctuations
    text = re.sub(r'[^\w\s]', '', text)
    return text

In [None]:
df['review'] = df['review'].apply(pre_process_text)

In [None]:
df['review'][1]

In [None]:
df['review_len'] = df['review'].str.len()
df['review_num_words'] = df['review'].str.split().str.len()
df

### Analyze Review Lengths

In [None]:
fig, ax = plt.subplots(figsize = (14,6))
plt.hist(df['review_len'], bins=40 )
plt.show()

In [None]:
df['review_len'].describe(percentiles = np.array([0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 95, 100])/100)

### Analyze Number of words in reviews

In [None]:
fig, ax = plt.subplots(figsize = (14,6))
plt.hist(df['review_num_words'], bins=40 )
plt.show()

## This is how our data looks like 

![image.png](attachment:38c5265a-9a1f-4676-9bfa-18eca3e7007f.png)

## Word Embeddings:

* In NLP, your features are words. But how should you represent a word in a computer?
    - ASCI Value (This tells you what word is; doesn't tell you anything about what word means)
    - One Hot Vector (Massive dimensional space)
* Fundamental linguistic assumption: that words appearing in similar contexts are related to each other semantically.
* How could we actually encode semantic similarity in words?
    - There can be thousands of semantic attributes that might be relevant. How on earth can you set the value of these attributes?

### Here comes the Central Idea of Deep Learning
* You let Neural Networks learn these representations of the feature i.e. let word embeddings be parameters in the model and then be updated during the training.


Reading reference:
1. [PyTorch Documentation](https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html)
1. [What the heck is word embedding?](https://towardsdatascience.com/what-the-heck-is-word-embedding-b30f67f01c81)

## Tokenize the reviews

### Create word to index mapping dictionary

We will create an index mapping dictionary in such a way that your frequently occurring words are assigned lower indexes. One of the most common way of doing this is to use Counter method from Collections library.


In [None]:
# Create List of reviews
reviews_list = list(df['review'])
len(reviews_list)

# From list of reviews, we need to get to list of words
all_words = ' '.join(reviews_list)
list_words = all_words.split()

# Count all the words using Counter Method
count_words = Counter(list_words)
total_words = len(list_words)
sorted_words = count_words.most_common(total_words)

# Create vocab_to_in dictionary
vocab_to_int = {w:i for i, (w,c) in enumerate(sorted_words)}

# tokenize reviews
reviews_int = []
for review in reviews_list:
    r = [vocab_to_int[w] for w in review.split()]
    reviews_int.append(r)
    
# Encode Labels
labels_list = list(df['sentiment'])
labels_int = [1 if label =='positive' else 0 for label in labels_list]
labels_int = np.array(labels_int)

In [None]:
print(total_words)

In [None]:
sorted_words[0:10]

In [None]:
dict(list(vocab_to_int.items())[0:10])

In [None]:
print (reviews_int[0:3])

In [None]:
print(labels_int[0:3])

# After tokenization this is how our data looks like
![image.png](attachment:0f30d1f3-6e0f-42c3-afa1-c976bcfa17a0.png)

# Define the sequence length

### This sequence length is same as number of time steps for LSTM layer.

* Too long reviews --> Truncate
* Too short reviews --> Delete
* Remaining reviews --> Padding

In [None]:
def pad_features(reviews_int, seq_length):
    ''' Return features of review_ints, where each review is padded with 0's or truncated to the input seq_length.
    '''
    features = np.zeros((len(reviews_int), seq_length), dtype = int)
    
    for i, review in enumerate(reviews_int):
        review_len = len(review)
        
        if review_len <= seq_length:
            zeroes = list(np.zeros(seq_length-review_len))
            new = zeroes+review
        elif review_len > seq_length:
            new = review[0:seq_length]
        
        features[i,:] = np.array(new)
    
    return features

In [None]:
seq_length = 2500
features = pad_features(reviews_int, seq_length)

In [None]:
features[0:3]

# Now Data looks like this
![image.png](attachment:789ad25d-73f6-4ff0-81a5-16778a3d2c9d.png)

# Training / Validation / Test Split

train= 80% | valid = 10% | test = 10%


In [None]:
len_feat = len(features)
split_frac = 0.8
train_x = features[0:int(split_frac*len_feat)]
train_y = labels_int[0:int(split_frac*len_feat)]

remaining_x = features[int(split_frac*len_feat):]
remaining_y = labels_int[int(split_frac*len_feat):]

valid_x = remaining_x[0:int(len(remaining_x)*0.5)]
valid_y = remaining_y[0:int(len(remaining_y)*0.5)]

test_x = remaining_x[int(len(remaining_x)*0.5):]
test_y = remaining_y[int(len(remaining_y)*0.5):]

In [None]:
print(f'Shape of Train X, {len(train_x)}, Shape of Train y, {len(train_y)}')
print(f'Shape of Train X, {len(valid_x)}, Shape of Train y, {len(valid_y)}')
print(f'Shape of Train X, {len(test_x)}, Shape of Train y, {len(test_y)}')

# Data Loaders and Batching

In [None]:
import torch
from torch.utils.data import DataLoader, TensorDataset

# create Tensor datasets
train_data = TensorDataset(torch.from_numpy(train_x), torch.from_numpy(train_y))
valid_data = TensorDataset(torch.from_numpy(valid_x), torch.from_numpy(valid_y))
test_data = TensorDataset(torch.from_numpy(test_x), torch.from_numpy(test_y))

# dataloaders
batch_size = 50

# make sure to SHUFFLE your data
train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size)
valid_loader = DataLoader(valid_data, shuffle=True, batch_size=batch_size)
test_loader = DataLoader(test_data, shuffle=True, batch_size=batch_size)

# Visualize one batch of data 

In [None]:
# Obtain one batch of training data
dataiter = iter(train_loader)
x, y = dataiter.next()
print('Sample input size: ', x.size()) # batch_size, seq_length
print('Sample input: \n', x)


In [None]:
print('Sample label size: ', y.size()) # batch_size
print('Sample label: \n', y)

![image.png](attachment:3769ed4b-38e0-4b71-912b-14918a5971d0.png)

# Add Embedding Layer

In [None]:
from torch import nn

vocab_size = len(all_words)
embedding_dim = 30

embeds = nn.Embedding(vocab_size, embedding_dim)
print ('Shape of Embedding layer is ', embeds)


In [None]:
print ('Embedding layer weights ', embeds.weight.shape)

In [None]:
embeds_out = embeds(x)

print ('Embedding layer output shape', embeds_out.shape)
print ('Embedding layer output ', embeds_out)

![image.png](attachment:47ab4e66-68b5-4ab9-ad4a-8e90f504c514.png)

# Add LSTM Layer

In [None]:
# initializing the hidden state to 0
hidden=None
hidden_units = 512

lstm = nn.LSTM(input_size=embedding_dim, hidden_size=hidden_units, num_layers=1, batch_first=True)
lstm_out, h = lstm(embeds_out, hidden)
print ('LSTM layer output shape', lstm_out.shape)
print ('LSTM layer output ', lstm_out)

![image.png](attachment:46363838-8f2f-40fa-a70a-ce1b5cecf030.png)


# Fully Connected Layer

In [None]:
fc = nn.Linear(in_features=hidden_units, out_features=1)

fc_out = fc(lstm_out.contiguous().view(-1, hidden_units))

print ('FC layer output shape', fc_out.shape)
print ('FC layer output ', fc_out)

In [None]:
50*2500

![image.png](attachment:06607270-93d6-4964-b3a5-b9ad77d7ff4d.png)

# Sigmoid Activation Layer

In [None]:
sigm = nn.Sigmoid()
sigm_out = sigm(fc_out)
print ('Sigmoid layer output shape', sigm_out.shape)
print ('Sigmoid layer output ', sigm_out)

![image.png](attachment:1f795684-7084-46c1-8c91-dfb23ec10e99.png)

# Final Output :
### This includes 2 steps

### Step 1) Reshape the output so that rows = batch_size

In [None]:
out = sigm_out.view(batch_size, -1)
print ('Output layer output shape', out.shape)
print ('Output layer output ', out)

![image.png](attachment:e8e64b7b-f958-4815-a1b4-7fabd4a51534.png)

### Step 2) Output from the last timestep

In [None]:
final_output = out[:,-1]
print ('Final Output Shape , ', final_output.shape)
print ('Final sentiment prediction, ', final_output)

![image.png](attachment:bb050196-dcc5-4be3-8ec2-b41131673028.png)