# NLP - Binary Text Classification using CNN+RNN

By [Akshaj Verma](https://akshajverma.com)  

This notebook takes you through the implementation of binary text classification in the form of sentiment analysis on yelp reviews using CNN+RNN in PyTorch.

In [1]:
import re
import numpy as np
import pandas as pd
from pprint import pprint
from collections import Counter 
import matplotlib.pyplot as plt

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence, pack_padded_sequence, pad_packed_sequence

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

%matplotlib inline

torch.manual_seed(1)

<torch._C.Generator at 0x7f50b23ccbf0>

## Prepare Data

In [2]:
df = pd.read_csv("../../../data/nlp/text_classification/yelp_labelled.txt", sep="\t", header=None, names=['text', 'tag'])
df.head()

Unnamed: 0,text,tag
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


In [3]:
df = df[-df['text'].str.split().str.len().lt(6)]

## Convert from dataframe to list

In [4]:
sentence_list = [t for t in df['text'].to_list()]
tag_list = [t for t in df['tag'].to_list()]

#### The input sentences.

In [5]:
sentence_list[1:10]

['Stopped by during the late May bank holiday off Rick Steve recommendation and loved it.',
 'The selection on the menu was great and so were the prices.',
 'Now I am getting angry and I want my damn pho.',
 "Honeslty it didn't taste THAT fresh.)",
 'The potatoes were like rubber and you could tell they had been made up ahead of time being kept under a warmer.',
 'The cashier had no care what so ever on what I had to say it still ended up being wayyy overpriced.',
 'I tried the Cape Cod ravoli, chicken,with cranberry...mmmm!',
 'I was disgusted because I was pretty sure that was human hair.',
 'I was shocked because no signs indicate cash only.']

#### The output tags.

In [6]:
tag_list[1:10]

[1, 1, 0, 0, 0, 0, 1, 0, 0]

### Clean the input data.

In [7]:
# Convert to lowercase
sentence_list = [s.lower() for s in sentence_list]

# Remove non alphavets
regex_remove_nonalphabets = re.compile('[^a-zA-Z]')
sentence_list = [regex_remove_nonalphabets.sub(' ', s) for s in sentence_list]

# Remove words with less than 2 letters
regex_remove_shortwords = re.compile(r'\b\w{1,2}\b')
sentence_list = [regex_remove_shortwords.sub("", s) for s in sentence_list]

# Remove words that appear only once
c = Counter(w for s in sentence_list for w in s.split())
sentence_list = [' '.join(y for y in x.split() if c[y] > 1) for x in sentence_list]

# Strip extra whitespaces
sentence_list = [" ".join(s.split()) for s in sentence_list]

In [8]:
sentence_list[0:10]

['not tasty and the texture was just nasty',
 'stopped during the late may off recommendation and loved',
 'the selection the menu was great and were the prices',
 'now getting and want damn pho',
 'didn taste that fresh',
 'the potatoes were like and you could tell they had been made time being kept under',
 'the cashier had care what ever what had say still ended being overpriced',
 'tried the chicken with mmmm',
 'was because was pretty sure that was human hair',
 'was because only']

### Create a vocab and dictionary for input.

#### Vocab for input.

In [9]:
words = []
for sentence in sentence_list:
    for w in sentence.split():
        words.append(w)
    
words = list(set(words))
print(f"Size of word-vocablury: {len(words)}\n")

Size of word-vocablury: 797



#### Input <=> ID.

In [10]:
word2idx = {word: i for i, word in enumerate(words)}

### Create a vocab and dictionary for output.

#### Vocab for output.

In [11]:
tags = []
for tag in tag_list:
    tags.append(tag)
tags = list(set(tags))
print(f"Size of tag-vocab: {len(tags)}\n")
print(tags)

Size of tag-vocab: 2

[0, 1]


#### Output <=> ID.

In [12]:
tag2idx = {word: i for i, word in enumerate(tags)}
print(tag2idx)

{0: 0, 1: 1}


### Encode the input and output to numbers.

#### Input

In [13]:
X = [[word2idx[w] for w in s.split()] for s in sentence_list]
X[:3]

[[674, 3, 148, 741, 98, 305, 751, 86],
 [382, 242, 741, 584, 158, 133, 366, 148, 651],
 [741, 413, 741, 784, 305, 406, 148, 697, 741, 47]]

#### Output

In [14]:
y = [tag2idx[t] for t in tag_list]
y[:3]

[0, 1, 1]

### Train-Test Split

In [15]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [16]:
print("X_train size: ", len(X_train))
print("X_test size: ", len(X_test))

X_train size:  540
X_test size:  232


## Sample Neural Network

### Sample Parameters.

In [17]:
BATCH_SIZE_SAMPLE = 2
EMBEDDING_SIZE_SAMPLE = 5
VOCAB_SIZE = len(word2idx)
TARGET_SIZE = len(tag2idx)
HIDDEN_SIZE_SAMPLE = 3
STACKED_LAYERS_SAMPLE = 4

### Sample Dataloader.

In [18]:
class SampleData(Dataset):
    
    def __init__(self, X_data, y_data):
        self.X_data = X_data
        self.y_data = y_data
        
        
    def __getitem__(self, index):
        return self.X_data[index], self.y_data[index]
        
    def __len__ (self):
        return len(self.X_data)

In [19]:
sample_data = SampleData(X_train, y_train)
sample_loader = DataLoader(sample_data, batch_size=BATCH_SIZE_SAMPLE, collate_fn=lambda x:x)

In [20]:
tl = iter(sample_loader)

i,j = map(list, zip(*next(tl)))

print(i,"\n\n", j, "\n")

[[674, 597, 432, 741], [408, 338, 280, 55, 74, 741, 148, 741, 59, 429, 666, 338, 406]] 

 [0, 1] 



### Sample CNN+RNN class.

In [21]:
class ModelCnnRnnSample(nn.Module):
    
    def __init__(self, embedding_size, vocab_size, hidden_size, target_size, stacked_layers):
        super(ModelCnnRnnSample, self).__init__()
        
        self.word_embeddings = nn.Embedding(num_embeddings = vocab_size, embedding_dim = embedding_size)
        self.conv1 = nn.Conv1d(in_channels=embedding_size, out_channels=100, kernel_size=3, stride=1, padding = 1)
        self.conv2 = nn.Conv1d(in_channels=100, out_channels=7, kernel_size=3, stride=1, padding = 1)
        self.maxpool = nn.MaxPool1d(kernel_size=3)
        self.gru = nn.GRU(input_size = 7, hidden_size=hidden_size, batch_first=True)
        self.linear = nn.Linear(in_features = hidden_size, out_features=1)
        
    def forward(self, x_batch):        
        padded_batch = pad_sequence(x_batch, batch_first=True)
        print("\nPadded X_batch: ", padded_batch.size(), "\n", padded_batch, "\n")

        embeds = self.word_embeddings(padded_batch)
        print("\nEmbeddings: ", embeds.size(), "\n", embeds, "\n")
    
        embeds_t = embeds.transpose(1, 2)
        print("\nEmbeddings transposed for CNN: ", embeds_t.size(), "\n", embeds_t, "\n")

        cnn1 = torch.relu(self.conv1(embeds_t))
        cnn2 = torch.relu(self.conv2(cnn1))
        print("\nCNN output: ", cnn2.size(), "\n", cnn2)
        
        maxpool1 = self.maxpool(cnn2)
        print("\nMaxpool output: ", maxpool1.size(), "\n", maxpool1)
        
        gru_input = maxpool1.transpose(1, 2)
        print("\nRNN Input: ", gru_input.size(), "\n", gru_input)
        
        _, gru_hidden = self.gru(gru_input)
        print("\nRNN Last Hidden: ", gru_hidden.size(), "\n", gru_hidden)
        
        
#         linear_in, _ = torch.max(gru_hidden, dim = 2)
#         print("\nLinear input: ", linear_in.size(), "\n", linear_in)

        linear_out = self.linear(gru_hidden)
        print("\nLinear Output: ", linear_out.size(), "\n", linear_out)
        
        y_out = torch.sigmoid(linear_out)
#         print("\nSigmoid:\n", y_out)

        
        return y_out

In [22]:
cnn_rnn_model_sample = ModelCnnRnnSample(embedding_size=EMBEDDING_SIZE_SAMPLE, vocab_size=len(word2idx), hidden_size = HIDDEN_SIZE_SAMPLE, target_size=len(tag2idx), stacked_layers=STACKED_LAYERS_SAMPLE)
print(cnn_rnn_model_sample)

ModelCnnRnnSample(
  (word_embeddings): Embedding(797, 5)
  (conv1): Conv1d(5, 100, kernel_size=(3,), stride=(1,), padding=(1,))
  (conv2): Conv1d(100, 7, kernel_size=(3,), stride=(1,), padding=(1,))
  (maxpool): MaxPool1d(kernel_size=3, stride=3, padding=0, dilation=1, ceil_mode=False)
  (gru): GRU(7, 3, batch_first=True)
  (linear): Linear(in_features=3, out_features=1, bias=True)
)


### Sample Output.

output = [batch size, sent len, hid dim]  
hidden = [batch size, 1, hid dim]

In [23]:
with torch.no_grad():
    for batch in sample_loader:
        x_batch, y_batch = map(list, zip(*batch))
        x_batch = [torch.tensor(i) for i in x_batch]
        y_batch = [torch.tensor(i) for i in y_batch]
        
        
        print("X batch: ")
        pprint(x_batch)
        print("\ny batch: ")
        pprint(y_batch)
        
        y_out = cnn_rnn_model_sample(x_batch)
        print("\nModel Output: ", y_out.size())
        print(y_out)
                        
        y_out_tag = torch.round(y_out)
        print("\nY Output Tag: \n", y_out_tag)
        
        
        print("\nActual Output: ")
        print(y_batch)

        break

X batch: 
[tensor([674, 597, 432, 741]),
 tensor([408, 338, 280,  55,  74, 741, 148, 741,  59, 429, 666, 338, 406])]

y batch: 
[tensor(0), tensor(1)]

Padded X_batch:  torch.Size([2, 13]) 
 tensor([[674, 597, 432, 741,   0,   0,   0,   0,   0,   0,   0,   0,   0],
        [408, 338, 280,  55,  74, 741, 148, 741,  59, 429, 666, 338, 406]]) 


Embeddings:  torch.Size([2, 13, 5]) 
 tensor([[[ 1.2912,  0.3553, -0.5949,  0.7913,  0.2434],
         [ 1.0208, -0.1065,  0.2071,  0.5192,  0.1796],
         [-0.6696, -0.7714, -0.2665,  0.0449, -0.7013],
         [-1.8827, -0.3300,  0.8413, -1.2723, -0.1413],
         [-0.6540, -1.6095, -0.1002, -0.6092, -0.9798],
         [-0.6540, -1.6095, -0.1002, -0.6092, -0.9798],
         [-0.6540, -1.6095, -0.1002, -0.6092, -0.9798],
         [-0.6540, -1.6095, -0.1002, -0.6092, -0.9798],
         [-0.6540, -1.6095, -0.1002, -0.6092, -0.9798],
         [-0.6540, -1.6095, -0.1002, -0.6092, -0.9798],
         [-0.6540, -1.6095, -0.1002, -0.6092, -0.9798],
 

## Actual Neural Network.

### Model parameters.

In [24]:
EPOCHS = 20
BATCH_SIZE = 32
EMBEDDING_SIZE = 512
VOCAB_SIZE = len(word2idx)
TARGET_SIZE = len(tag2idx)
HIDDEN_SIZE = 64
LEARNING_RATE = 0.005
STACKED_LAYERS = 2

### Data Loader.

#### Train Loader.

In [25]:
class TrainData(Dataset):
    
    def __init__(self, X_data, y_data):
        self.X_data = X_data
        self.y_data = y_data
        
        
    def __getitem__(self, index):
        return self.X_data[index], self.y_data[index]
        
    def __len__ (self):
        return len(self.X_data)

In [26]:
train_data = TrainData(X_train, y_train)
train_loader = DataLoader(train_data, batch_size=BATCH_SIZE, collate_fn=lambda x:x)

#### Test Loader

In [27]:
class TestData(Dataset):
    
    def __init__(self, X_data, y_data):
        self.X_data = X_data
        self.y_data = y_data
        
    def __getitem__(self, index):
        return self.X_data[index], self.y_data[index]
        
    def __len__ (self):
        return len(self.X_data)

In [28]:
test_data = TestData(X_test, y_test)
test_loader = DataLoader(test_data, batch_size=1, collate_fn=lambda x:x)

### CNN+RNN Model Class.

In [29]:
class ModelCnnRnn(nn.Module):
    
    def __init__(self, embedding_size, vocab_size, hidden_size, target_size, stacked_layers):
        super(ModelCnnRnn, self).__init__()
        
        self.word_embeddings = nn.Embedding(num_embeddings = vocab_size, embedding_dim = embedding_size)
        self.conv1 = nn.Conv1d(in_channels=embedding_size, out_channels=64, kernel_size=3, stride=1, padding = 1)
        self.conv2 = nn.Conv1d(in_channels=64, out_channels=32, kernel_size=3, stride=1, padding = 1)
        self.conv3 = nn.Conv1d(in_channels=32, out_channels=16, kernel_size=3, stride=1, padding = 1)
        self.maxpool = nn.MaxPool1d(kernel_size=3)
        self.dropout = nn.Dropout(p=0.2)
        self.gru = nn.GRU(input_size = 16, hidden_size=hidden_size, batch_first=True)
        self.linear = nn.Linear(in_features = hidden_size, out_features=1)
        self.relu = nn.ReLU()
        
    def forward(self, x_batch):        
        padded_batch = pad_sequence(x_batch, batch_first=True)
        
        embeds = self.word_embeddings(padded_batch)
        embeds_t = embeds.transpose(1, 2)
        
        cnn1 = self.relu(self.conv1(embeds_t))
        cnn1 = self.dropout(cnn1)
        cnn2 = self.relu(self.conv2(cnn1))
        cnn2 = self.dropout(cnn2)
        cnn3 = self.relu(self.conv3(cnn2))
        cnn3 = self.dropout(cnn3)
        
        gru_input = cnn3.transpose(1, 2)
        _, gru_hidden = self.gru(gru_input)
        
        linear_out = self.linear(gru_hidden)
        
        return linear_out

In [30]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)

cuda:0


In [31]:
cnn_rnn_model = ModelCnnRnn(embedding_size=EMBEDDING_SIZE, vocab_size=len(word2idx), hidden_size=HIDDEN_SIZE, target_size=len(tag2idx), stacked_layers=STACKED_LAYERS)

cnn_rnn_model.to(device)
print(cnn_rnn_model)

criterion = nn.BCEWithLogitsLoss()

optimizer =  optim.Adam(cnn_rnn_model.parameters())

ModelCnnRnn(
  (word_embeddings): Embedding(797, 512)
  (conv1): Conv1d(512, 64, kernel_size=(3,), stride=(1,), padding=(1,))
  (conv2): Conv1d(64, 32, kernel_size=(3,), stride=(1,), padding=(1,))
  (conv3): Conv1d(32, 16, kernel_size=(3,), stride=(1,), padding=(1,))
  (maxpool): MaxPool1d(kernel_size=3, stride=3, padding=0, dilation=1, ceil_mode=False)
  (dropout): Dropout(p=0.2, inplace=False)
  (gru): GRU(16, 64, batch_first=True)
  (linear): Linear(in_features=64, out_features=1, bias=True)
  (relu): ReLU()
)


## Train model.

In [32]:
def binary_acc(y_pred, y_test):
    y_pred_tag = torch.round(torch.sigmoid(y_pred))

    correct_results_sum = (y_pred_tag == y_test).sum().float()
    acc = correct_results_sum/y_test.shape[0]
    acc = torch.round(acc * 100)
    
    return acc

In [33]:
cnn_rnn_model.train()
for e in range(1, EPOCHS+1):
    epoch_loss = 0
    epoch_acc = 0
    for batch in train_loader:
        x_batch, y_batch = map(list, zip(*batch))
        x_batch = [torch.tensor(i).to(device) for i in x_batch]
        y_batch = torch.tensor(y_batch).long().to(device)
                
        optimizer.zero_grad()
        
        y_pred = cnn_rnn_model(x_batch)
                
        loss = criterion(y_pred.squeeze(), y_batch.float())
        acc = binary_acc(y_pred.squeeze(), y_batch.float())
        
        loss.backward()
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
    
    print(f'Epoch {e+0:03}: | Loss: {epoch_loss/len(train_loader):.9f} | Acc: {acc}')

Epoch 001: | Loss: 0.693399717 | Acc: 54.0
Epoch 002: | Loss: 0.691596887 | Acc: 64.0
Epoch 003: | Loss: 0.690825199 | Acc: 54.0
Epoch 004: | Loss: 0.682170871 | Acc: 64.0
Epoch 005: | Loss: 0.678511017 | Acc: 50.0
Epoch 006: | Loss: 0.663026319 | Acc: 54.0
Epoch 007: | Loss: 0.645953964 | Acc: 57.0
Epoch 008: | Loss: 0.596029194 | Acc: 54.0
Epoch 009: | Loss: 0.464356652 | Acc: 86.0
Epoch 010: | Loss: 0.338667913 | Acc: 93.0
Epoch 011: | Loss: 0.229146009 | Acc: 96.0
Epoch 012: | Loss: 0.276522907 | Acc: 89.0
Epoch 013: | Loss: 0.225944416 | Acc: 93.0
Epoch 014: | Loss: 0.156504884 | Acc: 89.0
Epoch 015: | Loss: 0.142707937 | Acc: 96.0
Epoch 016: | Loss: 0.168017417 | Acc: 93.0
Epoch 017: | Loss: 0.122333537 | Acc: 96.0
Epoch 018: | Loss: 0.107507899 | Acc: 96.0
Epoch 019: | Loss: 0.072540649 | Acc: 96.0
Epoch 020: | Loss: 0.056513863 | Acc: 100.0


## Test Model.

In [34]:
y_out_tags_list = []
with torch.no_grad():
    for batch in test_loader:
        x_batch, y_batch = map(list, zip(*batch))
        x_batch = [torch.tensor(i).to(device) for i in x_batch]
        y_batch = torch.tensor(y_batch).long().to(device)
        
        y_pred = cnn_rnn_model(x_batch)
        y_pred = torch.sigmoid(y_pred)
        y_pred_tag = torch.round(y_pred)

        y_out_tags_list.append(y_pred_tag.squeeze(0).cpu().numpy())

## Confusion Matrix.

In [35]:
y_out_tags_list = [a.squeeze().tolist() for a in y_out_tags_list]

In [36]:
print(confusion_matrix(y_test, y_out_tags_list))

[[79 38]
 [18 97]]


## Classification Report.

In [37]:
print(classification_report(y_test, y_out_tags_list))

              precision    recall  f1-score   support

           0       0.81      0.68      0.74       117
           1       0.72      0.84      0.78       115

    accuracy                           0.76       232
   macro avg       0.77      0.76      0.76       232
weighted avg       0.77      0.76      0.76       232



## View model output.

In [38]:
idx2word = {v: k for k, v in word2idx.items()}
idx2tag = {v: k for k, v in tag2idx.items()}

In [39]:
print('{:80}: {:15}\n'.format("Sentence", "Sentiment"))
for sentence, tag in zip(X_test[:10], y_out_tags_list[:10]):
    s = " ".join([idx2word[w] for w in sentence])
    print('{:80}: {:5}\n'.format(s, tag))


Sentence                                                                        : Sentiment      

this place has lot but                                                          :   1.0

believe that this place great stop for those with huge belly and for sushi      :   1.0

guess maybe went off night but was                                              :   0.0

the meat was pretty dry had the sliced and pulled pork                          :   0.0

this place one star and has with the food                                       :   1.0

service perfect and the family atmosphere nice see                              :   1.0

had tea which was good                                                          :   1.0

great food and great service clean and friendly setting                         :   1.0

great time family dinner night                                                  :   1.0

had fantastic service and were pleased the atmosphere                           :   1.0

