# NLP - Multi-Class Text Classification using CNNs

By [Akshaj Verma](https://akshajverma.com)  

This notebook takes you through the implementation of binary text classification in the form of sentiment analysis on yelp reviews using CNNs in PyTorch.

In [1]:
import re
import numpy as np
import pandas as pd
from pprint import pprint
from collections import Counter

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence, pack_padded_sequence, pad_packed_sequence

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

%matplotlib inline

torch.manual_seed(1)

<torch._C.Generator at 0x7fd8c5f96550>

## Prepare Data

In [2]:
df = pd.read_csv("../../../data/nlp/text_classification/bbc-text.csv")
df = df.rename(columns = {'category':'tag'})
df.head()

Unnamed: 0,tag,text
0,tech,tv future in the hands of viewers with home th...
1,business,worldcom boss left books alone former worldc...
2,sport,tigers wary of farrell gamble leicester say ...
3,sport,yeading face newcastle in fa cup premiership s...
4,entertainment,ocean s twelve raids box office ocean s twelve...


### Convert from dataframe to list

In [3]:
sentence_list = [t for t in df['text'].to_list()]
tag_list = [t for t in df['tag'].to_list()]

#### The input sentences.

In [4]:
sentence_list[:2]

['tv future in the hands of viewers with home theatre systems  plasma high-definition tvs  and digital video recorders moving into the living room  the way people watch tv will be radically different in five years  time.  that is according to an expert panel which gathered at the annual consumer electronics show in las vegas to discuss how these new technologies will impact one of our favourite pastimes. with the us leading the trend  programmes and other content will be delivered to viewers via home networks  through cable  satellite  telecoms companies  and broadband service providers to front rooms and portable devices.  one of the most talked-about technologies of ces has been digital and personal video recorders (dvr and pvr). these set-top boxes  like the us s tivo and the uk s sky+ system  allow people to record  store  play  pause and forward wind tv programmes when they want.  essentially  the technology allows for much more personalised tv. they are also being built-in to hig

#### The output tags.

In [5]:
tag_list[:2]

['tech', 'business']

### Clean the input data.

In [6]:
# Convert to lowercase
sentence_list = [s.lower() for s in sentence_list]

# Remove non alphavets
regex_remove_nonalphabets = re.compile('[^a-zA-Z]')
sentence_list = [regex_remove_nonalphabets.sub(' ', s) for s in sentence_list]

# Remove words with less than 2 letters
# regex_remove_shortwords = re.compile(r'\b\w{1,2}\b')
# sentence_list = [regex_remove_shortwords.sub("", s) for s in sentence_list]

# Remove words that appear only once
c = Counter(w for s in sentence_list for w in s.split())
sentence_list = [' '.join(y for y in x.split() if c[y] > 1) for x in sentence_list]

# Strip extra whitespaces
sentence_list = [" ".join(s.split()) for s in sentence_list]

In [7]:
sentence_list[:2]

['tv future in the hands of viewers with home theatre systems plasma high definition tvs and digital video recorders moving into the living room the way people watch tv will be radically different in five years time that is according to an expert panel which gathered at the annual consumer electronics show in las vegas to discuss how these new technologies will impact one of our favourite with the us leading the trend programmes and other content will be delivered to viewers via home networks through cable satellite telecoms companies and broadband service providers to front rooms and portable devices one of the most talked about technologies of ces has been digital and personal video recorders dvr and pvr these set top boxes like the us s tivo and the uk s sky system allow people to record store play pause and forward wind tv programmes when they want essentially the technology allows for much more personalised tv they are also being built in to high definition tv sets which are big b

### Create a vocab and dictionary for input.

#### Vocab for input.

In [8]:
words = []
for sentence in sentence_list:
    for w in sentence.split():
        words.append(w)
    
words = list(set(words))
print(f"Size of word-vocablury: {len(words)}\n")

Size of word-vocablury: 18636



#### Input <=> ID.

In [9]:
word2idx = {word: i for i, word in enumerate(words)}

### Create a vocab and dictionary for output.

#### Vocab for output.

In [10]:
tags = []
for tag in tag_list:
    tags.append(tag)
tags = list(set(tags))
print(f"Size of tag-vocab: {len(tags)}\n")
print(tags)

Size of tag-vocab: 5

['tech', 'sport', 'politics', 'business', 'entertainment']


#### Output <=> ID.

In [11]:
tag2idx = {word: i for i, word in enumerate(tags)}
print(tag2idx)

{'tech': 0, 'sport': 1, 'politics': 2, 'business': 3, 'entertainment': 4}


### Encode the input and output to numbers.

#### Input

In [12]:
X = [[word2idx[w] for w in s.split()] for s in sentence_list]

#### Output

In [13]:
y = [tag2idx[t] for t in tag_list]
y[:3]

[0, 3, 1]

### Train-Test Split

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [15]:
print("X_train size: ", len(X_train))
print("X_test size: ", len(X_test))

X_train size:  1557
X_test size:  668


## Sample Neural Network

### Sample Parameters.

In [16]:
BATCH_SIZE_SAMPLE = 2
EMBEDDING_SIZE_SAMPLE = 5
VOCAB_SIZE = len(word2idx)
TARGET_SIZE = len(tag2idx)
HIDDEN_SIZE_SAMPLE = 3
STACKED_LAYERS_SAMPLE = 4

### Sample Dataloader.

In [17]:
class SampleData(Dataset):
    
    def __init__(self, X_data, y_data):
        self.X_data = X_data
        self.y_data = y_data
        
        
    def __getitem__(self, index):
        return self.X_data[index], self.y_data[index]
        
    def __len__ (self):
        return len(self.X_data)

In [18]:
sample_data = SampleData(X_train, y_train)
sample_loader = DataLoader(sample_data, batch_size=BATCH_SIZE_SAMPLE, collate_fn=lambda x:x)

In [19]:
tl = iter(sample_loader)

i,j = map(list, zip(*next(tl)))

print(i,"\n\n", j, "\n")

[[2946, 1927, 13292, 1281, 13491, 8469, 2946, 16737, 11041, 17222, 13551, 4228, 105, 17222, 13292, 2027, 7294, 11800, 8271, 10190, 105, 5319, 17058, 17041, 1543, 9154, 1454, 6371, 2027, 5226, 1454, 1701, 5417, 13289, 5034, 10729, 2483, 5319, 9868, 4300, 2198, 11366, 17151, 15158, 15356, 8705, 17058, 13196, 5207, 12825, 8732, 8705, 16778, 10527, 9868, 657, 2936, 7670, 9822, 2027, 5226, 5207, 2552, 16726, 7294, 12929, 6923, 16478, 2218, 2959, 14148, 18024, 8705, 1025, 17041, 17253, 16429, 15059, 17489, 5226, 13212, 1869, 105, 15249, 14950, 4041, 13279, 4371, 18185, 10998, 105, 2226, 5207, 1025, 17253, 14510, 8185, 9352, 8533, 7234, 3, 14950, 16778, 10527, 9868, 4274, 14206, 17222, 8345, 13551, 2388, 4654, 12305, 9629, 17017, 7164, 105, 5126, 8883, 1551, 11939, 105, 2748, 17058, 14819, 14950, 5461, 1505, 18600, 2958, 16674, 12929, 13437, 2916, 1869, 8705, 2050, 6813, 12362, 11101, 13411, 8635, 18438, 1927, 12929, 6991, 10440, 16127, 560, 17058, 16922, 560, 5794, 9885, 11101, 15266, 15266,

### Sample CNN class.

In [20]:
class ModelCNNSample(nn.Module):
    
    def __init__(self, embedding_size, vocab_size, target_size):
        super(ModelCNNSample, self).__init__()
        
        self.word_embeddings = nn.Embedding(num_embeddings = vocab_size, embedding_dim = embedding_size)
        self.conv1 = nn.Conv1d(in_channels=embedding_size, out_channels=100, kernel_size=3, stride=1, padding = 1)
        self.conv2 = nn.Conv1d(in_channels=100, out_channels=10, kernel_size=3, stride=1, padding = 1)
        self.maxpool = nn.MaxPool1d(kernel_size=3)
        self.linear = nn.Linear(in_features = 10, out_features=target_size)
        
        
    def forward(self, x_batch):        
        padded_batch = pad_sequence(x_batch, batch_first=True)
        print("\nPadded X_batch: ", padded_batch.size(), "\n", padded_batch, "\n")

        
        embeds = self.word_embeddings(padded_batch)
        print("\nEmbeddings: ", embeds.size(), "\n", embeds, "\n")
    
        embeds_t = embeds.transpose(1, 2)
        print("\nEmbeddings transposed for CNN: ", embeds_t.size(), "\n", embeds_t, "\n")

        cnn1 = torch.relu(self.conv1(embeds_t))
        cnn2 = torch.relu(self.conv2(cnn1))
        print("\nCNN output: ", cnn2.size(), "\n", cnn2)
        
        maxpool1 = self.maxpool(cnn2)
        print("\nMaxpool output: ", maxpool1.size(), "\n", maxpool1)
        
        linear_in, _ = torch.max(maxpool1, dim = 2)
        print("\nLinear input: ", linear_in.size(), "\n", linear_in)


        linear_out = self.linear(linear_in)
        print("\nLinear Output:\n", linear_out)
        
        y_out = torch.log_softmax(linear_out, dim = 1)
        print("\nLog Softmax:\n", y_out)

        
        return y_out

In [21]:
cnn_model_sample = ModelCNNSample(embedding_size=EMBEDDING_SIZE_SAMPLE, vocab_size=len(word2idx), target_size=len(tag2idx))
print(cnn_model_sample)

ModelCNNSample(
  (word_embeddings): Embedding(18636, 5)
  (conv1): Conv1d(5, 100, kernel_size=(3,), stride=(1,), padding=(1,))
  (conv2): Conv1d(100, 10, kernel_size=(3,), stride=(1,), padding=(1,))
  (maxpool): MaxPool1d(kernel_size=3, stride=3, padding=0, dilation=1, ceil_mode=False)
  (linear): Linear(in_features=10, out_features=5, bias=True)
)


### Sample Output.

output = [batch size, sent len, hid dim]  
hidden = [batch size, 1, hid dim]

In [22]:
with torch.no_grad():
    for batch in sample_loader:
        x_batch, y_batch = map(list, zip(*batch))
        x_batch = [torch.tensor(i) for i in x_batch]
        y_batch = [torch.tensor(i) for i in y_batch]
        
        
        print("X batch: ")
        pprint(x_batch)
        print("\ny batch: ")
        pprint(y_batch)
        
        y_out = cnn_model_sample(x_batch)
                        
        _, y_out_tag = torch.max(y_out, dim = 1)
        print("\nY Output Tag: \n", y_out_tag)
        
        print("\nActual Output: ")
        print(y_batch)

        break

X batch: 
[tensor([ 2946,  1927, 13292,  1281, 13491,  8469,  2946, 16737, 11041, 17222,
        13551,  4228,   105, 17222, 13292,  2027,  7294, 11800,  8271, 10190,
          105,  5319, 17058, 17041,  1543,  9154,  1454,  6371,  2027,  5226,
         1454,  1701,  5417, 13289,  5034, 10729,  2483,  5319,  9868,  4300,
         2198, 11366, 17151, 15158, 15356,  8705, 17058, 13196,  5207, 12825,
         8732,  8705, 16778, 10527,  9868,   657,  2936,  7670,  9822,  2027,
         5226,  5207,  2552, 16726,  7294, 12929,  6923, 16478,  2218,  2959,
        14148, 18024,  8705,  1025, 17041, 17253, 16429, 15059, 17489,  5226,
        13212,  1869,   105, 15249, 14950,  4041, 13279,  4371, 18185, 10998,
          105,  2226,  5207,  1025, 17253, 14510,  8185,  9352,  8533,  7234,
            3, 14950, 16778, 10527,  9868,  4274, 14206, 17222,  8345, 13551,
         2388,  4654, 12305,  9629, 17017,  7164,   105,  5126,  8883,  1551,
        11939,   105,  2748, 17058, 14819, 14950,  54

## Acutal Neural Network.

### Model parameters.

In [23]:
EPOCHS = 10
BATCH_SIZE = 32
EMBEDDING_SIZE = 512
VOCAB_SIZE = len(word2idx)
TARGET_SIZE = len(tag2idx)
HIDDEN_SIZE = 64
LEARNING_RATE = 0.001

### Data Loader.

#### Train Loader.

In [24]:
class TrainData(Dataset):
    
    def __init__(self, X_data, y_data):
        self.X_data = X_data
        self.y_data = y_data
        
        
    def __getitem__(self, index):
        return self.X_data[index], self.y_data[index]
        
    def __len__ (self):
        return len(self.X_data)

In [25]:
train_data = TrainData(X_train, y_train)
train_loader = DataLoader(train_data, batch_size=BATCH_SIZE, collate_fn=lambda x:x)

#### Test Loader

In [26]:
class TestData(Dataset):
    
    def __init__(self, X_data, y_data):
        self.X_data = X_data
        self.y_data = y_data
        
    def __getitem__(self, index):
        return self.X_data[index], self.y_data[index]
        
    def __len__ (self):
        return len(self.X_data)

In [27]:
test_data = TestData(X_test, y_test)
test_loader = DataLoader(test_data, batch_size=1, collate_fn=lambda x:x)

### CNN Model Class.

In [28]:
class ModelCNN(nn.Module):
    
    def __init__(self, embedding_size, vocab_size, target_size):
        super(ModelCNN, self).__init__()
        
        self.word_embeddings = nn.Embedding(num_embeddings = vocab_size, embedding_dim = embedding_size)
        self.conv1 = nn.Conv1d(in_channels=embedding_size, out_channels=128, kernel_size=3, stride=1, padding = 1)
        self.conv2 = nn.Conv1d(in_channels=128, out_channels=10, kernel_size=3, stride=1, padding = 1)
        self.maxpool = nn.MaxPool1d(kernel_size=2)
        self.linear = nn.Linear(in_features = 10, out_features=target_size)
        
        
    def forward(self, x_batch):        
        padded_batch = pad_sequence(x_batch, batch_first=True)
        embeds = self.word_embeddings(padded_batch)
        embeds_t = embeds.transpose(1, 2)
        
        cnn1 = torch.relu(self.conv1(embeds_t))
        cnn2 = torch.relu(self.conv2(cnn1))
        maxpool1 = self.maxpool(cnn2)
        linear_in, _ = torch.max(maxpool1, dim = 2)
        
        linear_out = self.linear(linear_in)
        
        return linear_out

In [29]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)

cuda:0


In [30]:
cnn_model = ModelCNN(embedding_size=EMBEDDING_SIZE, vocab_size=len(word2idx), target_size=len(tag2idx))

cnn_model.to(device)
print(cnn_model)

criterion = nn.CrossEntropyLoss()

optimizer =  optim.Adam(cnn_model.parameters())

ModelCNN(
  (word_embeddings): Embedding(18636, 512)
  (conv1): Conv1d(512, 128, kernel_size=(3,), stride=(1,), padding=(1,))
  (conv2): Conv1d(128, 10, kernel_size=(3,), stride=(1,), padding=(1,))
  (maxpool): MaxPool1d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (linear): Linear(in_features=10, out_features=5, bias=True)
)


## Train model.

In [31]:
def multi_acc(y_pred, y_test):
    y_pred_softmax = torch.log_softmax(y_pred, dim = 1)
    _, y_pred_tags = torch.max(y_pred_softmax, dim = 1)    
    
    correct_pred = (y_pred_tags == y_test).float()
    acc = correct_pred.sum() / len(correct_pred)
    
    acc = torch.round(acc) * 100
    
    return acc

In [32]:
cnn_model.train()
for e in range(1, EPOCHS+1):
    epoch_loss = 0
    epoch_acc = 0
    for batch in train_loader:
        x_batch, y_batch = map(list, zip(*batch))
        x_batch = [torch.tensor(i).to(device) for i in x_batch]
        y_batch = torch.tensor(y_batch).long().to(device)
                
        optimizer.zero_grad()
        
        y_pred = cnn_model(x_batch)        
        
        loss = criterion(y_pred.squeeze(0), y_batch)
        acc = multi_acc(y_pred.squeeze(0), y_batch)
        
        loss.backward()
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    print(f'Epoch {e+0:03}: | Loss: {epoch_loss/len(train_loader):.5f} | Acc: {acc}')

Epoch 001: | Loss: 1.40272 | Acc: 0.0
Epoch 002: | Loss: 0.79136 | Acc: 100.0
Epoch 003: | Loss: 0.38017 | Acc: 100.0
Epoch 004: | Loss: 0.16631 | Acc: 100.0
Epoch 005: | Loss: 0.07012 | Acc: 100.0
Epoch 006: | Loss: 0.03393 | Acc: 100.0
Epoch 007: | Loss: 0.01774 | Acc: 100.0
Epoch 008: | Loss: 0.01027 | Acc: 100.0
Epoch 009: | Loss: 0.00709 | Acc: 100.0
Epoch 010: | Loss: 0.00522 | Acc: 100.0


## Test Model.

In [33]:
y_out_tags_list = []
with torch.no_grad():
    for batch in test_loader:
        x_batch, y_batch = map(list, zip(*batch))
        x_batch = [torch.tensor(i).to(device) for i in x_batch]
        y_batch = torch.tensor(y_batch).long().to(device)
        
        y_pred = cnn_model(x_batch)
        _, y_pred_tag = torch.max(y_pred, dim = 1)

        y_out_tags_list.append(y_pred_tag.squeeze(0).cpu().numpy())

## Confusion Matrix.

In [34]:
print(confusion_matrix(y_test, y_out_tags_list))

[[ 87   2   6   9   9]
 [  5 130   5   1  12]
 [ 10   5  96  12   5]
 [ 18   1  13 123   5]
 [  8   4   7   6  89]]


## Classification Report.

In [35]:
y_out_tags_list = [a.squeeze().tolist() for a in y_out_tags_list]

In [36]:
print(classification_report(y_test, y_out_tags_list))

              precision    recall  f1-score   support

           0       0.68      0.77      0.72       113
           1       0.92      0.85      0.88       153
           2       0.76      0.75      0.75       128
           3       0.81      0.77      0.79       160
           4       0.74      0.78      0.76       114

    accuracy                           0.79       668
   macro avg       0.78      0.78      0.78       668
weighted avg       0.79      0.79      0.79       668



## View model output.

In [37]:
idx2word = {v: k for k, v in word2idx.items()}
idx2tag = {v: k for k, v in tag2idx.items()}

In [38]:
print('{:80}: {:15}\n'.format("Word", "Sentiment"))
for sentence, tag in zip(X_test[:10], y_out_tags_list[:10]):
    s = " ".join([idx2word[w] for w in sentence])
    print('{:80}: {:5}\n'.format(s, tag))


Word                                                                            : Sentiment      

stam spices up man utd encounter ac milan defender stam says manchester united know they made a mistake by selling him in the sides meet at old trafford in the champions league game on wednesday and the year old s dutchman s presence is sure to add spice to the fixture united made a mistake in selling me stam told uefa s champions magazine i was settled at manchester united but they wanted to sell me if a club want to sell you there is nothing you can do you can be sold like cattle sir alex ferguson surprised the football world and stam by selling the dutchman to lazio for m in august the decision came shortly after stam claimed in his autobiography that ferguson had tapped him up when he was at psv eindhoven but ferguson insisted he sold the defender because the transfer fee was too good to refuse for a player past his prime the affair still with the dutchman i was settled at manchester 