# Text Classification Using RNN

In this work the aim is to classify text phrases into 5 categories using a recurrent neural network. The categories are politics, sports, technology, entertainment and business. The dataset used can be found at: https://www.kaggle.com/datasets/tanishqdublish/text-classification-documentation.

PyTorch is used to create the RNN, credits to the following tutorial, the structure of the script is based on it: https://www.geeksforgeeks.org/deep-learning/implementing-recurrent-neural-networks-in-pytorch/ 

## Import packages

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from torch.utils.data import Dataset, DataLoader

## Read and preprocess data

* Lowercasing text, dividing into words
* Building vocabulary from the words
* Encoding text to numbers

Even though the avg sequence size is almost 400, with a simple and quite small model the best results were achieved with only 50 first words. If the sequence length is smaller than 50, 0's are added to the end of its' encoded version. 

In [None]:
# Read data
df = pd.read_csv('df_file.csv')

# Text preprocessing: lowercasing and tokenization
df['Text'] = df['Text'].str.lower().str.split()

# Ensure correct label encoding
le = LabelEncoder()
df['Label'] = le.fit_transform(df['Label'])

trainData, testData = train_test_split(df, test_size=0.2, random_state=42)

# Build vocabulary
vocab = {word for phrase in df['Text'] for word in phrase}
wordToIdx = {word: idx for idx, word in enumerate(vocab, start=1)}  # Start indexing from 1

# Calculate average sequence length
length = 0
for text in df['Text']:
    length += len(text)
avgLength = length / len(df['Text'])
print(f"Average sequence length: {avgLength}")

maxSeqLength = 50

# Encode and pad sequences
def encode_and_pad(phrase, wordToIdx, maxSeqLength):
    encoded = [wordToIdx[word] for word in phrase]
    if len(encoded) <= maxSeqLength:
        return encoded + [0] * (maxSeqLength - len(encoded))
    else:
        return encoded[:maxSeqLength]

trainData['Text'] = trainData['Text'].apply(lambda x: encode_and_pad(x, wordToIdx, maxSeqLength))
testData['Text'] = testData['Text'].apply(lambda x: encode_and_pad(x, wordToIdx, maxSeqLength))

Average sequence length: 384.04044943820224


## Create dataset and dataloader

In [79]:
# Create PyTorch Dataset and DataLoader for our data

class TextDataset(Dataset):
    def __init__(self, data):
        self.texts = data['Text'].values
        self.labels = data['Label'].values

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        return torch.tensor(self.texts[idx], dtype=torch.long), torch.tensor(self.labels[idx], dtype=torch.long)
    
trainData = TextDataset(trainData)
testData = TextDataset(testData)

batchsize = 32

trainLoader = DataLoader(trainData, batch_size=batchsize, shuffle=True)
testLoader = DataLoader(testData, batch_size=batchsize, shuffle=False)

## Creating RNN model

A simple RNN model consisting of an embedding layer, 1 RNN layer, and output layer is used. The size of embeddings is 64, and we have 64 features in the hidden state. The initial hidden state consists of zeros. 

In [80]:
class RNNModel(nn.Module):
    def __init__(self, vocab_size, embed_size, hidden_size, output_size, n_layers=1):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_size) # store word embeddings
        self.rnn = nn.RNN(embed_size, hidden_size, num_layers=n_layers, batch_first=True) # RNN layer, where recurrent connections happen :D
        self.fc = nn.Linear(hidden_size, output_size) # output layer
        
    def forward(self, x):
        x = self.embedding(x) # convert word indices to embeddings
        h0 = torch.zeros(1, x.size(0), hidden_size).to(x.device) # initial hidden state
        output, _ = self.rnn(x, h0) # RNN forward pass
        output = self.fc(output[:, -1, :]) # use the last time step's output for classification
        return output
    
vocab_size = len(wordToIdx) + 1  
embed_size = 64
hidden_size = 64
output_size = 5
model = RNNModel(vocab_size, embed_size, hidden_size, output_size)

## Training the model

We train the model for 50 epochs using cross-entropy loss function and Adam optimizer with the learning rate of 0.001.  

In [81]:
loss_fn = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
num_epochs = 50

for epoch in range(num_epochs):
    model.train()
    epoch_loss = 0
    for texts, labels in trainLoader:
        outputs = model(texts)
        loss = loss_fn(outputs, labels)
        epoch_loss += loss.item()
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    print(f'Epoch {epoch+1}/{num_epochs}, Loss: {epoch_loss/len(trainLoader)}')

Epoch 1/50, Loss: 1.6299247933285577
Epoch 2/50, Loss: 1.5395584042583192
Epoch 3/50, Loss: 1.4711667725018092
Epoch 4/50, Loss: 1.374806489263262
Epoch 5/50, Loss: 1.2411382900817054
Epoch 6/50, Loss: 1.0497327936547143
Epoch 7/50, Loss: 0.8562618046998978
Epoch 8/50, Loss: 0.6639349763946873
Epoch 9/50, Loss: 0.5045583301356861
Epoch 10/50, Loss: 0.34392036284719196
Epoch 11/50, Loss: 0.2504749002733401
Epoch 12/50, Loss: 0.17141771675752743
Epoch 13/50, Loss: 0.15898248200703943
Epoch 14/50, Loss: 0.09946302969806961
Epoch 15/50, Loss: 0.06025222432799637
Epoch 16/50, Loss: 0.04152095021813044
Epoch 17/50, Loss: 0.031076891281242882
Epoch 18/50, Loss: 0.024062717449851334
Epoch 19/50, Loss: 0.020134851469525268
Epoch 20/50, Loss: 0.016411327124972428
Epoch 21/50, Loss: 0.013360466342419386
Epoch 22/50, Loss: 0.011416962112499667
Epoch 23/50, Loss: 0.009803800269894834
Epoch 24/50, Loss: 0.008472006784618966
Epoch 25/50, Loss: 0.007443025334006441
Epoch 26/50, Loss: 0.006625627653972

## Evaluating the model

After the training we evaluate model's performance with unseen test data. The maximum value from the output is checked to define the predicted class. We were able to achieve a bit over 50% accuracy with this setup. The performance could be improved with a bigger model, more layers, and more data used. Also, LSTM model should overperform RNN in this task. 

In [82]:
model.eval()
correct = 0
total = 0

with torch.no_grad():
    for texts, labels in testLoader:
        outputs = model(texts)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

print(f'Accuracy: {100 * correct / total:.2f}%')

Accuracy: 53.93%


Checking the predictions

In [83]:
print(f"Predicted: {predicted}, Actual: {labels}")

Predicted: tensor([4, 2, 4, 2, 0, 1, 1, 2, 1, 0, 1, 1, 4, 2, 2, 4, 2, 4, 4, 4, 2, 4, 3, 2,
        1, 0, 4, 1, 3]), Actual: tensor([3, 2, 4, 2, 1, 2, 1, 2, 3, 0, 1, 2, 2, 2, 3, 2, 1, 4, 3, 4, 4, 4, 3, 2,
        1, 0, 2, 1, 1])
