Credits to this tutorial: https://www.geeksforgeeks.org/deep-learning/implementing-recurrent-neural-networks-in-pytorch/

The dataset used can be found at: https://www.kaggle.com/datasets/tanishqdublish/text-classification-documentation.

The purpose of the task is to train the RNN model to classify text phrases. I used PyTorch library.

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from torch.utils.data import Dataset, DataLoader

Load the dataset

In [2]:
df = pd.read_csv('df_file.csv')

In [3]:
# Display data info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2225 entries, 0 to 2224
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Text    2225 non-null   object
 1   Label   2225 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 34.9+ KB


The data has two columns, the first containing text phrases and the second containing the corresponding category (class/label).

During data preprocessing, missing values are checked for, text is converted to lowercase, and words are separated from the text.

In [4]:
# Missing values count
print(df.isna().sum())

Text     0
Label    0
dtype: int64


In [5]:
df['Text'] = df['Text'].str.lower().str.split()

In [6]:
df.head()

Unnamed: 0,Text,Label
0,"[budget, to, set, scene, for, election, gordon...",0
1,"[army, chiefs, in, regiments, decision, milita...",0
2,"[howard, denies, split, over, id, cards, micha...",0
3,"[observers, to, monitor, uk, election, ministe...",0
4,"[kilroy, names, election, seat, target, ex-cha...",0


Train-test split (random based)

In [7]:
train_data, test_data = train_test_split(df, test_size=0.2, random_state=42) 

Encoding the text into integer sequences by building a vocabulary and assigning each token a unique index. Neural networks require fixed-length inputs, so different maximum sequence lengths were tested. Despite an average sequence length of nearly 400 tokens, the best results were achieved using only the first 50 words, with shorter sequences padded with zeros.

In [8]:
vocab = {word for phrase in df["Text"] for word in phrase}
word_to_idx = {word: idx for idx, word in enumerate(vocab, start=1)}

max_length = df["Text"].str.len().max()
print(max_length)
avg_len = df["Text"].str.len().mean()
print(avg_len)
max_length = 50

def encode_and_pad(text):
    encoded = [word_to_idx[word] for word in text]
    if len(encoded) >= max_length:
        encoded = encoded[:max_length]
    else:
        encoded = encoded + [0] * (max_length - len(encoded))
    return encoded


4432
384.04044943820224


In [9]:
train_data['Text'] = train_data['Text'].apply(encode_and_pad)
test_data['Text'] = test_data['Text'].apply(encode_and_pad)

Create dataset and dataloader

In [10]:
class TextDataset(Dataset):
    def __init__(self, data):
        self.texts = data['Text'].values
        self.labels = data['Label'].values
    
    def __len__(self):
        return len(self.texts)
    
    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]
        return torch.tensor(text, dtype=torch.long), torch.tensor(label, dtype=torch.long)

train_dataset = TextDataset(train_data)
test_dataset = TextDataset(test_data)

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)

Creating RNN model which has embedding, RNN and output layers.

In [11]:
class TextRNN(nn.Module):
    def __init__(self, vocab_size, embed_size, hidden_size, output_size):
        super(TextRNN, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.rnn = nn.RNN(embed_size, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)
    
    def forward(self, x):
        x = self.embedding(x)
        h0 = torch.zeros(1, x.size(0), hidden_size).to(x.device)
        out, _ = self.rnn(x, h0)
        out = self.fc(out[:, -1, :])
        return out

vocab_size = len(vocab) + 1
embed_size = 60
hidden_size = 60
output_size = 5  # Number of classes 
model = TextRNN(vocab_size, embed_size, hidden_size, output_size)

Model training using CrossEntropy loss function and Adam optimized with learning rate = 0.001.

In [12]:
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

num_epochs = 50
for epoch in range(num_epochs):
    model.train()
    epoch_loss = 0
    for texts, labels in train_loader:
        outputs = model(texts)
        loss = criterion(outputs, labels)
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        epoch_loss += loss.item()
    
    print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {epoch_loss / len(train_loader):.4f}')

Epoch [1/50], Loss: 1.6270
Epoch [2/50], Loss: 1.5479
Epoch [3/50], Loss: 1.4791
Epoch [4/50], Loss: 1.3814
Epoch [5/50], Loss: 1.2506
Epoch [6/50], Loss: 1.0887
Epoch [7/50], Loss: 0.8852
Epoch [8/50], Loss: 0.7070
Epoch [9/50], Loss: 0.5671
Epoch [10/50], Loss: 0.4264
Epoch [11/50], Loss: 0.2953
Epoch [12/50], Loss: 0.2091
Epoch [13/50], Loss: 0.1489
Epoch [14/50], Loss: 0.1118
Epoch [15/50], Loss: 0.1021
Epoch [16/50], Loss: 0.1075
Epoch [17/50], Loss: 0.0628
Epoch [18/50], Loss: 0.0391
Epoch [19/50], Loss: 0.0267
Epoch [20/50], Loss: 0.0200
Epoch [21/50], Loss: 0.0390
Epoch [22/50], Loss: 0.0836
Epoch [23/50], Loss: 0.0284
Epoch [24/50], Loss: 0.0176
Epoch [25/50], Loss: 0.0112
Epoch [26/50], Loss: 0.0090
Epoch [27/50], Loss: 0.0078
Epoch [28/50], Loss: 0.0068
Epoch [29/50], Loss: 0.0061
Epoch [30/50], Loss: 0.0055
Epoch [31/50], Loss: 0.0050
Epoch [32/50], Loss: 0.0045
Epoch [33/50], Loss: 0.0041
Epoch [34/50], Loss: 0.0038
Epoch [35/50], Loss: 0.0035
Epoch [36/50], Loss: 0.0032
E

In [14]:
model.eval()
correct = 0
total = 0
with torch.no_grad():
    for texts, labels in test_loader:
        outputs = model(texts)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

accuracy = 100 * correct / total
print(f'Accuracy: {accuracy:.2f}%')

Accuracy: 58.43%


When using 50 epochs, the model achieves an accuracy of 58.43% on the test set.