# **Toxic Comment Classification**



## **Introduction**

In this educational project we will be fine tuning a transformer model for the **Multilabel text classification** problem.

<br>

---

<br>

#### **Abstract**

This project is an educational endeavor that guides you through the process of fine-tuning a transformer model for multilabel text classification. The project is organized into distinct steps, including importing necessary libraries, preprocessing domain-specific data, setting up datasets and dataloaders, creating a neural network for fine-tuning, executing the fine-tuning process, evaluating model performance, and finally, saving the model and relevant artifacts for future inference.

To facilitate this, the Jigsaw toxic comment dataset from Kaggle is employed. The dataset consists of comment texts marked with labels like toxic, severe_toxic, obscene, threat, insult, and identity_hate. The model architecture chosen for fine-tuning is BERT, a powerful transformer model developed by Google AI.

This practical project equips learners with hands-on experience in the realms of multilabel text classification and transformer-based NLP models.

<a id='section01'></a>
### **1. Importing Libraries**

At this step we will be importing the libraries and modules needed to run our script.

In [None]:
# Installing the transformers library and additional libraries if looking process

!pip install -q transformers

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m60.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m33.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m116.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m82.5 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
# Importing stock ml libraries
import numpy as np
import pandas as pd
from sklearn import metrics
import transformers
import torch
from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler
from transformers import BertTokenizer, BertModel, BertConfig


In [None]:
# Setting up the device for GPU usage

from torch import cuda
device = 'cuda' if cuda.is_available() else 'cpu'

<a id='section02'></a>
### **2. Importing and Pre-Processing the domain data**

We will be working with the data and preparing for fine tuning purposes.

*  Load the file into a dataframe while assigning the appropriate headers based on the provided documentation.
*  Extract the values from all categories and transform them into a list.
*  This list is then added as a new column, leading to the removal of other columns.

In [None]:
df = pd.read_csv("train.csv")
df['list'] = df[df.columns[2:]].values.tolist()
new_df = df[['comment_text', 'list']].copy()
new_df.head()

Unnamed: 0,comment_text,list
0,Explanation\nWhy the edits made under my usern...,"[0, 0, 0, 0, 0, 0]"
1,D'aww! He matches this background colour I'm s...,"[0, 0, 0, 0, 0, 0]"
2,"Hey man, I'm really not trying to edit war. It...","[0, 0, 0, 0, 0, 0]"
3,"""\nMore\nI can't make any real suggestions on ...","[0, 0, 0, 0, 0, 0]"
4,"You, sir, are my hero. Any chance you remember...","[0, 0, 0, 0, 0, 0]"


<a id='section03'></a>
### **3. Preparing the Dataset and Dataloader**

Let's begin by establishing essential variables that will be utilized in the subsequent training or fine-tuning phase. Afterward, we'll craft the CustomDataset class, outlining the preprocessing steps for the text before it enters the neural network. Additionally, we'll set up the Dataloader responsible for sending data to the neural network in batches, effectively facilitating training and processing.

Both Dataset and Dataloader are components within the PyTorch library that manage data preprocessing and its efficient transfer to the neural network.


In [None]:
# Sections of config

# Defining some key variables that will be used later on in the training
MAX_LEN = 200
TRAIN_BATCH_SIZE = 8
VALID_BATCH_SIZE = 4
EPOCHS = 1
LEARNING_RATE = 1e-05
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [None]:
class CustomDataset(Dataset):

    def __init__(self, dataframe, tokenizer, max_len):
        self.tokenizer = tokenizer
        self.data = dataframe
        self.comment_text = dataframe.comment_text
        self.targets = self.data.list
        self.max_len = max_len

    def __len__(self):
        return len(self.comment_text)

    def __getitem__(self, index):
        comment_text = str(self.comment_text[index])
        comment_text = " ".join(comment_text.split())

        inputs = self.tokenizer.encode_plus(
            comment_text,
            None,
            add_special_tokens=True,
            max_length=self.max_len,
            pad_to_max_length=True,
            return_token_type_ids=True
        )
        ids = inputs['input_ids']
        mask = inputs['attention_mask']
        token_type_ids = inputs["token_type_ids"]


        return {
            'ids': torch.tensor(ids, dtype=torch.long),
            'mask': torch.tensor(mask, dtype=torch.long),
            'token_type_ids': torch.tensor(token_type_ids, dtype=torch.long),
            'targets': torch.tensor(self.targets[index], dtype=torch.float)
        }

In [None]:
# Creating the dataset and dataloader for the neural network

train_size = 0.8
train_dataset=new_df.sample(frac=train_size,random_state=200)
test_dataset=new_df.drop(train_dataset.index).reset_index(drop=True)
train_dataset = train_dataset.reset_index(drop=True)


print("FULL Dataset: {}".format(new_df.shape))
print("TRAIN Dataset: {}".format(train_dataset.shape))
print("TEST Dataset: {}".format(test_dataset.shape))

training_set = CustomDataset(train_dataset, tokenizer, MAX_LEN)
testing_set = CustomDataset(test_dataset, tokenizer, MAX_LEN)

FULL Dataset: (159571, 2)
TRAIN Dataset: (127657, 2)
TEST Dataset: (31914, 2)


In [None]:
train_params = {'batch_size': TRAIN_BATCH_SIZE,
                'shuffle': True,
                'num_workers': 0
                }

test_params = {'batch_size': VALID_BATCH_SIZE,
                'shuffle': True,
                'num_workers': 0
                }

training_loader = DataLoader(training_set, **train_params)
testing_loader = DataLoader(testing_set, **test_params)

<a id='section04'></a>
### **4. Creating the Neural Network for Fine Tuning**

Our neural network, constructed using BERTClass, includes a `Bert model` followed by a `Dropout layer` for regularization and a `Linear Layer` for classification. During the forward pass, the BertModel generates two outputs. The second output (pooled output) undergoes Dropout before reaching the Linear layer.

The Linear Layer has **6** dimensions, matching the categories we classify. Data flows through BertClass in the dataset. The final layer's output calculates loss and gauges prediction accuracy. We create an instance "model" for training and saving the trained model for future use.

In [None]:
# Creating the customized model, by adding a drop out and a dense layer on top of distil bert to get the final output for the model.

class BERTClass(torch.nn.Module):
    def __init__(self):
        super(BERTClass, self).__init__()
        self.l1 = transformers.BertModel.from_pretrained('bert-base-uncased', return_dict=False)
        self.l2 = torch.nn.Dropout(0.3)
        self.l3 = torch.nn.Linear(768, 6)

    def forward(self, ids, mask, token_type_ids):
        _, output_1= self.l1(ids, attention_mask = mask, token_type_ids = token_type_ids)
        output_2 = self.l2(output_1)
        output = self.l3(output_2)
        return output

model = BERTClass()
model.to(device)

BERTClass(
  (l1): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=Tr

In [None]:
def loss_fn(outputs, targets):
    return torch.nn.BCEWithLogitsLoss()(outputs, targets)

In [None]:
optimizer = torch.optim.Adam(params =  model.parameters(), lr=LEARNING_RATE)

<a id='section05'></a>
### **5. Fine Tuning the Model**

After completing the comprehensive stages of data loading, preparation, dataset establishment, model creation, and defining loss and optimization strategies, the forthcoming steps appear less intricate.

In this phase, we introduce a training function tasked with refining the model. This fine-tuning transpires across a specified number of iterations known as EPOCHs. Each epoch encompasses a complete traversal of the dataset through the network.

In [None]:
def train(epoch):
    model.train()
    for _, data in enumerate(training_loader, 0):
        ids = data['ids'].to(device, dtype = torch.long)
        mask = data['mask'].to(device, dtype = torch.long)
        token_type_ids = data['token_type_ids'].to(device, dtype = torch.long)
        targets = data['targets'].to(device, dtype = torch.float)

        outputs = model(ids, mask, token_type_ids)

        optimizer.zero_grad()
        loss = loss_fn(outputs, targets)
        if _%5000==0:
            print(f'Epoch: {epoch}, Loss:  {loss.item()}')

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

In [None]:
for epoch in range(EPOCHS):
    train(epoch)

Epoch: 0, Loss:  0.7814452648162842
Epoch: 0, Loss:  0.0026844427920877934
Epoch: 0, Loss:  0.001222499879077077
Epoch: 0, Loss:  0.03572620078921318


<a id='section06'></a>
### **6. Validating the Model**

During the validation stage we pass the unseen data(Testing Dataset) to the model. This step determines how good the model performs on the unseen data.

This unseen data is the 20% of `train.csv` which was seperated during the Dataset creation stage.
During the validation stage the weights of the model are not updated. Only the final output is compared to the actual value. This comparison is then used to calcuate the accuracy of the model.

As defined above to get a measure of our models performance we are using the following metrics.
- ***Accuracy Score***
- ***F1 Micro***
- ***F1 Macro***


In [None]:
def validation(epoch):
    model.eval()
    fin_targets=[]
    fin_outputs=[]
    with torch.no_grad():
        for _, data in enumerate(testing_loader, 0):
            ids = data['ids'].to(device, dtype = torch.long)
            mask = data['mask'].to(device, dtype = torch.long)
            token_type_ids = data['token_type_ids'].to(device, dtype = torch.long)
            targets = data['targets'].to(device, dtype = torch.float)
            outputs = model(ids, mask, token_type_ids)
            fin_targets.extend(targets.cpu().detach().numpy().tolist())
            fin_outputs.extend(torch.sigmoid(outputs).cpu().detach().numpy().tolist())
    return fin_outputs, fin_targets

In [None]:
for epoch in range(EPOCHS):
    outputs, targets = validation(epoch)
    outputs = np.array(outputs) >= 0.5
    accuracy = metrics.accuracy_score(targets, outputs)
    f1_score_micro = metrics.f1_score(targets, outputs, average='micro')
    f1_score_macro = metrics.f1_score(targets, outputs, average='macro')
    print(f"Accuracy Score = {accuracy}")
    print(f"F1 Score (Micro) = {f1_score_micro}")
    print(f"F1 Score (Macro) = {f1_score_macro}")

Accuracy Score = 0.9276806417246349
F1 Score (Micro) = 0.7693155863589823
F1 Score (Macro) = 0.5599649075606182


---

<br>

## **Conclusion**

Conclusions from One Epoch of Fine-Tuning BERT Transformer for Multilabel Classification:

***Accuracy Assessment:*** After a single epoch of fine-tuning the BERT transformer model, an accuracy score of approximately 92.77% was achieved. This score reflects the proportion of correct predictions among the total predictions made.

***F1 Score (Micro) Observation:*** The micro-averaged F1 score, which balances precision and recall, stood at around 0.77. This metric provides insight into the model's ability to correctly classify across all categories while considering class imbalances.

***F1 Score (Macro) Insight:*** The macro-averaged F1 score, approximately 0.56, emphasizes the model's proficiency in tackling each category independently, irrespective of class imbalances.

---

<br>

These initial results from one epoch of fine-tuning indicate promising performance, showcasing the BERT transformer's capability to effectively handle multilabel classification tasks. However, further epochs and evaluations will be essential to solidify the model's performance trends and potential refinements.