<a href="https://colab.research.google.com/github/soutrik71/MInMaxBERT/blob/main/DistilBertClassification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook we will experiment with distil-bert for classification problems and see how it performs.
We will majorily experimenting with ideas from:

https://github.com/Yorko/bert-finetuning-catalyst
https://www.kaggle.com/code/kashnitsky/distillbert-catalyst-amazon-product-reviews
https://huggingface.co/docs/transformers/model_doc/distilbert#distilbert
https://huggingface.co/blog/sentiment-analysis-python
https://medium.com/huggingface/distilbert-8cf3380435b5
https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification.ipynb
https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/token_classification.ipynb
https://colab.research.google.com/github/DhavalTaunk08/Transformers_scripts/blob/master/Transformers_multilabel_distilbert.ipynb#scrollTo=I4R39UTxNKTk
https://colab.research.google.com/github/abhimishra91/transformers-tutorials/blob/master/transformers_multiclass_classification.ipynb#scrollTo=8KIK6iRYOWr5

In [1]:
!pip install transformers



In [2]:
# Importing stock ml libraries
import warnings
warnings.simplefilter('ignore')
import numpy as np
import pandas as pd
from tqdm import tqdm
from sklearn import metrics
import transformers
import torch
from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler
from transformers import DistilBertTokenizer, DistilBertModel
from sklearn.model_selection import train_test_split
import logging
import os
import random
from typing import List, Mapping, Dict
from transformers import AutoConfig, AutoModel
import torch.nn as nn
logging.basicConfig(level=logging.DEBUG)

In [3]:
def set_seed(seed: int = 42) -> None:
    np.random.seed(seed)
    random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    # When running on the CuDNN backend, two further options must be set
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    # Set a fixed value for the hash seed
    os.environ["PYTHONHASHSEED"] = str(seed)
    print(f"Random seed set as {seed}")

In [4]:
# Set manual seed since nn.Parameter are randomly initialzied
set_seed(42)
# Set device cuda for GPU if it's available otherwise run on the CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)
MAX_LEN = 512
BATCH_SIZE = 10
EPOCHS = 10
LEARNING_RATE = 1e-05

Random seed set as 42
cuda


In [5]:
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')

tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/465 [00:00<?, ?B/s]

In [6]:
from transformers import AutoTokenizer
tokenizer_cp = AutoTokenizer.from_pretrained("distilbert-base-cased")

In [7]:
tokenizer

DistilBertTokenizer(name_or_path='distilbert-base-cased', vocab_size=28996, model_max_length=512, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

In [8]:
tokenizer_cp

DistilBertTokenizerFast(name_or_path='distilbert-base-cased', vocab_size=28996, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

## Basic Data Preprocessing

In [9]:
full_df = pd.read_csv("https://raw.githubusercontent.com/Yorko/bert-finetuning-catalyst/main/data/sdg_classification/train_set_sdg_1_7_8_12_13_toy.csv")

In [10]:
full_df.head()

Unnamed: 0,eid,sdg_id,title,keywords,abstract,title_keywords_abstract
0,84895022699,13,GIS-based risk assessment for the Nile Delta c...,GIS Inundation Sea level rise,Sea level changes are caused by several natura...,[TITLE] gis-based risk assessment for the nile...
1,84978997581,1,Ritual well-being: toward a social signaling m...,Costly signaling religion and mental health re...,Religion is positively correlated with subject...,[TITLE] ritual well-being: toward a social sig...
2,61949197853,8,Calculation method of eco-environmental water ...,Dongchang lake Eco-environmental water demand ...,Quantity and quality are inseparable propertie...,[TITLE] calculation method of eco-environmenta...
3,84866626961,8,Labour market and human resources development:...,Challenges Employment Enterprises Human resour...,"The Human Resources Development Survey (HRDS),...",[TITLE] labour market and human resources deve...
4,85072723300,13,Spinel oxides as coke-resistant supports for N...,Carbon capture Chemical looping Coke inhibitio...,Due to their high activity for methane convers...,[TITLE] spinel oxides as coke-resistant suppor...


In [11]:
full_df = full_df[['title_keywords_abstract','sdg_id']]

In [12]:
full_df.shape

(150, 2)

In [13]:
full_df['sdg_id'].value_counts()

13    33
8     31
12    31
1     29
7     26
Name: sdg_id, dtype: int64

In [14]:
target_dict = dict(zip(sorted(full_df['sdg_id'].unique().tolist()), range(len(full_df['sdg_id'].unique()))))
print(target_dict)

{1: 0, 7: 1, 8: 2, 12: 3, 13: 4}


In [15]:
full_df['target'] = full_df['sdg_id'].map(target_dict)

In [16]:
full_df.head()

Unnamed: 0,title_keywords_abstract,sdg_id,target
0,[TITLE] gis-based risk assessment for the nile...,13,4
1,[TITLE] ritual well-being: toward a social sig...,1,0
2,[TITLE] calculation method of eco-environmenta...,8,2
3,[TITLE] labour market and human resources deve...,8,2
4,[TITLE] spinel oxides as coke-resistant suppor...,13,4


In [17]:
# train validation split
train_df, val_df = train_test_split(full_df, test_size=0.2, random_state=42, stratify=full_df['target'])

In [18]:
train_df.shape, val_df.shape

((120, 3), (30, 3))

In [19]:
NUM_CLASSES = len(target_dict)
print(NUM_CLASSES)

5


## Custom Torch Dataset Class

In [20]:
idx = np.random.randint(0, len(train_df))
sample_text = train_df.iloc[idx]['title_keywords_abstract']
sample_label = train_df.iloc[idx]['target']
print(sample_text)
print(sample_label)

[TITLE] multi-point, high-speed passive ion velocity distribution diagnostic on the pegasus toroidal experiment [KEYWORDS] missing [ABSTRACT] a passive ion temperature polychromator has been deployed on pegasus to study power balance and non-thermal ion distributions that arise during point source helicity injection. spectra are recorded from a 1 m f8.6 czerny-turner polychromator whose output is recorded by an intensified high-speed camera. the use of high orders allows for a dispersion of 0.02 åmm in 4th order and a bandpass of 0.14 å (∼13 kms) at 3131 å in 4th order with 100 μm entrance slit. the instrument temperature of the spectrometer is 15 ev. light from the output of an image intensifier in the spectrometer focal plane is coupled to a high-speed cmos camera. the system can accommodate up to 20 spatial points recorded at 0.5 ms time resolution. during helicity injection, stochastic magnetic fields keep t e low (∼100 ev) and thus low ionization impurities penetrate to the core. 

In [21]:
outputs = tokenizer.encode_plus(
    text = sample_text,
    add_special_tokens=True,
    padding="max_length",
    max_length=MAX_LEN,
    return_tensors="pt",
    truncation=True,
    return_attention_mask=True,
    return_token_type_ids=True
)

In [22]:
outputs.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

In [23]:
ids = outputs['input_ids']

In [24]:
ids.squeeze(0).shape

torch.Size([512])

In [25]:
masks = outputs['attention_mask']
print(masks.shape)

torch.Size([1, 512])


**Both input and masks will be of [1,n] shape and we are not squeezing out the 1 extra dim as it will be internally handled and if not then we have to do it manually**

In [26]:
class BertClassificationDataset(Dataset):
    """
    Wrapper around Torch Dataset to perform text classification
    """

    def __init__(
        self,
        texts: List[str],
        labels: List[str] = None,
        label_dict: Mapping[str, int] = None,
        max_seq_length: int = 512,
        model_name: str = "distilbert-base-uncased",
    ):
        """
        Args:
            texts (List[str]): a list with texts to classify or to train the
                classifier on
            labels List[str]: a list with classification labels (optional)
            label_dict (dict): a dictionary mapping class names to class ids,
                to be passed to the validation data (optional)
            max_seq_length (int): maximal sequence length in tokens,
                texts will be stripped to this length
            model_name (str): transformer model name, needed to perform
                appropriate tokenization

        """

        self.texts = texts
        self.labels = labels
        self.label_dict = label_dict
        self.max_seq_length = max_seq_length

        if self.label_dict is None and labels is not None:
            self.label_dict = dict(zip(sorted(set(labels)), range(len(set(labels)))))

        self.tokenizer =  DistilBertTokenizer.from_pretrained(model_name)
        # suppresses tokenizer warnings
        logging.getLogger("transformers.tokenization_utils").setLevel(logging.FATAL)

    def __len__(self) -> int:
        """
        Returns:
            int: length of the dataset
        """
        return len(self.texts)

    def __getitem__(self, index) -> Mapping[str, torch.Tensor]:
        """Gets element of the dataset

        Args:
            index (int): index of the element in the dataset
        Returns:
            Single element by index
        """

        # encoding the text
        input_text = self.texts[index]

        # a dictionary with `input_ids` and `attention_mask` as keys
        output_dict  = self.tokenizer.encode_plus(
              text = input_text,
              add_special_tokens=True,
              padding="max_length",
              max_length=self.max_seq_length,
              return_tensors="pt",
              truncation=True,
              return_attention_mask=True,
              return_token_type_ids=True
          )

        # dealing with attention masks - there's a 1 for each input token and
        # if the sequence is shorter that `max_seq_length` then the rest is
        # padded with zeroes. Attention mask will be passed to the model in
        # order to compute attention scores only with input data

        ids = output_dict['input_ids'].squeeze(0)
        mask = output_dict['attention_mask']

        # encoding target
        if self.labels is not None:
            y = self.labels[index]
            y_encoded = torch.Tensor([self.label_dict.get(y,)]).long().squeeze(0)


        return {
            'ids': torch.tensor(ids, dtype=torch.long),
            'mask': torch.tensor(mask, dtype=torch.long),
            'targets': y_encoded
        }


In [27]:
train_dataset = BertClassificationDataset(
        texts=train_df["title_keywords_abstract"].values.tolist(),
        labels=train_df["sdg_id"].values,
        max_seq_length=MAX_LEN,
        model_name="distilbert-base-cased",
)

In [28]:
valid_dataset = BertClassificationDataset(
        texts=val_df["title_keywords_abstract"].values.tolist(),
        labels=val_df["sdg_id"].values,
        max_seq_length=MAX_LEN,
        model_name="distilbert-base-cased",
)

In [29]:
# data loader stuffs
train_loader = DataLoader(
    train_dataset,
    batch_size=BATCH_SIZE,
    shuffle=True,
    num_workers=4,
    pin_memory=True,
)

valid_loader = DataLoader(
    valid_dataset,
    batch_size=BATCH_SIZE,
    shuffle=False,
    num_workers=4,
    pin_memory=True,
)

In [30]:
for batch in train_loader:
    ids = batch['ids']
    mask = batch['mask']
    targets = batch['targets']
    print(ids.shape)
    print(mask.shape)
    print(targets.shape)
    break

torch.Size([10, 512])
torch.Size([10, 1, 512])
torch.Size([10])


## Model Building

In [31]:
class BertForSequenceClassification_A(nn.Module):
    """
    Simplified version of the same class by HuggingFace.
    See transformers/modeling_distilbert.py in the transformers repository.
    """

    def __init__(self, pretrained_model_name: str, num_classes: int = None, dropout: float = 0.1):
        super(BertForSequenceClassification_A, self).__init__()

        config = AutoConfig.from_pretrained(pretrained_model_name, num_labels=num_classes)
        print(config.hidden_size)

        self.model = AutoModel.from_pretrained(pretrained_model_name, config=config) # alternate DistilBertModel.from_pretrained("distilbert-base-uncased")
        self.classifier = nn.Linear(config.hidden_size, num_classes)
        self.dropout = nn.Dropout(dropout)

    def forward(self, features, attention_mask=None, head_mask=None):

        assert attention_mask is not None, "attention mask is none"

        bert_output = self.model(input_ids=features, attention_mask=attention_mask)
        # we only need the hidden state here and don't need transformer output, so index 0
        seq_output = bert_output[0]  # (bs, seq_len, dim)
        # mean pooling, i.e. getting average representation of all tokens
        pooled_output = seq_output.mean(axis=1)  # (bs, dim)
        pooled_output = self.dropout(pooled_output)  # (bs, dim)
        scores = self.classifier(pooled_output)  # (bs, num_classes)

        return scores

In [55]:
baseline_model1 = BertForSequenceClassification_A("distilbert-base-cased",NUM_CLASSES).to(device)
baseline_model1

768


BertForSequenceClassification_A(
  (model): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
          

In [56]:
# Creating the loss function and optimizer
loss_function = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(params =  baseline_model1.parameters(), lr=LEARNING_RATE)

In [57]:
# Function to calcuate the accuracy of the model

def calcuate_accu(big_idx, targets):
    n_correct = (big_idx==targets).sum().item()
    return n_correct

In [58]:
# Defining the training function on the 80% of the dataset for tuning the distilbert model

def train(epoch, model):
    tr_loss = 0
    n_correct = 0
    nb_tr_steps = 0
    nb_tr_examples = 0
    model.train()
    for _,data in enumerate(train_loader, 0):
        ids = data['ids'].to(device)
        mask = data['mask'].to(device)
        targets = data['targets'].to(device)

        outputs = model(ids, mask)
        loss = loss_function(outputs, targets)
        tr_loss += loss.item()
        big_val, big_idx = torch.max(outputs.data, dim=1)
        n_correct += calcuate_accu(big_idx, targets)

        nb_tr_steps += 1
        nb_tr_examples+=targets.size(0)

        if _%5000==0:
            loss_step = tr_loss/nb_tr_steps
            accu_step = (n_correct*100)/nb_tr_examples
            print(f"Training Loss per steps: {loss_step}")
            print(f"Training Accuracy 5000 steps: {accu_step}")

        optimizer.zero_grad()
        loss.backward()
        # # When using GPU
        optimizer.step()

    print(f'The Total Accuracy for Epoch {epoch}: {(n_correct*100)/nb_tr_examples}')
    epoch_loss = tr_loss/nb_tr_steps
    epoch_accu = (n_correct*100)/nb_tr_examples
    print(f"Training Loss Epoch: {epoch_loss}")
    print(f"Training Accuracy Epoch: {epoch_accu}")

    return

In [59]:
def valid(model):
    model.eval()
    n_correct = 0
    n_wrong = 0
    tr_loss = 0
    nb_tr_examples=0
    nb_tr_steps=0
    with torch.no_grad():
        for _, data in enumerate(valid_loader, 0):
            ids = data['ids'].to(device)
            mask = data['mask'].to(device)
            targets = data['targets'].to(device,)
            outputs = model(ids, mask).squeeze()
            loss = loss_function(outputs, targets)
            tr_loss += loss.item()
            big_val, big_idx = torch.max(outputs.data, dim=1)
            n_correct += calcuate_accu(big_idx, targets)

            nb_tr_steps += 1
            nb_tr_examples+=targets.size(0)

            if _%5000==0:
                loss_step = tr_loss/nb_tr_steps
                accu_step = (n_correct*100)/nb_tr_examples
                print(f"Validation Loss per steps: {loss_step}")
                print(f"Validation Accuracy per steps: {accu_step}")
    epoch_loss = tr_loss/nb_tr_steps
    epoch_accu = (n_correct*100)/nb_tr_examples
    print(f"Validation Loss Epoch: {epoch_loss}")
    print(f"Validation Accuracy Epoch: {epoch_accu}")

    return epoch_accu


In [60]:
for epoch in range(EPOCHS):
  print('\n Epoch {:} / {:}'.format(epoch + 1, EPOCHS))
  train(epoch, baseline_model1)
  valid(baseline_model1)


 Epoch 1 / 10
Training Loss per steps: 1.6087017059326172
Training Accuracy 5000 steps: 0.0
The Total Accuracy for Epoch 0: 20.0
Training Loss Epoch: 1.6189558605353038
Training Accuracy Epoch: 20.0
Validation Loss per steps: 1.6111980676651
Validation Accuracy per steps: 20.0
Validation Loss Epoch: 1.5884328285853069
Validation Accuracy Epoch: 23.333333333333332

 Epoch 2 / 10
Training Loss per steps: 1.6124054193496704
Training Accuracy 5000 steps: 10.0
The Total Accuracy for Epoch 1: 26.666666666666668
Training Loss Epoch: 1.5617998242378235
Training Accuracy Epoch: 26.666666666666668
Validation Loss per steps: 1.5373189449310303
Validation Accuracy per steps: 30.0
Validation Loss Epoch: 1.5555882851282756
Validation Accuracy Epoch: 26.666666666666668

 Epoch 3 / 10
Training Loss per steps: 1.5430573225021362
Training Accuracy 5000 steps: 30.0
The Total Accuracy for Epoch 2: 44.166666666666664
Training Loss Epoch: 1.4891491134961445
Training Accuracy Epoch: 44.166666666666664
Valid