<a href="https://colab.research.google.com/github/soutrik71/MInMaxBERT/blob/main/notebook/DistilBertClassification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook we will experiment with distil-bert for classification problems and see how it performs.
We will majorily experimenting with ideas from:

https://github.com/Yorko/bert-finetuning-catalyst
https://www.kaggle.com/code/kashnitsky/distillbert-catalyst-amazon-product-reviews
https://huggingface.co/docs/transformers/model_doc/distilbert#distilbert
https://huggingface.co/blog/sentiment-analysis-python
https://medium.com/huggingface/distilbert-8cf3380435b5
https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification.ipynb
https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/token_classification.ipynb
https://colab.research.google.com/github/DhavalTaunk08/Transformers_scripts/blob/master/Transformers_multilabel_distilbert.ipynb#scrollTo=I4R39UTxNKTk
https://colab.research.google.com/github/abhimishra91/transformers-tutorials/blob/master/transformers_multiclass_classification.ipynb#scrollTo=8KIK6iRYOWr5

In [1]:
!pip install transformers
!pip install torcheval

Collecting torcheval
  Downloading torcheval-0.0.7-py3-none-any.whl (179 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.2/179.2 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: torcheval
Successfully installed torcheval-0.0.7


In [2]:
# Importing stock ml libraries
import warnings
warnings.simplefilter('ignore')
import numpy as np
import pandas as pd
from tqdm import tqdm
from sklearn import metrics
import transformers
import torch
from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler
from transformers import DistilBertTokenizer, DistilBertModel
from sklearn.model_selection import train_test_split
import logging
import os
import random
from typing import List, Mapping, Dict
from transformers import AutoConfig, AutoModel
import torch.nn as nn
from torcheval.metrics import MulticlassAccuracy,BinaryAccuracy
logging.basicConfig(level=logging.DEBUG)

In [3]:
def set_seed(seed: int = 42) -> None:
    np.random.seed(seed)
    random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    # When running on the CuDNN backend, two further options must be set
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    # Set a fixed value for the hash seed
    os.environ["PYTHONHASHSEED"] = str(seed)
    print(f"Random seed set as {seed}")

In [4]:
# Set manual seed since nn.Parameter are randomly initialzied
set_seed(42)
# Set device cuda for GPU if it's available otherwise run on the CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)
MAX_LEN = 512
BATCH_SIZE = 10
EPOCHS = 10
LEARNING_RATE = 1e-05

Random seed set as 42
cuda


In [5]:
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')

tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/465 [00:00<?, ?B/s]

In [6]:
from transformers import AutoTokenizer
tokenizer_cp = AutoTokenizer.from_pretrained("distilbert-base-cased")

In [7]:
tokenizer

DistilBertTokenizer(name_or_path='distilbert-base-cased', vocab_size=28996, model_max_length=512, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

In [8]:
tokenizer_cp

DistilBertTokenizerFast(name_or_path='distilbert-base-cased', vocab_size=28996, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

## Basic Data Preprocessing

In [20]:
full_df = pd.read_csv("https://raw.githubusercontent.com/Yorko/bert-finetuning-catalyst/main/data/sdg_classification/train_set_sdg_1_7_8_12_13_toy.csv")

In [21]:
full_df.head()

Unnamed: 0,eid,sdg_id,title,keywords,abstract,title_keywords_abstract
0,84895022699,13,GIS-based risk assessment for the Nile Delta c...,GIS Inundation Sea level rise,Sea level changes are caused by several natura...,[TITLE] gis-based risk assessment for the nile...
1,84978997581,1,Ritual well-being: toward a social signaling m...,Costly signaling religion and mental health re...,Religion is positively correlated with subject...,[TITLE] ritual well-being: toward a social sig...
2,61949197853,8,Calculation method of eco-environmental water ...,Dongchang lake Eco-environmental water demand ...,Quantity and quality are inseparable propertie...,[TITLE] calculation method of eco-environmenta...
3,84866626961,8,Labour market and human resources development:...,Challenges Employment Enterprises Human resour...,"The Human Resources Development Survey (HRDS),...",[TITLE] labour market and human resources deve...
4,85072723300,13,Spinel oxides as coke-resistant supports for N...,Carbon capture Chemical looping Coke inhibitio...,Due to their high activity for methane convers...,[TITLE] spinel oxides as coke-resistant suppor...


In [22]:
full_df.shape

(150, 6)

In [27]:
target_dict = dict(zip(sorted(set(full_df['sdg_id'])), range(len(set(full_df['sdg_id'])))))

In [29]:
full_df['target'] = full_df['sdg_id'].map(target_dict)

In [30]:
full_df.head()

Unnamed: 0,eid,sdg_id,title,keywords,abstract,title_keywords_abstract,target
0,84895022699,13,GIS-based risk assessment for the Nile Delta c...,GIS Inundation Sea level rise,Sea level changes are caused by several natura...,[TITLE] gis-based risk assessment for the nile...,4
1,84978997581,1,Ritual well-being: toward a social signaling m...,Costly signaling religion and mental health re...,Religion is positively correlated with subject...,[TITLE] ritual well-being: toward a social sig...,0
2,61949197853,8,Calculation method of eco-environmental water ...,Dongchang lake Eco-environmental water demand ...,Quantity and quality are inseparable propertie...,[TITLE] calculation method of eco-environmenta...,2
3,84866626961,8,Labour market and human resources development:...,Challenges Employment Enterprises Human resour...,"The Human Resources Development Survey (HRDS),...",[TITLE] labour market and human resources deve...,2
4,85072723300,13,Spinel oxides as coke-resistant supports for N...,Carbon capture Chemical looping Coke inhibitio...,Due to their high activity for methane convers...,[TITLE] spinel oxides as coke-resistant suppor...,4


In [31]:
# train validation split
train_df, val_df = train_test_split(full_df, test_size=0.2, random_state=42, stratify=full_df['target'])

In [32]:
train_df.shape, val_df.shape

((120, 7), (30, 7))

In [33]:
NUM_CLASSES = len(target_dict)
print(NUM_CLASSES)

5


## Custom Torch Dataset Class

In [37]:
idx = np.random.randint(0, len(train_df))
sample_text = train_df.iloc[idx]['title_keywords_abstract']
sample_label = train_df.iloc[idx]['target']
print(sample_text)
print(sample_label)

[TITLE] the pre-aksumite and aksumite settlement of ne tigrai, ethiopia [KEYWORDS] missing [ABSTRACT] the first systematic archaeological survey conducted in ne tigrai has produced new insights into the settlement history of the pre-aksumite and aksumite kingdoms (800 b.c.-a.d. 700) of northern ethiopia. results of settlement data and ceramic and lithic artifact analyses from gulo-makeda indicate that the region experienced marked continuity in site occupations through time, suggesting a degree of political and economic stability that contrasts to the aksum-yeha regions. cultural links to eritrea including matara and the ancient ona culture are evident in ceramics dating to pre-aksumite and later middle to late aksumite times. sites in gulo-makeda are strategically located along historically known trade routes in areas with moderate to high water flow potential, suggesting that control of trade and high agricultural productivity were factors in the development of elite groups in the re

In [38]:
outputs = tokenizer.encode_plus(
    text = sample_text,
    add_special_tokens=True,
    padding="max_length",
    max_length=MAX_LEN,
    return_tensors="pt",
    truncation=True,
    return_attention_mask=True,
    return_token_type_ids=True
)

In [39]:
outputs.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

In [40]:
ids = outputs['input_ids']

In [41]:
ids.squeeze(0).shape

torch.Size([512])

In [42]:
masks = outputs['attention_mask']
print(masks.shape)

torch.Size([1, 512])


**Both input and masks will be of [1,n] shape and we are not squeezing out the 1 extra dim as it will be internally handled and if not then we have to do it manually**

In [43]:
class BertClassificationDataset(Dataset):
    """
    Wrapper around Torch Dataset to perform text classification
    """

    def __init__(
        self,
        texts: List[str],
        labels: List[str] = None,
        label_dict: Mapping[str, int] = None,
        max_seq_length: int = 512,
        model_name: str = "distilbert-base-uncased",
    ):
        """
        Args:
            texts (List[str]): a list with texts to classify or to train the
                classifier on
            labels List[str]: a list with classification labels (optional)
            label_dict (dict): a dictionary mapping class names to class ids,
                to be passed to the validation data (optional)
            max_seq_length (int): maximal sequence length in tokens,
                texts will be stripped to this length
            model_name (str): transformer model name, needed to perform
                appropriate tokenization

        """

        self.texts = texts
        self.labels = labels
        self.label_dict = label_dict
        self.max_seq_length = max_seq_length

        if self.label_dict is None and labels is not None:
            self.label_dict = dict(zip(sorted(set(labels)), range(len(set(labels)))))

        self.tokenizer =  DistilBertTokenizer.from_pretrained(model_name)
        # suppresses tokenizer warnings
        logging.getLogger("transformers.tokenization_utils").setLevel(logging.FATAL)

    def __len__(self) -> int:
        """
        Returns:
            int: length of the dataset
        """
        return len(self.texts)

    def __getitem__(self, index) -> Mapping[str, torch.Tensor]:
        """Gets element of the dataset

        Args:
            index (int): index of the element in the dataset
        Returns:
            Single element by index
        """

        # encoding the text
        input_text = self.texts[index]

        # a dictionary with `input_ids` and `attention_mask` as keys
        output_dict  = self.tokenizer.encode_plus(
              text = input_text,
              add_special_tokens=True,
              padding="max_length",
              max_length=self.max_seq_length,
              return_tensors="pt",
              truncation=True,
              return_attention_mask=True,
              return_token_type_ids=True
          )

        # dealing with attention masks - there's a 1 for each input token and
        # if the sequence is shorter that `max_seq_length` then the rest is
        # padded with zeroes. Attention mask will be passed to the model in
        # order to compute attention scores only with input data

        ids = output_dict['input_ids'].squeeze(0)
        mask = output_dict['attention_mask']

        # encoding target
        if self.labels is not None:
            y = self.labels[index]
            y_encoded = torch.Tensor([self.label_dict.get(y,)]).long().squeeze(0)


        return {
            'ids': torch.tensor(ids, dtype=torch.long),
            'mask': torch.tensor(mask, dtype=torch.long),
            'targets': y_encoded
        }


In [44]:
train_dataset = BertClassificationDataset(
        texts=train_df["title_keywords_abstract"].values.tolist(),
        labels=train_df["sdg_id"].values,
        max_seq_length=MAX_LEN,
        model_name="distilbert-base-cased",
        label_dict=target_dict
)

In [45]:
valid_dataset = BertClassificationDataset(
        texts=val_df["title_keywords_abstract"].values.tolist(),
        labels=val_df["sdg_id"].values,
        max_seq_length=MAX_LEN,
        model_name="distilbert-base-cased",
        label_dict=target_dict
)

In [46]:
# data loader stuffs
train_loader = DataLoader(
    train_dataset,
    batch_size=BATCH_SIZE,
    shuffle=True,
    num_workers=4,
    pin_memory=True,
)

valid_loader = DataLoader(
    valid_dataset,
    batch_size=BATCH_SIZE,
    shuffle=False,
    num_workers=4,
    pin_memory=True,
)

In [47]:
for batch in train_loader:
    ids = batch['ids']
    mask = batch['mask']
    targets = batch['targets']
    print(ids.shape)
    print(mask.shape)
    print(targets.shape)
    break

torch.Size([10, 512])
torch.Size([10, 1, 512])
torch.Size([10])


## Model Building

### Baseline Models using Distil Bert Classifier

In [54]:
class BertForSequenceClassification_A(nn.Module):
    """
    Simplified version of the same class by HuggingFace.
    See transformers/modeling_distilbert.py in the transformers repository.
    """

    def __init__(self, pretrained_model_name: str, num_classes: int = None, dropout: float = 0.1):
        super(BertForSequenceClassification_A, self).__init__()

        config = AutoConfig.from_pretrained(pretrained_model_name, num_labels=num_classes)
        print(config.hidden_size)

        self.model = AutoModel.from_pretrained(pretrained_model_name, config=config) # alternate DistilBertModel.from_pretrained("distilbert-base-uncased")
        self.classifier = nn.Linear(config.hidden_size, num_classes)
        self.dropout = nn.Dropout(dropout)

    def forward(self, features, attention_mask=None, head_mask=None):

        assert attention_mask is not None, "attention mask is none"

        bert_output = self.model(input_ids=features, attention_mask=attention_mask)
        # we only need the hidden state here and don't need transformer output, so index 0
        seq_output = bert_output[0]  # (bs, seq_len, dim)
        # mean pooling, i.e. getting average representation of all tokens
        pooled_output = seq_output.mean(axis=1)  # (bs, dim)
        pooled_output = self.dropout(pooled_output)  # (bs, dim)
        scores = self.classifier(pooled_output)  # (bs, num_classes)
        return scores

In [55]:
baseline_model1 = BertForSequenceClassification_A("distilbert-base-cased",NUM_CLASSES).to(device)
baseline_model1

768


BertForSequenceClassification_A(
  (model): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
          

In [56]:
# Creating the loss function and optimizer
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(params =  baseline_model1.parameters(), lr=LEARNING_RATE)
metric = MulticlassAccuracy(device = device, num_classes=NUM_CLASSES)

In [57]:
def train_module(model:torch.nn.Module,
                 device:torch.device,
                 train_dataloader:torch.utils.data.DataLoader ,
                 optimizer:torch.optim.Optimizer,
                 criterion:torch.nn.Module,
                 metric,
                 train_losses:list,
                 train_metrics:list):

  # setting model to train mode
  model.train()
  pbar = tqdm(train_dataloader)

  # batch metrics
  train_loss = 0
  train_metric = 0
  processed_batch = 0

  for _,data in enumerate(pbar):
    ids = data['ids'].to(device)
    mask = data['mask'].to(device)
    targets = data['targets'].to(device)

    outputs = model(ids, mask).squeeze()
    # calc loss
    loss = criterion(outputs, targets)
    train_loss += loss.item()
    # print(f"training loss for batch {idx} is {loss}")

    # backpropagation
    optimizer.zero_grad() # flush out  existing grads
    loss.backward() # back prop of weights wrt loss
    optimizer.step() # optimizer step -> minima

    # metric calc
    preds = torch.argmax(outputs,dim=-1)
    # print(f"preds:: {preds}")
    metric.update(preds,targets)
    train_metric += metric.compute().detach().item()

    #updating batch count
    processed_batch += 1

    pbar.set_description(f"Avg Train Loss: {train_loss/processed_batch} Avg Train Metric: {train_metric/processed_batch}")

  # It's typically called after the epoch completes
  metric.reset()
  # updating epoch metrics
  train_losses.append(train_loss/processed_batch)
  train_metrics.append(train_metric/processed_batch)

  return train_losses, train_metrics


In [58]:
def test_module(model:torch.nn.Module,
                device:torch.device,
                test_dataloader:torch.utils.data.DataLoader,
                criterion:torch.nn.Module,
                metric,
                test_losses,
                test_metrics):
  # setting model to eval mode
  model.eval()
  pbar = tqdm(test_dataloader)

  # batch metrics
  test_loss = 0
  test_metric = 0
  processed_batch = 0

  with torch.inference_mode():
    for _, data in enumerate(pbar, 0):
      ids = data['ids'].to(device)
      mask = data['mask'].to(device)
      targets = data['targets'].to(device)
      outputs = model(ids, mask).squeeze()
      # print(preds.shape)
      # print(label.shape)

     # calc loss
      loss = criterion(outputs, targets)
      test_loss += loss.item()

      # metric calc
      preds = torch.argmax(outputs,dim=-1)
      metric.update(preds, targets)
      test_metric += metric.compute().detach().item()

      #updating batch count
      processed_batch += 1

      pbar.set_description(f"Avg Test Loss: {test_loss/processed_batch} Avg Test Metric: {test_metric/processed_batch}")

    # It's typically called after the epoch completes
    metric.reset()
    # updating epoch metrics
    test_losses.append(test_loss/processed_batch)
    test_metrics.append(test_metric/processed_batch)

  return test_losses, test_metrics

In [59]:
# Place holders----
train_losses = []
train_metrics = []
test_losses = []
test_metrics = []

for epoch in range(0,EPOCHS):
  print(f'Epoch {epoch}')
  train_losses, train_metrics = train_module(baseline_model1, device, train_loader, optimizer, criterion, metric, train_losses, train_metrics)
  test_losses , test_metrics = test_module(baseline_model1, device, valid_loader, criterion, metric, test_losses, test_metrics)

Epoch 0


Avg Train Loss: 1.6286382873853047 Avg Train Metric: 0.208241643384099: 100%|██████████| 12/12 [00:06<00:00,  1.90it/s]
Avg Test Loss: 1.5758082071940105 Avg Test Metric: 0.2833333412806193: 100%|██████████| 3/3 [00:00<00:00,  3.36it/s]


Epoch 1


Avg Train Loss: 1.5096753239631653 Avg Train Metric: 0.35144931823015213: 100%|██████████| 12/12 [00:05<00:00,  2.09it/s]
Avg Test Loss: 1.5170602798461914 Avg Test Metric: 0.2666666756073634: 100%|██████████| 3/3 [00:00<00:00,  4.20it/s]


Epoch 2


Avg Train Loss: 1.4122887353102367 Avg Train Metric: 0.43963293731212616: 100%|██████████| 12/12 [00:05<00:00,  2.14it/s]
Avg Test Loss: 1.439679225285848 Avg Test Metric: 0.3333333432674408: 100%|██████████| 3/3 [00:00<00:00,  3.36it/s]


Epoch 3


Avg Train Loss: 1.245659331480662 Avg Train Metric: 0.5373208274443945: 100%|██████████| 12/12 [00:05<00:00,  2.03it/s]
Avg Test Loss: 1.3222206036249797 Avg Test Metric: 0.4333333373069763: 100%|██████████| 3/3 [00:00<00:00,  4.14it/s]


Epoch 4


Avg Train Loss: 1.0530317773421605 Avg Train Metric: 0.6342201828956604: 100%|██████████| 12/12 [00:05<00:00,  2.08it/s]
Avg Test Loss: 1.2479929526646931 Avg Test Metric: 0.42777777711550397: 100%|██████████| 3/3 [00:00<00:00,  3.70it/s]


Epoch 5


Avg Train Loss: 0.8450708389282227 Avg Train Metric: 0.7117312997579575: 100%|██████████| 12/12 [00:05<00:00,  2.02it/s]
Avg Test Loss: 1.1812528173128765 Avg Test Metric: 0.6500000158945719: 100%|██████████| 3/3 [00:00<00:00,  4.38it/s]


Epoch 6


Avg Train Loss: 0.6497031996647517 Avg Train Metric: 0.8111940821011862: 100%|██████████| 12/12 [00:05<00:00,  2.11it/s]
Avg Test Loss: 1.1803967356681824 Avg Test Metric: 0.5333333412806193: 100%|██████████| 3/3 [00:00<00:00,  3.89it/s]


Epoch 7


Avg Train Loss: 0.4834926202893257 Avg Train Metric: 0.9308468550443649: 100%|██████████| 12/12 [00:05<00:00,  2.07it/s]
Avg Test Loss: 1.146159589290619 Avg Test Metric: 0.5777777830759684: 100%|██████████| 3/3 [00:00<00:00,  4.04it/s]


Epoch 8


Avg Train Loss: 0.33699866011738777 Avg Train Metric: 0.9678553392489752: 100%|██████████| 12/12 [00:05<00:00,  2.09it/s]
Avg Test Loss: 1.1750960946083069 Avg Test Metric: 0.6166666746139526: 100%|██████████| 3/3 [00:00<00:00,  4.26it/s]


Epoch 9


Avg Train Loss: 0.22416570906837782 Avg Train Metric: 0.9734454651673635: 100%|██████████| 12/12 [00:05<00:00,  2.05it/s]
Avg Test Loss: 1.1261778871218364 Avg Test Metric: 0.7055555582046509: 100%|██████████| 3/3 [00:00<00:00,  4.12it/s]


### Extended Model

In [None]:
class BertForSequenceClassification_B(nn.Module):
    """
    Simplified version of the same class by HuggingFace.
    See transformers/modeling_distilbert.py in the transformers repository.
    """

    def __init__(self, pretrained_model_name: str, num_classes: int = None, dropout: float = 0.1):
        super(BertForSequenceClassification_B, self).__init__()

        config = AutoConfig.from_pretrained(pretrained_model_name, num_labels=num_classes)
        print(config.hidden_size)

        self.model = AutoModel.from_pretrained(pretrained_model_name, config=config) # alternate DistilBertModel.from_pretrained("distilbert-base-uncased")
        self.fc1 = nn.Linear(config.hidden_size, config.hidden_size)
        self.classifier = nn.Linear(config.hidden_size, num_classes)
        self.dropout = nn.Dropout(dropout)

    def forward(self, features, attention_mask=None, head_mask=None):

        assert attention_mask is not None, "attention mask is none"

        bert_output = self.model(input_ids=features, attention_mask=attention_mask)
        # we only need the hidden state here and don't need transformer output, so index 0
        seq_output = bert_output[0]  # (bs, seq_len, dim)
        pooler = seq_output[:, 0] # take out the first hideen state
        pooler = self.fc1(pooler)
        pooler = torch.nn.ReLU()(pooler)
        pooler = self.dropout(pooler)
        scores = self.classifier(pooler)

        return scores

In [None]:
baseline_model2 = BertForSequenceClassification_B("distilbert-base-cased",NUM_CLASSES).to(device)
baseline_model2

768


BertForSequenceClassification_B(
  (model): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
          

In [60]:
# Creating the loss function and optimizer
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(params =  baseline_model2.parameters(), lr=LEARNING_RATE)
metric = MulticlassAccuracy(device = device, num_classes=NUM_CLASSES)

NameError: name 'baseline_model2' is not defined

In [None]:
# Place holders----
train_losses = []
train_metrics = []
test_losses = []
test_metrics = []

for epoch in range(0,EPOCHS):
  print(f'Epoch {epoch}')
  train_losses, train_metrics = train_module(baseline_model2, device, train_loader, optimizer, criterion, metric, train_losses, train_metrics)
  test_losses , test_metrics = test_module(baseline_model2, device, valid_loader, criterion, metric, test_losses, test_metrics)

Epoch 0


Avg Train Loss: 1.6219481428464253 Avg Train Metric: 0.13370220238963762: 100%|██████████| 12/12 [00:06<00:00,  1.90it/s]
Avg Test Loss: 1.5897475481033325 Avg Test Metric: 0.2500000049670537: 100%|██████████| 3/3 [00:01<00:00,  2.77it/s]


Epoch 1


Avg Train Loss: 1.5739543338616688 Avg Train Metric: 0.3113918999830882: 100%|██████████| 12/12 [00:05<00:00,  2.03it/s]
Avg Test Loss: 1.5641411542892456 Avg Test Metric: 0.35555556416511536: 100%|██████████| 3/3 [00:00<00:00,  3.98it/s]


Epoch 2


Avg Train Loss: 1.5412176549434662 Avg Train Metric: 0.3609316398700078: 100%|██████████| 12/12 [00:05<00:00,  2.03it/s]
Avg Test Loss: 1.5088713963826497 Avg Test Metric: 0.3222222328186035: 100%|██████████| 3/3 [00:00<00:00,  3.95it/s]


Epoch 3


Avg Train Loss: 1.4314716855684917 Avg Train Metric: 0.4886420766512553: 100%|██████████| 12/12 [00:05<00:00,  2.03it/s]
Avg Test Loss: 1.3869462410608928 Avg Test Metric: 0.5833333532015482: 100%|██████████| 3/3 [00:00<00:00,  3.88it/s]


Epoch 4


Avg Train Loss: 1.2618279258410137 Avg Train Metric: 0.5882774268587431: 100%|██████████| 12/12 [00:06<00:00,  1.98it/s]
Avg Test Loss: 1.274840513865153 Avg Test Metric: 0.42222222685813904: 100%|██████████| 3/3 [00:00<00:00,  4.10it/s]


Epoch 5


Avg Train Loss: 1.0611073126395543 Avg Train Metric: 0.7082395305236181: 100%|██████████| 12/12 [00:06<00:00,  1.98it/s]
Avg Test Loss: 1.2173935572306316 Avg Test Metric: 0.6166666746139526: 100%|██████████| 3/3 [00:01<00:00,  2.99it/s]


Epoch 6


Avg Train Loss: 0.9086427291234335 Avg Train Metric: 0.7958143949508667: 100%|██████████| 12/12 [00:06<00:00,  1.89it/s]
Avg Test Loss: 1.1134816606839497 Avg Test Metric: 0.5833333532015482: 100%|██████████| 3/3 [00:00<00:00,  3.85it/s]


Epoch 7


Avg Train Loss: 0.7532484531402588 Avg Train Metric: 0.8749669243892034: 100%|██████████| 12/12 [00:06<00:00,  1.95it/s]
Avg Test Loss: 1.112821916739146 Avg Test Metric: 0.6277777751286825: 100%|██████████| 3/3 [00:00<00:00,  4.15it/s]


Epoch 8


Avg Train Loss: 0.605549452205499 Avg Train Metric: 0.8970475594202677: 100%|██████████| 12/12 [00:06<00:00,  1.98it/s]
Avg Test Loss: 1.0952296058336894 Avg Test Metric: 0.6777777870496114: 100%|██████████| 3/3 [00:00<00:00,  3.95it/s]


Epoch 9


Avg Train Loss: 0.48933863639831543 Avg Train Metric: 0.9801016201575597: 100%|██████████| 12/12 [00:06<00:00,  1.99it/s]
Avg Test Loss: 0.9921278556187948 Avg Test Metric: 0.6777777870496114: 100%|██████████| 3/3 [00:00<00:00,  3.85it/s]
