In [29]:
!nvidia-smi

Sun Apr 14 09:15:42 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla P100-PCIE-16GB           Off | 00000000:00:04.0 Off |                    0 |
| N/A   34C    P0              25W / 250W |      0MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                         

### Importing Libraries

In [30]:
import os
import pandas as pd
import numpy as np
import shutil
import sys
import tqdm.notebook as tq
from collections import defaultdict
from tqdm import tqdm 
import torch
import torch.nn as nn

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

In [31]:
print(device)

cuda


### Loading in data

In [32]:
# data_dir = "/content/drive/MyDrive/Notebooks_BERT/data"
df_data = pd.read_csv("Multi-Label Text Classification Dataset.csv")

In [34]:
df_data.head()

Unnamed: 0,Title,abstractText,meshMajor,pmid,meshid,meshroot,A,B,C,D,E,F,G,H,I,J,L,M,N,Z
0,Expression of p53 and coexistence of HPV in pr...,Fifty-four paraffin embedded tissue sections f...,"['DNA Probes, HPV', 'DNA, Viral', 'Female', 'H...",8549602,"[['D13.444.600.223.555', 'D27.505.259.750.600....","['Chemicals and Drugs [D]', 'Organisms [B]', '...",0,1,1,1,1,0,0,1,0,0,0,0,0,0
1,Vitamin D status in pregnant Indian women acro...,The present cross-sectional study was conducte...,"['Adult', 'Alkaline Phosphatase', 'Breast Feed...",21736816,"[['M01.060.116'], ['D08.811.277.352.650.035'],...","['Named Groups [M]', 'Chemicals and Drugs [D]'...",0,1,1,1,1,1,1,0,1,1,0,1,1,1
2,[Identification of a functionally important di...,The occurrence of individual amino acids and d...,"['Amino Acid Sequence', 'Analgesics, Opioid', ...",19060934,"[['G02.111.570.060', 'L01.453.245.667.060'], [...","['Phenomena and Processes [G]', 'Information S...",1,1,0,1,1,0,1,0,0,0,1,0,0,0
3,Multilayer capsules: a promising microencapsul...,"In 1980, Lim and Sun introduced a microcapsule...","['Acrylic Resins', 'Alginates', 'Animals', 'Bi...",11426874,"[['D05.750.716.822.111', 'D25.720.716.822.111'...","['Chemicals and Drugs [D]', 'Technology, Indus...",1,1,1,1,1,0,1,0,0,1,0,0,0,0
4,"Nanohydrogel with N,N'-bis(acryloyl)cystine cr...",Substantially improved hydrogel particles base...,"['Antineoplastic Agents', 'Cell Proliferation'...",28323099,"[['D27.505.954.248'], ['G04.161.750', 'G07.345...","['Chemicals and Drugs [D]', 'Phenomena and Pro...",1,1,0,1,1,0,1,0,0,1,0,0,0,0


In [35]:

df_data["combined"] = df_data["Title"] + ". " + df_data["abstractText"]
df_data.drop(columns=["abstractText", "Title"], axis=1, inplace=True)

In [36]:
from sklearn.model_selection import train_test_split
# split into train and test
df_train, df_test = train_test_split(df_data, random_state=77, test_size=0.30, shuffle=True)
# split test into test and validation datasets
df_test, df_valid = train_test_split(df_test, random_state=88, test_size=0.50, shuffle=True)

In [37]:
print(f"Train: {df_train.shape}, Test: {df_test.shape}, Valid: {df_valid.shape}")

Train: (35000, 19), Test: (7500, 19), Valid: (7500, 19)


In [38]:
# Hyperparameters
MAX_LEN = 512
TRAIN_BATCH_SIZE = 16
VALID_BATCH_SIZE = 16
TEST_BATCH_SIZE = 16
EPOCHS = 10
LEARNING_RATE = 1e-05
THRESHOLD = 0.5 

In [39]:
from transformers import RobertaTokenizer, RobertaModel

### RoBERTa Base

RoBERTa (Robustly optimized BERT approach) is a variant of the BERT (Bidirectional Encoder Representations from Transformers) model introduced by Facebook AI in 2019. It is designed to improve upon BERT's pretraining process by modifying key hyperparameters, training data, and training objectives.

#### Architecture

RoBERTa base follows the same architecture as BERT base, consisting of transformer layers with self-attention mechanisms. However, it employs several modifications to enhance its performance:

- **Larger Training Data**: RoBERTa is trained on more data compared to BERT, including additional unlabelled text from sources like BookCorpus and CC-News.
- **Dynamic Masking**: RoBERTa uses dynamic masking during pretraining, where different masks are applied to the input tokens in each training epoch.
- **No Next Sentence Prediction (NSP)**: Unlike BERT, RoBERTa does not use the NSP task during pretraining, relying solely on the masked language model (MLM) objective.
- **Hyperparameter Tuning**: RoBERTa fine-tunes hyperparameters such as batch size, learning rate, and training duration, resulting in improved performance.



#### Performance

RoBERTa base has demonstrated state-of-the-art performance on various natural language understanding (NLU) benchmarks, including GLUE (General Language Understanding Evaluation) and SQuAD (Stanford Question Answering Dataset). Its robust pretraining process and fine-tuning capabilities make it a widely used model for various NLP tasks.

For more details, refer to the original paper: [RoBERTa: A Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692).


In [40]:
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

In [41]:
# Test the tokenizer
test_text = "We are testing RoBERTa tokenizer."
# generate encodings
encodings = tokenizer.encode_plus(test_text, 
                                  add_special_tokens = True,
                                  max_length = 512,
                                  truncation = True,
                                  padding = "max_length", 
                                  return_attention_mask = True, 
                                  return_tensors = "pt")

In [42]:
print(encodings)

{'input_ids': tensor([[    0,   170,    32,  3044,  3830, 11126, 38495, 19233,  6315,     4,
             2,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,  

In [43]:
df_train['combined']

12608    A seven-day Helicobacter pylori treatment regi...
16180    Choanal atresia: an unusual serious complicati...
44364    Can Preoperative Sex-Related Differences in He...
362      Use of a First Large-Sized Coil Versus Convent...
25827    A Kaposi's sarcoma-associated herpesvirus prot...
                               ...                        
7832     Making health care more sustainable: the case ...
42277    Does co-infection with multiple viruses advers...
18667    Differential effects of L-NAME on rat venular ...
8799     [Dietary changes in Mexico].. Although the Mex...
47831    Managing variations from surgical care plans: ...
Name: combined, Length: 35000, dtype: object

### PubMedDataSet Class

This class is designed to create a PyTorch dataset for training and evaluation using data from a PubMed dataset. It prepares the data by tokenizing the text using a given tokenizer, encoding it into input tensors, and including the corresponding targets. Here's a breakdown of its components:

#### Initialization

- `df`: The DataFrame containing the PubMed dataset, with a column named 'combined' containing the combined text of title and abstract.
- `tokenizer`: A tokenizer object from the Hugging Face `transformers` library for tokenizing the text data.
- `max_len`: The maximum sequence length to which the input sequences will be padded or truncated.
- `target_list`: A list of target labels for the classification task.

#### Methods

##### `__init__()`

- Initializes the PubMedDataSet object with the provided DataFrame, tokenizer, maximum sequence length, and target list.

##### `__len__()`

- Returns the length of the dataset, which is the number of samples in the PubMed dataset.

##### `__getitem__()`

- Retrieves a single sample from the dataset at the specified index.
- Tokenizes the title text using the tokenizer and encodes it into input tensors.
- Performs padding or truncation to ensure that the input sequence length is within the specified maximum length.
- Encodes the target labels into PyTorch FloatTensor format.
- Returns a dictionary containing the input tensors ('input_ids', 'attention_mask', 'token_type_ids'), target labels ('targets'), and the original title text ('title').

This class facilitates the preprocessing of PubMed dataset for training and evaluation in PyTorch, making it easier to work with text data in machine learning pipelines.


In [44]:
class PubMedDataSet(torch.utils.data.Dataset):
    def __init__(self, df, tokenizer, max_len, target_list):
        self.tokenizer = tokenizer
        self.df = df
        self.title = list(df['combined'])
        self.targets = self.df[target_list].values
        self.max_len = max_len

    def __len__(self):
        return len(self.title)

    def __getitem__(self, index):
        title = str(self.title[index])
        title = " ".join(title.split())
        inputs = self.tokenizer.encode_plus(
            title,
            None,
            add_special_tokens=True,
            max_length=self.max_len,
            padding='max_length',
            return_token_type_ids=True,
            truncation=True,
            return_attention_mask=True,
            return_tensors='pt'
        )
        targets = torch.FloatTensor(self.targets[index])
        return {
            'input_ids': inputs['input_ids'].flatten(),
            'attention_mask': inputs['attention_mask'].flatten(),
            'token_type_ids': inputs["token_type_ids"].flatten(),
            'targets': targets,
            'title': title
        }


In [45]:
target_list = list(df_data.columns)
target_list= target_list[4:][:-1]
target_list

['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'L', 'M', 'N', 'Z']

In [46]:
train_dataset = PubMedDataSet(df_train, tokenizer, MAX_LEN, target_list)
valid_dataset = PubMedDataSet(df_valid, tokenizer, MAX_LEN, target_list)
test_dataset = PubMedDataSet(df_test, tokenizer, MAX_LEN, target_list)

In [47]:
# testing the dataset
next(iter(train_dataset))

{'input_ids': tensor([    0,   250,   707,    12,  1208, 31141,  2413, 35995,   181,  4360,
          6249,  1416, 30174,   634, 17691,  3432, 41643, 28366,     6, 16780,
          2462,  9919,  4104,     8,  1805,  1242, 45736,   385, 17022,   338,
          3938,   741,  1809,  5914,   877,  7586,    83,  3755,    35,   598,
         10516, 17691,  3432, 41643, 28366,  1764, 17844,   326,     4,   417,
             4,    29,   482,  1805,  1242, 45736,   385, 17022,   338,  3938,
           741,  1809,  5914,   877, 15452, 17844,   741,     4,   417,     4,
             8, 16780,  2462,  9919,  4104,   291, 17844,   741,     4,   417,
             4,    13,   262,   360,    25,    10, 31141,  2413, 35995,   181,
          4360,  6249,  1416, 30174,     4, 49767,   104,    35,    20,   289,
             4,   181,  4360,  6249,  2194,     9, 25599,  2379, 24617,  1484,
         11793,   253, 17591, 16572,    21, 11852,    30, 33945,  4383,     6,
          2040,     8,  6379,  1717,   

In [48]:
# Data loaders
train_data_loader = torch.utils.data.DataLoader(train_dataset, 
    batch_size=TRAIN_BATCH_SIZE,
    shuffle=True,
    num_workers=0
)

val_data_loader = torch.utils.data.DataLoader(valid_dataset, 
    batch_size=VALID_BATCH_SIZE,
    shuffle=False,
    num_workers=0
)

test_data_loader = torch.utils.data.DataLoader(test_dataset, 
    batch_size=TEST_BATCH_SIZE,
    shuffle=False,
    num_workers=0)                                             

### RoBERTaPubMed Class

This class defines a PyTorch neural network model based on RoBERTa for performing classification tasks on PubMed dataset. Here's a breakdown of its components:

#### Initialization

- `target_classes`: The number of target classes for classification (default is set to 14).

#### Attributes

- `roberta_model`: A pre-trained RoBERTa model loaded from Hugging Face's `transformers` library. It utilizes the 'roberta-base' variant and returns the model output as a dictionary.
- `dropout`: A dropout layer with a dropout probability of 0.3.
- `linear`: A linear layer that takes the pooled output of RoBERTa (768-dimensional) as input and outputs predictions for the target classes.

#### Methods

##### `__init__()`

- Initializes the RoBERTaPubMed object, setting up the RoBERTa model, dropout layer, and linear layer for classification.

##### `forward()`

- Defines the forward pass of the neural network model.
- Accepts input tensors (`input_ids`, `attn_mask`, `token_type_ids`) representing tokenized text data.
- Passes the input tensors through the RoBERTa model to obtain the model output.
- Applies dropout to the pooled output obtained from the RoBERTa model.
- Passes the dropout output through the linear layer to obtain predictions for the target classes.

#### Usage

To use this model for classification tasks on PubMed dataset, simply instantiate an object of the `RoBERTaPubMed` class. The default configuration utilizes the 'roberta-base' pre-trained model for feature extraction and classification.

```python
model = RoBERTaPubMed()


In [49]:
class RoBERTaPubMed(torch.nn.Module):
    def __init__(self, target_classes=14):
        super(RoBERTaPubMed, self).__init__()
        self.roberta_model = RobertaModel.from_pretrained('roberta-base', return_dict=True)
        self.dropout = torch.nn.Dropout(0.3)
        self.linear = torch.nn.Linear(768, target_classes)

    def forward(self, input_ids, attn_mask, token_type_ids):
        output = self.roberta_model(
            input_ids, 
            attention_mask=attn_mask, 
            token_type_ids=token_type_ids
        )
        output_dropout = self.dropout(output.pooler_output)
        output = self.linear(output_dropout)
        return output

model = RoBERTaPubMed()


model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [50]:
model.to(device)

RoBERTaPubMed(
  (roberta_model): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-11): 12 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm):

In [51]:

def loss_fn(outputs, targets):
    return torch.nn.BCEWithLogitsLoss()(outputs, targets)

### Optimizer Initialization

In this code snippet, we initialize an AdamW optimizer for training the model. Here's a breakdown:

#### AdamW Optimizer

AdamW is a variant of the Adam optimizer that incorporates weight decay directly into the update step, which is beneficial for training neural network models. It is commonly used in conjunction with transformer-based models like RoBERTa.

#### Initialization

We instantiate the AdamW optimizer by passing the model parameters (`model.parameters()`) and the learning rate (`lr=1e-5`). Additionally, we suppress any deprecation warnings that may occur during optimization using the `no_deprecation_warning=True` argument.

#### Usage

The initialized optimizer (`optimizer`) can then be used during the training loop to update the model parameters based on computed gradients.




In [52]:
from transformers import AdamW

# define the optimizer
optimizer = AdamW(model.parameters(), lr = 1e-5, no_deprecation_warning=True)   

In [53]:
# Define your training function
def train_model(training_loader, model, optimizer, loss_fn, device):
    model.train()
    loop = tqdm(enumerate(training_loader), total=len(training_loader), leave=True)
    
    total_loss = 0.0  # Initialize total_loss
    for batch_idx, data in loop:
        ids = data['input_ids'].to(device, dtype=torch.long)
        mask = data['attention_mask'].to(device, dtype=torch.long)
        token_type_ids = data['token_type_ids'].to(device, dtype=torch.long)
        targets = data['targets'].to(device, dtype=torch.float)
        
        optimizer.zero_grad()
        outputs = model(ids, mask, token_type_ids)
        loss = loss_fn(outputs, targets)
        total_loss += loss.item()
        loss.backward()
        optimizer.step()

        loop.set_description(f"Training Epoch [{epoch}/{EPOCHS}]")
        loop.set_postfix(loss=loss.item())
        
    return total_loss / len(training_loader)

In [54]:
# Define your evaluation function
def eval_model(validation_loader, model, loss_fn, device, action):
    model.eval()
    total_loss = 0
    all_targets = []
    all_predictions = []
    loop = tqdm(enumerate(validation_loader), total=len(validation_loader), leave=True)
    with torch.no_grad():
        for batch_idx, data in loop:
            ids = data['input_ids'].to(device, dtype=torch.long)
            mask = data['attention_mask'].to(device, dtype=torch.long)
            token_type_ids = data['token_type_ids'].to(device, dtype=torch.long)
            targets = data['targets'].to(device, dtype=torch.float)

            outputs = model(ids, mask, token_type_ids)
            loss = loss_fn(outputs, targets)
            total_loss += loss.item()

            # Apply thresholding to convert outputs to binary labels
            predictions = (outputs > 0.5).float()  
            all_targets.extend(targets.cpu().numpy())
            all_predictions.extend(predictions.cpu().numpy())

            loop.set_description(f"{action} Epoch [{epoch}/{EPOCHS}]")
            loop.set_postfix(loss=total_loss / (batch_idx + 1))

    return total_loss / len(validation_loader), np.array(all_targets), np.array(all_predictions)

In [55]:

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [57]:
EPOCHS=6

In [58]:
from sklearn.metrics import precision_score, recall_score, f1_score
from tqdm import tqdm


def compute_metrics(targets, outputs):
    precision = precision_score(targets, outputs, average=None)
    recall = recall_score(targets, outputs, average=None)
    f1 = f1_score(targets, outputs, average=None)
    return precision, recall, f1


history = defaultdict(list)
best_val_loss = float('inf')

for epoch in range(1, EPOCHS + 1):
    print(f'Epoch {epoch}/{EPOCHS}')
    
    train_loss = train_model(train_data_loader, model, optimizer, loss_fn, device)
    val_loss, val_targets, val_predictions = eval_model(val_data_loader, model, loss_fn, device,action='Validation')
    test_loss, test_targets, test_predictions = eval_model(test_data_loader, model, loss_fn, device, action= "Testing")
    
    
    val_precision, val_recall, val_f1 = compute_metrics(val_targets, val_predictions)
    test_precision, test_recall, test_f1 = compute_metrics(test_targets, test_predictions)
    
    
    history['train_loss'].append(train_loss)
    history['val_loss'].append(val_loss)
    history['val_precision'].append(val_precision)
    history['val_recall'].append(val_recall)
    history['val_f1'].append(val_f1)
    history['test_loss'].append(test_loss)
    history['test_precision'].append(test_precision)
    history['test_recall'].append(test_recall)
    history['test_f1'].append(test_f1)
    
    print(f'Train Loss: {train_loss:.4f}, Validation Loss: {val_loss:.4f}, Test Loss: {test_loss:.4f}')
    
    # Save best model based on validation loss
    if val_loss < best_val_loss:
        torch.save(model.state_dict(), 'best_model.pt')
        best_val_loss = val_loss
    
    # Save metrics to CSV after each epoch
    df_metrics = pd.DataFrame(history)
    df_metrics.to_csv('metrics.csv', index=False)

Epoch 1/6


Training Epoch [1/6]: 100%|██████████| 2188/2188 [32:32<00:00,  1.12it/s, loss=0.242]
Validation Epoch [1/6]: 100%|██████████| 469/469 [02:35<00:00,  3.01it/s, loss=0.28]
Testing Epoch [1/6]: 100%|██████████| 469/469 [02:36<00:00,  3.01it/s, loss=0.278]


Train Loss: 0.3277, Validation Loss: 0.2797, Test Loss: 0.2777
Epoch 2/6


Training Epoch [2/6]: 100%|██████████| 2188/2188 [32:22<00:00,  1.13it/s, loss=0.489]
Validation Epoch [2/6]: 100%|██████████| 469/469 [02:34<00:00,  3.03it/s, loss=0.267]
Testing Epoch [2/6]: 100%|██████████| 469/469 [02:34<00:00,  3.03it/s, loss=0.265]


Train Loss: 0.2689, Validation Loss: 0.2672, Test Loss: 0.2647
Epoch 3/6


Training Epoch [3/6]: 100%|██████████| 2188/2188 [32:23<00:00,  1.13it/s, loss=0.265]
Validation Epoch [3/6]: 100%|██████████| 469/469 [02:34<00:00,  3.03it/s, loss=0.26]
Testing Epoch [3/6]: 100%|██████████| 469/469 [02:34<00:00,  3.03it/s, loss=0.258]


Train Loss: 0.2478, Validation Loss: 0.2597, Test Loss: 0.2579
Epoch 4/6


Training Epoch [4/6]: 100%|██████████| 2188/2188 [32:25<00:00,  1.12it/s, loss=0.281]
Validation Epoch [4/6]: 100%|██████████| 469/469 [02:34<00:00,  3.03it/s, loss=0.263]
Testing Epoch [4/6]: 100%|██████████| 469/469 [02:34<00:00,  3.03it/s, loss=0.261]


Train Loss: 0.2295, Validation Loss: 0.2632, Test Loss: 0.2606
Epoch 5/6


Training Epoch [5/6]: 100%|██████████| 2188/2188 [32:24<00:00,  1.13it/s, loss=0.216]
Validation Epoch [5/6]: 100%|██████████| 469/469 [02:34<00:00,  3.03it/s, loss=0.265]
Testing Epoch [5/6]: 100%|██████████| 469/469 [02:34<00:00,  3.03it/s, loss=0.262]


Train Loss: 0.2126, Validation Loss: 0.2654, Test Loss: 0.2620
Epoch 6/6


Training Epoch [6/6]: 100%|██████████| 2188/2188 [32:20<00:00,  1.13it/s, loss=0.26]
Validation Epoch [6/6]: 100%|██████████| 469/469 [02:34<00:00,  3.03it/s, loss=0.273]
Testing Epoch [6/6]: 100%|██████████| 469/469 [02:34<00:00,  3.03it/s, loss=0.269]

Train Loss: 0.1949, Validation Loss: 0.2728, Test Loss: 0.2693



