# Benchmarking models against SoDa-RoBERTa

In this document we want to benchmark the performance of different models in comparisson to [SoDa-RoBERTa](https://github.com/source-data/soda-roberta) in different `token classification` tasks. 

The goal behind this experiment is not to find the best overall models, but those that are more accurate in classifying text entities of interest in the [SourceData](https://sourcedata.embo.org/) context. This means in the field of molecular cell biology. With this goal in mind, we will not use typical benchmarking datasets, but the [`sd-nlp`](https://huggingface.co/datasets/EMBO/sd-nlp) and [`sd-nlp-non-tokenized`](https://huggingface.co/datasets/EMBO/sd-nlp-non-tokenized) datasets. These datasets contain annotated image captions extracted from papers on molecular biology. The data has been curated and annotated by profesionals in the field. 

This notebook is intended to be used with the [🤗 Datasets](https://huggingface.co/) library. 

## Table of contents

* [Chapter 1 - SoDa-RoBERTa](#chapter1)
    * [Section 1.1 - NER task for SoDa-RoBERTa](#section_1_1)
    * [Section 1.2 - SMALL_MOL_ROLES task for SoDa-RoBERTa](#section_1_2)
    * [Section 1.3 - GENEPROD_ROLES task for SoDa-RoBERTa](#section_1_3)
    * [Section 1.4 - PANELIZATION task for SoDa-RoBERTa](#section_1_4)
    * [Section 1.5 - BORING task for SoDa-RoBERTa](#section_1_5)
* [Chapter 2 - RoBERTa](#chapter2)
    * [Section 1.1 - NER task for RoBERTa](#section_2_1)
    * [Section 1.2 - SMALL_MOL_ROLES task for RoBERTa](#section_2_2)
    * [Section 1.3 - GENEPROD_ROLES task for RoBERTa](#section_2_3)
    * [Section 1.4 - PANELIZATION task for RoBERTa](#section_2_4)
    * [Section 1.5 - BORING task for RoBERTa](#section_2_5)


In [1]:
%load_ext autoreload
%autoreload 2

from pathlib import Path
from smtag.config import Config


In [2]:
from transformers import __version__
__version__

'4.15.0'

In [3]:
# from huggingface_hub import notebook_login
# notebook_login()

HUB_TOKEN = "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"

# Chapter 1 - SoDa RoBERTa <a class="anchor" id="chapter1"></a>

[SoDa-RoBERTa](https://github.com/source-data/soda-roberta) [(Liechti. et al. 2017)](https://doi.org/10.1038/nmeth.4471) is a package generated by the [SourceData](https://sourcedata.embo.org/) team. This package has been developed to improve the data curation of biomedical papers in the field of molecular and cell biology.

This is the first model that we will use in our benchmarking. The data available in [`sd-nlp`](https://huggingface.co/datasets/EMBO/sd-nlp) has already been tokenized using the 🤗 `roberta-base` tokenizer. This tokenizer has been pre-trained with the `roberta-base` model, which is the base model on top of which SoDa-RoBERTa has been built. 

Since the model is already pre-trained, we just need to fine-tune it. The basic idea is that the pre-trained model with generate a series of outputs that will be token encoders. By fine-tuning a model, FFNN is added on the top of these embeddings and connected to a `softmax` layer to classify tokens.

This process is mostly automated to us by the [🤗 `AutoModelForTokenClassification`](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoModelForTokenClassification) class.

# Section 1.1 - NER task for SoDa-RoBERTa <a class="anchor" id="section_1_1"></a>


In [4]:
from datasets import load_dataset

In [5]:
data = load_dataset("drAbreu/sd-nlp-2", "NER")
train_dataset, eval_dataset, test_dataset = data["train"], data['validation'], data['test']
train_dataset[0:2]

Reusing dataset source_data_nlp (/root/.cache/huggingface/datasets/drAbreu___source_data_nlp/NER/0.0.1/bc570bbe08a861e7e058193ee9cf347dbe9f132e96d86d20817bc45c49737226)


  0%|          | 0/3 [00:00<?, ?it/s]

{'input_ids': [[0,
   1640,
   347,
   43,
   4052,
   847,
   33101,
   43916,
   14868,
   303,
   129,
   15,
   15145,
   36,
   571,
   35,
   361,
   25610,
   8,
   1368,
   35,
   545,
   25610,
   43,
   15,
   36475,
   36,
   118,
   35,
   361,
   25610,
   43,
   9,
   3186,
   3082,
   4,
   306,
   4,
   735,
   2549,
   58,
   1455,
   129,
   15,
   5,
   5856,
   9,
   5,
   155,
   1484,
   25,
   22827,
   13,
   3186,
   3082,
   4,
   306,
   36,
   267,
   35,
   545,
   25610,
   322,
   1437,
   2],
  [0,
   28588,
   3693,
   3041,
   44193,
   40899,
   16007,
   21258,
   2018,
   5,
   127,
   523,
   12572,
   3551,
   5252,
   11,
   364,
   771,
   2571,
   9,
   545,
   12,
   3583,
   12,
   279,
   15540,
   9789,
   15,
   10,
   239,
   12,
   19987,
   5626,
   36,
   725,
   24667,
   43,
   36,
   4070,
   43,
   8,
   5,
   12337,
   11257,
   5656,
   36,
   6960,
   322,
   20,
   2853,
   2798,
   924,
   5,
   274,
   306,
   73,
   2940,
  

Each of the different models we are using will use different configurations for training. We will generate them using the `config_dict` variable in the module `config.py`.

This information contains topics as important as the model checkpoints to be used. 

SoDa-RoBERTa has generated a language model [`bio-lm`](https://huggingface.co/EMBO/bio-lm). This model has been initialized from the `roberta-base` checkpoint. It is for this reason that the tokenizer to be used is that of `roberta-base`.

In the next line we will load the `bio-lm` checkpoint and the `roberta-base` tokenizer.

The dataset for `sd-nlp` we have the data ready to be processed. The next step would be to organize the data into a way that it can be load into batches.

This is done with data collators. There is a generic data collator known as `DataCollatorForTokenClassification` in  🤗 that will do what we need. However, we have a `DataCollatorForMaskedTokenClassification` generated that uses the `tag_mask` column to randomly mask the values. This was done by Thomas. I am assuming the reason behind was to improve the generalization of the task. But this needs to be checked  with him.

After checking, it looks like for some reason only the `DataCollatorForMaskedTokenClassification` will work here, so let's keep it as it is and move forward.

In [12]:
from transformers import DataCollatorForTokenClassification
from smtag.data_collator import DataCollatorForMaskedTokenClassification
from transformers import AutoTokenizer
# Check the case of the MaskedTokenClassification
config = Config(model_type = "Autoencoder", from_pretrained = "EMBO/bio-lm", tokenizer = 'roberta-base')
# data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer, padding=True,return_tensors='pt')
data_collator_mask = DataCollatorForMaskedTokenClassification(tokenizer=AutoTokenizer.from_pretrained(config.tokenizer), 
                                                              padding=True,
                                                              max_length=512,
                                                              pad_to_multiple_of=None,
                                                              return_tensors='pt',
                                                              masking_probability=0.1,
                                                              replacement_probability=0.1,
                                                              select_labels=False)

In [13]:
batch_size=1
batch = data_collator_mask([data["train"][i] for i in range(batch_size)])
batch



{'input_ids': tensor([[    0,  1640,   347,    43,  4052,   847, 33101, 43916, 14868,   303,
            129,    15,  5507,    36,   571,    35,   361, 25610,     8,  1368,
             35,   545, 25610,    43,    15, 36475,    36,   118,    35,   361,
          25610,    43,     9, 32367,  3082,     4,   306,     4,   735, 19990,
             58,  1455,   129,    15,     5,  5856,     9,     5,   155, 50264,
             25, 22827,    13,  3186,  3082,     4,   306,    36,   267,    35,
            545, 25610,   322,  1437,     2]]),
 'labels': tensor([[ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,  0,  0,  0,  0,  0,
           0,  0,  0,  0,  0,  0,  0, 10,  0,  0,  0,  0,  0,  0,  0, 12,  0,  0,
           0,  0, 10,  9,  0,  0,  0,  0,  0, 10,  0,  0,  0, 12,  0,  0,  0, 12,
           0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0]]),
 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 

After the data collector is time to define the hyperparameters needed by the `Trainer` class of 🤗. 

In [29]:
import os
from dotenv import load_dotenv

load_dotenv()
# LM_MODEL_PATH = os.getenv('LM_MODEL_PATH')
# TOKENIZER_PATH = os.getenv('TOKENIZER_PATH')
# TOKCL_MODEL_PATH = os.getenv('TOKCL_MODEL_PATH')
# CACHE = os.getenv('CACHE')
# RUNS_DIR = os.getenv('RUNS_DIR')
LM_MODEL_PATH="./.lm_models"
TOKCL_MODEL_PATH="./.tokcl_models"
CACHE="./.cache"
RUNS_DIR="./.runs"
DUMMY_DIR="./.dummy"
RUNTIME=""


We need to do also a series of important definitions at this point.

In [30]:
num_labels = train_dataset.info.features['labels'].feature.num_classes
label_list = train_dataset.info.features['labels'].feature.names
id2label, label2id = {}, {}
for class_, label in zip(range(num_labels), label_list):
    id2label[class_] = label 
    label2id[label] = class_ 
print(f"\nTraining on {num_labels} features:")
print(", ".join(label_list))
id2label
label2id


Training on 15 features:
O, I-SMALL_MOLECULE, B-SMALL_MOLECULE, I-GENEPROD, B-GENEPROD, I-SUBCELLULAR, B-SUBCELLULAR, I-CELL, B-CELL, I-TISSUE, B-TISSUE, I-ORGANISM, B-ORGANISM, I-EXP_ASSAY, B-EXP_ASSAY


{'O': 0,
 'I-SMALL_MOLECULE': 1,
 'B-SMALL_MOLECULE': 2,
 'I-GENEPROD': 3,
 'B-GENEPROD': 4,
 'I-SUBCELLULAR': 5,
 'B-SUBCELLULAR': 6,
 'I-CELL': 7,
 'B-CELL': 8,
 'I-TISSUE': 9,
 'B-TISSUE': 10,
 'I-ORGANISM': 11,
 'B-ORGANISM': 12,
 'I-EXP_ASSAY': 13,
 'B-EXP_ASSAY': 14}

Let us define now the metrics that will be used to evaluate the performance of the model.

In [31]:
from smtag.metrics import MetricsTOKCL
compute_metrics = MetricsTOKCL(label_list=label_list)

In [32]:
from smtag.train.train_tokcl import TrainingArgumentsTOKCL

training_args = TrainingArgumentsTOKCL(
    output_dir = TOKCL_MODEL_PATH,
    overwrite_output_dir = True,
    logging_steps = 1000,
    evaluation_strategy = 'epoch',
    prediction_loss_only = True,  # crucial to avoid OOM at evaluation stage!
    learning_rate = 1e-4,
    per_device_train_batch_size = 16,
    per_device_eval_batch_size = 16,
    num_train_epochs = 10,
    masking_probability = None,
    replacement_probability = None,
    select_labels = False,
    per_gpu_train_batch_size=None, 
    per_gpu_eval_batch_size=None, 
    gradient_accumulation_steps=1, 
    eval_accumulation_steps=None, 
    weight_decay=0.0, 
    adam_beta1=0.9, 
    adam_beta2=0.999, 
    adam_epsilon=1e-08, 
    max_grad_norm=1.0, 
    max_steps=-1, 
    lr_scheduler_type='linear', 
    warmup_ratio=0.0, 
    warmup_steps=0, 
    save_strategy='epoch', 
    save_steps=1, 
    save_total_limit=5, 
    save_on_each_node=False, 
    no_cuda=False, 
    seed=42, 
    bf16=False, 
    fp16=False, 
    fp16_opt_level='O1', 
    half_precision_backend='auto', 
    bf16_full_eval=False, 
    fp16_full_eval=False, 
    tf32=None, 
    local_rank=-1, 
    xpu_backend=None, 
    tpu_num_cores=None, 
    tpu_metrics_debug=False, 
    debug=[], 
    dataloader_drop_last=False, 
    eval_steps=1000, 
    dataloader_num_workers=0, 
    past_index=-1, 
    run_name=TOKCL_MODEL_PATH, 
    disable_tqdm=False, 
    remove_unused_columns=True, 
    label_names=None, 
    load_best_model_at_end=False, 
    metric_for_best_model=None, 
    greater_is_better=None, 
    ignore_data_skip=False, 
    sharded_ddp=[], 
    deepspeed=None, 
    label_smoothing_factor=0.0, 
    adafactor=False, 
    group_by_length=False, 
    length_column_name='length', 
    report_to=['tensorboard'], 
    ddp_find_unused_parameters=None, 
    ddp_bucket_cap_mb=None, 
    dataloader_pin_memory=True, 
    skip_memory_metrics=True, 
    use_legacy_prediction_loop=False, 
    push_to_hub=True, 
    resume_from_checkpoint=None, 
    hub_model_id="EMBO/SourceData-NER", 
    hub_strategy='every_save', 
    hub_token=HUB_TOKEN, 
    gradient_checkpointing=False, 
    fp16_backend='auto', 
    mp_parameters=''
    )

training_args

PyTorch: setting up devices


TrainingArgumentsTOKCL(output_dir='./.tokcl_models', overwrite_output_dir=True, do_train=False, do_eval=True, do_predict=False, evaluation_strategy=<IntervalStrategy.EPOCH: 'epoch'>, prediction_loss_only=True, per_device_train_batch_size=16, per_device_eval_batch_size=16, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=0.0001, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=10, max_steps=-1, lr_scheduler_type=<SchedulerType.LINEAR: 'linear'>, warmup_ratio=0.0, warmup_steps=0, log_level=-1, log_level_replica=-1, log_on_each_node=True, logging_dir='./.tokcl_models/runs/May19_08-48-32_21fe288ec657', logging_strategy=<IntervalStrategy.STEPS: 'steps'>, logging_first_step=False, logging_steps=1000, logging_nan_inf_filter=True, save_strategy=<IntervalStrategy.EPOCH: 'epoch'>, save_steps=1, save_total_limit=5, save_on_each_node=False, no_cuda=False

Up to now, we have pre-processed data and load a model. The loaded model has been cropped at the transformer network. For the model to be able to perform a task, we need to provide the model with a model head. 

The model heads are usually fully connected layers on the top of the transformer network. Although at this point we could perfectly use `torch` to build our own model from the output of the transformers, it has been shown that the performance of fully connected layers is at this point good enough to perform several NLP tasks, including NER.

The reason is that the transformer models already encodes several context information on its resulting embeddings. We would therefore not benefit from generating a second RNN or conditional random fields on top, as it was usually done for NER. We will therefore keep it simple and use the fully connected network provided by 🤗.

The way to do so is to load our model, but now using a different class: `AutoModelForTokenClassification`. In this case we use token classification since NER belongs to this task. 

We need to pass the number of labels. To avoid doing this for every single checkpoint we can do it programatically. The way of getting the number of classes from the training dataset is shown below.

In [33]:
from transformers import AutoModelForTokenClassification
model = AutoModelForTokenClassification.from_pretrained(
            config.from_pretrained,
            num_labels=num_labels,
            max_position_embeddings=config.max_length + 2,  # max_length + 2 for start/end token
            id2label = id2label,
            label2id = label2id
        )
model_config = model.config

loading configuration file https://huggingface.co/EMBO/bio-lm/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/09fed88b4a07fe6baced126e3cdb14f2764c1bc57f62d1026a75b3ffdb3ec5f8.c781727f43e25ac5b298f775b2dd4f32f53c9890a2367bbd99ffdbd856251b85
Model config RobertaConfig {
  "_name_or_path": "EMBO/bio-lm",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "O",
    "1": "I-SMALL_MOLECULE",
    "2": "B-SMALL_MOLECULE",
    "3": "I-GENEPROD",
    "4": "B-GENEPROD",
    "5": "I-SUBCELLULAR",
    "6": "B-SUBCELLULAR",
    "7": "I-CELL",
    "8": "B-CELL",
    "9": "I-TISSUE",
    "10": "B-TISSUE",
    "11": "I-ORGANISM",
    "12": "B-ORGANISM",
    "13": "I-EXP_ASSAY",
    "14": "B-EXP_ASSAY"
  },
  "initializer_range": 0.0

In [34]:
print(f"\nTraining arguments for model type {config.model_type}:")
print(model_config)
print(training_args)


Training arguments for model type Autoencoder:
RobertaConfig {
  "_name_or_path": "EMBO/bio-lm",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "O",
    "1": "I-SMALL_MOLECULE",
    "2": "B-SMALL_MOLECULE",
    "3": "I-GENEPROD",
    "4": "B-GENEPROD",
    "5": "I-SUBCELLULAR",
    "6": "B-SUBCELLULAR",
    "7": "I-CELL",
    "8": "B-CELL",
    "9": "I-TISSUE",
    "10": "B-TISSUE",
    "11": "I-ORGANISM",
    "12": "B-ORGANISM",
    "13": "I-EXP_ASSAY",
    "14": "B-EXP_ASSAY"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "B-CELL": 8,
    "B-EXP_ASSAY": 14,
    "B-GENEPROD": 4,
    "B-ORGANISM": 12,
    "B-SMALL_MOLECULE": 2,
    "B-SUBCELLULAR": 6,
    "B-TISSUE": 10,
    "I-CELL": 7,
    "I-EXP_ASSA

#### Training Step 3: Define the `Trainer`

We are ready now to define the [`Trainer` class](https://🤗.co/docs/transformers/main_classes/trainer). This class is a basic training loop supporting a series of features defined in the documentation. However, it can be further customized. We encourage you to take a look to the documentation and try it. 

As it is, `trainer.train` would already train our model. However, it would offer only information about the loss during the process. We know that we want the loss to get smaller with time, and ideally, that this is true for both, training and validation datasets. Otherwise we would be incurring in overfitting.

What if we want to see other information during training like the accuracy or f1 score? `Trainer` provides an argument `compute_metrics` that will help us with this.

In [35]:
from transformers import Trainer
from smtag.tb_callback import MyTensorBoardCallback
import torch
from smtag.show import ShowExampleTOKCL
from transformers.integrations import TensorBoardCallback

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator_mask,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics,
    callbacks=[ShowExampleTOKCL(AutoTokenizer.from_pretrained(config.tokenizer))]
)

# switch the Tensorboard callback to plot losses on same plot
trainer.remove_callback(TensorBoardCallback)  # remove default Tensorboard callback
trainer.add_callback(MyTensorBoardCallback)  # replace with customized callback

print(f"CUDA available: {torch.cuda.is_available()}")


Could not locate the tokenizer configuration file, will try to use the model config instead.
loading configuration file https://huggingface.co/roberta-base/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/733bade19e5f0ce98e6531021dd5180994bb2f7b8bd7e80c7968805834ba351e.35205c6cfc956461d8515139f0f8dd5d207a2f336c0c3a83b4bc8dca3518e37b
Model config RobertaConfig {
  "_name_or_path": "roberta-base",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.15.0",
  "type_vocab_size": 1,
  "use_cach

OSError: Looks like you do not have git-lfs installed, please install. You can install from https://git-lfs.github.com/. Then run `git lfs install` (you only have to do this once).

In [None]:
print(trainer.args)

In [38]:
!git lfs install

git: 'lfs' is not a git command. See 'git --help'.

The most similar command is
	log
