# Benchmarking models against SoDa-RoBERTa

In this document we want to benchmark the performance of different models in comparisson to [SoDa-RoBERTa](https://github.com/source-data/soda-roberta) in different `token classification` tasks. 

The goal behind this experiment is not to find the best overall models, but those that are more accurate in classifying text entities of interest in the [SourceData](https://sourcedata.embo.org/) context. This means in the field of molecular cell biology. With this goal in mind, we will not use typical benchmarking datasets, but the [`sd-nlp`](https://huggingface.co/datasets/EMBO/sd-nlp) and [`sd-nlp-non-tokenized`](https://huggingface.co/datasets/EMBO/sd-nlp-non-tokenized) datasets. These datasets contain annotated image captions extracted from papers on molecular biology. The data has been curated and annotated by profesionals in the field. 

This notebook is intended to be used with the [🤗 Datasets](https://huggingface.co/) library. 

## Table of contents

* [Chapter 1 - SoDa-RoBERTa](#chapter1)
    * [Section 1.1 - NER task for SoDa-RoBERTa](#section_1_1)
    * [Section 1.2 - SMALL_MOL_ROLES task for SoDa-RoBERTa](#section_1_2)
    * [Section 1.3 - GENEPROD_ROLES task for SoDa-RoBERTa](#section_1_3)
    * [Section 1.4 - PANELIZATION task for SoDa-RoBERTa](#section_1_4)
    * [Section 1.5 - BORING task for SoDa-RoBERTa](#section_1_5)
* [Chapter 2 - RoBERTa](#chapter2)
    * [Section 1.1 - NER task for RoBERTa](#section_2_1)
    * [Section 1.2 - SMALL_MOL_ROLES task for RoBERTa](#section_2_2)
    * [Section 1.3 - GENEPROD_ROLES task for RoBERTa](#section_2_3)
    * [Section 1.4 - PANELIZATION task for RoBERTa](#section_2_4)
    * [Section 1.5 - BORING task for RoBERTa](#section_2_5)


In [27]:
%load_ext autoreload
%autoreload 2

from pathlib import Path
from smtag.config import Config


In [28]:
from transformers import __version__
__version__

'4.15.0'

In [29]:
# from huggingface_hub import notebook_login
# notebook_login()



# Chapter 1 - SoDa RoBERTa <a class="anchor" id="chapter1"></a>

[SoDa-RoBERTa](https://github.com/source-data/soda-roberta) [(Liechti. et al. 2017)](https://doi.org/10.1038/nmeth.4471) is a package generated by the [SourceData](https://sourcedata.embo.org/) team. This package has been developed to improve the data curation of biomedical papers in the field of molecular and cell biology.

This is the first model that we will use in our benchmarking. The data available in [`sd-nlp`](https://huggingface.co/datasets/EMBO/sd-nlp) has already been tokenized using the 🤗 `roberta-base` tokenizer. This tokenizer has been pre-trained with the `roberta-base` model, which is the base model on top of which SoDa-RoBERTa has been built. 

Since the model is already pre-trained, we just need to fine-tune it. The basic idea is that the pre-trained model with generate a series of outputs that will be token encoders. By fine-tuning a model, FFNN is added on the top of these embeddings and connected to a `softmax` layer to classify tokens.

This process is mostly automated to us by the [🤗 `AutoModelForTokenClassification`](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoModelForTokenClassification) class.

# Section 1.1 - NER task for SoDa-RoBERTa <a class="anchor" id="section_1_1"></a>


In [30]:
from datasets import load_dataset

In [31]:
data = load_dataset("drAbreu/sd-nlp-2", "NER")
train_dataset, eval_dataset, test_dataset = data["train"], data['validation'], data['test']
train_dataset[0:2]

Reusing dataset source_data_nlp (/root/.cache/huggingface/datasets/drAbreu___source_data_nlp/NER/0.0.1/440dcf19a03697fc2ce9c579ac33eca032235705ae974982f23b0275b37d3660)


  0%|          | 0/3 [00:00<?, ?it/s]

{'input_ids': [[0,
   1640,
   347,
   43,
   4052,
   847,
   33101,
   43916,
   14868,
   303,
   129,
   15,
   15145,
   36,
   571,
   35,
   361,
   25610,
   8,
   1368,
   35,
   545,
   25610,
   43,
   15,
   36475,
   36,
   118,
   35,
   361,
   25610,
   43,
   9,
   3186,
   3082,
   4,
   306,
   4,
   735,
   2549,
   58,
   1455,
   129,
   15,
   5,
   5856,
   9,
   5,
   155,
   1484,
   25,
   22827,
   13,
   3186,
   3082,
   4,
   306,
   36,
   267,
   35,
   545,
   25610,
   322,
   1437,
   2],
  [0,
   28588,
   3693,
   3041,
   44193,
   40899,
   16007,
   21258,
   2018,
   5,
   127,
   523,
   12572,
   3551,
   5252,
   11,
   364,
   771,
   2571,
   9,
   545,
   12,
   3583,
   12,
   279,
   15540,
   9789,
   15,
   10,
   239,
   12,
   19987,
   5626,
   36,
   725,
   24667,
   43,
   36,
   4070,
   43,
   8,
   5,
   12337,
   11257,
   5656,
   36,
   6960,
   322,
   20,
   2853,
   2798,
   924,
   5,
   274,
   306,
   73,
   2940,
  

In [32]:
type(train_dataset)

datasets.arrow_dataset.Dataset

Each of the different models we are using will use different configurations for training. We will generate them using the `config_dict` variable in the module `config.py`.

This information contains topics as important as the model checkpoints to be used. 

SoDa-RoBERTa has generated a language model [`bio-lm`](https://huggingface.co/EMBO/bio-lm). This model has been initialized from the `roberta-base` checkpoint. It is for this reason that the tokenizer to be used is that of `roberta-base`.

In the next line we will load the `bio-lm` checkpoint and the `roberta-base` tokenizer.

The dataset for `sd-nlp` we have the data ready to be processed. The next step would be to organize the data into a way that it can be load into batches.

This is done with data collators. There is a generic data collator known as `DataCollatorForTokenClassification` in  🤗 that will do what we need. However, we have a `DataCollatorForMaskedTokenClassification` generated that uses the `tag_mask` column to randomly mask the values. This was done by Thomas. I am assuming the reason behind was to improve the generalization of the task. But this needs to be checked  with him.

After checking, it looks like for some reason only the `DataCollatorForMaskedTokenClassification` will work here, so let's keep it as it is and move forward.

In [39]:
from transformers import DataCollatorForTokenClassification
from smtag.data_collator import DataCollatorForMaskedTokenClassification
from transformers import AutoTokenizer
# Check the case of the MaskedTokenClassification
tokenizer = AutoTokenizer.from_pretrained('roberta-base')
config = Config(model_type = "Autoencoder", from_pretrained = "EMBO/bio-lm", tokenizer = 'roberta-base')
data_collator_mask = DataCollatorForMaskedTokenClassification(tokenizer=AutoTokenizer.from_pretrained(config.tokenizer), 
                                                              padding=True,
                                                              max_length=512,
                                                              pad_to_multiple_of=None,
                                                              return_tensors='pt',
                                                              masking_probability=0.0,
                                                              replacement_probability=0.0,
                                                              select_labels=False)
data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer, 
                                                   padding=True,
                                                   return_tensors='pt')


In [40]:
batch_size=1
batch = data_collator([data["train"][i] for i in range(batch_size)])
batch

{'input_ids': tensor([[    0,  1640,   347,    43,  4052,   847, 33101, 43916, 14868,   303,
            129,    15, 15145,    36,   571,    35,   361, 25610,     8,  1368,
             35,   545, 25610,    43,    15, 36475,    36,   118,    35,   361,
          25610,    43,     9,  3186,  3082,     4,   306,     4,   735,  2549,
             58,  1455,   129,    15,     5,  5856,     9,     5,   155,  1484,
             25, 22827,    13,  3186,  3082,     4,   306,    36,   267,    35,
            545, 25610,   322,  1437,     2]]),
 'labels': tensor([[ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,  0,  0,  0,  0,  0,
           0,  0,  0,  0,  0,  0,  0, 10,  0,  0,  0,  0,  0,  0,  0, 12,  0,  0,
           0,  0, 10,  9,  0,  0,  0,  0,  0, 10,  0,  0,  0, 12,  0,  0,  0, 12,
           0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0]]),
 'tag_mask': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 

In [41]:
batch_size=1
batch = data_collator_mask([data["train"][i] for i in range(batch_size)])
batch



{'input_ids': tensor([[    0,  1640,   347,    43,  4052,   847, 33101, 43916, 14868,   303,
            129,    15, 15145,    36,   571,    35,   361, 25610,     8,  1368,
             35,   545, 25610,    43,    15, 36475,    36,   118,    35,   361,
          25610,    43,     9,  3186,  3082,     4,   306,     4,   735,  2549,
             58,  1455,   129,    15,     5,  5856,     9,     5,   155,  1484,
             25, 22827,    13,  3186,  3082,     4,   306,    36,   267,    35,
            545, 25610,   322,  1437,     2]]),
 'labels': tensor([[ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,  0,  0,  0,  0,  0,
           0,  0,  0,  0,  0,  0,  0, 10,  0,  0,  0,  0,  0,  0,  0, 12,  0,  0,
           0,  0, 10,  9,  0,  0,  0,  0,  0, 10,  0,  0,  0, 12,  0,  0,  0, 12,
           0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0]]),
 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 

After the data collector is time to define the hyperparameters needed by the `Trainer` class of 🤗. 

In [42]:
import os
from dotenv import load_dotenv

load_dotenv()
LM_MODEL_PATH = os.getenv('LM_MODEL_PATH')
TOKENIZER_PATH = os.getenv('TOKENIZER_PATH')
TOKCL_MODEL_PATH = os.getenv('TOKCL_MODEL_PATH')
CACHE = os.getenv('CACHE')
RUNS_DIR = os.getenv('RUNS_DIR')


We need to do also a series of important definitions at this point.

In [43]:
num_labels = train_dataset.info.features['labels'].feature.num_classes
label_list = train_dataset.info.features['labels'].feature.names
id2label, label2id = {}, {}
for class_, label in zip(range(num_labels), label_list):
    id2label[class_] = label 
    label2id[label] = class_ 
print(f"\nTraining on {num_labels} features:")
print(", ".join(label_list))
id2label
label2id


Training on 15 features:
O, I-SMALL_MOLECULE, B-SMALL_MOLECULE, I-GENEPROD, B-GENEPROD, I-SUBCELLULAR, B-SUBCELLULAR, I-CELL, B-CELL, I-TISSUE, B-TISSUE, I-ORGANISM, B-ORGANISM, I-EXP_ASSAY, B-EXP_ASSAY


{'O': 0,
 'I-SMALL_MOLECULE': 1,
 'B-SMALL_MOLECULE': 2,
 'I-GENEPROD': 3,
 'B-GENEPROD': 4,
 'I-SUBCELLULAR': 5,
 'B-SUBCELLULAR': 6,
 'I-CELL': 7,
 'B-CELL': 8,
 'I-TISSUE': 9,
 'B-TISSUE': 10,
 'I-ORGANISM': 11,
 'B-ORGANISM': 12,
 'I-EXP_ASSAY': 13,
 'B-EXP_ASSAY': 14}

Let us define now the metrics that will be used to evaluate the performance of the model.

In [44]:
from smtag.metrics import MetricsTOKCL
compute_metrics = MetricsTOKCL(label_list=label_list)

In [46]:
from smtag.train.train_tokcl import TrainingArgumentsTOKCL

training_args = TrainingArgumentsTOKCL(
    output_dir = TOKCL_MODEL_PATH,
    overwrite_output_dir = True,
    logging_steps = 1000,
    evaluation_strategy = 'epoch',
    prediction_loss_only = True,  # crucial to avoid OOM at evaluation stage!
    learning_rate = 1e-4,
    per_device_train_batch_size = 16,
    per_device_eval_batch_size = 16,
    num_train_epochs = 10,
    masking_probability = None,
    replacement_probability = None,
    select_labels = False,
    per_gpu_train_batch_size=None, 
    per_gpu_eval_batch_size=None, 
    gradient_accumulation_steps=1, 
    eval_accumulation_steps=None, 
    weight_decay=0.0, 
    adam_beta1=0.9, 
    adam_beta2=0.999, 
    adam_epsilon=1e-08, 
    max_grad_norm=1.0, 
    max_steps=-1, 
    lr_scheduler_type='linear', 
    warmup_ratio=0.0, 
    warmup_steps=0, 
    save_strategy='epoch', 
    save_steps=1, 
    save_total_limit=5, 
    save_on_each_node=False, 
    no_cuda=False, 
    seed=42, 
    bf16=False, 
    fp16=False, 
    fp16_opt_level='O1', 
    half_precision_backend='auto', 
    bf16_full_eval=False, 
    fp16_full_eval=False, 
    tf32=None, 
    local_rank=-1, 
    xpu_backend=None, 
    tpu_num_cores=None, 
    tpu_metrics_debug=False, 
    debug=[], 
    dataloader_drop_last=False, 
    eval_steps=1000, 
    dataloader_num_workers=0, 
    past_index=-1, 
    run_name=TOKCL_MODEL_PATH, 
    disable_tqdm=False, 
    remove_unused_columns=True, 
    label_names=None, 
    load_best_model_at_end=False, 
    metric_for_best_model=None, 
    greater_is_better=None, 
    ignore_data_skip=False, 
    sharded_ddp=[], 
    deepspeed=None, 
    label_smoothing_factor=0.0, 
    adafactor=False, 
    group_by_length=False, 
    length_column_name='length', 
    report_to=['tensorboard'], 
    ddp_find_unused_parameters=None, 
    ddp_bucket_cap_mb=None, 
    dataloader_pin_memory=True, 
    skip_memory_metrics=True, 
    use_legacy_prediction_loop=False, 
    push_to_hub=True, 
    resume_from_checkpoint=None, 
    hub_model_id="EMBO/SourceData-NER", 
    hub_strategy='every_save', 
    hub_token=False, 
    gradient_checkpointing=False, 
    fp16_backend='auto', 
    mp_parameters=''
    )

training_args

TrainingArgumentsTOKCL(output_dir='/tokcl_models', overwrite_output_dir=True, do_train=False, do_eval=True, do_predict=False, evaluation_strategy=<IntervalStrategy.EPOCH: 'epoch'>, prediction_loss_only=True, per_device_train_batch_size=16, per_device_eval_batch_size=16, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=0.0001, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=10, max_steps=-1, lr_scheduler_type=<SchedulerType.LINEAR: 'linear'>, warmup_ratio=0.0, warmup_steps=0, log_level=-1, log_level_replica=-1, log_on_each_node=True, logging_dir='/tokcl_models/runs/May20_08-45-53_5719849be9d3', logging_strategy=<IntervalStrategy.STEPS: 'steps'>, logging_first_step=False, logging_steps=1000, logging_nan_inf_filter=True, save_strategy=<IntervalStrategy.EPOCH: 'epoch'>, save_steps=1, save_total_limit=5, save_on_each_node=False, no_cuda=False, se

Up to now, we have pre-processed data and load a model. The loaded model has been cropped at the transformer network. For the model to be able to perform a task, we need to provide the model with a model head. 

The model heads are usually fully connected layers on the top of the transformer network. Although at this point we could perfectly use `torch` to build our own model from the output of the transformers, it has been shown that the performance of fully connected layers is at this point good enough to perform several NLP tasks, including NER.

The reason is that the transformer models already encodes several context information on its resulting embeddings. We would therefore not benefit from generating a second RNN or conditional random fields on top, as it was usually done for NER. We will therefore keep it simple and use the fully connected network provided by 🤗.

The way to do so is to load our model, but now using a different class: `AutoModelForTokenClassification`. In this case we use token classification since NER belongs to this task. 

We need to pass the number of labels. To avoid doing this for every single checkpoint we can do it programatically. The way of getting the number of classes from the training dataset is shown below.

In [47]:
from transformers import AutoModelForTokenClassification
model = AutoModelForTokenClassification.from_pretrained(
            config.from_pretrained,
            num_labels=num_labels,
            max_position_embeddings=config.max_length + 2,  # max_length + 2 for start/end token
            id2label = id2label,
            label2id = label2id
        )
model_config = model.config

Some weights of the model checkpoint at EMBO/bio-lm were not used when initializing RobertaForTokenClassification: ['lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.dense.bias', 'lm_head.bias', 'lm_head.decoder.bias', 'lm_head.layer_norm.weight', 'lm_head.dense.weight']
- This IS expected if you are initializing RobertaForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at EMBO/bio-lm and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-

In [48]:
print(f"\nTraining arguments for model type {config.model_type}:")
print(model_config)
print(training_args)


Training arguments for model type Autoencoder:
RobertaConfig {
  "_name_or_path": "EMBO/bio-lm",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "O",
    "1": "I-SMALL_MOLECULE",
    "2": "B-SMALL_MOLECULE",
    "3": "I-GENEPROD",
    "4": "B-GENEPROD",
    "5": "I-SUBCELLULAR",
    "6": "B-SUBCELLULAR",
    "7": "I-CELL",
    "8": "B-CELL",
    "9": "I-TISSUE",
    "10": "B-TISSUE",
    "11": "I-ORGANISM",
    "12": "B-ORGANISM",
    "13": "I-EXP_ASSAY",
    "14": "B-EXP_ASSAY"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "B-CELL": 8,
    "B-EXP_ASSAY": 14,
    "B-GENEPROD": 4,
    "B-ORGANISM": 12,
    "B-SMALL_MOLECULE": 2,
    "B-SUBCELLULAR": 6,
    "B-TISSUE": 10,
    "I-CELL": 7,
    "I-EXP_ASSA

#### Training Step 3: Define the `Trainer`

We are ready now to define the [`Trainer` class](https://🤗.co/docs/transformers/main_classes/trainer). This class is a basic training loop supporting a series of features defined in the documentation. However, it can be further customized. We encourage you to take a look to the documentation and try it. 

As it is, `trainer.train` would already train our model. However, it would offer only information about the loss during the process. We know that we want the loss to get smaller with time, and ideally, that this is true for both, training and validation datasets. Otherwise we would be incurring in overfitting.

What if we want to see other information during training like the accuracy or f1 score? `Trainer` provides an argument `compute_metrics` that will help us with this.

In [49]:
from transformers import Trainer
from smtag.tb_callback import MyTensorBoardCallback
import torch
from smtag.show import ShowExampleTOKCL
from transformers.integrations import TensorBoardCallback

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics,
    callbacks=[ShowExampleTOKCL(AutoTokenizer.from_pretrained(config.tokenizer))]
)

# switch the Tensorboard callback to plot losses on same plot
trainer.remove_callback(TensorBoardCallback)  # remove default Tensorboard callback
trainer.add_callback(MyTensorBoardCallback)  # replace with customized callback

print(f"CUDA available: {torch.cuda.is_available()}")


Cloning https://huggingface.co/EMBO/SourceData-NER into local empty directory.


OSError: [Errno 16] Device or resource busy: '/tokcl_models'

In [None]:
print(trainer.args)

In [None]:
#trainer.train()


## Train the models using the general tokenizers

In [1]:
from datasets import load_dataset
import os
from dotenv import load_dotenv
from transformers import DataCollatorForTokenClassification
from smtag.data_collator import DataCollatorForMaskedTokenClassification
from transformers import AutoTokenizer
from smtag.metrics import MetricsTOKCL
from transformers import AutoModelForTokenClassification
from transformers import Trainer
from smtag.tb_callback import MyTensorBoardCallback
import torch
from smtag.show import ShowExampleTOKCL
from transformers.integrations import TensorBoardCallback
from smtag.config import Config
from smtag.train.train_tokcl import TrainingArgumentsTOKCL
from transformers import TrainingArguments


Downloading:   0%|          | 0.00/481 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

In [2]:
def shift_label(label):
    # If the label is B-XXX we change it to I-XX
    if label % 2 == 1:
        label += 1
    return label


def align_labels_with_tokens(labels, word_ids):
    """
    Expands the NER tags once the sub-word tokenization is added.
    Arguments
    ---------
    labels list[int]:
    word_ids list[int]
    """
    new_labels = []
    current_word = None
    for word_id in word_ids:
        if word_id is None:
            new_labels.append(-100)
        elif word_id != current_word:
            # Start of a new word!
            current_word = word_id
            # As far as word_id matches the index of the current word
            # We append the same label
            new_labels.append(labels[word_id])
        else:
            new_labels.append(shift_label(labels[word_id]))

    return new_labels

def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples['words'], 
                       truncation=True,
                       is_split_into_words=True)
    
    all_labels = examples['labels']
    new_labels = []
    for i, labels in enumerate(all_labels):
        word_ids = tokenized_inputs.word_ids(i)
        new_labels.append(align_labels_with_tokens(labels, word_ids))
        
    tokenized_inputs['labels'] = new_labels
    return tokenized_inputs


In [8]:
checkpoints = {"bert-cased": "bert-base-cased", # working
               "bert-uncased": "bert-base-uncased", # working
              "biobert": "dmis-lab/biobert-base-cased-v1.1", # working
              "pubmedbert": "microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext"} # Working on colab

In [9]:
checkpoint = checkpoints["biobert"]
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

data = load_dataset("EMBO/sd-nlp-non-tokenized", "NER")
train_dataset, eval_dataset, test_dataset = data["train"], data['validation'], data['test']
train_dataset[0:2]

tokenized_data = data.map(
    tokenize_and_align_labels,
    batched=True,
    remove_columns=data['train'].column_names)#,

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer, 
                                        return_tensors='pt')

Downloading and preparing dataset source_data_nlp/NER to /root/.cache/huggingface/datasets/EMBO___source_data_nlp/NER/0.0.1/b15357fa238e627492e02f6ada34ffe2637a00d9bf27a2404602d3f052a46581...


Downloading:   0%|          | 0.00/17.8M [00:00<?, ?B/s]

0 examples [00:00, ? examples/s]

 This line is taking place


0 examples [00:00, ? examples/s]

 This line is taking place


0 examples [00:00, ? examples/s]

 This line is taking place
Dataset source_data_nlp downloaded and prepared to /root/.cache/huggingface/datasets/EMBO___source_data_nlp/NER/0.0.1/b15357fa238e627492e02f6ada34ffe2637a00d9bf27a2404602d3f052a46581. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/49 [00:00<?, ?ba/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


  0%|          | 0/8 [00:00<?, ?ba/s]

  0%|          | 0/14 [00:00<?, ?ba/s]

In [8]:
training_args = TrainingArguments(checkpoint)

In [None]:
training_args

In [None]:
def define_labels(dataset):
    num_labels = dataset.info.features['labels'].feature.num_classes
    label_list = dataset.info.features['labels'].feature.names
    id2label, label2id = {}, {}
    print(num_labels, label_list)
    for class_, label in zip(range(num_labels), label_list):
        id2label[class_] = label 
        label2id[label] = class_ 
    return id2label, label2id
id2label, label2id = define_labels(tokenized_data['train'])
id2label

In [None]:
max_length = 512
model = AutoModelForTokenClassification.from_pretrained(
    checkpoint,
    num_labels=len(list(id2label.keys())),
    max_position_embeddings=max_length,  
    id2label = id2label,
    label2id = label2id
)
model_config = model.config
compute_metrics = MetricsTOKCL(label_list=list(label2id.keys()))
       
# Define the trainer
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    tokenizer=tokenizer,
    train_dataset=tokenized_data['train'],
    eval_dataset=tokenized_data['validation'],
    compute_metrics=compute_metrics,
)

# switch the Tensorboard callback to plot losses on same plot
trainer.remove_callback(TensorBoardCallback)  # remove default Tensorboard callback
trainer.add_callback(MyTensorBoardCallback)  # replace with customized callback

trainer.train()
    


# Loading BioMegatron

In [23]:
from transformers import AutoModel
model = AutoModel.from_pretrained(
    "data/models/biomegatron/",
    from_tf=False,
    from_flax=False
#     num_labels=len(list(id2label.keys())),
#     max_position_embeddings=max_length,  
#     id2label = id2label,
#     label2id = label2id
)


OSError: Error no file named ['pytorch_model.bin', 'tf_model.h5', 'model.ckpt.index', 'flax_model.msgpack'] found in directory data/models/biomegatron/ or `from_tf` and `from_flax` set to False.

In [1]:
from transformers import AutoModel, AutoConfig, MegatronBertConfig
import torch
model = AutoModel.from_config('data/models/biomegatron/config.json')
weights = torch.load('data/models/biomegatron/MegatronBERT.pt', 
                                 map_location=torch.device('cpu'))
model.load_state_dict(weights['model'])

ValueError: Unrecognized configuration class <class 'str'> for this kind of AutoModel: AutoModel.
Model type should be one of ImageGPTConfig, QDQBertConfig, FNetConfig, SegformerConfig, VisionTextDualEncoderConfig, PerceiverConfig, GPTJConfig, LayoutLMv2Config, BeitConfig, RemBertConfig, VisualBertConfig, CanineConfig, RoFormerConfig, CLIPConfig, BigBirdPegasusConfig, DeiTConfig, LukeConfig, DetrConfig, GPTNeoConfig, BigBirdConfig, Speech2TextConfig, ViTConfig, Wav2Vec2Config, M2M100Config, ConvBertConfig, LEDConfig, BlenderbotSmallConfig, RetriBertConfig, IBertConfig, MT5Config, T5Config, MobileBertConfig, DistilBertConfig, AlbertConfig, BertGenerationConfig, CamembertConfig, XLMRobertaConfig, PegasusConfig, MarianConfig, MBartConfig, MegatronBertConfig, MPNetConfig, BartConfig, BlenderbotConfig, ReformerConfig, LongformerConfig, RobertaConfig, DebertaV2Config, DebertaConfig, FlaubertConfig, FSMTConfig, SqueezeBertConfig, HubertConfig, BertConfig, OpenAIGPTConfig, GPT2Config, TransfoXLConfig, XLNetConfig, XLMProphetNetConfig, ProphetNetConfig, XLMConfig, CTRLConfig, ElectraConfig, FunnelConfig, LxmertConfig, DPRConfig, LayoutLMConfig, TapasConfig, SplinterConfig, SEWDConfig, SEWConfig, UniSpeechSatConfig, UniSpeechConfig, WavLMConfig.

In [22]:
from transformers import MegatronBertModel, MegatronBertConfig
model = MegatronBertModel(MegatronBertConfig()) 
model.load_state_dict(torch.load('data/models/biomegatron/MegatronBERT.pt', 
                                 map_location=torch.device('cpu')))

# (torch.load('data/models/biomegatron/MegatronBERT.pt', 
#                                  map_location=torch.device('cpu')))

RuntimeError: Error(s) in loading state_dict for MegatronBertModel:
	Missing key(s) in state_dict: "embeddings.position_ids", "embeddings.word_embeddings.weight", "embeddings.position_embeddings.weight", "embeddings.token_type_embeddings.weight", "encoder.layer.0.attention.ln.weight", "encoder.layer.0.attention.ln.bias", "encoder.layer.0.attention.self.query.weight", "encoder.layer.0.attention.self.query.bias", "encoder.layer.0.attention.self.key.weight", "encoder.layer.0.attention.self.key.bias", "encoder.layer.0.attention.self.value.weight", "encoder.layer.0.attention.self.value.bias", "encoder.layer.0.attention.output.dense.weight", "encoder.layer.0.attention.output.dense.bias", "encoder.layer.0.ln.weight", "encoder.layer.0.ln.bias", "encoder.layer.0.intermediate.dense.weight", "encoder.layer.0.intermediate.dense.bias", "encoder.layer.0.output.dense.weight", "encoder.layer.0.output.dense.bias", "encoder.layer.1.attention.ln.weight", "encoder.layer.1.attention.ln.bias", "encoder.layer.1.attention.self.query.weight", "encoder.layer.1.attention.self.query.bias", "encoder.layer.1.attention.self.key.weight", "encoder.layer.1.attention.self.key.bias", "encoder.layer.1.attention.self.value.weight", "encoder.layer.1.attention.self.value.bias", "encoder.layer.1.attention.output.dense.weight", "encoder.layer.1.attention.output.dense.bias", "encoder.layer.1.ln.weight", "encoder.layer.1.ln.bias", "encoder.layer.1.intermediate.dense.weight", "encoder.layer.1.intermediate.dense.bias", "encoder.layer.1.output.dense.weight", "encoder.layer.1.output.dense.bias", "encoder.layer.2.attention.ln.weight", "encoder.layer.2.attention.ln.bias", "encoder.layer.2.attention.self.query.weight", "encoder.layer.2.attention.self.query.bias", "encoder.layer.2.attention.self.key.weight", "encoder.layer.2.attention.self.key.bias", "encoder.layer.2.attention.self.value.weight", "encoder.layer.2.attention.self.value.bias", "encoder.layer.2.attention.output.dense.weight", "encoder.layer.2.attention.output.dense.bias", "encoder.layer.2.ln.weight", "encoder.layer.2.ln.bias", "encoder.layer.2.intermediate.dense.weight", "encoder.layer.2.intermediate.dense.bias", "encoder.layer.2.output.dense.weight", "encoder.layer.2.output.dense.bias", "encoder.layer.3.attention.ln.weight", "encoder.layer.3.attention.ln.bias", "encoder.layer.3.attention.self.query.weight", "encoder.layer.3.attention.self.query.bias", "encoder.layer.3.attention.self.key.weight", "encoder.layer.3.attention.self.key.bias", "encoder.layer.3.attention.self.value.weight", "encoder.layer.3.attention.self.value.bias", "encoder.layer.3.attention.output.dense.weight", "encoder.layer.3.attention.output.dense.bias", "encoder.layer.3.ln.weight", "encoder.layer.3.ln.bias", "encoder.layer.3.intermediate.dense.weight", "encoder.layer.3.intermediate.dense.bias", "encoder.layer.3.output.dense.weight", "encoder.layer.3.output.dense.bias", "encoder.layer.4.attention.ln.weight", "encoder.layer.4.attention.ln.bias", "encoder.layer.4.attention.self.query.weight", "encoder.layer.4.attention.self.query.bias", "encoder.layer.4.attention.self.key.weight", "encoder.layer.4.attention.self.key.bias", "encoder.layer.4.attention.self.value.weight", "encoder.layer.4.attention.self.value.bias", "encoder.layer.4.attention.output.dense.weight", "encoder.layer.4.attention.output.dense.bias", "encoder.layer.4.ln.weight", "encoder.layer.4.ln.bias", "encoder.layer.4.intermediate.dense.weight", "encoder.layer.4.intermediate.dense.bias", "encoder.layer.4.output.dense.weight", "encoder.layer.4.output.dense.bias", "encoder.layer.5.attention.ln.weight", "encoder.layer.5.attention.ln.bias", "encoder.layer.5.attention.self.query.weight", "encoder.layer.5.attention.self.query.bias", "encoder.layer.5.attention.self.key.weight", "encoder.layer.5.attention.self.key.bias", "encoder.layer.5.attention.self.value.weight", "encoder.layer.5.attention.self.value.bias", "encoder.layer.5.attention.output.dense.weight", "encoder.layer.5.attention.output.dense.bias", "encoder.layer.5.ln.weight", "encoder.layer.5.ln.bias", "encoder.layer.5.intermediate.dense.weight", "encoder.layer.5.intermediate.dense.bias", "encoder.layer.5.output.dense.weight", "encoder.layer.5.output.dense.bias", "encoder.layer.6.attention.ln.weight", "encoder.layer.6.attention.ln.bias", "encoder.layer.6.attention.self.query.weight", "encoder.layer.6.attention.self.query.bias", "encoder.layer.6.attention.self.key.weight", "encoder.layer.6.attention.self.key.bias", "encoder.layer.6.attention.self.value.weight", "encoder.layer.6.attention.self.value.bias", "encoder.layer.6.attention.output.dense.weight", "encoder.layer.6.attention.output.dense.bias", "encoder.layer.6.ln.weight", "encoder.layer.6.ln.bias", "encoder.layer.6.intermediate.dense.weight", "encoder.layer.6.intermediate.dense.bias", "encoder.layer.6.output.dense.weight", "encoder.layer.6.output.dense.bias", "encoder.layer.7.attention.ln.weight", "encoder.layer.7.attention.ln.bias", "encoder.layer.7.attention.self.query.weight", "encoder.layer.7.attention.self.query.bias", "encoder.layer.7.attention.self.key.weight", "encoder.layer.7.attention.self.key.bias", "encoder.layer.7.attention.self.value.weight", "encoder.layer.7.attention.self.value.bias", "encoder.layer.7.attention.output.dense.weight", "encoder.layer.7.attention.output.dense.bias", "encoder.layer.7.ln.weight", "encoder.layer.7.ln.bias", "encoder.layer.7.intermediate.dense.weight", "encoder.layer.7.intermediate.dense.bias", "encoder.layer.7.output.dense.weight", "encoder.layer.7.output.dense.bias", "encoder.layer.8.attention.ln.weight", "encoder.layer.8.attention.ln.bias", "encoder.layer.8.attention.self.query.weight", "encoder.layer.8.attention.self.query.bias", "encoder.layer.8.attention.self.key.weight", "encoder.layer.8.attention.self.key.bias", "encoder.layer.8.attention.self.value.weight", "encoder.layer.8.attention.self.value.bias", "encoder.layer.8.attention.output.dense.weight", "encoder.layer.8.attention.output.dense.bias", "encoder.layer.8.ln.weight", "encoder.layer.8.ln.bias", "encoder.layer.8.intermediate.dense.weight", "encoder.layer.8.intermediate.dense.bias", "encoder.layer.8.output.dense.weight", "encoder.layer.8.output.dense.bias", "encoder.layer.9.attention.ln.weight", "encoder.layer.9.attention.ln.bias", "encoder.layer.9.attention.self.query.weight", "encoder.layer.9.attention.self.query.bias", "encoder.layer.9.attention.self.key.weight", "encoder.layer.9.attention.self.key.bias", "encoder.layer.9.attention.self.value.weight", "encoder.layer.9.attention.self.value.bias", "encoder.layer.9.attention.output.dense.weight", "encoder.layer.9.attention.output.dense.bias", "encoder.layer.9.ln.weight", "encoder.layer.9.ln.bias", "encoder.layer.9.intermediate.dense.weight", "encoder.layer.9.intermediate.dense.bias", "encoder.layer.9.output.dense.weight", "encoder.layer.9.output.dense.bias", "encoder.layer.10.attention.ln.weight", "encoder.layer.10.attention.ln.bias", "encoder.layer.10.attention.self.query.weight", "encoder.layer.10.attention.self.query.bias", "encoder.layer.10.attention.self.key.weight", "encoder.layer.10.attention.self.key.bias", "encoder.layer.10.attention.self.value.weight", "encoder.layer.10.attention.self.value.bias", "encoder.layer.10.attention.output.dense.weight", "encoder.layer.10.attention.output.dense.bias", "encoder.layer.10.ln.weight", "encoder.layer.10.ln.bias", "encoder.layer.10.intermediate.dense.weight", "encoder.layer.10.intermediate.dense.bias", "encoder.layer.10.output.dense.weight", "encoder.layer.10.output.dense.bias", "encoder.layer.11.attention.ln.weight", "encoder.layer.11.attention.ln.bias", "encoder.layer.11.attention.self.query.weight", "encoder.layer.11.attention.self.query.bias", "encoder.layer.11.attention.self.key.weight", "encoder.layer.11.attention.self.key.bias", "encoder.layer.11.attention.self.value.weight", "encoder.layer.11.attention.self.value.bias", "encoder.layer.11.attention.output.dense.weight", "encoder.layer.11.attention.output.dense.bias", "encoder.layer.11.ln.weight", "encoder.layer.11.ln.bias", "encoder.layer.11.intermediate.dense.weight", "encoder.layer.11.intermediate.dense.bias", "encoder.layer.11.output.dense.weight", "encoder.layer.11.output.dense.bias", "encoder.layer.12.attention.ln.weight", "encoder.layer.12.attention.ln.bias", "encoder.layer.12.attention.self.query.weight", "encoder.layer.12.attention.self.query.bias", "encoder.layer.12.attention.self.key.weight", "encoder.layer.12.attention.self.key.bias", "encoder.layer.12.attention.self.value.weight", "encoder.layer.12.attention.self.value.bias", "encoder.layer.12.attention.output.dense.weight", "encoder.layer.12.attention.output.dense.bias", "encoder.layer.12.ln.weight", "encoder.layer.12.ln.bias", "encoder.layer.12.intermediate.dense.weight", "encoder.layer.12.intermediate.dense.bias", "encoder.layer.12.output.dense.weight", "encoder.layer.12.output.dense.bias", "encoder.layer.13.attention.ln.weight", "encoder.layer.13.attention.ln.bias", "encoder.layer.13.attention.self.query.weight", "encoder.layer.13.attention.self.query.bias", "encoder.layer.13.attention.self.key.weight", "encoder.layer.13.attention.self.key.bias", "encoder.layer.13.attention.self.value.weight", "encoder.layer.13.attention.self.value.bias", "encoder.layer.13.attention.output.dense.weight", "encoder.layer.13.attention.output.dense.bias", "encoder.layer.13.ln.weight", "encoder.layer.13.ln.bias", "encoder.layer.13.intermediate.dense.weight", "encoder.layer.13.intermediate.dense.bias", "encoder.layer.13.output.dense.weight", "encoder.layer.13.output.dense.bias", "encoder.layer.14.attention.ln.weight", "encoder.layer.14.attention.ln.bias", "encoder.layer.14.attention.self.query.weight", "encoder.layer.14.attention.self.query.bias", "encoder.layer.14.attention.self.key.weight", "encoder.layer.14.attention.self.key.bias", "encoder.layer.14.attention.self.value.weight", "encoder.layer.14.attention.self.value.bias", "encoder.layer.14.attention.output.dense.weight", "encoder.layer.14.attention.output.dense.bias", "encoder.layer.14.ln.weight", "encoder.layer.14.ln.bias", "encoder.layer.14.intermediate.dense.weight", "encoder.layer.14.intermediate.dense.bias", "encoder.layer.14.output.dense.weight", "encoder.layer.14.output.dense.bias", "encoder.layer.15.attention.ln.weight", "encoder.layer.15.attention.ln.bias", "encoder.layer.15.attention.self.query.weight", "encoder.layer.15.attention.self.query.bias", "encoder.layer.15.attention.self.key.weight", "encoder.layer.15.attention.self.key.bias", "encoder.layer.15.attention.self.value.weight", "encoder.layer.15.attention.self.value.bias", "encoder.layer.15.attention.output.dense.weight", "encoder.layer.15.attention.output.dense.bias", "encoder.layer.15.ln.weight", "encoder.layer.15.ln.bias", "encoder.layer.15.intermediate.dense.weight", "encoder.layer.15.intermediate.dense.bias", "encoder.layer.15.output.dense.weight", "encoder.layer.15.output.dense.bias", "encoder.layer.16.attention.ln.weight", "encoder.layer.16.attention.ln.bias", "encoder.layer.16.attention.self.query.weight", "encoder.layer.16.attention.self.query.bias", "encoder.layer.16.attention.self.key.weight", "encoder.layer.16.attention.self.key.bias", "encoder.layer.16.attention.self.value.weight", "encoder.layer.16.attention.self.value.bias", "encoder.layer.16.attention.output.dense.weight", "encoder.layer.16.attention.output.dense.bias", "encoder.layer.16.ln.weight", "encoder.layer.16.ln.bias", "encoder.layer.16.intermediate.dense.weight", "encoder.layer.16.intermediate.dense.bias", "encoder.layer.16.output.dense.weight", "encoder.layer.16.output.dense.bias", "encoder.layer.17.attention.ln.weight", "encoder.layer.17.attention.ln.bias", "encoder.layer.17.attention.self.query.weight", "encoder.layer.17.attention.self.query.bias", "encoder.layer.17.attention.self.key.weight", "encoder.layer.17.attention.self.key.bias", "encoder.layer.17.attention.self.value.weight", "encoder.layer.17.attention.self.value.bias", "encoder.layer.17.attention.output.dense.weight", "encoder.layer.17.attention.output.dense.bias", "encoder.layer.17.ln.weight", "encoder.layer.17.ln.bias", "encoder.layer.17.intermediate.dense.weight", "encoder.layer.17.intermediate.dense.bias", "encoder.layer.17.output.dense.weight", "encoder.layer.17.output.dense.bias", "encoder.layer.18.attention.ln.weight", "encoder.layer.18.attention.ln.bias", "encoder.layer.18.attention.self.query.weight", "encoder.layer.18.attention.self.query.bias", "encoder.layer.18.attention.self.key.weight", "encoder.layer.18.attention.self.key.bias", "encoder.layer.18.attention.self.value.weight", "encoder.layer.18.attention.self.value.bias", "encoder.layer.18.attention.output.dense.weight", "encoder.layer.18.attention.output.dense.bias", "encoder.layer.18.ln.weight", "encoder.layer.18.ln.bias", "encoder.layer.18.intermediate.dense.weight", "encoder.layer.18.intermediate.dense.bias", "encoder.layer.18.output.dense.weight", "encoder.layer.18.output.dense.bias", "encoder.layer.19.attention.ln.weight", "encoder.layer.19.attention.ln.bias", "encoder.layer.19.attention.self.query.weight", "encoder.layer.19.attention.self.query.bias", "encoder.layer.19.attention.self.key.weight", "encoder.layer.19.attention.self.key.bias", "encoder.layer.19.attention.self.value.weight", "encoder.layer.19.attention.self.value.bias", "encoder.layer.19.attention.output.dense.weight", "encoder.layer.19.attention.output.dense.bias", "encoder.layer.19.ln.weight", "encoder.layer.19.ln.bias", "encoder.layer.19.intermediate.dense.weight", "encoder.layer.19.intermediate.dense.bias", "encoder.layer.19.output.dense.weight", "encoder.layer.19.output.dense.bias", "encoder.layer.20.attention.ln.weight", "encoder.layer.20.attention.ln.bias", "encoder.layer.20.attention.self.query.weight", "encoder.layer.20.attention.self.query.bias", "encoder.layer.20.attention.self.key.weight", "encoder.layer.20.attention.self.key.bias", "encoder.layer.20.attention.self.value.weight", "encoder.layer.20.attention.self.value.bias", "encoder.layer.20.attention.output.dense.weight", "encoder.layer.20.attention.output.dense.bias", "encoder.layer.20.ln.weight", "encoder.layer.20.ln.bias", "encoder.layer.20.intermediate.dense.weight", "encoder.layer.20.intermediate.dense.bias", "encoder.layer.20.output.dense.weight", "encoder.layer.20.output.dense.bias", "encoder.layer.21.attention.ln.weight", "encoder.layer.21.attention.ln.bias", "encoder.layer.21.attention.self.query.weight", "encoder.layer.21.attention.self.query.bias", "encoder.layer.21.attention.self.key.weight", "encoder.layer.21.attention.self.key.bias", "encoder.layer.21.attention.self.value.weight", "encoder.layer.21.attention.self.value.bias", "encoder.layer.21.attention.output.dense.weight", "encoder.layer.21.attention.output.dense.bias", "encoder.layer.21.ln.weight", "encoder.layer.21.ln.bias", "encoder.layer.21.intermediate.dense.weight", "encoder.layer.21.intermediate.dense.bias", "encoder.layer.21.output.dense.weight", "encoder.layer.21.output.dense.bias", "encoder.layer.22.attention.ln.weight", "encoder.layer.22.attention.ln.bias", "encoder.layer.22.attention.self.query.weight", "encoder.layer.22.attention.self.query.bias", "encoder.layer.22.attention.self.key.weight", "encoder.layer.22.attention.self.key.bias", "encoder.layer.22.attention.self.value.weight", "encoder.layer.22.attention.self.value.bias", "encoder.layer.22.attention.output.dense.weight", "encoder.layer.22.attention.output.dense.bias", "encoder.layer.22.ln.weight", "encoder.layer.22.ln.bias", "encoder.layer.22.intermediate.dense.weight", "encoder.layer.22.intermediate.dense.bias", "encoder.layer.22.output.dense.weight", "encoder.layer.22.output.dense.bias", "encoder.layer.23.attention.ln.weight", "encoder.layer.23.attention.ln.bias", "encoder.layer.23.attention.self.query.weight", "encoder.layer.23.attention.self.query.bias", "encoder.layer.23.attention.self.key.weight", "encoder.layer.23.attention.self.key.bias", "encoder.layer.23.attention.self.value.weight", "encoder.layer.23.attention.self.value.bias", "encoder.layer.23.attention.output.dense.weight", "encoder.layer.23.attention.output.dense.bias", "encoder.layer.23.ln.weight", "encoder.layer.23.ln.bias", "encoder.layer.23.intermediate.dense.weight", "encoder.layer.23.intermediate.dense.bias", "encoder.layer.23.output.dense.weight", "encoder.layer.23.output.dense.bias", "encoder.ln.weight", "encoder.ln.bias", "pooler.dense.weight", "pooler.dense.bias". 
	Unexpected key(s) in state_dict: "iteration", "model". 

In [23]:
model = AutoModel.from_pretrained("nvidia/megatron-bert-cased-345m")

404 Client Error: Not Found for url: https://huggingface.co/nvidia/megatron-bert-cased-345m/resolve/main/config.json


OSError: Can't load config for 'nvidia/megatron-bert-cased-345m'. Make sure that:

- 'nvidia/megatron-bert-cased-345m' is a correct model identifier listed on 'https://huggingface.co/models'
  (make sure 'nvidia/megatron-bert-cased-345m' is not a path to a local directory with something else, in that case)

- or 'nvidia/megatron-bert-cased-345m' is the correct path to a directory containing a config.json file



In [9]:
weights['model'].keys()

dict_keys(['language_model', 'qa_head'])

In [12]:
weights['model']['language_model'].keys()

dict_keys(['embedding', 'transformer'])

## Testing a loop for the training

In [None]:
from datasets import load_dataset
import os
from dotenv import load_dotenv
from transformers import DataCollatorForTokenClassification
from smtag.data_collator import DataCollatorForMaskedTokenClassification
from transformers import AutoTokenizer
from smtag.metrics import MetricsTOKCL
from transformers import AutoModelForTokenClassification
from transformers import Trainer
from smtag.tb_callback import MyTensorBoardCallback
import torch
from smtag.show import ShowExampleTOKCL
from transformers.integrations import TensorBoardCallback
from smtag.config import Config
from smtag.train.train_tokcl import TrainingArgumentsTOKCL

In [None]:
TASKS = ["NER", "GENEPROD_ROLES", "SMALL_MOL_ROLES", "BORING", "PANELIZATION"]
MODELS = ["EMBO/bio-lm", "roberta-base"]

ROBERTA_DATASET = "drAbreu/sd-nlp-2"
GENERAL_DATASET = "EMBO/sd-nlp-non-tokenized"

HUB_TOKEN = "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"

HUB_USER = "EMBO"

load_dotenv()
LM_MODEL_PATH = os.getenv('LM_MODEL_PATH')
TOKENIZER_PATH = os.getenv('TOKENIZER_PATH')
TOKCL_MODEL_PATH = os.getenv('TOKCL_MODEL_PATH')
CACHE = os.getenv('CACHE')
RUNS_DIR = os.getenv('RUNS_DIR')

TRAINING_ARGS_DICT = {"output_dir": TOKCL_MODEL_PATH,
                     "overwrite_output_dir": True,
                    "logging_steps":1000,
                    "evaluation_strategy":'epoch',
                    "lr_scheduler_type":'linear', 
                    "save_strategy":'epoch', 
                    "save_steps":1, 
#                     "eval_strategy":'epoch', 
                    "save_total_limit":5, 
                    "seed":42, 
                    "eval_steps":1, 
                    "past_index":-1, 
                    "run_name":TOKCL_MODEL_PATH, 
                    "disable_tqdm":False, 
                    "metric_for_best_model":'overall_f1', 
                    "load_best_model_at_end":True, 
                    "greater_is_better":True, 
                    "length_column_name":'length', 
                    "report_to":['tensorboard'], 
                    "push_to_hub":False, 
                    "resume_from_checkpoint":None,  
                    "hub_strategy":'every_save', 
                    "hub_token":HUB_TOKEN, 
                    }


In [None]:
data = load_dataset("drAbreu/sd-nlp-2", "NER")
train_dataset, eval_dataset, test_dataset = data["train"], data['validation'], data['test']


In [None]:
def define_labels(dataset):
    num_labels = dataset.info.features['labels'].feature.num_classes
    label_list = dataset.info.features['labels'].feature.names
    id2label, label2id = {}, {}
    for class_, label in zip(range(num_labels), label_list):
        id2label[class_] = label 
        label2id[label] = class_ 
        return id2label, label2id


In [None]:
def shift_label(label):
    # If the label is B-XXX we change it to I-XX
    if label % 2 == 1:
        label += 1
    return label


def align_labels_with_tokens(labels, word_ids):
    """
    Expands the NER tags once the sub-word tokenization is added.
    Arguments
    ---------
    labels list[int]:
    word_ids list[int]
    """
    new_labels = []
    current_word = None
    for word_id in word_ids:
        if word_id is None:
            new_labels.append(-100)
        elif word_id != current_word:
            # Start of a new word!
            current_word = word_id
            # As far as word_id matches the index of the current word
            # We append the same label
            new_labels.append(labels[word_id])
        else:
            new_labels.append(shift_label(labels[word_id]))

    return new_labels

def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples['words'], 
                       truncation=True,
                       is_split_into_words=True)
    
    all_labels = examples['labels']
    new_labels = []
    for i, labels in enumerate(all_labels):
        word_ids = tokenized_inputs.word_ids(i)
        new_labels.append(align_labels_with_tokens(labels, word_ids))
        
    tokenized_inputs['labels'] = new_labels
    return tokenized_inputs


In [None]:
list_repo_names = []

for model_name in MODELS:
    for task in TASKS:
        TRAINING_ARGS_DICT["hub_model_id"] = f"{HUB_USER}/{model_name.replace('/','_')}_{task}"
        if model_name in ["EMBO/bio-lm", "roberta-base"]:
            
            config = Config(model_type = "Autoencoder", 
                            from_pretrained = model_name, 
                            tokenizer = "roberta-base")
            
            data = load_dataset(ROBERTA_DATASET, task)
            train_dataset, eval_dataset, test_dataset = data["train"], data['validation'], data['test']
            id2label, label2id = define_labels(train_dataset)
            tokenizer = AutoTokenizer.from_pretrained(config.tokenizer)
            data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer, 
                                                               padding=True,
                                                               return_tensors='pt')
            
            compute_metrics = MetricsTOKCL(label_list=list(label2id.keys()))
            
            
            # Get the training arguments
            training_args = TrainingArgumentsTOKCL(**TRAINING_ARGS_DICT)

            # Select the model (This is for Token Classification)
            
            model = AutoModelForTokenClassification.from_pretrained(
                config.from_pretrained,
                num_labels=len(list(id2label.keys())),
                max_position_embeddings=config.max_length + 2,  
                id2label = id2label,
                label2id = label2id
            )
            model_config = model.config
            
            # Define the trainer
            trainer = Trainer(
                model=model,
                args=training_args,
                data_collator=data_collator,
                train_dataset=train_dataset,
                eval_dataset=eval_dataset,
                compute_metrics=compute_metrics,
                callbacks=[ShowExampleTOKCL(tokenizer)]
            )

            # switch the Tensorboard callback to plot losses on same plot
            trainer.remove_callback(TensorBoardCallback)  # remove default Tensorboard callback
            trainer.add_callback(MyTensorBoardCallback)  # replace with customized callback

#             train()
            list_repo_names.append(TRAINING_ARGS_DICT["hub_model_id"])
    
        elif model_name in []:
            data = load_dataset(GENERAL_DATASET, task)
            list_repo_names.append(TRAINING_ARGS_DICT["hub_model_id"])
        else:
            raise ValueError(f"The selected model ({model_name}) is not contained in our Benchmark list. Please add it.")

In [None]:
list_repo_names