<a href="https://colab.research.google.com/github/zhangxl2002/ORL/blob/main/T5_Ner_Finetuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!nvidia-smi

Tue Apr  9 18:19:01 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.01    Driver Version: 470.82.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA A10          Off  | 00000000:00:08.0 Off |                    0 |
|  0%   27C    P8    16W / 150W |      2MiB / 22731MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

# Named Entity Recognition with T5

This notebook shows how to finetune [T5 Model](https://https://huggingface.co/docs/transformers/model_doc/t5) for token classification or named entity recognition with pytorch lighning. In this demo, I used the T5-Small and cast the entities as a text using the text to text framework used in the t5 paper. During Eval the generated tokens are then split and classifies into their specific classes

In [4]:
import argparse
import glob
import os
import json
import time
import logging
import random
import re
from itertools import chain
from string import punctuation

import nltk
# nltk.download('punkt')
from nltk.tokenize import sent_tokenize

import pandas as pd
import numpy as np
import torch
from torch.utils.data import Dataset, DataLoader
# import pytorch_lightning as pl


from transformers import (
    AdamW,
    MT5ForConditionalGeneration,
    T5ForConditionalGeneration,
    T5Tokenizer,
    AutoTokenizer,
    get_linear_schedule_with_warmup
)

from datasets import load_dataset

In [3]:
def set_seed(seed):
  random.seed(seed)
  np.random.seed(seed)
  torch.manual_seed(seed)
  if torch.cuda.is_available():
    torch.cuda.manual_seed_all(seed)

set_seed(42)

In [5]:
dataset = load_dataset("zhangxl2002/mpqa_ORL")

In [9]:
print(dataset)
cnt = 0
for i in range(len(dataset['train'])):
    if len(dataset['train'][i]['spans']) > 3:
        print(dataset['train'][i])
        cnt+=1
        if cnt>5: break

print(" ".join(dataset['train'][0]['words']))

DatasetDict({
    train: Dataset({
        features: ['words', 'label_ids', 'labels', 'spans', 'dse'],
        num_rows: 3549
    })
    validation: Dataset({
        features: ['words', 'label_ids', 'labels', 'spans', 'dse'],
        num_rows: 893
    })
    test: Dataset({
        features: ['words', 'label_ids', 'labels', 'spans', 'dse'],
        num_rows: 1509
    })
})
{'words': ['He', 'argued', 'that', 'had', 'the', 'West', 'not', 'continued', 'to', 'keep', 'alive', 'during', 'the', 'past', 'several', 'years', 'the', 'Cold', 'War', 'stereotype', 'of', 'a', 'threat', 'from', 'the', 'East', ',', 'but', 'would', 'have', 'concentrated', 'instead', 'on', 'terrorism', ',', 'the', 'common', 'enemy', ',', 'the', 'twin', 'towers', 'of', 'New', 'York', 'may', 'not', 'have', 'collapsed', '.'], 'label_ids': [1, 3, 0, 0, 5, 6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 0], 'labels': ['B-AGENT', 'B-DSE', 'O',

In [10]:
class MPQADataset(Dataset):
  def __init__(self, tokenizer, dataset, type_path, max_len=512):

    self.data = dataset[type_path]
    self.max_len = max_len
    self.tokenizer = tokenizer
    self.tokenizer.max_length = max_len
    self.tokenizer.model_max_length = max_len
    self.inputs = []
    self.targets = []

    self._build()

  def __len__(self):
    return len(self.inputs)

  def __getitem__(self, index):
    source_ids = self.inputs[index]["input_ids"].squeeze()
    target_ids = self.targets[index]["input_ids"].squeeze()

    src_mask    = self.inputs[index]["attention_mask"].squeeze()  # might need to squeeze
    target_mask = self.targets[index]["attention_mask"].squeeze()  # might need to squeeze

    return {"source_ids": source_ids, "source_mask": src_mask, "target_ids": target_ids, "target_mask": target_mask}

  def _build(self):
    for idx in range(len(self.data)):
      input_, target = " ".join(self.data[idx]["words"]), "; ".join(self.data[idx]["spans"])
      input_ = input_ + " DSE:" + self.data[idx]["dse"]

      input_ = input_.lower() + ' </s>'
      target = target.lower() + " </s>"

       # tokenize inputs
      tokenized_inputs = self.tokenizer.batch_encode_plus(
          [input_], max_length=self.max_len, padding="max_length", truncation=True, return_tensors="pt"
      )
       # tokenize targets
      tokenized_targets = self.tokenizer.batch_encode_plus(
          [target],max_length=self.max_len, padding="max_length", truncation=True, return_tensors="pt"
      )

      self.inputs.append(tokenized_inputs)
      self.targets.append(tokenized_targets)

In [11]:
tokenizer = AutoTokenizer.from_pretrained("../T5-base")

print(tokenizer)

input_dataset = MPQADataset(tokenizer=tokenizer, dataset=dataset, type_path='train')

T5TokenizerFast(name_or_path='../T5-base', vocab_size=32100, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<pad>', 'additional_special_tokens': ['<extra_id_0>', '<extra_id_1>', '<extra_id_2>', '<extra_id_3>', '<extra_id_4>', '<extra_id_5>', '<extra_id_6>', '<extra_id_7>', '<extra_id_8>', '<extra_id_9>', '<extra_id_10>', '<extra_id_11>', '<extra_id_12>', '<extra_id_13>', '<extra_id_14>', '<extra_id_15>', '<extra_id_16>', '<extra_id_17>', '<extra_id_18>', '<extra_id_19>', '<extra_id_20>', '<extra_id_21>', '<extra_id_22>', '<extra_id_23>', '<extra_id_24>', '<extra_id_25>', '<extra_id_26>', '<extra_id_27>', '<extra_id_28>', '<extra_id_29>', '<extra_id_30>', '<extra_id_31>', '<extra_id_32>', '<extra_id_33>', '<extra_id_34>', '<extra_id_35>', '<extra_id_36>', '<extra_id_37>', '<extra_id_38>', '<extra_id_39>', '<extra_id_40>', '<extra_id_41>', '<extra_id_42

In [13]:
data = input_dataset[0]
# print(data)
print(tokenizer.decode(data["source_ids"][400]))
print((tokenizer.decode(data["source_ids"][0:5])))
print(tokenizer.encode("<unk>"))
print(data["source_ids"][0:2])
print(tokenizer.pad_token_id)
print(tokenizer.decode(data["source_ids"], skip_special_tokens=False))
print(tokenizer.decode(data["target_ids"], skip_special_tokens=False))

<pad>
the kimberley
[2, 1]
tensor([8, 3])
0
the kimberley provincial hospital said it would probably know by tuesday whether one of its patients had congo fever. dse:would probably know</s></s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><p

In [14]:
args_dict = dict(
    data_dir="zhangxl2002/mpqa_ORL", # path for data files
    output_dir="", # path to save the checkpoints
    model_name_or_path='t5-base',
    tokenizer_name_or_path='t5-base',
    max_seq_length=256,
    learning_rate=3e-4,
    weight_decay=0.0,
    adam_epsilon=1e-8,
    warmup_steps=0,
    train_batch_size=8,
    eval_batch_size=8,
    num_train_epochs=10,
    gradient_accumulation_steps=16,
    n_gpu=1,
    early_stop_callback=False,
    fp_16=True, # if you want to enable 16-bit training then install apex and set this to true
    opt_level='O1', # you can find out more on optimisation levels here https://nvidia.github.io/apex/amp.html#opt-levels-and-properties
    max_grad_norm=1, # if you enable 16-bit training then set this to a sensible value, 0.5 is a good default
    seed=42,
)

In [15]:
args = argparse.Namespace(**args_dict)

In [34]:
class T5FineTuner():
    def __init__(self, hparam):
        self.hparam = hparam
        self.model = T5ForConditionalGeneration.from_pretrained(
            hparam.model_name_or_path)
        self.tokenizer = AutoTokenizer.from_pretrained(
            hparam.model_name_or_path
        )
    def is_logger(self):
        return True

    def forward(
        self, input_ids, attention_mask=None, decoder_input_ids=None, decoder_attention_mask=None, lm_labels=None
    ):
        return self.model(
            input_ids,
            attention_mask=attention_mask,
            decoder_input_ids=decoder_input_ids,
            decoder_attention_mask=decoder_attention_mask,
            labels=lm_labels,
        )

    def _step(self, batch):
        lm_labels = batch["target_ids"]
        lm_labels[lm_labels[:, :] == self.tokenizer.pad_token_id] = -100

        outputs = self.model(
            input_ids=batch["source_ids"],
            attention_mask=batch["source_mask"],
            labels=lm_labels,
            decoder_attention_mask=batch['target_mask']
        )

        loss = outputs[0]

        return loss

    def training_step(self, batch, batch_idx):
        loss = self._step(batch)

        # tensorboard_logs = {"train_loss": loss}
        # return {"loss": loss, "log": tensorboard_logs}
        return loss

    def training_epoch_end(self, outputs):
        avg_train_loss = torch.stack([x["loss"] for x in outputs]).mean()
        tensorboard_logs = {"avg_train_loss": avg_train_loss}

    def validation_step(self, batch, batch_idx):
        loss = self._step(batch)
        return {"val_loss": loss}

    def validation_epoch_end(self, outputs):
        avg_loss = torch.stack([x["val_loss"] for x in outputs]).mean()
        tensorboard_logs = {"val_loss": avg_loss}

    def configure_optimizers(self):
        "Prepare optimizer and schedule (linear warmup and decay)"

        model = self.model
        no_decay = ["bias", "LayerNorm.weight"]
        optimizer_grouped_parameters = [
            {
                "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
                "weight_decay": self.hparam.weight_decay,
            },
            {
                "params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)],
                "weight_decay": 0.0,
            },
        ]
        optimizer = AdamW(optimizer_grouped_parameters,
                          lr=self.hparam.learning_rate, eps=self.hparam.adam_epsilon)
        self.opt = optimizer
        return [optimizer]

    def optimizer_step(self,
                       epoch=None,
                       batch_idx=None,
                       optimizer=None,
                       optimizer_idx=None,
                       optimizer_closure=None,
                       on_tpu=None,
                       using_native_amp=None,
                       using_lbfgs=None
                       ):

        # optimizer.step(closure=optimizer_closure)
        # optimizer.zero_grad()
        self.opt.step(closure=optimizer_closure)
        self.opt.zero_grad()
        self.lr_scheduler.step()

    def get_tqdm_dict(self):
        tqdm_dict = {"loss": "{:.3f}".format(
            self.trainer.avg_loss), "lr": self.lr_scheduler.get_last_lr()[-1]}

        return tqdm_dict

    def train_dataloader(self):
        train_dataset = get_dataset(
            tokenizer=self.tokenizer, type_path="train", args=self.hparam)
        dataloader = DataLoader(train_dataset, batch_size=self.hparam.train_batch_size,
                                drop_last=True, shuffle=True, num_workers=2)
        t_total = (
            (len(dataloader.dataset) //
             (self.hparam.train_batch_size * max(1, self.hparam.n_gpu)))
            // self.hparam.gradient_accumulation_steps
            * float(self.hparam.num_train_epochs)
        )
        scheduler = get_linear_schedule_with_warmup(
            self.opt, num_warmup_steps=self.hparam.warmup_steps, num_training_steps=t_total
        )
        self.lr_scheduler = scheduler
        return dataloader

    def val_dataloader(self):
        val_dataset = get_dataset(
            tokenizer=self.tokenizer, type_path="validation", args=self.hparam)
        return DataLoader(val_dataset, batch_size=self.hparam.eval_batch_size, num_workers=2)

In [35]:
model = T5FineTuner(args)

In [21]:
checkpointPath = "saveCheckpointPath/checkpoint.pth"

In [25]:
def get_dataset(tokenizer, type_path, args):
    tokenizer.max_length = args.max_seq_length
    tokenizer.model_max_length = args.max_seq_length
    # dataset = load_dataset(args.data_dir)
    dataset = load_dataset("zhangxl2002/mpqa_ORL")
    return MPQADataset(tokenizer=tokenizer, dataset=dataset, type_path=type_path)

In [36]:
model.configure_optimizers()
# for i, batch in enumerate(dataloader):
#     # x, y = batch                      moved to training_step
#     # y_hat = model(x)                  moved to training_step
#     # loss = loss_function(y_hat, y)    moved to training_step
#     loss = lightning_module.training_step(batch, i)

#     # Lighting handles automatically:
#     optimizer.zero_grad()
#     loss.backward()
#     optimizer.step()
max_epochs = 10
for epoch in range(max_epochs):
    total_loss = 0.0
    num_batches = 0
    for i, batch in enumerate(model.train_dataloader()):
        loss = model.training_step(batch, i)
        total_loss += loss.item()
        num_batches += 1

        # 打印每个批次的损失值
        print(f"Epoch [{epoch + 1}/{max_epochs}], Batch [{i + 1}/{len(model.train_dataloader())}], Loss: {loss.item():.4f}")

        # 执行优化步骤
        model.optimizer_step()

    # 计算并打印平均损失值
    avg_loss = total_loss / num_batches
    print(f"Epoch [{epoch + 1}/{max_epochs}], Average Loss: {avg_loss:.4f}")
    # model.training_epoch_end(outputs) # outputs是什么



huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


RuntimeError: DataLoader worker (pid 13404) is killed by signal: Killed. 

In [16]:
logger = logging.getLogger(__name__)

class LoggingCallback():
  def on_validation_end(self, trainer, pl_module):
    logger.info("***** Validation results *****")
    if pl_module.is_logger():
      metrics = trainer.callback_metrics
      # Log results
      for key in sorted(metrics):
        if key not in ["log", "progress_bar"]:
          logger.info("{} = {}\n".format(key, str(metrics[key])))

  def on_test_end(self, trainer, pl_module):
    logger.info("***** Test results *****")

    if pl_module.is_logger():
      metrics = trainer.callback_metrics

      # Log and save results to file
      output_test_results_file = os.path.join(pl_module.hparams.output_dir, "test_results.txt")
      with open(output_test_results_file, "w") as writer:
        for key in sorted(metrics):
          if key not in ["log", "progress_bar"]:
            logger.info("{} = {}\n".format(key, str(metrics[key])))
            writer.write("{} = {}\n".format(key, str(metrics[key])))

In [17]:
checkpoint_callback = pl.callbacks.ModelCheckpoint(
    filename=args.output_dir+"/checkpoint.pth", monitor="val_loss", mode="min", save_top_k=5
)

train_params = dict(
    accumulate_grad_batches=args.gradient_accumulation_steps,
    gpus=args.n_gpu,
    max_epochs=args.num_train_epochs,
    #early_stop_callback=False,
    precision= 16 if args.fp_16 else 32,
    #amp_level=args.opt_level,
    gradient_clip_val=args.max_grad_norm,
    checkpoint_callback=checkpoint_callback,
    callbacks=[LoggingCallback()],
)

# train_params = dict(
#     accumulate_grad_batches=args.gradient_accumulation_steps,
#     ## gpus=args.n_gpu,
#     max_epochs=args.num_train_epochs,
#     #early_stop_callback=False,
#     precision= 16 if args.fp_16 else 32,
#     #amp_level=args.opt_level,
#     gradient_clip_val=args.max_grad_norm,
#     # callbacks=[LoggingCallback()],
# )

In [2]:
from pytorch_lightning.loops import FitLoop
# https://pytorch-lightning.readthedocs.io/en/1.5.10/extensions/loops.html
# 重写FitLoop的advance方法
# 例子：https://github.com/Lightning-Universe/lightning-flash/blob/try/icevision-data/flash/image/classification/integrations/baal/loop.py
# 例子：https://github.com/Lightning-Universe/lightning-flash/blob/try/icevision-data/flash_examples/integrations/baal/image_classification_active_learning.py
class CustomFitLoop(FitLoop):
    def advance(self):
        """Advance from one iteration to the next."""

    def on_advance_end(self):
        """Do something at the end of an iteration."""

    def on_run_end(self):
        """Do something when the loop ends."""

In [22]:
trainer = pl.Trainer(**train_params)

Using 16bit native Automatic Mixed Precision (AMP)
  rank_zero_deprecation(
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs


In [None]:
trainer.fit(model)

In [19]:
model = model.load_from_checkpoint("/mnt/workspace/ORL/lightning_logs/version_4/checkpoints/epoch=9-step=279.ckpt")

In [None]:
import textwrap

dataloader = DataLoader(input_dataset, batch_size=32, num_workers=2, shuffle=True)
model.model.eval()
model = model.to("cpu")

In [27]:
outputs = []
targets = []
texts = []
cnt = 0
for batch in dataloader:

    outs = model.model.generate(input_ids=batch['source_ids'],
                                attention_mask=batch['source_mask'])
    dec = [tokenizer.decode(ids, skip_special_tokens=True, clean_up_tokenization_spaces=False).strip() for ids in outs]
    target = [tokenizer.decode(ids, skip_special_tokens=True, clean_up_tokenization_spaces=False).strip()
                for ids in batch["target_ids"]]
    text = [tokenizer.decode(ids, skip_special_tokens=True, clean_up_tokenization_spaces=False).strip()
                for ids in batch["source_ids"]]
    texts.extend(text)
    outputs.extend(dec)
    targets.extend(target)
    break


for i in range(10):
    c = texts[i]
    lines = textwrap.wrap("text:\n%s\n" % c, width=100)
    print("\n".join(lines))
    print("\nActual Entities: %s" % target[i])
    print("Predicted Entities: %s" % outputs[i])
    print("=====================================================================\n")

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




text: at the state house , where the ceremony was held , thousands of zanu-pf supporters in clothing
emblazoned with mugabe 's portrait , were singing and dancing to celebrate his victory , with a hope
that mugabe will finally deliver on his promises to give them the land currently owned by the
country 's white minority . dse:with a hope

Actual Entities: agent: thousands of zanu-pf supporters
Predicted Entities: agent: thousands of zanu-pf supporters; target: mugabe will finally deliver

text: nobody who has a live conscience and human feelings , whether he is palestinian or otherwise ,
could not have been moved by these events and expressed sympathy for the families of the us victims
, regardless of the us political stances that are totally biased to israel and israel 's use of the
most advanced us weapons to curb the palestinian intifadah . dse:stances

Actual Entities: agent: the us
Predicted Entities: agent: the us; target: israel

text: however , chatterbox notes that nbc , like 

In [None]:
dir(model.model)

In [118]:
# search:how to get the probability huggingface
# https://discuss.huggingface.co/t/announcement-generation-get-probabilities-for-generated-output/30075

# 设置输入句子和相关参数
input_text = "I love the person who likes me. dse:love"
input_texts = ["I love the person who likes me. dse:love","He loves his wife. dse:loves"]
max_length = 32
temperature = 1.0
num_samples = 1

# 推理
# inputs = tokenizer([input_text], return_tensors="pt")

inputs = tokenizer.batch_encode_plus(
    input_texts, max_length=max_length, padding="max_length", truncation=True, return_tensors="pt"
)
# input_ids = tokenizer(input_text, return_tensors="pt").input_ids
outputs = model.model.generate(**inputs,return_dict_in_generate=True,output_scores=True)
# print(outputs)
print("outputs.keys(): ",outputs.keys())
print("outputs.scores[0].shape:",outputs.scores[0].shape)
print("len(outputs.scores):",len(outputs.scores))
#output.scores是一个长度为输出的token数量的元组，元组中的每个元素为batchsize*词表大小的tensor
print("outputs.scores[0].shape:",outputs.scores[0].shape) 
print("ddddddddddddddddddddddddddddddddddddd",outputs.scores[0][0][0])

transition_scores = model.model.compute_transition_scores(
    outputs.sequences, outputs.scores, normalize_logits=True
)
print("len(transition_scores[0]): ", len(transition_scores[0]))
print("transition_scores: ", transition_scores)

# input_length = 1 if model.config.is_encoder_decoder else inputs.input_ids.shape[1]
# 由于t5是encoder-decoder架构，所以这里input_length取1
generated_tokens = outputs.sequences[:, 1:] 
print("outputs.sequences: ", outputs.sequences)
print("len(generated_tokens[0]):",len(generated_tokens[0]))
print("generated_tokens: ",generated_tokens)
for i in range(generated_tokens.shape[0]):
    for tok, score in zip(generated_tokens[i], transition_scores[i]):
        # | token | token string | logits | probability
        print(f"| {tok:5d} | {tokenizer.decode(tok):8s} | {score.numpy():.4f} |{np.exp(score.numpy()):.2%}")
    print("----------------------------------------------")

print(tokenizer.decode(outputs.sequences[0], skip_special_tokens=True))





outputs.keys():  odict_keys(['sequences', 'scores', 'past_key_values'])
outputs.scores[0].shape: torch.Size([2, 32128])
len(outputs.scores): 13
outputs.scores: torch.Size([2, 32128])
ddddddddddddddddddddddddddddddddddddd tensor(-22.2044)
len(transition_scores[0]):  13
transition_scores:  tensor([[-7.3570e-02, -1.5182e-04, -1.8435e-02, -1.8597e-01, -1.7682e-03,
         -9.1794e-05, -1.1499e-02, -1.4008e-04, -2.0066e-01, -2.2774e-02,
         -1.1470e-03, -4.0414e-04, -5.3514e-03],
        [-9.9976e-02, -7.1911e-05, -1.6243e-02, -1.4897e-01, -1.2826e-04,
         -1.0216e-04, -2.3009e-03, -3.8949e-05, -8.3902e-03, -5.1038e+01,
         -2.2564e+01, -2.2113e+01, -2.1953e+01]])
outputs.sequences:  tensor([[   0, 3102,   10,   27,  117, 2387,   10,    8,  568,  113,  114,    7,
          140,    1],
        [   0, 3102,   10,  216,  117, 2387,   10,  112, 2512,    1,    0,    0,
            0,    0]])
len(generated_tokens[0]): 13
generated_tokens:  tensor([[3102,   10,   27,  117, 2387,   

In [33]:
for batch in dataloader:

    outs = model.model.generate(input_ids=batch['source_ids'],
                                attention_mask=batch['source_mask'])
    # dec = [tokenizer.decode(ids, skip_special_tokens=True, clean_up_tokenization_spaces=False).strip() for ids in outs]
    # target = [tokenizer.decode(ids, skip_special_tokens=True, clean_up_tokenization_spaces=False).strip()
    #             for ids in batch["target_ids"]]
    # text = [tokenizer.decode(ids, skip_special_tokens=True, clean_up_tokenization_spaces=False).strip()
    #             for ids in batch["source_ids"]]
    # texts.extend(text)
    # outputs.extend(dec)
    # targets.extend(target)
    # cnt += 1
    # if cnt > 10:
    #     break
    print(outs)
    print(tokenizer.decode(outs[0], skip_special_tokens=True, clean_up_tokenization_spaces=False).strip())
    break

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




tensor([[    0,  2387,    10,     8, 23997,  3368,   628,  2478,     1,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0],
        [    0,  3102,    10,     3,    23,   117,  2387,    10,     3,  1258,
           189,    63,   410,    48, 24522,     1,     0,     0,     0,     0],
        [    0,  3102,    10,     8,   126, 25453,   648,     3,     6, 10381,
         11831,    15,     7,   648,    11,  6179,  6029,   442,   117,  2387],
        [    0,  2387,    10,     8,     3,   102,    29,   102,     1,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0],
        [    0,  3102,    10,  7609,   261,    12,     8,  1075,    13,  1291,
          4028,   127,     7,   117,  2387,    10,  4297,  3101,    84,  4840],
        [    0,  3102,    10,   112,   117,  2387,    10, 26504,   323,    30,
             8,   962,     1,     0,     0,     0,     0,     0,     0,     0],
        [    0,  3102,    10,     3,     9,  1

In [41]:
print(outs[0])
print(tokenizer.decode(outs[0], skip_special_tokens=True, clean_up_tokenization_spaces=False).strip())

print(outs[1])
print(tokenizer.decode(outs[1], skip_special_tokens=True, clean_up_tokenization_spaces=False).strip())
for i in range(15):
    print(outs[1][i])
    print(tokenizer.decode(outs[1][i], skip_special_tokens=True, clean_up_tokenization_spaces=False).strip())

tensor([    0,  2387,    10,     8, 23997,  3368,   628,  2478,     1,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0])
target: the christmas island space station
tensor([    0,  3102,    10,     3,    23,   117,  2387,    10,     3,  1258,
          189,    63,   410,    48, 24522,     1,     0,     0,     0,     0])
agent: i; target: kathy did this intentionally
tensor(0)

tensor(3102)
agent
tensor(10)
:
tensor(3)

tensor(23)
i
tensor(117)
;
tensor(2387)
target
tensor(10)
:
tensor(3)

tensor(1258)
ka
tensor(189)
th
tensor(63)
y
tensor(410)
did
tensor(48)
this
tensor(24522)
intentionally


In [21]:
def find_sub_list(sl, l):
    results = []
    sll = len(sl)
    for ind in (i for i, e in enumerate(l) if e == sl[0]):
        if l[ind:ind+sll] == sl:
            results.append((ind, ind+sll-1))
    return results

def generate_label(input: str, target: str):
    mapper = {
        "O": 0,
        "B-AGENT": 1,
        "I-AGENT": 2,
        "B-DSE": 3,
        "I-DSE": 4,
        "B-TARGET": 5,
        "I-TARGET": 6
    }
    inv_mapper = {v: k for k, v in mapper.items()}

    input = input.split(" ")
    target = target.split("; ")

    init_target_label = [mapper['O']]*len(input)

    for ent in target:
        ent = ent.split(": ")
        try:
            sent_end = ent[1].split(" ")
            index = find_sub_list(sent_end, input)
        except:
            continue
        # print(index)
        try:
            init_target_label[index[0][0]] = mapper[f"B-{ent[0].upper()}"]
            for i in range(index[0][0]+1, index[0][1]+1):
                init_target_label[i] = mapper[f"I-{ent[0].upper()}"]
        except:
            continue
    init_target_label = [inv_mapper[j] for j in init_target_label]
    return init_target_label

In [22]:
from tqdm import tqdm

test_dataset = MPQADataset(tokenizer=tokenizer, dataset=dataset, type_path='test')
test_loader = DataLoader(test_dataset, batch_size=32,
                             num_workers=2, shuffle=True)
model.model.eval()
model = model.to("cuda")
outputs = []
targets = []
all_text = []
true_labels = []
pred_labels = []
for batch in tqdm(test_loader):
    input_ids = batch['source_ids'].to("cuda")
    attention_mask = batch['source_mask'].to("cuda")
    outs = model.model.generate(input_ids=input_ids,
                                attention_mask=attention_mask)
    dec = [tokenizer.decode(ids, skip_special_tokens=True,
                            clean_up_tokenization_spaces=False).strip() for ids in outs]
    target = [tokenizer.decode(ids, skip_special_tokens=True,  clean_up_tokenization_spaces=False).strip()
                for ids in batch["target_ids"]]
    texts = [tokenizer.decode(ids, skip_special_tokens=True,  clean_up_tokenization_spaces=False).strip()
                for ids in batch["source_ids"]]
    true_label = [generate_label(texts[i].strip(), target[i].strip()) if target[i].strip() != 'none' else [
        "O"]*len(texts[i].strip().split()) for i in range(len(texts))]
    pred_label = [generate_label(texts[i].strip(), dec[i].strip()) if dec[i].strip() != 'none' else [
        "O"]*len(texts[i].strip().split()) for i in range(len(texts))]

    outputs.extend(dec)
    targets.extend(target)
    true_labels.extend(true_label)
    pred_labels.extend(pred_label)
    all_text.extend(texts)

  0%|          | 0/48 [00:00<?, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
100%|██████████| 48/48 [00:35<00:00,  1.37it/s]


In [23]:
from datasets import load_metric

metric = load_metric("seqeval")

for i in range(10):
    print(f"Text:  {all_text[i]}")
    print(f"Predicted Token Class:  {pred_labels[i]}")
    print(f"True Token Class:  {true_labels[i]}")
    print("=====================================================================\n")

print(metric.compute(predictions=pred_labels, references=true_labels))

  metric = load_metric("seqeval")
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


Using the latest cached version of the module from /root/.cache/huggingface/modules/datasets_modules/metrics/seqeval/c8563af43bdce095d0f9e8b8b79c9c96d5ea5499b3bf66f90301c9cb82910f11 (last modified on Tue Feb 27 21:06:26 2024) since it couldn't be found locally at seqeval, or remotely on the Hugging Face Hub.


Text:  garcia ponce , who said he was speaking on behalf of the entire cpr , pointed out that it would be  truly inconceivable '' that the interference of a state to approve or disapprove the decisions of another or other states became a trait of foreign policy . dse:pointed out
Predicted Token Class:  ['B-AGENT', 'I-AGENT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
True Token Class:  ['O', 'O', 'O', 'O', 'O', 'B-AGENT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']

Text:  in the eyes of japan and many us allies , the bush administration 's decision not to ratify the kyoto protocol is one of many signals that 

# Model

Majority of the code here is adapted from [here](https://colab.research.google.com/github/patil-suraj/exploring-T5/blob/master/t5_fine_tuning.ipynb) which uses the pytorch-lightning framework for training neural networks. T5 has shown that it can generate state of the art on many tasks as long as it can be cast as a text-to-text problem

In [4]:
class T5FineTuner(pl.LightningModule):
    def __init__(self, hparam):
        super(T5FineTuner, self).__init__()
        self.hparam = hparam

        self.model = T5ForConditionalGeneration.from_pretrained(
            hparam.model_name_or_path)
        self.tokenizer = AutoTokenizer.from_pretrained(
            hparam.model_name_or_path
        )
        self.save_hyperparameters()

    def is_logger(self):
        return True

    def forward(
        self, input_ids, attention_mask=None, decoder_input_ids=None, decoder_attention_mask=None, lm_labels=None
    ):
        return self.model(
            input_ids,
            attention_mask=attention_mask,
            decoder_input_ids=decoder_input_ids,
            decoder_attention_mask=decoder_attention_mask,
            labels=lm_labels,
        )

    def _step(self, batch):
        lm_labels = batch["target_ids"]
        lm_labels[lm_labels[:, :] == self.tokenizer.pad_token_id] = -100

        outputs = self(
            input_ids=batch["source_ids"],
            attention_mask=batch["source_mask"],
            lm_labels=lm_labels,
            decoder_attention_mask=batch['target_mask']
        )

        loss = outputs[0]

        return loss

    def training_step(self, batch, batch_idx):
        loss = self._step(batch)

        tensorboard_logs = {"train_loss": loss}
        return {"loss": loss, "log": tensorboard_logs}

    def training_epoch_end(self, outputs):
        avg_train_loss = torch.stack([x["loss"] for x in outputs]).mean()
        tensorboard_logs = {"avg_train_loss": avg_train_loss}

    def validation_step(self, batch, batch_idx):
        loss = self._step(batch)
        return {"val_loss": loss}

    def validation_epoch_end(self, outputs):
        avg_loss = torch.stack([x["val_loss"] for x in outputs]).mean()
        tensorboard_logs = {"val_loss": avg_loss}

    def configure_optimizers(self):
        "Prepare optimizer and schedule (linear warmup and decay)"

        model = self.model
        no_decay = ["bias", "LayerNorm.weight"]
        optimizer_grouped_parameters = [
            {
                "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
                "weight_decay": self.hparam.weight_decay,
            },
            {
                "params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)],
                "weight_decay": 0.0,
            },
        ]
        optimizer = AdamW(optimizer_grouped_parameters,
                          lr=self.hparam.learning_rate, eps=self.hparam.adam_epsilon)
        self.opt = optimizer
        return [optimizer]

    def optimizer_step(self,
                       epoch=None,
                       batch_idx=None,
                       optimizer=None,
                       optimizer_idx=None,
                       optimizer_closure=None,
                       on_tpu=None,
                       using_native_amp=None,
                       using_lbfgs=None
                       ):

        optimizer.step(closure=optimizer_closure)
        optimizer.zero_grad()
        self.lr_scheduler.step()

    def get_tqdm_dict(self):
        tqdm_dict = {"loss": "{:.3f}".format(
            self.trainer.avg_loss), "lr": self.lr_scheduler.get_last_lr()[-1]}

        return tqdm_dict

    def train_dataloader(self):
        train_dataset = get_dataset(
            tokenizer=self.tokenizer, type_path="train", args=self.hparam)
        dataloader = DataLoader(train_dataset, batch_size=self.hparam.train_batch_size,
                                drop_last=True, shuffle=True, num_workers=2)
        t_total = (
            (len(dataloader.dataset) //
             (self.hparam.train_batch_size * max(1, self.hparam.n_gpu)))
            // self.hparam.gradient_accumulation_steps
            * float(self.hparam.num_train_epochs)
        )
        scheduler = get_linear_schedule_with_warmup(
            self.opt, num_warmup_steps=self.hparam.warmup_steps, num_training_steps=t_total
        )
        self.lr_scheduler = scheduler
        return dataloader

    def val_dataloader(self):
        val_dataset = get_dataset(
            tokenizer=self.tokenizer, type_path="validation", args=self.hparam)
        return DataLoader(val_dataset, batch_size=self.hparam.eval_batch_size, num_workers=2)

In [5]:
logger = logging.getLogger(__name__)

class LoggingCallback(pl.Callback):
  def on_validation_end(self, trainer, pl_module):
    logger.info("***** Validation results *****")
    if pl_module.is_logger():
      metrics = trainer.callback_metrics
      # Log results
      for key in sorted(metrics):
        if key not in ["log", "progress_bar"]:
          logger.info("{} = {}\n".format(key, str(metrics[key])))

  def on_test_end(self, trainer, pl_module):
    logger.info("***** Test results *****")

    if pl_module.is_logger():
      metrics = trainer.callback_metrics

      # Log and save results to file
      output_test_results_file = os.path.join(pl_module.hparams.output_dir, "test_results.txt")
      with open(output_test_results_file, "w") as writer:
        for key in sorted(metrics):
          if key not in ["log", "progress_bar"]:
            logger.info("{} = {}\n".format(key, str(metrics[key])))
            writer.write("{} = {}\n".format(key, str(metrics[key])))

In [6]:
args_dict = dict(
    data_dir="wikiann", # path for data files
    output_dir="", # path to save the checkpoints
    model_name_or_path='t5-small',
    tokenizer_name_or_path='t5-small',
    max_seq_length=256,
    learning_rate=3e-4,
    weight_decay=0.0,
    adam_epsilon=1e-8,
    warmup_steps=0,
    train_batch_size=8,
    eval_batch_size=8,
    num_train_epochs=3,
    gradient_accumulation_steps=16,
    n_gpu=1,
    early_stop_callback=False,
    fp_16=True, # if you want to enable 16-bit training then install apex and set this to true
    opt_level='O1', # you can find out more on optimisation levels here https://nvidia.github.io/apex/amp.html#opt-levels-and-properties
    max_grad_norm=1, # if you enable 16-bit training then set this to a sensible value, 0.5 is a good default
    seed=42,
)

# Dataset

Here, I used the popular [WikiANN](https://https://huggingface.co/datasets/wikiann) dataset which is a multilingual named entity recognition dataset consisting of Wikipedia articles annotated with LOC (location), PER (person), and ORG (organisation) tags in the IOB2 format.

In [7]:
from datasets import load_dataset

dataset = load_dataset("wikiann", "en")

In [8]:
print(dataset)

DatasetDict({
    validation: Dataset({
        features: ['tokens', 'ner_tags', 'langs', 'spans'],
        num_rows: 10000
    })
    test: Dataset({
        features: ['tokens', 'ner_tags', 'langs', 'spans'],
        num_rows: 10000
    })
    train: Dataset({
        features: ['tokens', 'ner_tags', 'langs', 'spans'],
        num_rows: 20000
    })
})


In [9]:
" ".join(dataset['train'][0]['tokens'])

'R.H. Saunders ( St. Lawrence River ) ( 968 MW )'

In [10]:
dataset['train'][0]

{'tokens': ['R.H.',
  'Saunders',
  '(',
  'St.',
  'Lawrence',
  'River',
  ')',
  '(',
  '968',
  'MW',
  ')'],
 'ner_tags': [3, 4, 0, 3, 4, 4, 0, 0, 0, 0, 0],
 'langs': ['en', 'en', 'en', 'en', 'en', 'en', 'en', 'en', 'en', 'en', 'en'],
 'spans': ['ORG: R.H. Saunders', 'ORG: St. Lawrence River']}

In this section, we create a custom dataset class where we cast the NER task as a text to text problem. This is done by concatenating the spans in the data as one line of string separated by a semi-colon (;). e.g

*   **Input**: R.H. Saunders ( St. Lawrence River ) ( 968 MW )
*   **Target**: ORG: R.H. Saunders; ORG: St. Lawrence River




In [11]:
class WikiAnnDataset(Dataset):
  def __init__(self, tokenizer, dataset, type_path, max_len=512):

    self.data = dataset[type_path]
    self.max_len = max_len
    self.tokenizer = tokenizer
    self.tokenizer.max_length = max_len
    self.tokenizer.model_max_length = max_len
    self.inputs = []
    self.targets = []

    self._build()

  def __len__(self):
    return len(self.inputs)

  def __getitem__(self, index):
    source_ids = self.inputs[index]["input_ids"].squeeze()
    target_ids = self.targets[index]["input_ids"].squeeze()

    src_mask    = self.inputs[index]["attention_mask"].squeeze()  # might need to squeeze
    target_mask = self.targets[index]["attention_mask"].squeeze()  # might need to squeeze

    return {"source_ids": source_ids, "source_mask": src_mask, "target_ids": target_ids, "target_mask": target_mask}

  def _build(self):
    for idx in range(len(self.data)):
      input_, target = " ".join(self.data[idx]["tokens"]), "; ".join(self.data[idx]["spans"])

      input_ = input_.lower() + ' </s>'
      target = target.lower() + " </s>"

       # tokenize inputs
      tokenized_inputs = self.tokenizer.batch_encode_plus(
          [input_], max_length=self.max_len, padding="max_length", truncation=True, return_tensors="pt"
      )
       # tokenize targets
      tokenized_targets = self.tokenizer.batch_encode_plus(
          [target],max_length=self.max_len, padding="max_length", truncation=True, return_tensors="pt"
      )

      self.inputs.append(tokenized_inputs)
      self.targets.append(tokenized_targets)

In [12]:
tokenizer = AutoTokenizer.from_pretrained("../T5-base")

print(tokenizer)

input_dataset = WikiAnnDataset(tokenizer=tokenizer, dataset=dataset, type_path='train')

T5TokenizerFast(name_or_path='../T5-base', vocab_size=32100, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<pad>', 'additional_special_tokens': ['<extra_id_0>', '<extra_id_1>', '<extra_id_2>', '<extra_id_3>', '<extra_id_4>', '<extra_id_5>', '<extra_id_6>', '<extra_id_7>', '<extra_id_8>', '<extra_id_9>', '<extra_id_10>', '<extra_id_11>', '<extra_id_12>', '<extra_id_13>', '<extra_id_14>', '<extra_id_15>', '<extra_id_16>', '<extra_id_17>', '<extra_id_18>', '<extra_id_19>', '<extra_id_20>', '<extra_id_21>', '<extra_id_22>', '<extra_id_23>', '<extra_id_24>', '<extra_id_25>', '<extra_id_26>', '<extra_id_27>', '<extra_id_28>', '<extra_id_29>', '<extra_id_30>', '<extra_id_31>', '<extra_id_32>', '<extra_id_33>', '<extra_id_34>', '<extra_id_35>', '<extra_id_36>', '<extra_id_37>', '<extra_id_38>', '<extra_id_39>', '<extra_id_40>', '<extra_id_41>', '<extra_id_42

In [13]:
for i in range(len(input_dataset)):
    _ = input_dataset[i]

In [14]:
data = input_dataset[0]

print(tokenizer.decode(data["source_ids"], skip_special_tokens=False))
print(tokenizer.decode(data["target_ids"], skip_special_tokens=False))

r.h. saunders ( st. lawrence river ) ( 968 mw )</s></s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>

In [15]:
!mkdir -p t5_ner

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [16]:
args = argparse.Namespace(**args_dict)
model = T5FineTuner(args)

In [17]:
checkpoint_callback = pl.callbacks.ModelCheckpoint(
    filename=args.output_dir+"/checkpoint.pth", monitor="val_loss", mode="min", save_top_k=5
)

train_params = dict(
    accumulate_grad_batches=args.gradient_accumulation_steps,
    gpus=args.n_gpu,
    max_epochs=args.num_train_epochs,
    #early_stop_callback=False,
    precision= 16 if args.fp_16 else 32,
    #amp_level=args.opt_level,
    gradient_clip_val=args.max_grad_norm,
    checkpoint_callback=checkpoint_callback,
    callbacks=[LoggingCallback()],
)

# train_params = dict(
#     accumulate_grad_batches=args.gradient_accumulation_steps,
#     ## gpus=args.n_gpu,
#     max_epochs=args.num_train_epochs,
#     #early_stop_callback=False,
#     precision= 16 if args.fp_16 else 32,
#     #amp_level=args.opt_level,
#     gradient_clip_val=args.max_grad_norm,
#     # callbacks=[LoggingCallback()],
# )

In [18]:
def get_dataset(tokenizer, type_path, args):
    tokenizer.max_length = args.max_seq_length
    tokenizer.model_max_length = args.max_seq_length
    dataset = load_dataset(args.data_dir, "en")
    return WikiAnnDataset(tokenizer=tokenizer, dataset=dataset, type_path=type_path)

In [19]:
trainer = pl.Trainer(**train_params)

Using 16bit native Automatic Mixed Precision (AMP)
  rank_zero_deprecation(
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs


In [20]:
trainer.fit(model)

  rank_zero_warn(
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name  | Type                       | Params
-----------------------------------------------------
0 | model | T5ForConditionalGeneration | 60.5 M
-----------------------------------------------------
60.5 M    Trainable params
0         Non-trainable params
60.5 M    Total params
121.013   Total estimated model params size (MB)


Validation sanity check:   0%|          | 0/2 [00:00<?, ?it/s]

  rank_zero_warn(
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


                                                                      

  rank_zero_warn(


Epoch 0:   0%|          | 0/3750 [00:00<?, ?it/s] 

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
  rank_zero_deprecation(


Epoch 0:  67%|██████▋   | 2500/3750 [06:36<03:18,  6.30it/s, loss=0.138, v_num=0] 
Validating: 0it [00:00, ?it/s][A
Validating:   0%|          | 0/1250 [00:00<?, ?it/s][A

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)



Epoch 0:  67%|██████▋   | 2502/3750 [06:37<03:18,  6.30it/s, loss=0.138, v_num=0]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)



Epoch 0:  67%|██████▋   | 2504/3750 [06:37<03:17,  6.31it/s, loss=0.138, v_num=0]
Epoch 0:  67%|██████▋   | 2506/3750 [06:37<03:17,  6.31it/s, loss=0.138, v_num=0]
Epoch 0:  67%|██████▋   | 2508/3750 [06:37<03:16,  6.31it/s, loss=0.138, v_num=0]
Epoch 0:  67%|██████▋   | 2510/3750 [06:37<03:16,  6.32it/s, loss=0.138, v_num=0]
Epoch 0:  67%|██████▋   | 2512/3750 [06:37<03:15,  6.32it/s, loss=0.138, v_num=0]
Epoch 0:  67%|██████▋   | 2514/3750 [06:37<03:15,  6.32it/s, loss=0.138, v_num=0]
Epoch 0:  67%|██████▋   | 2516/3750 [06:37<03:15,  6.32it/s, loss=0.138, v_num=0]
Epoch 0:  67%|██████▋   | 2518/3750 [06:37<03:14,  6.33it/s, loss=0.138, v_num=0]
Epoch 0:  67%|██████▋   | 2520/3750 [06:38<03:14,  6.33it/s, loss=0.138, v_num=0]
Epoch 0:  67%|██████▋   | 2522/3750 [06:38<03:13,  6.33it/s, loss=0.138, v_num=0]
Epoch 0:  67%|██████▋   | 2524/3750 [06:38<03:13,  6.34it/s, loss=0.138, v_num=0]
Epoch 0:  67%|██████▋   | 2526/3750 [06:38<03:13,  6.34it/s, loss=0.138, v_num=0]
Epoch 0:  67%|█

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Epoch 1:  67%|██████▋   | 2500/3750 [06:38<03:19,  6.28it/s, loss=0.125, v_num=0] 
Validating: 0it [00:00, ?it/s][A
Validating:   0%|          | 0/1250 [00:00<?, ?it/s][A

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)



Epoch 1:  67%|██████▋   | 2502/3750 [06:38<03:18,  6.28it/s, loss=0.125, v_num=0]
Epoch 1:  67%|██████▋   | 2504/3750 [06:38<03:18,  6.28it/s, loss=0.125, v_num=0]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)



Epoch 1:  67%|██████▋   | 2506/3750 [06:38<03:17,  6.29it/s, loss=0.125, v_num=0]
Epoch 1:  67%|██████▋   | 2508/3750 [06:38<03:17,  6.29it/s, loss=0.125, v_num=0]
Epoch 1:  67%|██████▋   | 2510/3750 [06:38<03:17,  6.29it/s, loss=0.125, v_num=0]
Epoch 1:  67%|██████▋   | 2512/3750 [06:38<03:16,  6.30it/s, loss=0.125, v_num=0]
Epoch 1:  67%|██████▋   | 2514/3750 [06:39<03:16,  6.30it/s, loss=0.125, v_num=0]
Epoch 1:  67%|██████▋   | 2516/3750 [06:39<03:15,  6.30it/s, loss=0.125, v_num=0]
Epoch 1:  67%|██████▋   | 2518/3750 [06:39<03:15,  6.31it/s, loss=0.125, v_num=0]
Epoch 1:  67%|██████▋   | 2520/3750 [06:39<03:14,  6.31it/s, loss=0.125, v_num=0]
Epoch 1:  67%|██████▋   | 2522/3750 [06:39<03:14,  6.31it/s, loss=0.125, v_num=0]
Epoch 1:  67%|██████▋   | 2524/3750 [06:39<03:14,  6.32it/s, loss=0.125, v_num=0]
Epoch 1:  67%|██████▋   | 2526/3750 [06:39<03:13,  6.32it/s, loss=0.125, v_num=0]
Epoch 1:  67%|██████▋   | 2528/3750 [06:39<03:13,  6.32it/s, loss=0.125, v_num=0]
Epoch 1:  67%|█

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Epoch 2:  67%|██████▋   | 2500/3750 [06:38<03:19,  6.28it/s, loss=0.0849, v_num=0]
Validating: 0it [00:00, ?it/s][A
Validating:   0%|          | 0/1250 [00:00<?, ?it/s][A

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)



Epoch 2:  67%|██████▋   | 2502/3750 [06:38<03:18,  6.28it/s, loss=0.0849, v_num=0]


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Epoch 2:  67%|██████▋   | 2504/3750 [06:38<03:18,  6.29it/s, loss=0.0849, v_num=0]
Epoch 2:  67%|██████▋   | 2506/3750 [06:38<03:17,  6.29it/s, loss=0.0849, v_num=0]
Epoch 2:  67%|██████▋   | 2508/3750 [06:38<03:17,  6.29it/s, loss=0.0849, v_num=0]
Epoch 2:  67%|██████▋   | 2510/3750 [06:38<03:16,  6.30it/s, loss=0.0849, v_num=0]
Epoch 2:  67%|██████▋   | 2512/3750 [06:38<03:16,  6.30it/s, loss=0.0849, v_num=0]
Epoch 2:  67%|██████▋   | 2514/3750 [06:38<03:16,  6.30it/s, loss=0.0849, v_num=0]
Epoch 2:  67%|██████▋   | 2516/3750 [06:39<03:15,  6.31it/s, loss=0.0849, v_num=0]
Epoch 2:  67%|██████▋   | 2518/3750 [06:39<03:15,  6.31it/s, loss=0.0849, v_num=0]
Epoch 2:  67%|██████▋   | 2520/3750 [06:39<03:14,  6.31it/s, loss=0.0849, v_num=0]
Epoch 2:  67%|██████▋   | 2522/3750 [06:39<03:14,  6.31it/s, loss=0.0849, v_num=0]
Epoch 2:  67%|██████▋   | 2524/3750 [06:39<03:14,  6.32it/s, loss=0.0849, v_num=0]
Epoch 2:  67%|██████▋   | 2526/3750 [06:39<03:13,  6.32it/s, loss=0.0849, v_num=0]
Epoc

## Load the Stored Model and Evaluate

In [23]:
model = model.load_from_checkpoint("/mnt/workspace/ORL/lightning_logs/version_0/checkpoints/epoch=2-step=470.ckpt")


In [24]:
import textwrap

dataloader = DataLoader(input_dataset, batch_size=32, num_workers=2, shuffle=True)
model.model.eval()
model = model.to("cpu")
outputs = []
targets = []
texts = []
for batch in dataloader:

    outs = model.model.generate(input_ids=batch['source_ids'],
                                attention_mask=batch['source_mask'])
    dec = [tokenizer.decode(ids, skip_special_tokens=True, clean_up_tokenization_spaces=False).strip() for ids in outs]
    target = [tokenizer.decode(ids, skip_special_tokens=True, clean_up_tokenization_spaces=False).strip()
                for ids in batch["target_ids"]]
    text = [tokenizer.decode(ids, skip_special_tokens=True, clean_up_tokenization_spaces=False).strip()
                for ids in batch["source_ids"]]
    texts.extend(text)
    outputs.extend(dec)
    targets.extend(target)
    break

for i in range(10):
    c = texts[i]
    lines = textwrap.wrap("text:\n%s\n" % c, width=100)
    print("\n".join(lines))
    print("\nActual Entities: %s" % target[i])
    print("Predicted Entities: %s" % outputs[i])
    print("=====================================================================\n")


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


text: calhoun city , mississippi

Actual Entities: loc: calhoun city , mississippi
Predicted Entities: loc: calhoun city , mississippi

text: honorary fellow of the american institute of architects ( 1993 ) and of the royal institute of
british architects ( 1995 ) .

Actual Entities: org: american institute of architects; org: royal institute of british architects
Predicted Entities: org: american institute of architects; org: royal institute of british architects

text: museo di storia naturale di venezia , venice

Actual Entities: org: museo di storia naturale di venezia; loc: venice
Predicted Entities: loc: museo di storia naturale di venezia

text: edgar allan poe in popular culture

Actual Entities: org: edgar allan poe in popular culture
Predicted Entities: per: edgar allan poe in popular culture

text: '' shine my shoes ''

Actual Entities: org: shine my shoes
Predicted Entities: org: shine my shoes

text: värmlands län , seat no .

Actual Entities: org: värmlands län
Predicted 

In [25]:
def find_sub_list(sl, l):
    results = []
    sll = len(sl)
    for ind in (i for i, e in enumerate(l) if e == sl[0]):
        if l[ind:ind+sll] == sl:
            results.append((ind, ind+sll-1))
    return results

def generate_label(input: str, target: str):
    mapper = {'O': 0, 'B-DATE': 1, 'I-DATE': 2, 'B-PER': 3,
              'I-PER': 4, 'B-ORG': 5, 'I-ORG': 6, 'B-LOC': 7, 'I-LOC': 8}
    inv_mapper = {v: k for k, v in mapper.items()}

    input = input.split(" ")
    target = target.split("; ")

    init_target_label = [mapper['O']]*len(input)

    for ent in target:
        ent = ent.split(": ")
        try:
            sent_end = ent[1].split(" ")
            index = find_sub_list(sent_end, input)
        except:
            continue
        # print(index)
        try:
            init_target_label[index[0][0]] = mapper[f"B-{ent[0].upper()}"]
            for i in range(index[0][0]+1, index[0][1]+1):
                init_target_label[i] = mapper[f"I-{ent[0].upper()}"]
        except:
            continue
    init_target_label = [inv_mapper[j] for j in init_target_label]
    return init_target_label

In [26]:
from tqdm import tqdm

test_dataset = WikiAnnDataset(tokenizer=tokenizer, dataset=dataset, type_path='test')
test_loader = DataLoader(test_dataset, batch_size=32,
                             num_workers=2, shuffle=True)
model.model.eval()
model = model.to("cuda")
outputs = []
targets = []
all_text = []
true_labels = []
pred_labels = []
for batch in tqdm(test_loader):
    input_ids = batch['source_ids'].to("cuda")
    attention_mask = batch['source_mask'].to("cuda")
    outs = model.model.generate(input_ids=input_ids,
                                attention_mask=attention_mask)
    dec = [tokenizer.decode(ids, skip_special_tokens=True,
                            clean_up_tokenization_spaces=False).strip() for ids in outs]
    target = [tokenizer.decode(ids, skip_special_tokens=True,  clean_up_tokenization_spaces=False).strip()
                for ids in batch["target_ids"]]
    texts = [tokenizer.decode(ids, skip_special_tokens=True,  clean_up_tokenization_spaces=False).strip()
                for ids in batch["source_ids"]]
    true_label = [generate_label(texts[i].strip(), target[i].strip()) if target[i].strip() != 'none' else [
        "O"]*len(texts[i].strip().split()) for i in range(len(texts))]
    pred_label = [generate_label(texts[i].strip(), dec[i].strip()) if dec[i].strip() != 'none' else [
        "O"]*len(texts[i].strip().split()) for i in range(len(texts))]

    outputs.extend(dec)
    targets.extend(target)
    true_labels.extend(true_label)
    pred_labels.extend(pred_label)
    all_text.extend(texts)

  0%|          | 0/313 [00:00<?, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
100%|██████████| 313/313 [01:18<00:00,  3.97it/s]


In [27]:
all_text[1]

"orguss '' - additional voices"

In [29]:
from datasets import load_metric

metric = load_metric("seqeval")

for i in range(10):
    print(f"Text:  {all_text[i]}")
    print(f"Predicted Token Class:  {pred_labels[i]}")
    print(f"True Token Class:  {true_labels[i]}")
    print("=====================================================================\n")

print(metric.compute(predictions=pred_labels, references=true_labels))

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Text:  reginald beckwith as kenniston .
Predicted Token Class:  ['B-PER', 'I-PER', 'O', 'B-PER', 'O']
True Token Class:  ['B-PER', 'I-PER', 'O', 'O', 'O']

Text:  orguss '' - additional voices
Predicted Token Class:  ['B-ORG', 'O', 'O', 'O', 'O']
True Token Class:  ['B-ORG', 'O', 'O', 'O', 'O']

Text:  dihedral symmetry groups with even-orders have a number of subgroups .
Predicted Token Class:  ['B-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'O', 'O', 'O', 'O', 'O', 'O']
True Token Class:  ['B-ORG', 'I-ORG', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']

Text:  council of southern africa football associations
Predicted Token Class:  ['B-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'I-ORG']
True Token Class:  ['B-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'I-ORG']

Text:  he died in 1924 and was buried in the hôtel des invalides .
Predicted Token Class:  ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'O']
True Token Class:  ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O