Este notebook é um tutorial de fine-tunning em modelos de linguagem.

São suportados os modelos do tipo decoder e encoder-decoder. Estes dois modelos têm textos como entrada e saída. O código não suporta modelos do tipo encoder, que têm como entrada textos e geram como saída geralmente um vetor numérico ou um número (embeddings ou probabilidades das classes).

```
Modelos do tipo decoder: GPT
Modelos do tipo encoder-decoder: T5
Modelos do tipo encoder: BERT, DeBERTa
```

# Inicialização das variáveis

In [1]:
RESUME_FROM_CHECKPOINT = False

inserir_beginoftext_token = True # Inserir um token '<|target_bos|>' separando o prompt da resposta nos modelos tipo decoder (GPT, llama)
MAX_TOKEN_GENERATION_LENGTH=60 # O número de tokens que será gerado, no máximo, no step de validação

output_dir="/content/fine-tuned-model"

## Esses valores devem ser definidos para cada modelo
## Caso retorne o erro CUDA out of memory, diminuia o batch size
# model_type='encoder'
# model_type='encoder-decoder'
# model_type='decoder'
# BATCH_SIZE = 16
# EVAL_BATCH_SIZE=16
# dropout_rate=0.1
# fp16=True # Treina o modelo em fp16. É a metade do tempo de treino, porém pode diminuir a precisão do modelo

# transformer_model_name='thacio/ult5-pt-small'; model_type='encoder-decoder'; dropout_rate=0.0; BATCH_SIZE = 16; EVAL_BATCH_SIZE=16; fp16=True; #prefix_input='<|NLU|>' # '<|NLG|>'
transformer_model_name='tgsc/debertina-base'; model_type='encoder'; BATCH_SIZE = 8; EVAL_BATCH_SIZE=8; fp16=False

gradient_accumulation_steps = int(round(128//BATCH_SIZE))
epochs = 10

In [2]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

import multiprocessing

num_proc = multiprocessing.cpu_count()
print('cpu_count:',num_proc)

Wed Jul 19 19:35:32 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   65C    P8    13W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [3]:
!pip install datasets
!pip install transformers==4.30.2 accelerate
!pip install sentencepiece
!pip install evaluate
!pip install rouge_score

Collecting datasets
  Downloading datasets-2.13.1-py3-none-any.whl (486 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m486.2/486.2 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.7,>=0.3.0 (from datasets)
  Downloading dill-0.3.6-py3-none-any.whl (110 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.5/212.5 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.14-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.3/134.3 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0.0,>=0.11.0 (from datasets)
  Downloading huggingface_hub-0.16.4-py3-none-a

# Carrega o tokenizer

In [4]:
import transformers as transformers
from transformers import AutoTokenizer, AutoConfig

tokenizer = AutoTokenizer.from_pretrained(transformer_model_name,use_auth_token='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')

# Nos modelos decoder, adicionaremos um token separando a entrada da resposta para podermos identificar e dar split na string na méetrica de validação
if model_type=='decoder':
    tokenizer.add_special_tokens({'pad_token': '<|pad|>'}) # Adicionaremos um token de pad caso o modelo não tenha (não afeta o resultado)

    if inserir_beginoftext_token:
        target_bos_token='<|target_bos|>'
        tokenizer.add_special_tokens({ "additional_special_tokens": [target_bos_token] })

if model_type=='encoder' and tokenizer.sep_token==None:
    tokenizer.add_special_tokens({'sep_token': tokenizer.eos_token})

Downloading (…)okenizer_config.json:   0%|          | 0.00/51.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/606 [00:00<?, ?B/s]

Downloading spm.model:   0%|          | 0.00/816k [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


# Carrega o modelo

In [5]:
import transformers
from transformers import AutoModelForSeq2SeqLM, AutoModelForCausalLM, AutoModelForSequenceClassification
import torch

if model_type=='encoder-decoder':
    # model = AutoModelForSeq2SeqLM.from_pretrained(transformer_model_name)
    model = AutoModelForSeq2SeqLM.from_pretrained(transformer_model_name,dropout_rate=dropout_rate)
elif model_type=='decoder':
    model = AutoModelForCausalLM.from_pretrained(transformer_model_name)
elif model_type=='encoder':
    model = AutoModelForSequenceClassification.from_pretrained(transformer_model_name)
else:
    raise ValueError('tipo de arquitetura deve ser "encoder", "encoder-decoder" ou "decoder"')

model.resize_token_embeddings(len(tokenizer))
model.max_length=MAX_TOKEN_GENERATION_LENGTH

# context_length é o tamanho máximo do modelo
try:
    context_length=model.config.n_positions
except:
    context_length=model.config.max_position_embeddings

print(model.config)
model_size = sum(t.numel() for t in model.parameters())
print(f"Model size: {model_size/1000**2:.1f}M parameters")

Downloading pytorch_model.bin:   0%|          | 0.00/272M [00:00<?, ?B/s]

Some weights of the model checkpoint at tgsc/debertina-base were not used when initializing DebertaV2ForSequenceClassification: ['mask_predictions.dense.weight', 'mask_predictions.classifier.bias', 'mask_predictions.classifier.weight', 'mask_predictions.LayerNorm.bias', 'mask_predictions.LayerNorm.weight', 'mask_predictions.dense.bias']
- This IS expected if you are initializing DebertaV2ForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaV2ForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at tgsc/debertina-base and are newly initialized: ['classifi

DebertaV2Config {
  "_name_or_path": "tgsc/debertina-base",
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-07,
  "max_position_embeddings": 512,
  "max_relative_positions": -1,
  "model_type": "deberta-v2",
  "norm_rel_ebd": "layer_norm",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "pooler_dropout": 0,
  "pooler_hidden_act": "gelu",
  "pooler_hidden_size": 768,
  "pos_att_type": [
    "p2c",
    "c2p"
  ],
  "position_biased_input": false,
  "position_buckets": 256,
  "relative_attention": true,
  "share_att_key": true,
  "torch_dtype": "float16",
  "transformers_version": "4.30.2",
  "type_vocab_size": 0,
  "vocab_size": 32001
}

Model size: 110.6M parameters


# Cria e processa o dataset

##Faz o download dos datasets.

Um arquivo tsv é um arquivo csv, porém usa como separador a tabulação \t em vez de vírgula

In [6]:
import os

!wget 'https://github.com/ju-resplande/PLUE/raw/master/datasets/MRPC/train.tsv' -O train.tsv
!wget 'https://github.com/ju-resplande/PLUE/raw/master/datasets/MRPC/dev.tsv' -O validation.tsv

--2023-07-19 19:36:42--  https://github.com/ju-resplande/PLUE/raw/master/datasets/MRPC/train.tsv
Resolving github.com (github.com)... 192.30.255.113
Connecting to github.com (github.com)|192.30.255.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/ju-resplande/PLUE/master/datasets/MRPC/train.tsv [following]
--2023-07-19 19:36:42--  https://raw.githubusercontent.com/ju-resplande/PLUE/master/datasets/MRPC/train.tsv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1011446 (988K) [text/plain]
Saving to: ‘train.tsv’


2023-07-19 19:36:42 (28.4 MB/s) - ‘train.tsv’ saved [1011446/1011446]

--2023-07-19 19:36:43--  https://github.com/ju-resplande/PLUE/raw/master/datasets/MRPC/dev.tsv
Resolving github.

## Carrega o dataset pelo pandas

In [7]:
import pandas as pd

num_labels = 2 # Quantidade de classes contidas no dataset

# O arquivo dá erro ao carregar algumas linhas, então utilizaremos on_bad_lines='skip' para pulá-las
df_train = pd.read_csv('train.tsv', sep='\t', header=0,  on_bad_lines='skip')
df_validation = pd.read_csv('validation.tsv', sep='\t', header=0, on_bad_lines='skip')
df_train.head()

Unnamed: 0,Quality,#1 ID,#2 ID,#1 String,#2 String
0,1,702876,702977,"Amrozi acusou seu irmão, a quem chamou de ""tes...","Referindo-se a ele como apenas ""a testemunha"",..."
1,0,2108705,2108831,Yucaipa possuía a Dominick 's antes de vender ...,Yucaipa comprou a Dominick em 1995 por US $ 69...
2,1,1330381,1330521,Eles publicaram um anúncio na Internet em 10 d...,"Em 10 de junho, os proprietários do navio havi..."
3,0,3344667,3344648,"Por volta de 0335 GMT, as ações da Tab subiram...","As ações da Tab saltaram 20 centavos, ou 4,6%,..."
4,1,1236820,1236712,"As ações subiram US $ 2,11, ou cerca de 11%, p...","As ações da PG & E Corp subiram US $ 1,63 ou 8..."


## Carrega o dataset na biblioteca datasets do huggingface

https://huggingface.co/docs/datasets/loading

In [8]:
import datasets
from datasets import load_dataset

ds_train = datasets.Dataset.from_pandas(df_train)
ds_validation = datasets.Dataset.from_pandas(df_validation)

ds = datasets.DatasetDict({
    'train' : ds_train,
    'validation' : ds_validation
    })

ds

DatasetDict({
    train: Dataset({
        features: ['Quality', '#1 ID', '#2 ID', '#1 String', '#2 String'],
        num_rows: 3549
    })
    validation: Dataset({
        features: ['Quality', '#1 ID', '#2 ID', '#1 String', '#2 String'],
        num_rows: 388
    })
})

In [9]:
print('exemplo do dataset')
ds['train'][0]

exemplo do dataset


{'Quality': 1,
 '#1 ID': 702876,
 '#2 ID': 702977,
 '#1 String': 'Amrozi acusou seu irmão, a quem chamou de "testemunha", de distorcer deliberadamente suas evidências.',
 '#2 String': 'Referindo-se a ele como apenas "a testemunha", Amrozi acusou seu irmão de distorcer deliberadamente suas evidências.'}

## Converte o dataset para textos de input e labels

Os modelos do tipo decoder (gpt2) e enconder-decoder (t5) geram textos, então devemos converter tudo em texto. Se por acaso os rótulos forem numéricos (0,1,...), também devem ser convertido para textos. Textos representativos dos rótulos costumam gerar melhores resultados do que rótulos de string '0' e '1'.

Já os modelos do tipo encoder (BERT) geralmente tem como saída as classes númericas. (há exceção e tem como usar o BERT como decoder, porém não é usual)

---

O dataset MRPC é composto da sentença 1 e sentença 2, e o rótulo é se as sentenças são paráfrases ou não.

Dessa forma, faremos a transformação do exemplo:

```
{
'#1 String': 'Amrozi acusou seu irmão, a quem chamou de "testemunha", de distorcer deliberadamente suas evidências.',
'#2 String': 'Referindo-se a ele como apenas "a testemunha", Amrozi acusou seu irmão de distorcer deliberadamente suas evidências.'
'Quality': 1
}
```

Para

```
{
'text' : 'mrpc sentença 1: Amrozi acusou seu irmão, a quem chamou de "testemunha", de distorcer deliberadamente suas evidências. sentença 2: Referindo-se a ele como apenas "a testemunha", Amrozi acusou seu irmão de distorcer deliberadamente suas evidências.'
'label': 'equivalentes'
}
```
com acréscimo dos tokens necessários (eos_token e taget_bos_token em decoders)

Para fazer essa conversão, usaremos a função map do huggingface datasets.

https://huggingface.co/docs/datasets/process

### Função map para decoders (gpt2, llama)

In [10]:
def mrpc_map_dec_function(examples):
    new_examples = { 'text':[], 'labels':[]}

    first_key=list(examples.keys())[0]
    for i in range(0,len(examples[first_key])):
        if examples["#1 String"][i]==None or examples["#2 String"][i]==None or examples['Quality'][i]==None:
            continue

        input=f'mrpc sentença 1: {examples["#1 String"][i]}'
        input+=f' sentença 2: {examples["#2 String"][i]}'
        if examples['Quality'][i] == 0:
            label = 'diferentes'
        elif examples['Quality'][i] == 1:
            label = 'equivalentes'

        if inserir_beginoftext_token:
            input += target_bos_token

        # adicionamos o o token de fim de texto ao label
        label += tokenizer.eos_token

        new_examples['text'].append(input)
        new_examples['labels'].append(label)

    return new_examples

if model_type=='decoder':
    ds_processado = ds.map(
          mrpc_map_dec_function,
          batched=True,
          batch_size=1_000,
          remove_columns=ds['train'].column_names,
          num_proc=2
      )
    print(ds_processado)
    print(ds_processado['train'][0])

###Função map para modelos encoder-decoders (t5, ul2)

In [11]:
# define a função de map para ser aplicada ao dataset
def mrpc_map_enc_dec_function(examples):
    new_examples = { 'text':[], 'labels':[]}

    first_key=list(examples.keys())[0]
    for i in range(0,len(examples[first_key])):
        if examples["#1 String"][i]==None or examples["#2 String"][i]==None or examples['Quality'][i]==None:
            continue

        input=f'mrpc sentença 1: {examples["#1 String"][i]}'
        input+=f' sentença 2: {examples["#2 String"][i]}'
        input+=' As duas sentenças são equivalentes ou diferentes?'
        if examples['Quality'][i] == 0:
            label = 'diferentes'
        elif examples['Quality'][i] == 1:
            label = 'equivalentes'

        label += tokenizer.eos_token

        if 'prefix_input' in globals() and prefix_input!=None and len(prefix_input)>0:
            input = prefix_input + input

        new_examples['text'].append(input)
        new_examples['labels'].append(label)

    return new_examples

# aplica a função
if model_type=='encoder-decoder':
    ds_processado = ds.map(
          mrpc_map_enc_dec_function,
          batched=True,
          batch_size=1_000,
          remove_columns=ds['train'].column_names,
          num_proc=2
      )
    print(ds_processado)
    print(ds_processado['train'][0])

### Função map para os modelos encoders

In [12]:
# define a função de map para ser aplicada ao dataset
def mrpc_map_enc_dec_function(examples):
    new_examples = { 'text':[], 'labels':[]}

    first_key=list(examples.keys())[0]
    for i in range(0,len(examples[first_key])):

        if examples["#1 String"][i]==None or examples["#2 String"][i]==None or examples['Quality'][i]==None:
            continue

        input= examples["#1 String"][i] + tokenizer.sep_token + examples["#2 String"][i]
        label = examples['Quality'][i]

        new_examples['text'].append(input)
        new_examples['labels'].append(label)

    return new_examples

# aplica a função
if model_type=='encoder':
    ds_processado = ds.map(
          mrpc_map_enc_dec_function,
          batched=True,
          batch_size=1_000,
          remove_columns=ds['train'].column_names,
          num_proc=2
      )
    print(ds_processado)
    print(ds_processado['train'][0])

Map (num_proc=2):   0%|          | 0/3549 [00:00<?, ? examples/s]

Map (num_proc=2):   0%|          | 0/388 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'labels'],
        num_rows: 3532
    })
    validation: Dataset({
        features: ['text', 'labels'],
        num_rows: 387
    })
})
{'text': 'Amrozi acusou seu irmão, a quem chamou de "testemunha", de distorcer deliberadamente suas evidências.[SEP]Referindo-se a ele como apenas "a testemunha", Amrozi acusou seu irmão de distorcer deliberadamente suas evidências.', 'labels': 1}


## Tokeniza o dataset

In [13]:
is_validation_ds = False
def tokenize_dataset(examples):

    examples['input_ids']=tokenizer(examples['text'],
                      return_attention_mask=False,
                      truncation=True,
                      max_length=context_length,
                      )['input_ids']

    if model_type!='encoder':
        examples['labels']=tokenizer(examples['labels'],
                          return_attention_mask=False,
                          truncation=True,
                          max_length=context_length,
                          )['input_ids']

        # Insere o eos_token_id caso não tenha sido inserido anteriormente
        for i, label in enumerate(examples['labels']):
            try:
                if label[len(label)-1]!=tokenizer.eos_token_id:
                    examples['labels'][i] += [tokenizer.eos_token_id]
            except:
                # Caso por erro do dataset não haja label
                examples['labels'][i] = [tokenizer.eos_token_id]
                pass
            # Nos decoders Nos datasets de validação, precisamos inseri
            if model_type=='decoder' and is_validation_ds:
                examples['labels'][i] = [500_000] + label

    return examples

ds_tokenizado = datasets.DatasetDict({'train':None, 'validation': None})

ds_tokenizado['train'] = ds_processado['train'].map(
    tokenize_dataset,
    batched=True,
    batch_size=1_000,
    num_proc=2,
    remove_columns=['text']
)

is_validation_ds = True
ds_tokenizado['validation'] = ds_processado['validation'].map(
    tokenize_dataset,
    batched=True,
    batch_size=1_000,
    num_proc=2,
    remove_columns=['text']
)
is_validation_ds = False

print(ds_tokenizado)
print(ds_tokenizado['train'][0])

Map (num_proc=2):   0%|          | 0/3532 [00:00<?, ? examples/s]

Map (num_proc=2):   0%|          | 0/387 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['labels', 'input_ids'],
        num_rows: 3532
    })
    validation: Dataset({
        features: ['labels', 'input_ids'],
        num_rows: 387
    })
})
{'labels': 1, 'input_ids': [1, 2743, 573, 1584, 16305, 316, 1131, 260, 268, 1042, 3478, 261, 300, 1185, 5099, 337, 2279, 382, 261, 9144, 304, 1998, 23310, 410, 5371, 263, 2, 512, 7495, 344, 275, 302, 268, 346, 297, 442, 300, 306, 8837, 382, 2743, 573, 1584, 16305, 316, 1131, 261, 9144, 304, 1998, 23310, 410, 5371, 263, 2]}


## Cria a métrica de validação do dataset

### Métrica de avaliação do dataset

Par ao dataset MRPC, usaremos a acurácia, ou seja, o acerto exato do rótulo

In [14]:
from evaluate import load

def mrpc_metric(predictions,labels):
    exact_match_metric = load("exact_match")
    result = exact_match_metric.compute(predictions=predictions,references=labels)
    return {'mrpc_acc': result['exact_match']}

### Função de computo da métrica com geração de texto

No caso de decoders, a função da métrica de avaliação recebe o texto inteiro 'input + labels', então precisamos processar a string recebida pela função para separar o input do label para, em seguida, calcular a métrica

In [15]:
import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    result = {}

    if model_type=='encoder':
        return mrpc_metric(predictions,labels)

    predictions=list(predictions)
    labels=list(labels)

    if model_type=='decoder':
        target_bos_token_id = tokenizer.convert_tokens_to_ids(target_bos_token)

        for i in range(len(predictions)):
            # # Split nos  tokens gerados considerando o target_bos_token para identificar o input e o label
            index = np.where(predictions[i] == target_bos_token_id)[0][0]
            predictions[i] = predictions[i][index+1:] # texto gerado após o input

    for i in range(0,len(labels)):
        # remove  os ids que não podem ser decodificados
        labels[i] = list(filter(lambda x: x!= -100, labels[i]))
        predictions[i] = list(filter(lambda x: x!= -100, predictions[i]))

        # remove os pad_tokens
        labels[i] = list(filter(lambda x: x!= tokenizer.pad_token_id, labels[i]))
        predictions[i] = list(filter(lambda x: x!= tokenizer.pad_token_id, predictions[i]))

        # remove os eos_tokens
        labels[i] = list(filter(lambda x: x!= tokenizer.eos_token_id, labels[i]))
        predictions[i] = list(filter(lambda x: x!= tokenizer.eos_token_id, predictions[i]))


    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)

    # print('Decoded labels')
    # print(decoded_labels)
    # print('Decoded predictions')
    # print(decoded_preds)

    result = mrpc_metric(decoded_preds,decoded_labels)

    return result

### Função métrica para encoders

Para encoders, a funçao da métrica é mais simples.

In [16]:
import numpy as np
from evaluate import load

if model_type=='encoder':
    def compute_metrics(eval_pred):
        predictions, labels = eval_pred
        predictions = np.argmax(predictions, axis=1)
        acc_metric = load("accuracy")
        result = acc_metric.compute(predictions=predictions,references=labels)
        return {'mrpc_acc': result['accuracy']}

# DataCollator para Encoder-Decoder (Causal Language Modeling)

A única modificação feita do código original é silenciar o tokenizador durante o pad
https://github.com/huggingface/transformers/blob/v4.28.1/src/transformers/data/data_collator.py

In [17]:
import random
import warnings
from collections.abc import Mapping
from dataclasses import dataclass
from typing import Any, Callable, Dict, List, NewType, Optional, Tuple, Union

from transformers.models.bert import BertTokenizer, BertTokenizerFast
from transformers.tokenization_utils_base import PreTrainedTokenizerBase
from transformers.utils import PaddingStrategy

@dataclass
class DataCollatorForSeq2SeqModified:
    """
    Data collator that will dynamically pad the inputs received, as well as the labels.
    Args:
        tokenizer ([`PreTrainedTokenizer`] or [`PreTrainedTokenizerFast`]):
            The tokenizer used for encoding the data.
        model ([`PreTrainedModel`]):
            The model that is being trained. If set and has the *prepare_decoder_input_ids_from_labels*, use it to
            prepare the *decoder_input_ids*
            This is useful when using *label_smoothing* to avoid calculating loss twice.
        padding (`bool`, `str` or [`~utils.PaddingStrategy`], *optional*, defaults to `True`):
            Select a strategy to pad the returned sequences (according to the model's padding side and padding index)
            among:
            - `True` or `'longest'` (default): Pad to the longest sequence in the batch (or no padding if only a single
              sequence is provided).
            - `'max_length'`: Pad to a maximum length specified with the argument `max_length` or to the maximum
              acceptable input length for the model if that argument is not provided.
            - `False` or `'do_not_pad'`: No padding (i.e., can output a batch with sequences of different lengths).
        max_length (`int`, *optional*):
            Maximum length of the returned list and optionally padding length (see above).
        pad_to_multiple_of (`int`, *optional*):
            If set will pad the sequence to a multiple of the provided value.
            This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >=
            7.5 (Volta).
        label_pad_token_id (`int`, *optional*, defaults to -100):
            The id to use when padding the labels (-100 will be automatically ignored by PyTorch loss functions).
        return_tensors (`str`):
            The type of Tensor to return. Allowable values are "np", "pt" and "tf".
    """

    tokenizer: PreTrainedTokenizerBase
    model: Optional[Any] = None
    padding: Union[bool, str, PaddingStrategy] = True
    max_length: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None
    label_pad_token_id: int = -100
    return_tensors: str = "pt"

    def __call__(self, features, return_tensors=None):
        if return_tensors is None:
            return_tensors = self.return_tensors
        labels = [feature["labels"] for feature in features] if "labels" in features[0].keys() else None

        previous_level = transformers.logging.get_verbosity()
        transformers.logging.set_verbosity_error()

        # We have to pad the labels before calling `tokenizer.pad` as this method won't pad them and needs them of the
        # same length to return tensors.
        if labels is not None:
            max_label_length = max(len(l) for l in labels)
            if self.pad_to_multiple_of is not None:
                max_label_length = (
                    (max_label_length + self.pad_to_multiple_of - 1)
                    // self.pad_to_multiple_of
                    * self.pad_to_multiple_of
                )

            padding_side = self.tokenizer.padding_side
            for feature in features:
                remainder = [self.label_pad_token_id] * (max_label_length - len(feature["labels"]))
                if isinstance(feature["labels"], list):
                    feature["labels"] = (
                        feature["labels"] + remainder if padding_side == "right" else remainder + feature["labels"]
                    )
                elif padding_side == "right":
                    feature["labels"] = np.concatenate([feature["labels"], remainder]).astype(np.int64)
                else:
                    feature["labels"] = np.concatenate([remainder, feature["labels"]]).astype(np.int64)

        features = self.tokenizer.pad(
            features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors=return_tensors,
        )


        # prepare decoder_input_ids
        if (
            labels is not None
            and self.model is not None
            and hasattr(self.model, "prepare_decoder_input_ids_from_labels")
        ):
            decoder_input_ids = self.model.prepare_decoder_input_ids_from_labels(labels=features["labels"])
            features["decoder_input_ids"] = decoder_input_ids

        transformers.logging.set_verbosity(previous_level) ####
        return features

# DataCollator para Decoders (Causal Language Modeling)

Enquanto os modelos encoder-decoder possuem o input e os outputs separados, nos modelos decoder, só há um vetor de texto.

No finetunning para classificação dos modelos decoders, colocaremos para o modelo apenas prever o texto do label. Então será atribuído à parte do input um label de valor -100, assim o modelo saberá que não deve ser calculado *loss* para esses tokens.

In [18]:
import random
import warnings
from collections.abc import Mapping
from dataclasses import dataclass
from random import randint
from typing import Any, Callable, Dict, List, NewType, Optional, Tuple, Union

import numpy as np

from transformers.models.bert import BertTokenizer, BertTokenizerFast
from transformers.tokenization_utils_base import PreTrainedTokenizerBase
from transformers.utils import PaddingStrategy

import torch
import transformers.data.data_collator
from transformers.data.data_collator import _torch_collate_batch


class DataCollatorMixin:
    def __call__(self, features, return_tensors=None):
        if return_tensors is None:
            return_tensors = self.return_tensors
        if return_tensors == "tf":
            return self.tf_call(features)
        elif return_tensors == "pt":
            return self.torch_call(features)
        elif return_tensors == "np":
            return self.numpy_call(features)
        else:
            raise ValueError(f"Framework '{return_tensors}' not recognized!")

@dataclass
class DataCollatorWithPaddingModified:
    """
    Data collator that will dynamically pad the inputs received.
    Args:
        tokenizer ([`PreTrainedTokenizer`] or [`PreTrainedTokenizerFast`]):
            The tokenizer used for encoding the data.
        padding (`bool`, `str` or [`~utils.PaddingStrategy`], *optional*, defaults to `True`):
            Select a strategy to pad the returned sequences (according to the model's padding side and padding index)
            among:
            - `True` or `'longest'` (default): Pad to the longest sequence in the batch (or no padding if only a single
              sequence is provided).
            - `'max_length'`: Pad to a maximum length specified with the argument `max_length` or to the maximum
              acceptable input length for the model if that argument is not provided.
            - `False` or `'do_not_pad'`: No padding (i.e., can output a batch with sequences of different lengths).
        max_length (`int`, *optional*):
            Maximum length of the returned list and optionally padding length (see above).
        pad_to_multiple_of (`int`, *optional*):
            If set will pad the sequence to a multiple of the provided value.
            This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >=
            7.5 (Volta).
        return_tensors (`str`):
            The type of Tensor to return. Allowable values are "np", "pt" and "tf".
    """

    tokenizer: PreTrainedTokenizerBase
    padding: Union[bool, str, PaddingStrategy] = True
    max_length: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None
    return_tensors: str = "pt"

    def __call__(self, features: List[Dict[str, Any]]) -> Dict[str, Any]:
        inputs=[]
        labels=[]
        attention_mask=[]

        is_validation_dataset = (features[0]['labels'][0] > len(tokenizer))
        if is_validation_dataset:
            i = 0
            for feat in features:
                labels.append(feat['labels'][1:])
                inputs.append(feat['input_ids'])
        else: # training datset
            for feat in features:
                labels.append([-100] * len(feat['input_ids']) + feat['labels'])
                inputs.append(feat['input_ids'] + feat['labels'])

        # artifício para dar pad nos inputs e labels ao mesmo tempo
        inputs = {'input_ids' : inputs + labels}

        previous_level = transformers.logging.get_verbosity()
        transformers.logging.set_verbosity_error()
        #####
        batch = self.tokenizer.pad(
            inputs,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors=self.return_tensors,
        )
        transformers.logging.set_verbosity(previous_level) ####

        half_idx = len(labels)

        batch['labels'] = batch['input_ids'][half_idx:len(batch['input_ids'])]
        batch['input_ids'] = batch['input_ids'][0:half_idx]
        batch['attention_mask'] = batch['attention_mask'][0:half_idx]


        batch['labels'][batch['labels'] == self.tokenizer.pad_token_id] = -100

        if "label" in batch:
            batch["labels"] = batch["label"]
            del batch["label"]
        if "label_ids" in batch:
            batch["labels"] = batch["label_ids"]
            del batch["label_ids"]

        return batch

# Treina o modelo

## Ajusta a classe Trainer do hugginface

Foi alterada na classe Trainer a configuração de geração de textos de validação e silenciado os avisos na geração

In [19]:
# https://github.com/huggingface/transformers/blob/v4.26.1/src/transformers/trainer_seq2seq.py
# Copyright 2020 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from typing import Any, Dict, List, Optional, Tuple, Union

import torch
from torch import nn
from torch.utils.data import Dataset

from transformers.deepspeed import is_deepspeed_zero3_enabled
from transformers.trainer import Trainer
from transformers.trainer_utils import PredictionOutput
from transformers.utils import logging
import transformers


logger = logging.get_logger(__name__)

class Seq2SeqTrainerModified(Trainer):
    def evaluate(
        self,
        eval_dataset: Optional[Dataset] = None,
        ignore_keys: Optional[List[str]] = None,
        metric_key_prefix: str = "eval",
        **gen_kwargs
    ) -> Dict[str, float]:
        """
        Run evaluation and returns metrics.
        The calling script will be responsible for providing a method to compute metrics, as they are task-dependent
        (pass it to the init `compute_metrics` argument).
        You can also subclass and override this method to inject custom behavior.
        Args:
            eval_dataset (`Dataset`, *optional*):
                Pass a dataset if you wish to override `self.eval_dataset`. If it is an [`~datasets.Dataset`], columns
                not accepted by the `model.forward()` method are automatically removed. It must implement the `__len__`
                method.
            ignore_keys (`List[str]`, *optional*):
                A list of keys in the output of your model (if it is a dictionary) that should be ignored when
                gathering predictions.
            metric_key_prefix (`str`, *optional*, defaults to `"eval"`):
                An optional prefix to be used as the metrics key prefix. For example the metrics "bleu" will be named
                "eval_bleu" if the prefix is `"eval"` (default)
            max_length (`int`, *optional*):
                The maximum target length to use when predicting with the generate method.
            num_beams (`int`, *optional*):
                Number of beams for beam search that will be used when predicting with the generate method. 1 means no
                beam search.
            gen_kwargs:
                Additional `generate` specific kwargs.
        Returns:
            A dictionary containing the evaluation loss and the potential metrics computed from the predictions. The
            dictionary also contains the epoch number which comes from the training state.
        """

        gen_kwargs = gen_kwargs.copy()
        if gen_kwargs.get("max_length") is None and gen_kwargs.get("max_new_tokens") is None:
            gen_kwargs["max_length"] = self.args.generation_max_length
        gen_kwargs["num_beams"] = (
            gen_kwargs["num_beams"] if gen_kwargs.get("num_beams") is not None else self.args.generation_num_beams
        )
        self._gen_kwargs = gen_kwargs

        return super().evaluate(eval_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix)

    def predict(
        self,
        test_dataset: Dataset,
        ignore_keys: Optional[List[str]] = None,
        metric_key_prefix: str = "test",
        **gen_kwargs
    ) -> PredictionOutput:
        """
        Run prediction and returns predictions and potential metrics.
        Depending on the dataset and your use case, your test dataset may contain labels. In that case, this method
        will also return metrics, like in `evaluate()`.
        Args:
            test_dataset (`Dataset`):
                Dataset to run the predictions on. If it is a [`~datasets.Dataset`], columns not accepted by the
                `model.forward()` method are automatically removed. Has to implement the method `__len__`
            ignore_keys (`List[str]`, *optional*):
                A list of keys in the output of your model (if it is a dictionary) that should be ignored when
                gathering predictions.
            metric_key_prefix (`str`, *optional*, defaults to `"eval"`):
                An optional prefix to be used as the metrics key prefix. For example the metrics "bleu" will be named
                "eval_bleu" if the prefix is `"eval"` (default)
            max_length (`int`, *optional*):
                The maximum target length to use when predicting with the generate method.
            num_beams (`int`, *optional*):
                Number of beams for beam search that will be used when predicting with the generate method. 1 means no
                beam search.
            gen_kwargs:
                Additional `generate` specific kwargs.
        <Tip>
        If your predictions or labels have different sequence lengths (for instance because you're doing dynamic
        padding in a token classification task) the predictions will be padded (on the right) to allow for
        concatenation into one array. The padding index is -100.
        </Tip>
        Returns: *NamedTuple* A namedtuple with the following keys:
            - predictions (`np.ndarray`): The predictions on `test_dataset`.
            - label_ids (`np.ndarray`, *optional*): The labels (if the dataset contained some).
            - metrics (`Dict[str, float]`, *optional*): The potential dictionary of metrics (if the dataset contained
              labels).
        """

        gen_kwargs = gen_kwargs.copy()
        if gen_kwargs.get("max_length") is None and gen_kwargs.get("max_new_tokens") is None:
            gen_kwargs["max_length"] = self.args.generation_max_length
        gen_kwargs["num_beams"] = (
            gen_kwargs["num_beams"] if gen_kwargs.get("num_beams") is not None else self.args.generation_num_beams
        )
        self._gen_kwargs = gen_kwargs

        return super().predict(test_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix)

    def prediction_step(
        self,
        model: nn.Module,
        inputs: Dict[str, Union[torch.Tensor, Any]],
        prediction_loss_only: bool,
        ignore_keys: Optional[List[str]] = None,
    ) -> Tuple[Optional[float], Optional[torch.Tensor], Optional[torch.Tensor]]:
        """
        Perform an evaluation step on `model` using `inputs`.
        Subclass and override to inject custom behavior.
        Args:
            model (`nn.Module`):
                The model to evaluate.
            inputs (`Dict[str, Union[torch.Tensor, Any]]`):
                The inputs and targets of the model.
                The dictionary will be unpacked before being fed to the model. Most models expect the targets under the
                argument `labels`. Check your model's documentation for all accepted arguments.
            prediction_loss_only (`bool`):
                Whether or not to return the loss only.
        Return:
            Tuple[Optional[float], Optional[torch.Tensor], Optional[torch.Tensor]]: A tuple with the loss, logits and
            labels (each being optional).
        """

        if not self.args.predict_with_generate or prediction_loss_only:
            return super().prediction_step(
                model, inputs, prediction_loss_only=prediction_loss_only, ignore_keys=ignore_keys
            )

        has_labels = "labels" in inputs
        inputs = self._prepare_inputs(inputs)

        # XXX: adapt synced_gpus for fairscale as well
        gen_kwargs = self._gen_kwargs.copy()
        if gen_kwargs.get("max_length") is None and gen_kwargs.get("max_new_tokens") is None:
            gen_kwargs["max_length"] = self.model.config.max_length
        gen_kwargs["num_beams"] = (
            gen_kwargs["num_beams"] if gen_kwargs.get("num_beams") is not None else self.model.config.num_beams
        )
        default_synced_gpus = True if is_deepspeed_zero3_enabled() else False
        gen_kwargs["synced_gpus"] = (
            gen_kwargs["synced_gpus"] if gen_kwargs.get("synced_gpus") is not None else default_synced_gpus
        )

        if "attention_mask" in inputs:
            gen_kwargs["attention_mask"] = inputs.get("attention_mask", None)
        if "global_attention_mask" in inputs:
            gen_kwargs["global_attention_mask"] = inputs.get("global_attention_mask", None)

        # prepare generation inputs
        # some encoder-decoder models can have varying encoder's and thus
        # varying model input names
        if hasattr(self.model, "encoder") and self.model.encoder.main_input_name != self.model.main_input_name:
            generation_inputs = inputs[self.model.encoder.main_input_name]
        else:
            generation_inputs = inputs[self.model.main_input_name]

        ##### Alteração
        gen_kwargs["max_new_tokens"] = MAX_TOKEN_GENERATION_LENGTH
        del gen_kwargs["max_length"]
        gen_kwargs["eos_token_id"]=self.tokenizer.eos_token_id
        previous_level = transformers.logging.get_verbosity()
        transformers.logging.set_verbosity_error()
        #####
        generated_tokens = self.model.generate(
            generation_inputs,
            **gen_kwargs
        )



        # in case the batch is shorter than max length, the output should be padded
        if gen_kwargs.get("max_length") is not None and generated_tokens.shape[-1] < gen_kwargs["max_length"]:
            generated_tokens = self._pad_tensors_to_max_len(generated_tokens, gen_kwargs["max_length"])
        elif gen_kwargs.get("max_new_tokens") is not None and generated_tokens.shape[-1] < (
            gen_kwargs["max_new_tokens"] + 1
        ):
            generated_tokens = self._pad_tensors_to_max_len(generated_tokens, gen_kwargs["max_new_tokens"] + 1)

        with torch.no_grad():
            if has_labels:
                with self.compute_loss_context_manager():
                    outputs = model(**inputs)
                if self.label_smoother is not None:
                    loss = self.label_smoother(outputs, inputs["labels"]).mean().detach()
                else:
                    loss = (outputs["loss"] if isinstance(outputs, dict) else outputs[0]).mean().detach()
            else:
                loss = None

        if self.args.prediction_loss_only:
            return (loss, None, None)

        if has_labels:
            labels = inputs["labels"]
            if gen_kwargs.get("max_length") is not None and labels.shape[-1] < gen_kwargs["max_length"]:
                labels = self._pad_tensors_to_max_len(labels, gen_kwargs["max_length"])
            elif gen_kwargs.get("max_new_tokens") is not None and labels.shape[-1] < (
                gen_kwargs["max_new_tokens"] + 1
            ):
                labels = self._pad_tensors_to_max_len(labels, (gen_kwargs["max_new_tokens"] + 1))
        else:
            labels = None
        transformers.logging.set_verbosity(previous_level) ####
        return (loss, generated_tokens, labels)

    def _pad_tensors_to_max_len(self, tensor, max_length):
        if self.tokenizer is not None and hasattr(self.tokenizer, "pad_token_id"):
            # If PAD token is not defined at least EOS token has to be defined
            pad_token_id = (
                self.tokenizer.pad_token_id if self.tokenizer.pad_token_id is not None else self.tokenizer.eos_token_id
            )
        else:
            if self.model.config.pad_token_id is not None:
                pad_token_id = self.model.config.pad_token_id
            else:
                raise ValueError("Pad_token_id must be set in the configuration of the model, in order to pad tensors")

        padded_tensor = pad_token_id * torch.ones(
            (tensor.shape[0], max_length), dtype=tensor.dtype, device=tensor.device
        )
        padded_tensor[:, : tensor.shape[-1]] = tensor
        return padded_tensor

## Treina o model

Treino

In [None]:
from transformers import TrainingArguments, Trainer, Seq2SeqTrainingArguments, Seq2SeqTrainer, DataCollatorWithPadding
import torch
import os


if model_type=='decoder':
    data_collator = DataCollatorWithPaddingModified(tokenizer,max_length=model.config.n_positions,pad_to_multiple_of=8,return_tensors='pt')
    compute_metrics=compute_metrics
    learning_rate=1e-4
    predict_with_generate=True
elif model_type=='encoder-decoder':
    data_collator = DataCollatorForSeq2SeqModified(tokenizer,model=model,max_length=context_length,pad_to_multiple_of=8,return_tensors='pt')
    compute_metrics=compute_metrics
    learning_rate=1e-4
    predict_with_generate=True
elif model_type=='encoder':
    data_collator = DataCollatorWithPadding(tokenizer,max_length=context_length,pad_to_multiple_of=8,return_tensors='pt')
    compute_metrics=compute_metrics
    learning_rate=5e-5
    predict_with_generate=False

args = Seq2SeqTrainingArguments(
    output_dir=output_dir,
    save_strategy="epoch",
    save_total_limit=1,
    load_best_model_at_end=True,

    evaluation_strategy="epoch",
    eval_steps=1,

    logging_strategy="epoch",
    logging_steps=1,
    predict_with_generate=predict_with_generate,
    # resume_from_checkpoint=RESUME_FROM_CHECKPOINT,

    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=EVAL_BATCH_SIZE,
    gradient_accumulation_steps = gradient_accumulation_steps,

    num_train_epochs=epochs,

    lr_scheduler_type="linear",
    warmup_ratio=0.1, # warmup de 10% do treino
    learning_rate=learning_rate,
    weight_decay=0.1,

    fp16=fp16,
    fp16_full_eval=fp16,
    dataloader_num_workers=1,
    # push_to_hub=True,
    # hub_token='token_do_huggingface',
    # hub_strategy="checkpoint",
    # hub_model_id="nome_do_usuario/nome_do_mudelo",
)


trainer = Seq2SeqTrainerModified(
    model=model,
    tokenizer=tokenizer,
    args=args,
    data_collator=data_collator,
    train_dataset=ds_tokenizado["train"],
    eval_dataset=ds_tokenizado["validation"],
    compute_metrics=compute_metrics,
)

trainer.train(resume_from_checkpoint=RESUME_FROM_CHECKPOINT)

You're using a DebertaV2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Mrpc Acc
0,0.6479,0.598592,0.682171
1,0.5145,0.373698,0.850129
2,0.3474,0.314132,0.863049
3,0.2166,0.306313,0.878553
4,0.1571,0.323758,0.883721
5,0.0907,0.372456,0.901809
6,0.062,0.44841,0.883721
8,0.0427,0.432733,0.881137
8,0.036,0.447747,0.886305
9,0.0205,0.441889,0.888889


You're using a DebertaV2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

You're using a DebertaV2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a DebertaV2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a DebertaV2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a DebertaV2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a DebertaV2TokenizerFast tokenizer. Please note that with a fas

# Gera texto pelo modelo finetune

In [None]:
import torch
from transformers import pipeline
import pandas as pd


if model_type!='encoder':
    texts=[]

    for key in ds_processado.keys():
        texts.append(ds_processado[key][0]['text'])
        texts.append(ds_processado[key][1]['text'])
    model.to('cpu')
    pred=[]

    previous_level = transformers.logging.get_verbosity()
    transformers.logging.set_verbosity_error()

    for text in texts:
        pred.append(tokenizer.batch_decode(model.generate(tokenizer.encode(text,return_tensors='pt'),max_new_tokens=20,eos_token_id=tokenizer.eos_token_id)))

    transformers.logging.set_verbosity(previous_level) ####

    for i in range(0,len(texts)):
        print('input:',texts[i])
        print('generated:',pred[i])
        print('')

# Desconectar do COLAB

In [None]:
from google.colab import runtime

runtime.unassign()