## **config**

In [1]:
! pip install datasets transformers seqeval



In [2]:
! pip install accelerate -U



In [3]:
# from huggingface_hub import notebook_login

# notebook_login()

In [4]:
# git larga files
!apt install git-lfs

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
git-lfs is already the newest version (3.0.2-1ubuntu0.2).
0 upgraded, 0 newly installed, 0 to remove and 45 not upgraded.


In [5]:
import transformers

assert transformers.__version__ >= "4.11.0", "Please update your transformers version by running 'pip install -U transformers'"

## **fine-tuning workflow and preprocessing**

depending on model and the GPU you'll use, you've to adjust the batch size. same with dataset, it might need some small adjustments. then, the rest of the notebook must be straigh forward.

In [6]:
task = "ner"
model_checkpoint = "distilbert-base-uncased" # https://huggingface.co/distilbert/distilbert-base-uncased

### loading the dataset

In [7]:
from datasets import load_dataset, load_metric

i'm using [CONLL 2003 dataset](https://www.aclweb.org/anthology/W03-0419.pdf) in order to achieve a solid behavior. check the [Datasets documentation](https://huggingface.co/docs/datasets/loading_datasets.html#from-local-files) on how to load my own dataset. it must work with any dataset from hugging face library.

In [8]:
datasets = load_dataset("conll2003") # https://huggingface.co/datasets/eriktks/conll2003

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [9]:
datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3453
    })
})

In [10]:
datasets["train"][0]

{'id': '0',
 'tokens': ['EU',
  'rejects',
  'German',
  'call',
  'to',
  'boycott',
  'British',
  'lamb',
  '.'],
 'pos_tags': [22, 42, 16, 21, 35, 37, 16, 21, 7],
 'chunk_tags': [11, 21, 11, 12, 21, 22, 11, 12, 0],
 'ner_tags': [3, 0, 7, 0, 0, 0, 7, 0, 0]}

labels are already encoded and the map is under features method

In [11]:
datasets["train"].features[f"ner_tags"]

Sequence(feature=ClassLabel(names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC'], id=None), length=-1, id=None)

In [12]:
label_list = datasets["train"].features[f"{task}_tags"].feature.names
label_list

['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

so, we've:
- 'O' for nothing special entity
- 'PER' for person
- 'ORG' for organization
- 'LOC' for location
- 'MISC' for miscellaneous

as we can see above, the tags also have an `B` or `I` which means; beginning and intermidiate.

In [13]:
from datasets import ClassLabel, Sequence
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    """
    Displays a random selection of elements from the dataset in order to understand how the data lookslike.

    Args:
        dataset (Dataset): The dataset to select elements from.
        num_examples (int): The number of random examples to display. Defaults to 10.

    Raises:
        AssertionError: If `num_examples` is greater than the length of the dataset.

    Returns:
        None
    """
    assert num_examples <= len(dataset)
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)

    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
        elif isinstance(typ, Sequence) and isinstance(typ.feature, ClassLabel):
            df[column] = df[column].transform(lambda x: [typ.feature.names[i] for i in x])
    display(HTML(df.to_html()))

In [14]:
show_random_elements(datasets["train"])

Unnamed: 0,id,tokens,pos_tags,chunk_tags,ner_tags
0,5340,"[Major, League, Baseball]","[NNP, NNP, NNP]","[B-NP, I-NP, I-NP]","[B-MISC, I-MISC, I-MISC]"
1,8333,"[Lewis, 71, 10, 264, 1, 264.00]","[NNP, CD, CD, CD, CD, CD]","[B-NP, I-NP, I-NP, I-NP, I-NP, I-NP]","[B-PER, O, O, O, O, O]"
2,1197,"[Guinea, 's, president, ,, Lansana, Conte, ,, vice-president, of, the, Organisation, of, the, Islamic, Conference, ,, left, for, Kuwait, on, August, 16, to, prepare, the, next, OIC, summit, in, Pakistan, in, 1997, .]","[NNP, POS, NN, ,, NNP, NNP, ,, NN, IN, DT, NNP, IN, DT, NNP, NNP, ,, VBD, IN, NNP, IN, NNP, CD, TO, VB, DT, JJ, NNP, NN, IN, NNP, IN, CD, .]","[B-NP, B-NP, I-NP, O, B-NP, I-NP, O, B-NP, B-PP, B-NP, I-NP, B-PP, B-NP, I-NP, I-NP, O, B-VP, B-PP, B-NP, B-PP, B-NP, I-NP, B-VP, I-VP, B-NP, I-NP, I-NP, I-NP, B-PP, B-NP, B-PP, B-NP, O]","[B-LOC, O, O, O, B-PER, I-PER, O, O, O, O, B-ORG, I-ORG, I-ORG, I-ORG, I-ORG, O, O, O, B-LOC, O, O, O, O, O, O, O, B-ORG, O, O, B-LOC, O, O, O]"
3,1053,"[Women, who, get, measles, while, pregnant, may, have, babies, at, higher, risk, of, Crohn, 's, disease, ,, a, debilitating, bowel, disorder, ,, researchers, said, on, Friday, .]","[NNS, WP, VBP, FW, IN, JJ, MD, VB, NNS, IN, JJR, NN, IN, NNP, POS, NN, ,, DT, JJ, NN, NN, ,, NNS, VBD, IN, NNP, .]","[B-NP, B-NP, B-VP, B-NP, B-SBAR, B-NP, B-VP, I-VP, B-NP, B-PP, B-NP, I-NP, B-PP, B-NP, B-NP, I-NP, O, B-NP, I-NP, I-NP, I-NP, O, B-NP, B-VP, B-PP, B-NP, O]","[O, O, O, O, O, O, O, O, O, O, O, O, O, B-PER, O, O, O, O, O, O, O, O, O, O, O, O, O]"
4,12900,"[Over, the, seven, months, ,, the, private, sector, accounted, for, 76, percent, of, total, exports, ,, with, "", common, metals, "", the, strongest, export, sector, accounting, for, $, 951, million, ,, or, 42.5, percent, of, total, exports, .]","[IN, DT, CD, NNS, ,, DT, JJ, NN, VBD, IN, CD, NN, IN, JJ, NNS, ,, IN, "", JJ, NNS, "", DT, JJS, NN, NN, NN, IN, $, CD, CD, ,, CC, CD, NN, IN, JJ, NNS, .]","[B-PP, B-NP, I-NP, I-NP, O, B-NP, I-NP, I-NP, B-VP, B-PP, B-NP, I-NP, B-PP, B-NP, I-NP, O, B-PP, O, B-NP, I-NP, O, B-NP, I-NP, I-NP, I-NP, I-NP, B-PP, B-NP, I-NP, I-NP, O, O, B-NP, I-NP, B-PP, B-NP, I-NP, O]","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O]"
5,3080,"[The, United, States, said, on, Thursday, it, remained, committed, to, migration, accords, with, Cuba, and, would, continue, to, repatriate, intercepted, Cuban, migrants, who, attempted, to, enter, U.S., territory, illegally, .]","[DT, NNP, NNP, VBD, IN, NNP, PRP, VBD, VBN, TO, NN, NNS, IN, NNP, CC, MD, VB, TO, VB, VBN, JJ, NNS, WP, VBD, TO, VB, NNP, NN, RB, .]","[B-NP, I-NP, I-NP, B-VP, B-PP, B-NP, B-NP, B-VP, B-ADJP, B-PP, B-NP, I-NP, B-PP, B-NP, O, B-VP, I-VP, I-VP, I-VP, I-VP, B-NP, I-NP, B-NP, B-VP, I-VP, I-VP, B-NP, I-NP, B-ADVP, O]","[O, B-LOC, I-LOC, O, O, O, O, O, O, O, O, O, O, B-LOC, O, O, O, O, O, O, B-MISC, O, O, O, O, O, B-LOC, O, O, O]"
6,11074,"[He, declined, to, say, how, many, people, were, being, considered, for, asylum, .]","[PRP, VBD, TO, VB, WRB, JJ, NNS, VBD, VBG, VBN, IN, NN, .]","[B-NP, B-VP, I-VP, I-VP, B-ADVP, B-NP, I-NP, B-VP, I-VP, I-VP, B-PP, B-NP, O]","[O, O, O, O, O, O, O, O, O, O, O, O, O]"
7,11413,"[WATER, DISTRICT, 1, OF, JOHNSON, CO, .]","[NNP, NNP, CD, IN, NNP, NNP, .]","[B-NP, I-NP, I-NP, B-PP, B-NP, I-NP, O]","[B-MISC, I-MISC, I-MISC, O, B-ORG, I-ORG, O]"
8,365,"[HORSE, RACING, -, NUNTHORPE, STAKES, RESULTS, .]","[NNP, NNP, :, NNP, NNP, NNS, .]","[B-NP, I-NP, O, B-NP, I-NP, I-NP, O]","[O, O, O, O, O, O, O]"
9,2165,"[Bench, coach, Andy, Etchebarren, took, his, place, .]","[NN, NN, NNP, NNP, VBD, PRP$, NN, .]","[B-NP, I-NP, I-NP, I-NP, B-VP, B-NP, I-NP, O]","[O, O, B-PER, I-PER, O, O, O, O]"


### preprocessing data (tokenizing)

In [15]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [16]:
import transformers
assert isinstance(tokenizer, transformers.PreTrainedTokenizerFast) # checking if the tokenizer is fast one from hugging face tokenizers library

table with models type and if have a fast tokenizer [here](https://huggingface.co/transformers/index.html#bigtable).

In [17]:
tokenizer("Hello, this is one sentence!") # for sentences

{'input_ids': [101, 7592, 1010, 2023, 2003, 2028, 6251, 999, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [18]:
tokenizer(["Hello", ",", "this", "is", "one", "sentence", "split", "into", "words", "."], is_split_into_words=True) # when the sentence is already splitted into words

{'input_ids': [101, 7592, 1010, 2023, 2003, 2028, 6251, 3975, 2046, 2616, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

**important:** transformers are pretrained with subword tokenizers.

if your inputs are already splitted into words, each of those words could be split again by the tokenizer and also add some special tokens as you can see in the following example.



In [19]:
example = datasets["train"][4]
print(example["tokens"])

['Germany', "'s", 'representative', 'to', 'the', 'European', 'Union', "'s", 'veterinary', 'committee', 'Werner', 'Zwingmann', 'said', 'on', 'Wednesday', 'consumers', 'should', 'buy', 'sheepmeat', 'from', 'countries', 'other', 'than', 'Britain', 'until', 'the', 'scientific', 'advice', 'was', 'clearer', '.']


In [20]:
tokenized_input = tokenizer(example["tokens"], is_split_into_words=True)
tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
print(tokens)

['[CLS]', 'germany', "'", 's', 'representative', 'to', 'the', 'european', 'union', "'", 's', 'veterinary', 'committee', 'werner', 'z', '##wing', '##mann', 'said', 'on', 'wednesday', 'consumers', 'should', 'buy', 'sheep', '##me', '##at', 'from', 'countries', 'other', 'than', 'britain', 'until', 'the', 'scientific', 'advice', 'was', 'clearer', '.', '[SEP]']


so, what would happen? we'll have discrepancy between our `list of labels` for the text and the `input_ids` when the model tokenize the text for training.  let's see and do some processing on our labels using `words_ids` from tokenizer

In [21]:
len(example[f"{task}_tags"]), len(tokenized_input["input_ids"])

(31, 39)

In [22]:
print(tokenized_input.word_ids())

[None, 0, 1, 1, 2, 3, 4, 5, 6, 7, 7, 8, 9, 10, 11, 11, 11, 12, 13, 14, 15, 16, 17, 18, 18, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, None]


In [23]:
word_ids = tokenized_input.word_ids()
aligned_labels = [-100 if i is None else example[f"{task}_tags"][i] for i in word_ids]
print(len(aligned_labels), len(tokenized_input["input_ids"]))

39 39


In [24]:
label_all_tokens = True

In [25]:
def tokenize_and_align_labels(examples):
    """
    Tokenizes the input examples and aligns the labels with the tokenized inputs.

    Args:
        examples (dict): A dictionary containing the input examples.

    Returns:
        dict: A dictionary containing the tokenized inputs with aligned labels.
    """
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)

    labels = []
    for i, label in enumerate(examples[f"{task}_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            # setting special tokens (none valued) to -100 label_id bc it's ignored by pytorch
            if word_idx is None:
                label_ids.append(-100)
            # settint the label for the first token of each word
            elif word_idx != previous_word_idx:
                label_ids.append(label[word_idx])
            # for other tokens in a word, set the label to -100 as well or just the previous id (depending on label)
            else:
                label_ids.append(label[word_idx] if label_all_tokens else -100)
            previous_word_idx = word_idx

        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

In [26]:
tokenize_and_align_labels(datasets['train'][:5]) # could be just one sample

{'input_ids': [[101, 7327, 19164, 2446, 2655, 2000, 17757, 2329, 12559, 1012, 102], [101, 2848, 13934, 102], [101, 9371, 2727, 1011, 5511, 1011, 2570, 102], [101, 1996, 2647, 3222, 2056, 2006, 9432, 2009, 18335, 2007, 2446, 6040, 2000, 10390, 2000, 18454, 2078, 2329, 12559, 2127, 6529, 5646, 3251, 5506, 11190, 4295, 2064, 2022, 11860, 2000, 8351, 1012, 102], [101, 2762, 1005, 1055, 4387, 2000, 1996, 2647, 2586, 1005, 1055, 15651, 2837, 14121, 1062, 9328, 5804, 2056, 2006, 9317, 10390, 2323, 4965, 8351, 4168, 4017, 2013, 3032, 2060, 2084, 3725, 2127, 1996, 4045, 6040, 2001, 24509, 1012, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 'labels': [[-100, 3, 0, 7, 0, 0, 0, 7, 0, 0, -100], [-100, 1, 2, -100], [-100, 5, 0, 

In [27]:
# applying to the whole dataset (train, test, val)
tokenized_datasets = datasets.map(tokenize_and_align_labels, batched=True) # batched enables multi-threading to the processing

### fine-tuning the model

In [28]:
from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer

model = AutoModelForTokenClassification.from_pretrained(model_checkpoint, num_labels=len(label_list)) # specifing n° of labels

Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [29]:
! pip install accelerate -U



In [30]:
batch_size = 16

# to save the checkpoints
model_name = model_checkpoint.split("/")[-1]

# training args
args = TrainingArguments(
    output_dir=f"{model_name}-finetuned-{task}",
    eval_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
    push_to_hub=False, # change this to upload the model to hugging face hub
)



In [31]:
import os
os.makedirs(args.output_dir, exist_ok=True)

In [32]:
from transformers import DataCollatorForTokenClassification

# this will batch the processed examples together while applyies padding to make them all the same size (inputs and labels)
data_collator = DataCollatorForTokenClassification(tokenizer)

as compute metric for the predictions we'll use [`seqeval`](https://github.com/chakki-works/seqeval). a framework for sequence labeling evaluation, it calculates common metrics but for sequence labeling tasks

In [33]:
metric = load_metric("seqeval", trust_remote_code=True)

  metric = load_metric("seqeval")


Downloading builder script:   0%|          | 0.00/2.47k [00:00<?, ?B/s]

The repository for seqeval contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/seqeval.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


This metric takes list of labels for the predictions and references:

example:
```python
true_labels = [
    ["B-PER", "I-PER", "O", "B-LOC"],
    ["B-ORG", "O", "B-LOC", "I-LOC", "O"]
]

predicted_labels = [
    ["B-PER", "I-PER", "O", "B-LOC"],
    ["B-ORG", "O", "B-LOC", "O", "O"]
]
```

In [34]:
# this metric takes a predicted_labels and true_labels (of course both a list)
labels = [label_list[i] for i in example[f"{task}_tags"]]
metric.compute(predictions=[labels], references=[labels])

{'LOC': {'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 2},
 'ORG': {'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 1},
 'PER': {'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 1},
 'overall_precision': 1.0,
 'overall_recall': 1.0,
 'overall_f1': 1.0,
 'overall_accuracy': 1.0}

In [35]:
import numpy as np

def compute_metrics(p):
    """
    Computes evaluation metrics for token classification for label_list doing postprocesing for predictions.

    Args:
        p (tuple): A tuple containing predictions and labels.

    Returns:
        dict: A dictionary containing the computed evaluation metrics.
            - precision: The overall precision.
            - recall: The overall recall.
            - f1: The overall F1 score.
            - accuracy: The overall accuracy.
    """
    # getting predicted index
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    # remove ignored index (special tokens)
    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = metric.compute(predictions=true_predictions, references=true_labels)

    # droping specific metrics and just keeping overall (anyways we can get the metrics for specific labels later)
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

In [36]:
trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [37]:
# we compute the train here
trainer.train()

Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,0.2536,0.069434,0.900186,0.919118,0.909554,0.979904
2,0.0517,0.060372,0.921778,0.934668,0.928179,0.983033
3,0.0312,0.061092,0.929205,0.938248,0.933704,0.983796


TrainOutput(global_step=2634, training_loss=0.08775087089002721, metrics={'train_runtime': 292.3435, 'train_samples_per_second': 144.087, 'train_steps_per_second': 9.01, 'total_flos': 510122266253334.0, 'train_loss': 0.08775087089002721, 'epoch': 3.0})

In [38]:
# and we evaluate on this cell
trainer.evaluate()

{'eval_loss': 0.06109246239066124,
 'eval_precision': 0.9292045202747617,
 'eval_recall': 0.9382481261886118,
 'eval_f1': 0.933704425271361,
 'eval_accuracy': 0.9837958917819754,
 'eval_runtime': 5.937,
 'eval_samples_per_second': 547.411,
 'eval_steps_per_second': 34.361,
 'epoch': 3.0}

to get the metrics computed for each category we can apply the same function as before on the result of the `predict` method

In [39]:
predictions, labels, _ = trainer.predict(tokenized_datasets["validation"])
predictions = np.argmax(predictions, axis=2)

true_predictions = [
    [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
    for prediction, label in zip(predictions, labels)
]
true_labels = [
    [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
    for prediction, label in zip(predictions, labels)
]

results = metric.compute(predictions=true_predictions, references=true_labels)
results

{'LOC': {'precision': 0.9548729616989002,
  'recall': 0.9618029029793735,
  'f1': 0.9583254043767839,
  'number': 2618},
 'MISC': {'precision': 0.8256658595641646,
  'recall': 0.8310316815597075,
  'f1': 0.8283400809716599,
  'number': 1231},
 'ORG': {'precision': 0.8928909952606635,
  'recall': 0.9163424124513618,
  'f1': 0.9044647143542968,
  'number': 2056},
 'PER': {'precision': 0.9743421052631579,
  'recall': 0.976268951878708,
  'f1': 0.975304576885084,
  'number': 3034},
 'overall_precision': 0.9292045202747617,
 'overall_recall': 0.9382481261886118,
 'overall_f1': 0.933704425271361,
 'overall_accuracy': 0.9837958917819754}

In [40]:
# uncomment to push the model to hf hub
# trainer.push_to_hub()

# then, to load it and use it is the same as you loaded the model used