Extracted from: https://www.kaggle.com/code/eriknovak/pytorch-roberta-named-entity-recognition

# Named Entity Recognition using RoBERTa

In this notebook we will train a Named Entity Recognition (NER) model using RoBERTa.

The python libraries that we will use in this notebook are:

- [transformers][transformers]. The huggingface library containing the general-purpose architectures for NLP.
- [datasets][datasets]. The huggingface library containing datasets and evaluation metrics for NLP.
- [torch][torch]. An open source machine learning framework used to create and train models on the GPU.
- [matplotlib][matplotlib]. A comprehensive library for creating static, animated and interactive visualizations in Python.

[transformers]: https://huggingface.co/transformers/index.html
[datasets]: https://huggingface.co/docs/datasets/
[torch]: https://pytorch.org/
[matplotlib]: https://matplotlib.org/

**NOTE:** The notebook uses `transformers-3.5.1` and `torch-1.7.0` versions.

# Import the Libraries

In [1]:
import os
os.environ['PYTORCH_ENABLE_MPS_FALLBACK'] = "1"  # Workaround for M1 GPUs
# Previous code must be executed before importing torch


# visualization libraries
import matplotlib.pyplot as plt
import numpy as np

# pytorch libraries
import torch # the main pytorch library
import torch.nn as nn # the sub-library containing Softmax, Module and other useful functions
import torch.optim as optim # the sub-library containing the common optimizers (SGD, Adam, etc.)

# huggingface's transformers library
from transformers import RobertaForTokenClassification, RobertaTokenizer

# huggingface's datasets library
from datasets import load_dataset

# the tqdm library used to show the iteration progress
import tqdm
tqdmn = tqdm.notebook.tqdm



# Import RoBERTa Tokenizer

Now we will import a pretrained RoBERTa model from Hugging Face. There are a lot of different models that were trained with different hyperparameters and tasks. A full list of official models is available [here][pretrained], but there are also a lot of models that are uploaded by the [community][community].

In this notebook we will use the `roberta-base` model. To use it we will need to import both the `tokenizer` and the `model`: The `tokenizer` will enable us to transform strings into tensors that can be then sent to the `model`, which in turn will give us the embeddings.

At this point we will only import the tokenizer as we will require it when we manipulate with data. The model will be defined later.

[pretrained]: https://huggingface.co/transformers/pretrained_models.html
[community]: https://huggingface.co/models

In [2]:
roberta_version = 'PlanTL-GOB-ES/roberta-large-bne'
tokenizer = RobertaTokenizer.from_pretrained(roberta_version)

# Import Dataset

In this section we will use the `datasets` library to import one the NER benchmark datasets called [CoNLL-2003][conll2003]. This dataset contains the tokenized sentences and its corresponding named entities, part-of-speech tags and the syntactic chunk tags. In this example, we will train a NER model using only the named entities.

But first we need to import the dataset. This is done with the `load_dataset` function which allows us to download and prepare a dataset from the huggingface's [dataset index][datasets]. 


[conll2003]: https://www.aclweb.org/anthology/W03-0419/
[datasets]: https://huggingface.co/datasets

In [3]:
original_conll_features_dict = {  # This is the original features dict from the CoNLL dataset (reverse-engineered as reference)
    "features": {
        "id": {
            "dtype": "string", 
            "_type": "Value"
        }, 
        "tokens": {
            "feature": {
                "dtype": "string", 
                "_type": "Value"
            }, 
            "_type": "Sequence"
        }, 
        "pos_tags": {
            "feature": {
                "names": ["\"", "''", "#", "$", "(", ")", ",", ".", ":", "``", "CC", "CD", "DT", "EX", "FW", "IN", "JJ", "JJR", "JJS", "LS", "MD", "NN", "NNP", "NNPS", "NNS", "NN|SYM", "PDT", "POS", "PRP", "PRP$", "RB", "RBR", "RBS", "RP", "SYM", "TO", "UH", "VB", "VBD", "VBG", "VBN", "VBP", "VBZ", "WDT", "WP", "WP$", "WRB"], 
                "_type": "ClassLabel"
            }, 
            "_type": "Sequence"
        }, 
        "chunk_tags": {
            "feature": {
                "names": ["O", "B-ADJP", "I-ADJP", "B-ADVP", "I-ADVP", "B-CONJP", "I-CONJP", "B-INTJ", "I-INTJ", "B-LST", "I-LST", "B-NP", "I-NP", "B-PP", "I-PP", "B-PRT", "I-PRT", "B-SBAR", "I-SBAR", "B-UCP", "I-UCP", "B-VP", "I-VP"], 
                "_type": "ClassLabel"
            }, 
            "_type": "Sequence"
        }, 
        "ner_tags": {
            "feature": {
                "names": ["O", "B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC", "B-MISC", "I-MISC"], 
                "_type": "ClassLabel"
            }, 
            "_type": "Sequence"
        }
    }
}


In [4]:
ner_categories = ['O', 'B-PIN_CLAVE', 'I-PIN_CLAVE', 'B-CVV', 'I-CVV', 'B-PAN', 'I-PAN', 'B-TELEFONO', 'I-TELEFONO', 'B-EMAIL', 'I-EMAIL', 'B-DOCUMENTO_IDENTIDAD', 'I-DOCUMENTO_IDENTIDAD', 'B-NOMBRE', 'I-NOMBRE', 'B-FECHA_NACIMIENTO', 'I-FECHA_NACIMIENTO', 'B-DIRECCION', 'I-DIRECCION', 'B-CONTRAPARTE', 'I-CONTRAPARTE', 'B-SS', 'I-SS', 'B-FUC_EMPRESA', 'I-FUC_EMPRESA', 'B-ACCOUNT_NUMBER', 'I-ACCOUNT_NUMBER', 'B-REFERENCIA_CATASTRAL', 'I-REFERENCIA_CATASTRAL', 'B-DISPOSITIVO', 'I-DISPOSITIVO', 'B-MATRICULA_VEHICULO', 'I-MATRICULA_VEHICULO', 'B-FIRMA_ELECTRÓNICA', 'I-FIRMA_ELECTRÓNICA', 'B-ID_RRSS', 'I-ID_RRSS', 'B-GEOLOCALIZACIÓN', 'I-GEOLOCALIZACIÓN']

In [5]:
from datasets import ClassLabel, Value, Features, Sequence

conll_features = Features(
    {
        'id': Value(dtype='string', id=None),
        'tokens': Sequence(feature=Value(dtype='string')),
        'ner_tags': Sequence(ClassLabel(num_classes=len(ner_categories), names=ner_categories))
    }
)


In [6]:
data_files = {'train': 'traindataset.jsonl', 'validation': 'validationdataset.jsonl' ,'test': 'testdataset.jsonl'}

dataset = load_dataset(
    "./datasets/conll_fake_test",
    "testdataset",
    features=conll_features,
    data_files=data_files
)

Downloading and preparing dataset json/conll_fake_test to /Users/u093715/.cache/huggingface/datasets/json/conll_fake_test-6c67fc4105484cc1/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Dataset json downloaded and prepared to /Users/u093715/.cache/huggingface/datasets/json/conll_fake_test-6c67fc4105484cc1/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

The imported dataset is a `datasets.DatasetDict` - a dictionary that contains three key values: 

- `train`. Used to train the model.
- `validation`. Used to evaluate the model during training.
- `test`. Used for evaluating the trained model.

Each subset is a `datasets.Dataset` object which has a variety of attributes and methods. 

**NOTE:** Not all imported datasets are a `datasets.DatasetDict`, some can be already a `datasets.Dataset`. Depends on the dataset that one wants to upload.


In [7]:
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 16
    })
    validation: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 16
    })
    test: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 16
    })
})

To get the shape (number of examples and features) one can use the `.shape` attribute.

In [8]:
dataset.shape

{'train': (16, 3), 'validation': (16, 3), 'test': (16, 3)}

To access an example in a subset one can 'pretend' that the dataset is a list and access it as such.

In [9]:
dataset['train'][0]

{'id': '0',
 'tokens': ['He',
  'recibido',
  'su',
  'mensaje',
  'y',
  'para',
  'poder',
  'ayudarle',
  'necesito',
  'que',
  'me',
  'confirme',
  'su',
  'clave',
  'de',
  'acceso',
  '.',
  '¿Podría',
  'proporcionármela',
  'por',
  'favor',
  '?',
  'Gracias',
  '.',
  'La',
  'clave',
  'es',
  '#',
  '5M4lRz6z',
  '*'],
 'ner_tags': [0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  1,
  2,
  2]}

Since the dataset contains the named entity tags, we want to know the number of such tags. This number is found in the `ClassLabel` object and can be accessed in the following way.

In [10]:
num_labels = dataset['train'].features['ner_tags'].feature.num_classes

# Dataset Metadata

In additional to the actual data, the imported datasets can also have various metadata attached:

- `description`. The description of the dataset.
- `citation`. The citation used if one is submitting the results in a paper.
- `homepage`. The homepage of the dataset.
- `license`. The license under which the dataset can be used.

In [11]:
print(dataset['train'].description)




In [12]:
print(dataset['train'].citation)




In [13]:
print(dataset['train'].homepage)




In [14]:
print(dataset['train'].license)




# Dataset Format

The dataset is going to be used to train and evaluate the NER model. In order to do so, we must first modify and/or add the examples so that they can be sent to the model. With `datasets` this can be done using the `.map` method. The map method accepts a `dict` and returns a `dict`. 

If the returned dictionary contains new key-values, these are then added to input dictionary. In addition, if the returned dictionary contains keys that are also in the input dictionary, it overrides them.

With the `add_encodings` method we want to add the following values to our dataset examples:

- `input_ids`. The input ids that will be sent to the model. This is generated by the `tokenizer`.
- `attention_mask`. The attention_mask that will be sent to the model. This is generated by the `tokenizer`.
- `labels`. The labels that will be sent to the model to calculate the loss value. This will be created from the `ner_tags` attribute of the examples.


In [15]:
def add_encodings(example):
    """Processing the example
    
    Args:
        example (dict): The dataset example.
    
    Returns:
        dict: The dictionary containing the following updates:
            - input_ids: The list of input ids of the tokens.
            - attention_mask: The attention mask list.
            - ner_tags: The updated ner_tags.
    
    """
    # get the encodings of the tokens. The tokens are already split, that is why we must add is_split_into_words=True
    encodings = tokenizer(example['tokens'], truncation=True, padding='max_length', is_split_into_words=True)
    # extend the ner_tags so that it matches the max_length of the input_ids
    labels = example['ner_tags'] + [0] * (tokenizer.model_max_length - len(example['ner_tags']))
    # return the encodings and the extended ner_tags
    return { **encodings, 'labels': labels }

Simply run the `.map` method with the `add_encodings` to format the dataset.

In [16]:
# modify/format all datasets so that they include the 'input_ids', 'attention_mask' 
# and 'labels' used to train and evaluate the model
dataset = dataset.map(add_encodings)

Map:   0%|          | 0/16 [00:00<?, ? examples/s]

Map:   0%|          | 0/16 [00:00<?, ? examples/s]

Map:   0%|          | 0/16 [00:00<?, ? examples/s]

In addition to adding new values to the dataset examples, we also want to modify what values are returned by the dataset by default and in which format. To do this, we will use the `.set_format` method and provide the following inputs:

- `columns`. This will specify which attributes in the dataset we want to return by default.
- `type`. This will transform the attributes specified in the `columns` attribute to the appropriate type. In our example we will transform into the `torch` tensor.

In [17]:
# format the datasets so that we return only 'input_ids', 'attention_mask' and 'labels' 
# making it easier to train and validate the model
dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])

Also, we want to have an easy way to map the NER tags to indices and vice-versa. We will do this with the following lines.

In [18]:
# get the NER labels and create two dictionaries for accessing their ids
labels = dataset['train'].features['ner_tags'].feature
category_texts = labels.names
label2id = { k: labels.str2int(k) for k in category_texts }
id2label = { v: k for k, v in label2id.items() }

In [19]:
category_texts

['O',
 'B-PIN_CLAVE',
 'I-PIN_CLAVE',
 'B-CVV',
 'I-CVV',
 'B-PAN',
 'I-PAN',
 'B-TELEFONO',
 'I-TELEFONO',
 'B-EMAIL',
 'I-EMAIL',
 'B-DOCUMENTO_IDENTIDAD',
 'I-DOCUMENTO_IDENTIDAD',
 'B-NOMBRE',
 'I-NOMBRE',
 'B-FECHA_NACIMIENTO',
 'I-FECHA_NACIMIENTO',
 'B-DIRECCION',
 'I-DIRECCION',
 'B-CONTRAPARTE',
 'I-CONTRAPARTE',
 'B-SS',
 'I-SS',
 'B-FUC_EMPRESA',
 'I-FUC_EMPRESA',
 'B-ACCOUNT_NUMBER',
 'I-ACCOUNT_NUMBER',
 'B-REFERENCIA_CATASTRAL',
 'I-REFERENCIA_CATASTRAL',
 'B-DISPOSITIVO',
 'I-DISPOSITIVO',
 'B-MATRICULA_VEHICULO',
 'I-MATRICULA_VEHICULO',
 'B-FIRMA_ELECTRÓNICA',
 'I-FIRMA_ELECTRÓNICA',
 'B-ID_RRSS',
 'I-ID_RRSS',
 'B-GEOLOCALIZACIÓN',
 'I-GEOLOCALIZACIÓN']

In [20]:
id2label

{0: 'O',
 1: 'B-PIN_CLAVE',
 2: 'I-PIN_CLAVE',
 3: 'B-CVV',
 4: 'I-CVV',
 5: 'B-PAN',
 6: 'I-PAN',
 7: 'B-TELEFONO',
 8: 'I-TELEFONO',
 9: 'B-EMAIL',
 10: 'I-EMAIL',
 11: 'B-DOCUMENTO_IDENTIDAD',
 12: 'I-DOCUMENTO_IDENTIDAD',
 13: 'B-NOMBRE',
 14: 'I-NOMBRE',
 15: 'B-FECHA_NACIMIENTO',
 16: 'I-FECHA_NACIMIENTO',
 17: 'B-DIRECCION',
 18: 'I-DIRECCION',
 19: 'B-CONTRAPARTE',
 20: 'I-CONTRAPARTE',
 21: 'B-SS',
 22: 'I-SS',
 23: 'B-FUC_EMPRESA',
 24: 'I-FUC_EMPRESA',
 25: 'B-ACCOUNT_NUMBER',
 26: 'I-ACCOUNT_NUMBER',
 27: 'B-REFERENCIA_CATASTRAL',
 28: 'I-REFERENCIA_CATASTRAL',
 29: 'B-DISPOSITIVO',
 30: 'I-DISPOSITIVO',
 31: 'B-MATRICULA_VEHICULO',
 32: 'I-MATRICULA_VEHICULO',
 33: 'B-FIRMA_ELECTRÓNICA',
 34: 'I-FIRMA_ELECTRÓNICA',
 35: 'B-ID_RRSS',
 36: 'I-ID_RRSS',
 37: 'B-GEOLOCALIZACIÓN',
 38: 'I-GEOLOCALIZACIÓN'}

In [21]:
label2id

{'O': 0,
 'B-PIN_CLAVE': 1,
 'I-PIN_CLAVE': 2,
 'B-CVV': 3,
 'I-CVV': 4,
 'B-PAN': 5,
 'I-PAN': 6,
 'B-TELEFONO': 7,
 'I-TELEFONO': 8,
 'B-EMAIL': 9,
 'I-EMAIL': 10,
 'B-DOCUMENTO_IDENTIDAD': 11,
 'I-DOCUMENTO_IDENTIDAD': 12,
 'B-NOMBRE': 13,
 'I-NOMBRE': 14,
 'B-FECHA_NACIMIENTO': 15,
 'I-FECHA_NACIMIENTO': 16,
 'B-DIRECCION': 17,
 'I-DIRECCION': 18,
 'B-CONTRAPARTE': 19,
 'I-CONTRAPARTE': 20,
 'B-SS': 21,
 'I-SS': 22,
 'B-FUC_EMPRESA': 23,
 'I-FUC_EMPRESA': 24,
 'B-ACCOUNT_NUMBER': 25,
 'I-ACCOUNT_NUMBER': 26,
 'B-REFERENCIA_CATASTRAL': 27,
 'I-REFERENCIA_CATASTRAL': 28,
 'B-DISPOSITIVO': 29,
 'I-DISPOSITIVO': 30,
 'B-MATRICULA_VEHICULO': 31,
 'I-MATRICULA_VEHICULO': 32,
 'B-FIRMA_ELECTRÓNICA': 33,
 'I-FIRMA_ELECTRÓNICA': 34,
 'B-ID_RRSS': 35,
 'I-ID_RRSS': 36,
 'B-GEOLOCALIZACIÓN': 37,
 'I-GEOLOCALIZACIÓN': 38}

# RoBERTa Model for Token Classification

Since Named Entity Recognition can be seen as token classification we will use the RobertaForTokenClassification model provided by huggingface.

We will import the pretrained `RoBERTa` model and provide it the number of labels we will have. In addition, we will change the model configurations so that it will be easy for us to assign the labels and/or label ids.

In [22]:
# initialize the model and provide the 'num_labels' used to create the classification layer
model = RobertaForTokenClassification.from_pretrained(roberta_version, num_labels=num_labels)
# assign the 'id2label' and 'label2id' model configs
model.config.id2label = id2label
model.config.label2id = label2id

Some weights of the model checkpoint at PlanTL-GOB-ES/roberta-large-bne were not used when initializing RobertaForTokenClassification: ['lm_head.dense.bias', 'lm_head.bias', 'lm_head.layer_norm.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.weight', 'lm_head.decoder.bias', 'lm_head.decoder.weight']
- This IS expected if you are initializing RobertaForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at PlanTL-GOB-ES/roberta-large-bne and are newly initialized: ['classifier.bias', 'classifier.weight']
You sho

Outputing the model will show us its architecture. As we can see, the first part is the architecture of the `RoBERTa` model, with an added `dropout` and `classifier` layers at the end. When we load a pretrained model, the `RoBERTa` weights are already set, but it requires fine-tuning for specific tasks.

In [23]:
model

RobertaForTokenClassification(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50262, 1024, padding_idx=1)
      (position_embeddings): Embedding(514, 1024, padding_idx=1)
      (token_type_embeddings): Embedding(1, 1024)
      (LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.0, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0): RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=1024, out_features=1024, bias=True)
              (key): Linear(in_features=1024, out_features=1024, bias=True)
              (value): Linear(in_features=1024, out_features=1024, bias=True)
              (dropout): Dropout(p=0.0, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=1024, out_features=1024, bias=True)
            

# Training the Model

In this notebook, we will fine-tune the model to correctly identify named entities. During training we will update both the `RoBERTa` model weights as well as the `classifier`.

With `pytorch` we are able to move the python calculations to the GPU. To do this we define the `device` on which we wish to run the calculations. Depending if `cuda` (the GPU drivers that enable running calculations on the graphic card) is enabled on the machine, we define the device as follows:

In [24]:
device = torch.device(
    'cuda' if torch.cuda.is_available() else 
    'mps' if torch.backends.mps.is_built() else 
    'cpu'
)

Each model has two states:

- `train`. If the model is in this state it is configured for training Components such as Dropout and other that have been found to improve the models performance are active in this mode.
- `eval`. If the model is in this state it is configured for production. Components such as Dropout are deactivated, since they are used only when training the model.

If the model is only going to be used in production, then set the model to eval. If the model is first going to be fine-tuned, set the model to train. After training, set the model to eval. **NOTE:** It is good practice to specify the models state before using it.

Since we will train the model we will set it to `train`. In addition, we will run the model on the `device` defined above (if the `cuda` is available we will train the model on the GPU, otherwise on the CPU).

We will also define an `optimizer`. An optimizer is an object that will help us update the model. There is a whole [list of optimizers][optim], but in this example we will use the [AdamW optimizer][adam] which is currently one of the most popular optimizers used in deep learning. 

Every optimizer accepts the model parameters (`params`) which can be acquired with `model.parameters()`. In addition, the optimizer accepts other inputs such as the learning rate `lr` used to specify for how much we want to move the model parameters in the direction of the gradient.

[optim]: https://pytorch.org/docs/stable/optim.html
[adam]: https://pytorch.org/docs/stable/optim.html#torch.optim.AdamW

In [25]:
# set the model in 'train' mode and send it to the device
model.train().to(device)
# initialize the Adam optimizer (used for training/updating the model)
# optimizer = optim.AdamW(params=model.parameters(), lr=1e-5)
optimizer = optim.AdamW(params=model.parameters(), lr=5e-5)

Another thing to define for training a model is the number of epochs `n_epochs`. This determines the number of times we will go through the whole dataset during training.

To train the dataset we will use the `dataset['train']` data. We also want to batch the examples, so that we can process multiple examples at once. Our batch size will be 4, meaning that there will be 4 examples processed in parallel.

**NOTE:** Generally, when training a model it is advised to use a bigger batch. Here is why:

- **Averaging gradients.** Training a model means calculating the gradient and changing the model parameters in the direction of that gradient, i.e. $params = params - learning_rate * gradient$. For each example we calculate the gradient and update the model, but for some examples the gradient can be directed in an odd direction (e.g. away from the optimal point). When using batches, the model will first get the gradient of every example and then average them. The average is then used to update the model. With this we can minimize the effect of the odd gradients that we might get during training.

- **Faster training.** Training the model on the GPU allows us to process data faster. This is also due to the parallel processing capabilities of the GPU, meaning that it can process multiple examples at once. Increasing the batch size will make the training time shorter.

Although there are benefits to using bigger batches, there are also some drawbacks:

- **Bigger batches take more space.** When training on the GPU, one of the main problems is running out of GPU memory. That is usually because of two reasons: 1) the model is to big and cannot be trained on the GPU, or 2) the batch size is too big. The solution to the first reason is to get more GPU power. The solution to the second reason is to make the batch smaller. If there are still problems with GPU memory, try the [solutions provided here][cuda].

[cuda]: https://pytorch.org/docs/stable/notes/faq.html


In [31]:
# set the number of epochs and batch size
n_epochs = 15
batch_size = 4


# batch the train data so that each batch contains 4 examples (using 'batch_size')
train_data = torch.utils.data.DataLoader(dataset['train'], batch_size=batch_size, shuffle=True)
validation_data = torch.utils.data.DataLoader(dataset['validation'], batch_size=batch_size, shuffle=False)

mini_subset = torch.utils.data.Subset(dataset['validation'], list(range(10*batch_size)))
mini_data = torch.utils.data.DataLoader(mini_subset, batch_size=batch_size, shuffle=False)


In [32]:
for i, d in enumerate(validation_data):
    if i == 0:
        print(d)
        print(d['labels'].shape)
        print(d['labels'].view(-1).shape)
        print(d['labels'].view(-1).cpu().numpy())

{'input_ids': tensor([[    0,  4865,  4159,  ...,     1,     1,     1],
        [    0,   765,  5631,  ...,     1,     1,     1],
        [    0, 10383,  1249,  ...,     1,     1,     1],
        [    0,  5726, 10632,  ...,     1,     1,     1]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]]), 'labels': tensor([[0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0]])}
torch.Size([4, 512])
torch.Size([2048])
[0 0 0 ... 0 0 0]


In [33]:
from sklearn.metrics import precision_recall_fscore_support

# Next function validates the model on the validation dataset, 
# returning the average loss, and four lists with precission, 
# recall, f1 and support scores of each class in a multiclass
# classification problem.
def validate(model, validation_data, device):
    """Validates the model on the validation dataset.
    
    Args:
        model (torch.nn.Module): The model to be validated.
        validation_data (torch.utils.data.DataLoader): The validation dataset.
        device (torch.device): The device where the model is stored.
    
    Returns:
        float: The average loss.
        list: The list of precission scores.
        list: The list of recall scores.
        list: The list of f1 scores.
        list: The list of support values (number of category occurences).

    """
    # store the model training state
    model_was_training = model.training
    
    # initialize the lists of losses, precission, recall and f1 scores
    losses = []
    y_true_list = []
    y_pred_list = []
    # iterate over the batches in the validation_data
    for batch in tqdmn(validation_data, desc='Validating', leave=False):
        # send the batch to the device
        batch_input = { k: v.to(device) for k, v in batch.items() }
        with torch.no_grad():
            # set the model in 'eval' mode
            model.eval()
            # get the model output
            output = model(**batch_input)
        # get the loss and logits from the output
        loss = output.loss
        logits = output.logits
        # get the predictions by getting the maximum logits value for each token
        predictions = torch.argmax(logits, dim=-1)
        # calculate the loss and append it to the losses list
        losses.append(loss.item())
        y_true_list.extend(batch['labels'].view(-1).cpu().numpy())
        y_pred_list.extend(predictions.view(-1).cpu().numpy())

    # get the precission, recall and f1 scores of the batch
    prec, rec, f1, support = precision_recall_fscore_support(
        y_true=y_true_list, 
        y_pred=y_pred_list, 
        labels=range(1, num_labels), #  0 label is the 'ignore_index' label
        average=None, 
        zero_division=0)

    if model_was_training:
        # set the model in 'train' mode
        model.train()

    # return the average loss, precission, recall and f1 scores
    return np.mean(losses), prec, rec, f1, support


In [34]:
# display grafically the results, one line per label.


def draw_values(title, rows, columns, order, values, labels=[]):
    plt.subplot(rows, columns, order)
    if len(labels) > 1:
        for i in range(len(values[0])):
            plt.plot([p[i] for p in values], label=labels[i+1])
        plt.legend()
    else:
        plt.plot(values)
    # set the x axis labels as integers
    plt.xticks(ticks=range(len(values)))
    plt.title(title)
    # draw internal grid in light grey
    plt.grid(color='lightgrey', linewidth=0.5)

In [36]:
# Do the actual training and validation

update_block = 4

train_loss = []
train_batch_loss = []
validation_loss = []
validation_prec = []
validation_rec = []
validation_f1 = []
# iterate through the data 'n_epochs' times
for epoch in tqdmn(range(n_epochs), desc='Epoch'):
    current_loss = 0
    # iterate through each batch of the train data
    for i, batch in enumerate(tqdmn(train_data, desc='Iteration', leave=False)):
        # move the batch tensors to the same device as the 
        batch = { k: v.to(device) for k, v in batch.items() }
        # send 'input_ids', 'attention_mask' and 'labels' to the model
        outputs = model(**batch)
        # the outputs are of shape (loss, logits)
        loss = outputs[0]
        # with the .backward method it calculates all 
        # of  the gradients used for autograd
        loss.backward()
        # NOTE: if we append `loss` (a tensor) we will force the GPU to save
        # the loss into its memory, potentially filling it up. To avoid this
        # we rather store its float value, which can be accessed through the
        # `.item` method
        current_loss += loss.item()

        if i % update_block == 0 and i != 0:  # update every `update_block` batches
            # update the model using the optimizer
            optimizer.step()
            # once we update the model we set the gradients to zero
            optimizer.zero_grad()
            # store the loss value for visualization
            train_loss.append(current_loss/update_block)

            current_loss = 0
    # update the model one last time for this epoch if block batches are still pending
    if current_loss != 0:
        optimizer.step()
        optimizer.zero_grad()
    
    train_batch_loss.append(train_loss[-1])

    # calculate accuracy and recall from validation data
    val_loss, val_prec, val_rec, val_f1, support = validate(model, validation_data, device)
    print(f'Val Loss: {val_loss:.4f}\nVal Prec: {val_prec}\nVal Rec: {val_rec}\nVal F1: {val_f1}\nsupport: {support}')
    validation_loss.append(val_loss)
    validation_prec.append(val_prec)
    validation_rec.append(val_rec)
    validation_f1.append(val_f1)
    
    # display grafically the results
    plt.figure(figsize=(12, 8))
    draw_values('Precission', 2, 2, 1, validation_prec, category_texts)
    draw_values('Recall', 2, 2, 2, validation_rec, category_texts)
    draw_values('F1', 2, 2, 3, validation_f1, category_texts)
    draw_values('Loss', 2, 2, 4, list(zip(train_batch_loss, validation_loss)), ['O', 'Train', 'Val'])
    plt.show()

    # print the average results
    print(f'Epoch {epoch + 1} - Loss: {train_loss[-1]:.4f} - Val Loss: {np.mean(val_loss):.4f} - Val Prec: {np.mean(val_prec):.4f} - Val Rec: {np.mean(val_rec):.4f} - Val F1: {np.mean(val_f1):.4f}')


Epoch:   0%|          | 0/15 [00:00<?, ?it/s]

Iteration:   0%|          | 0/4 [00:00<?, ?it/s]

IndexError: list index out of range

## Visualize the Training Loss

Just visualize the training loss through the iterations.

In [None]:
fig, ax = plt.subplots(figsize=(10, 4))
# visualize the loss values
ax.plot(train_loss)
# set the labels
ax.set_ylabel('Loss')
ax.set_xlabel('Iterations (32 examples)')
fig.tight_layout()
plt.show()

# Evaluate the Model

To evaluate the model we must first put it in the `eval` mode. In addition, we must prepare the test dataset in a similar fasion as we did for training. 

In [None]:
model = model.eval()
# batch the train data so that each batch contains 4 examples (using 'batch_size')
test_data = torch.utils.data.DataLoader(dataset['test'], batch_size=4)

The model will be evaluated by creating the confusion matrix. The confusion matrix will show the missmatch between the true labels and the predicted ones. 

First we create a tensor of size (`num_labels`, `num_labels`) in which we will store the results. Then, for each batch in the test dataset we will compare the true label with the predicted one and store it in the confusion matrix.

In [None]:
# create the confusion matrix
confusion = torch.zeros(num_labels, num_labels)

# iterate through each batch of the test data
for i, batch in enumerate(tqdmn(test_data)):
    # do not calculate the gradients
    with torch.no_grad():
        # set the model in 'eval' mode
        model.eval()
        # move the batch tensors to the same device as the model
        batch = { k: v.to(device) for k, v in batch.items() }
        # send 'input_ids', 'attention_mask' and 'labels' to the model
        outputs = model(**batch)
    # get the sentence lengths
    s_lengths = batch['attention_mask'].sum(dim=1)
    # iterate through the examples
    for idx, length in enumerate(s_lengths):
        # get the true values
        true_values = batch['labels'][idx][:length]
        # get the predicted values
        pred_values = torch.argmax(outputs[1], dim=2)[idx][:length]
        # go through all true and predicted values and store them in the confusion matrix
        for true, pred in zip(true_values, pred_values):
            confusion[true.item()][pred.item()] += 1

Now we will normalize the confusion matrix by its rows so that we will see how did the model perform. 

In [None]:
# Normalize by dividing every row by its sum
for i in range(num_labels):
    confusion[i] = confusion[i] / confusion[i].sum()

Visualize the confusion matrix.

In [None]:
fig, ax = plt.subplots(figsize=(10, 10))
# visualize the loss values
ax.matshow(confusion.numpy())

# get the labels
labels = list(label2id.keys())
ids = np.arange(len(labels))

ax.set_ylabel('True Labels', fontsize='x-large')
ax.set_xlabel('Pred Labels', fontsize='x-large')

# set the x ticks
ax.set_xticks(ids)
ax.set_xticklabels(labels)

# set the y ticks
ax.set_yticks(ids)
ax.set_yticklabels(labels)

# plot figure
fig.tight_layout()
plt.show()

In [None]:
test_loss, test_prec, test_rec, test_f1, test_support = validate(model, test_data, device)
print(f'Val Loss: {test_loss:.4f}\nVal Prec: {test_prec}\nVal Rec: {test_rec}\nVal F1: {test_f1}\nsupport: {test_support}')


In [None]:
losses = [validation_loss[-1], test_loss, train_batch_loss[-1]]
precissions = [validation_prec[-1], test_prec]
recalls = [validation_rec[-1], test_rec]
f1s = [validation_f1[-1], test_f1]
support = [support, test_support]

In [None]:
plt.figure(figsize=(12, 8))

draw_values('Precission', 2, 2, 1, precissions, category_texts)
draw_values('Recall', 2, 2, 2, recalls, category_texts)
draw_values('F1', 2, 2, 3, f1s, category_texts)
draw_values('Loss', 2, 2, 4, losses)

plt.show()


In [None]:
model.save_pretrained('models/RobertaNER_connl2003')

In [None]:
# Verify validate() function is consistent with metric function "recall" in evaluate.py

import evaluate

# metric = evaluate.combine(["f1", "precision", "recall"])
metric = evaluate.load("recall")

for batch in tqdmn(mini_data):
    # move the batch tensors to the same device as the 
    batch_input = { k: v.to(device) for k, v in batch.items() }
    # send 'input_ids', 'attention_mask' and 'labels' to the model
    with torch.no_grad():
        outputs = model(**batch_input)
    # get the loss and logits from the output
    loss = outputs.loss
    logits = outputs.logits
    
    # get the predictions by getting the maximum logits value for each token
    predictions = torch.argmax(logits, dim=-1).flatten()
    ground_truth = batch_input['labels'].flatten()
    metric.add_batch(references=ground_truth, predictions=predictions)
results = metric.compute(labels=range(1, num_labels), average=None, zero_division=0)

print(results)

val_loss, val_prec, val_rec, val_f1, support = validate(model, mini_data, device)
print(f'Val Loss: {val_loss:.4f}\nVal Prec: {val_prec}\nVal Rec: {val_rec}\nVal F1: {val_f1}\nsupport: {support}')
