In [1]:
import warnings

warnings.filterwarnings("ignore")

# Multilingual Named Entity Recognition

This notebook will use XLM-RoBERTa (or XLM-R for short) to perform multilingual named entity recognition (NER) on a subset of the Cross-Lingual Transfer Evaluation of Multilingual Encoders (XTREME) benchmark called WikiAnn or PAN-X dataset. This dataset consists of texts from Wikipedia articles in many languages. Each article is annotated with `LOC` (location), `PER` (person) and `ORG` (organization) tags in the IOB2 format. In this format, a `B-` prefix indicates the beginning of an entity, and consecutive tokens belonging to the same entity are given an `I-` prefix. An `O` tag indicates that the token does not belong to any entity.

XLM-RoBERTa belongs to a class of multilingual transformers that use masked language modeling as a pretraining objective and are trained jointly in many languages, enabling *zero-shot cross-lingual transfer*.

# 1. Dataset

In [2]:
from datasets import get_dataset_config_names

xtreme_subsets = get_dataset_config_names("xtreme")
print(f"XTREME has {len(xtreme_subsets)} configurations")

XTREME has 183 configurations


In [3]:
panx_subsets = [s for s in xtreme_subsets if s.startswith("PAN")]
panx_subsets[:5]

['PAN-X.af', 'PAN-X.ar', 'PAN-X.bg', 'PAN-X.bn', 'PAN-X.de']

In [4]:
# build an imbalanced multilingual dataset from the XTREME PANX subsets
from collections import defaultdict
from datasets import DatasetDict, load_dataset

# return a DatasetDict if a key is not found
panx_ch = defaultdict(DatasetDict)

langs = ["de", "fr", "it", "en"]
fracs = [0.6, 0.2, 0.1, 0.1]

for lang, frac in zip(langs, fracs):
    # load monolingual corpus
    ds = load_dataset("xtreme", name=f"PAN-X.{lang}")
    # shuffle and downsample each split according to fracs proportions
    for split in ds:
        panx_ch[lang][split] = (
            ds[split]
            .shuffle()
            .select(range(int(frac * ds[split].num_rows))))

In [5]:
import pandas as pd

pd.DataFrame(
  { lang: [panx_ch[lang]["train"].num_rows] for lang in langs },
  index=["num_training_examples"]
)

Unnamed: 0,de,fr,it,en
num_training_examples,12000,4000,2000,2000


In [6]:
element = panx_ch["de"]["train"][0]
for key, value in element.items():
    print(f"{key}: {value}")

tokens: ['Die', 'Zahnplatte', 'am', 'Zwischenkieferbein', 'ist', 'bei', 'geschlossenem', 'Maul', 'sichtbar', '.']
ner_tags: [0, 0, 0, 5, 0, 0, 0, 0, 0, 0]
langs: ['de', 'de', 'de', 'de', 'de', 'de', 'de', 'de', 'de', 'de']


In [7]:
for key, value in panx_ch["de"]["train"].features.items():
    print(f"{key}: {value}")

tokens: Sequence(feature=Value(dtype='string', id=None), length=-1, id=None)
ner_tags: Sequence(feature=ClassLabel(names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC'], id=None), length=-1, id=None)
langs: Sequence(feature=Value(dtype='string', id=None), length=-1, id=None)


In [8]:
tags = panx_ch["de"]["train"].features["ner_tags"].feature
panx_de = panx_ch["de"].map(lambda batch: {"ner_tags_str": [tags.int2str(idx) for idx in batch["ner_tags"]]})
de_example = panx_de["train"][0]
pd.DataFrame([de_example["tokens"], de_example["ner_tags_str"]], index=["tokens", "tags"])

Map:   0%|          | 0/12000 [00:00<?, ? examples/s]

Map:   0%|          | 0/6000 [00:00<?, ? examples/s]

Map:   0%|          | 0/6000 [00:00<?, ? examples/s]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
tokens,Die,Zahnplatte,am,Zwischenkieferbein,ist,bei,geschlossenem,Maul,sichtbar,.
tags,O,O,O,B-LOC,O,O,O,O,O,O


In [9]:
from collections import Counter

split2freqs = defaultdict(Counter)
for split, dataset in panx_de.items():
    for row in dataset["ner_tags_str"]:
        for tag in row:
            if tag.startswith("B"):
                tag_type = tag.split("-")[1]
                split2freqs[split][tag_type] += 1

pd.DataFrame.from_dict(split2freqs, orient="index")

Unnamed: 0,LOC,ORG,PER
train,5879,5114,5550
validation,2947,2580,2765
test,2921,2505,2855


# 2. Multilingual Transformers

XLM-R is a great choice for multilingual NLU tasks, because it was trained on 100 different languages. The RoBERTa part of the model's name refers to the fact that the pretraining approach is the same as for the monolingual RoBERTa models. It improves on several aspects of BERT, in particular by removing the next sentence prediction task and dropping the WordPiece tokenizer in favor of the SentencePiece tokenizer, which is trained on the raw text of all one hundred language and supports a much larger vocabulary (250,000 tokens versus 55,000 tokens).

## 2.1. A closer look at tokenization

A `tokenizer` is actually a full processing pipeline that usually consists of four steps:

<img src="https://raw.githubusercontent.com/nlp-with-transformers/notebooks/884293a3ab50071aa2afd1329ecf4a24f0793333//images/chapter04_tokenizer-pipeline.png" width="600" alt="The four steps of the tokenization pipeline">

1. `Normalization`: The text is normalized, for example by removing accents and converting the text to lowercase. This technique can reduce the size of the vocabulary by merging tokens that differ only by case or accent.
2. `Pretokenization`: The text is split into words, punctuation marks and other tokens, such as numbers or, in the case of Chinese, characters. This step is language dependent and is usually performed by a rule-based tokenizer. The words are then split into subwords with Byte-Pair Encoding (BPE) or Unigram algorithms in the next step of the pipeline.
3. `Tokenizer model`: A subword splitting model is applied on the words to reduce the size of the vocabulary and try to reduce the number of out-of-vocabulary tokens. This is the part of the pipeline that needs to be trained on a large corpus and at this point the strings are converted to integers (`input IDs`). Several subword tokenization algorithms exist, including BPE, Unigram, and WordPiece.
4. `Postprocessing`: Additional transformations are applied to the list of tokens, such as adding special tokens or truncating the sequence. This sequence of integers is then fed to the model.

## 2.2. The SentencePiece Tokenizer

The SentencePiece tokenizer is a subword tokenizer that uses the unigram algorithm and encodes each input text as a sequence of Unicode characters, allowing it to be agnostic about language. It also assigns the Unicode symbol U+2581 (or ▁) to whitespace characters, which allows the model to distinguish between words and subwords and convert back to raw text without any ambiguity.

In [10]:
from transformers import AutoTokenizer

bert_model_name = "bert-base-cased"
xlmr_model_name = "xlm-roberta-base"
bert_tokenizer = AutoTokenizer.from_pretrained(bert_model_name)
xlmr_tokenizer = AutoTokenizer.from_pretrained(xlmr_model_name)

In [11]:
text = "Jack Sparrow loves New York!"
bert_tokens = bert_tokenizer(text).tokens()
xlmr_tokens = xlmr_tokenizer(text).tokens()

print(f"BERT: {bert_tokens}")
print(f"XLMR: {xlmr_tokens}")

BERT: ['[CLS]', 'Jack', 'Spa', '##rrow', 'loves', 'New', 'York', '!', '[SEP]']
XLMR: ['<s>', '▁Jack', '▁Spar', 'row', '▁love', 's', '▁New', '▁York', '!', '</s>']


In [12]:
"".join(xlmr_tokens).replace(u"\u2581", " ")

'<s> Jack Sparrow loves New York!</s>'

# 3. Transformers for Named Entity Recognition

For text classification, an encoder-only model such as BERT uses the special `[CLS]` token to represent an entire sequence of text. This representation is then fed through a fully connected or dense layer to output the distribution of all the discrete label values.

<img src="https://raw.githubusercontent.com/nlp-with-transformers/notebooks/884293a3ab50071aa2afd1329ecf4a24f0793333//images/chapter04_clf-architecture.png" width="600" alt="Text classification with BERT">

For named entity recognition (NER), these models take a similar approach, except that the representation of each individual input token is fed into the same fully connected layer to output the entity of the token. For this reason, NER can be seen as a *token classification task*.

<img src="https://raw.githubusercontent.com/nlp-with-transformers/notebooks/884293a3ab50071aa2afd1329ecf4a24f0793333//images/chapter04_ner-architecture.png" width="600" alt="Named entity recognition with BERT">

# 4. The Anatomy of the 🤗 Transformers Model Class

🤗 Transformers is organized around dedicated classes for each architecture and task. The model classes associated with different tasks are named according to a `<ModelName>For<Task>` or `AutoModelFor<Task>` convention. In the case that there is no dedicated model class for a given task, existing models can be extended by using helper functions that add a task-specific head to the base model.

## 4.1. Bodies and Heads

The main concept that makes 🤗 Transformers so versatile is the split of the architecture into a *body* and *head*. When we switch from the pretraining task to the downstream task, we need to replace the last layer of the model with one that is suitable for the task. This last layer is called the model head and is *task-specific*. The rest of the model is called the body and is *task-agnostic*: it includes the token embeddings and transformer layers.

<img src="https://raw.githubusercontent.com/nlp-with-transformers/notebooks/884293a3ab50071aa2afd1329ecf4a24f0793333//images/chapter04_bert-body-head.png" width="400" alt="The body and head of a transformer model">

## 4.2. Creating a custom model **head** for token classification

In [40]:
import torch.nn as nn
from transformers import XLMRobertaConfig
from transformers.modeling_outputs import TokenClassifierOutput
from transformers.models.roberta.modeling_roberta import RobertaModel, RobertaPreTrainedModel

class XLMRobertaForTokenClassification(RobertaPreTrainedModel):
    """
    XLM-RoBERTa model for token-level classification.

    Inherits from `RobertaPreTrainedModel` and adds a token-level classifier on top of the RoBERTa model.

    Args:
        config (:class:`~transformers.XLMRobertaConfig`):
            Model configuration class with all the parameters of the model. Initializing with a config file does not
            load the weights associated with the model, only the configuration. Check out the
            :meth:`~transformers.PreTrainedModel.from_pretrained` method to load the model weights.

    `config_class` ensures that the standard XLM-R settings are used when a new model is initialized. The default parameters can be overwritten in the configuration object.

    `self.roberta` loads the XLM-R model body. The `add_pooling_layer` argument is set to `False` to ensure all hidden states are returned and not only the one associated with the [CLS] token.

    `init_weights()` inherited from `RobertaPreTrainedModel` loads the weights associated with the XLM-R model body and randomly initializes the weights of the classification head.
    """

    config_class = XLMRobertaConfig

    def __init__(self, config):
        super().__init__(config)
        self.num_labels = config.num_labels
        self.roberta = RobertaModel(config, add_pooling_layer=False)           
        self.dropout = nn.Dropout(config.hidden_dropout_prob)                # add dropout layer 
        self.classifier = nn.Linear(config.hidden_size, self.num_labels)     # add standard feed-forward layer
        self.init_weights()

    def forward(self, input_ids=None, attention_mask=None, token_type_ids=None, labels=None, **kwargs):
        # use model body to get encoder representations
        outputs = self.roberta(input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, **kwargs)
        print("outputs[0].shape: ", outputs[0].shape)
        print("outputs[0]: ", outputs[0])

        # apply classifier to encoder representations
        sequence_output = self.dropout(outputs[0])
        print("sequence_output: ", sequence_output.shape)
        logits = self.classifier(sequence_output)
        print("logits: ", logits.shape)

        # calculate losses
        loss = None
        if labels is not None:
            loss_fct = nn.CrossEntropyLoss()
            loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))

        # return model output object
        return TokenClassifierOutput(
            loss=loss,
            logits=logits,
            hidden_states=outputs.hidden_states,
            attentions=outputs.attentions,
        )

## 4.3. Loading a custom model

The `AutoConfig` class contains the blueprint of a model's architecture. It is used to instantiate a model with the correct hyperparameters, such as the number of layers, the size of the hidden states, and the number of attention heads.

When a model is loaded with `AutoModel.from_pretrained(model_ckpt)`, the configuration file associated with that model is downloaded automatically. However, if we want to modify the configuration, we can load the configuration file separately with the customized parameters and pass it to the model.

In [28]:
tags = panx_ch["de"]["train"].features["ner_tags"].feature
tags

ClassLabel(names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC'], id=None)

In [29]:
index2tag = {idx: tag for idx, tag in enumerate(tags.names)}
tag2index = {tag: idx for idx, tag in enumerate(tags.names)}

print(f"index2tag: {index2tag}")
print(f"tag2index: {tag2index}")

index2tag: {0: 'O', 1: 'B-PER', 2: 'I-PER', 3: 'B-ORG', 4: 'I-ORG', 5: 'B-LOC', 6: 'I-LOC'}
tag2index: {'O': 0, 'B-PER': 1, 'I-PER': 2, 'B-ORG': 3, 'I-ORG': 4, 'B-LOC': 5, 'I-LOC': 6}


In [30]:
from transformers import AutoConfig

xlmr_config = AutoConfig.from_pretrained(xlmr_model_name, num_labels=len(tags.names), id2label=index2tag, label2id=tag2index)

In [41]:
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

xlmr_model = XLMRobertaForTokenClassification.from_pretrained(xlmr_model_name, config=xlmr_config).to(device)

Some weights of XLMRobertaForTokenClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [32]:
# prepare data for testing the model initialization
input_ids = xlmr_tokenizer("Jack Sparrow loves New York!", return_tensors="pt")
pd.DataFrame([xlmr_tokens, input_ids["input_ids"].squeeze().numpy()], index=["Tokens", "Input IDs"])

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
Tokens,<s>,▁Jack,▁Spar,row,▁love,s,▁New,▁York,!,</s>
Input IDs,0,21763,37456,15555,5161,7,2356,5753,38,2


In [42]:
# quick check that the model and tokenizer were initialized correctly
outputs = xlmr_model(**input_ids.to(device)).logits
predictions = torch.argmax(outputs, dim=-1)
print(f"Number of tokens in sequence: {len(xlmr_tokens)}")
print(f"Shape of logits: {outputs.shape}")
print(f"Predictions: {predictions}")
print(f"Predicted tags: {[index2tag[prediction.item()] for prediction in predictions[0]]}")

outputs[0].shape:  torch.Size([1, 10, 768])
outputs[0]:  tensor([[[ 0.1883,  0.1270,  0.0512,  ..., -0.1185,  0.1011, -0.0155],
         [ 0.0004,  0.0413, -0.0279,  ...,  0.0284,  0.0168,  0.0749],
         [ 0.0485,  0.0551,  0.0085,  ...,  0.0673, -0.0997,  0.1789],
         ...,
         [-0.0049,  0.0067,  0.0732,  ...,  0.2713, -0.0115,  0.2732],
         [ 0.0185,  0.0094,  0.0372,  ..., -0.1094,  0.0249,  0.1612],
         [ 0.1977,  0.1146, -0.0694,  ..., -0.3103, -0.0388,  0.0689]]],
       grad_fn=<NativeLayerNormBackward0>)
sequence_output:  torch.Size([1, 10, 768])
logits:  torch.Size([1, 10, 7])
Number of tokens in sequence: 10
Shape of logits: torch.Size([1, 10, 7])
Predictions: tensor([[2, 2, 2, 2, 2, 2, 2, 2, 2, 2]])
Predicted tags: ['I-PER', 'I-PER', 'I-PER', 'I-PER', 'I-PER', 'I-PER', 'I-PER', 'I-PER', 'I-PER', 'I-PER']


# 6. Tokenizing Texts for NER

# 7. Performance Measures

# 8. Fine-Tuning XLM-RoBERTa

# 9. Error Analysis

# 10. Cross-Lingual Transfer