<a href="https://colab.research.google.com/github/rahiakela/transformers-research-and-practice/blob/main/natural-language-processing-with-transformers/04-multilingual-ner/multilingual_named_entity_recognition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Multilingual Named Entity Recognition

In this notebook we will explore how a single Transformer model called XLM-RoBERTa can be fine-tuned to
perform named entity recognition (NER) across several languages. NER is a common NLP task that identifies
entities like people, organizations, or locations in text. These entities can be used for various applications such as
gaining insights from company documents, augmenting the quality of search engines, or simply building a
structured database from a corpus.

##Setup

In [None]:
%%shell

pip -q install transformers
pip -q install datasets

In [2]:
import pandas as pd
import numpy as np

import torch
import torch.nn as nn

from transformers import AutoTokenizer
from transformers import XLMRobertaConfig
from transformers.modeling_outputs import TokenClassifierOutput
from transformers.models.roberta.modeling_roberta import RobertaModel
from transformers.models.roberta.modeling_roberta import RobertaPreTrainedModel
from transformers import AutoConfig
from transformers import TrainingArguments

from datasets import get_dataset_config_names
from datasets import load_dataset
from datasets import DatasetDict

from itertools import chain
from collections import defaultdict
from collections import Counter

from IPython.display import HTML, display, set_matplotlib_formats

In [3]:
def display_df(df, max_cols=15, header=True, index=True):
    # 15 cols seems to be limit for O'reilly
    return display(HTML(df.to_html(header=header, index=index, max_cols=max_cols)))

##The Dataset

we will be using a subset of the Cross-lingual TRansfer Evaluation of Multilingual Encoders
(XTREME) benchmark called Wikiann or PAN-X. This dataset consists of Wikipedia articles in many
languages, including the four most commonly spoken languages in Switzerland: German (62.9%), French (22.9%),
Italian (8.4%), and English (5.9%). 

Each article is annotated with LOC (location), PER (person) and ORG
(organization) tags in the “inside-outside-beginning” (IOB2) format, where a B-prefix indicates the beginning of
an entity, and consecutive positions of the same entity are given an I- prefix. An O tag indicates that the token does
not belong to any entity. 

For example, the following sentence



In [4]:
tokens = "Jeff Dean is a computer scientist at Google in California".split()
labels = ["B-PER", "I-PER", "O", "O", "O", "O", "O", "B-ORG", "O", "B-LOC"]

df = pd.DataFrame(data=[tokens, labels], index=["Tokens", "Tags"])
display_df(df, header=None)

0,1,2,3,4,5,6,7,8,9,10
Tokens,Jeff,Dean,is,a,computer,scientist,at,Google,in,California
Tags,B-PER,I-PER,O,O,O,O,O,B-ORG,O,B-LOC


To load PAN-X with HuggingFace Datasets we first need to manually download the file AmazonPhotos.zip from
XTREME’s [Amazon Cloud Drive](https://www.amazon.com/clouddrive/share/d3KGCRCIYwhKJF0H3eWA26hjg2ZCRhjpEQtDL70FSBN), and place it in a local directory (data in our example).

For example, to load the
German corpus we use the “de” code as follows:

In [23]:
load_dataset("xtreme", "PAN-X.de", data_dir="data")

Using custom data configuration PAN-X.de-data_dir=data
Reusing dataset xtreme (/root/.cache/huggingface/datasets/xtreme/PAN-X.de-data_dir=data/1.0.0/2fc6b63c5326cc0d1f73060649612889b3a7ed8a6605c91cecdbd228a7158b17)


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    validation: Dataset({
        features: ['tokens', 'ner_tags', 'langs'],
        num_rows: 10000
    })
    test: Dataset({
        features: ['tokens', 'ner_tags', 'langs'],
        num_rows: 10000
    })
    train: Dataset({
        features: ['tokens', 'ner_tags', 'langs'],
        num_rows: 20000
    })
})

In this case, `load_dataset` returns a `DatasetDict` where each key corresponds to one of the splits, and each
value is a `Dataset` object with `features` and `num_rows` attributes.

To keep track of each language, let’s create a Python `defaultdict` that stores the language code as the key and
a PAN-X corpus of type `DatasetDict` as the value:

In [None]:
languages = ["de", "fr", "it", "en"]
fractions = [0.629, 0.229, 0.084, 0.059]

# return a DatasetDict if a key doesn't exist
panx_ch = defaultdict(DatasetDict)

for lang, frac in zip(languages, fractions):
  # load monolingual corpus
  ds = load_dataset("xtreme", f"PAN-X.{lang}", data_dir="data")
  # shuffle and downsample each split according to spoken proportion
  for split in ds.keys():
    panx_ch[lang][split] = (ds[split].shuffle(seed=0).select(range(int(frac * ds[split].num_rows))))

Here we’ve used the `Dataset.shuffle` function to make sure we don’t accidentally bias our dataset splits,
while `Dataset.select` allows us to downsample each corpus according to the values in fracs. 

Let’s have a
look at how many examples we have per language in the training sets by accessing the `Dataset.num_rows`
attribute:

In [7]:
pd.DataFrame({lang:[panx_ch[lang]["train"].num_rows] for lang in languages}, index=["Number of training examples"])

Unnamed: 0,de,fr,it,en
Number of training examples,12580,4580,1680,1180


In [8]:
# Let’s inspect one of the examples in the German corpus
panx_ch["de"]["train"][0]

{'langs': ['de',
  'de',
  'de',
  'de',
  'de',
  'de',
  'de',
  'de',
  'de',
  'de',
  'de',
  'de'],
 'ner_tags': [0, 0, 0, 0, 5, 6, 0, 0, 5, 5, 6, 0],
 'tokens': ['2.000',
  'Einwohnern',
  'an',
  'der',
  'Danziger',
  'Bucht',
  'in',
  'der',
  'polnischen',
  'Woiwodschaft',
  'Pommern',
  '.']}

In particular, we see that the `ner_tags` column corresponds to the mapping of each entity to an integer. This is a bit cryptic to the human eye,
so let’s create a new column with the familiar `LOC, PER`, and `ORG` tags. 

To do this, the first thing to notice is that
our `Dataset` object has a `features` attribute that specifies the underlying data types associated with each
column:

In [9]:
panx_ch["de"]["train"].features

{'langs': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None),
 'ner_tags': Sequence(feature=ClassLabel(num_classes=7, names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC'], names_file=None, id=None), length=-1, id=None),
 'tokens': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None)}

The `Sequence` class specifies that the field contains a list of features, which in the case of `ner_tags`
corresponds to a list of ClassLabel `features`. 

Let’s pick out this feature from the training set as follows:

In [10]:
tags = panx_ch["de"]["train"].features["ner_tags"].feature
tags

ClassLabel(num_classes=7, names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC'], names_file=None, id=None)

In [11]:
langs = panx_ch["de"]["train"].features["langs"].feature
langs

Value(dtype='string', id=None)

In [12]:
tokens = panx_ch["de"]["train"].features["tokens"].feature
tokens

Value(dtype='string', id=None)

One handy property of the `ClassLabel` feature is that it has conversion methods to convert from the class name
to an integer and vice versa. 

For example, we can find the integer associated with the `B-PER` tag by using the `ClassLabel.str2int` function as follows:

In [13]:
tags.str2int("B-PER")

1

In [14]:
tags.str2int("I-PER")

2

Similarly, we can map back from an integer to the corresponding class name:

In [15]:
tags.int2str(1)

'B-PER'

In [16]:
tags.int2str(3)

'B-ORG'

Let’s use the `ClassLabel.int2str` function to create a new column in our training set with class names for
each tag. 

We’ll use the `Dataset.map` function to return a dict with the key corresponding to the new column
name and the value as a list of class names:

In [17]:
def create_tag_names(batch):
  return {"ner_tags_str": [tags.int2str(idx) for idx in batch["ner_tags"]]}

In [None]:
panx_de = panx_ch["de"].map(create_tag_names)

In [19]:
panx_de["train"][0]

{'langs': ['de',
  'de',
  'de',
  'de',
  'de',
  'de',
  'de',
  'de',
  'de',
  'de',
  'de',
  'de'],
 'ner_tags': [0, 0, 0, 0, 5, 6, 0, 0, 5, 5, 6, 0],
 'ner_tags_str': ['O',
  'O',
  'O',
  'O',
  'B-LOC',
  'I-LOC',
  'O',
  'O',
  'B-LOC',
  'B-LOC',
  'I-LOC',
  'O'],
 'tokens': ['2.000',
  'Einwohnern',
  'an',
  'der',
  'Danziger',
  'Bucht',
  'in',
  'der',
  'polnischen',
  'Woiwodschaft',
  'Pommern',
  '.']}

Now that we have our tags in human-readable format, let’s see how the tokens and tags align for the first example in the training set:

In [20]:
de_example = panx_de["train"][0]
df = pd.DataFrame([de_example["tokens"], de_example["ner_tags_str"]], ["Tokens", "Tags"])
display_df(df, header=None)

0,1,2,3,4,5,6,7,8,9,10,11,12
Tokens,2.000,Einwohnern,an,der,Danziger,Bucht,in,der,polnischen,Woiwodschaft,Pommern,.
Tags,O,O,O,O,B-LOC,I-LOC,O,O,B-LOC,B-LOC,I-LOC,O


As a sanity check that we don’t have any unusual imbalance in the tags, let’s calculate the frequencies of each
entity across each split:

In [21]:
split2freqs = {}

for split in panx_de.keys():
  tag_names = []
  for row in panx_de[split]["ner_tags_str"]:
    tag_names.append([t.split("-")[1] for t in row if t.startswith("B")])
  
  split2freqs[split] = Counter(chain.from_iterable(tag_names))

pd.DataFrame.from_dict(split2freqs, orient="index")

Unnamed: 0,ORG,LOC,PER
validation,2683,3172,2893
test,2573,3180,3071
train,5366,6186,5810


This looks good - the distribution of the `PER, LOC`, and `ORG` frequencies are roughly the same for each split, so
the validation and test sets should provide a good measure of our `NER` tagger’s ability to generalize.

##Training a Named Entity Recognition Tagger

We know that for text classification, BERT uses the special `[CLS]` token to represent an entire sequence of text.

This representation is then fed through a fully connected
or dense layer to output the distribution of all the discrete label values.

BERT and other encoder
Transformers take a similar approach for NER, except that the representation of every input token is fed into the
same fully-connected layer to output the entity of the token.

For this reason, NER is often framed as a token
classification task.

<img src='https://github.com/rahiakela/transformers-research-and-practice/blob/main/natural-language-processing-with-transformers/04-multilingual-ner/images/1.png?raw=1' width='600'/>

So far, so good, but how should we handle subwords in a token classification task?

For example, the last name `Sparrow` is tokenized by WordPiece into the subwords `Spa` and `##rrow`, so which one (or both)
should be assigned the `I-PER` label?

Although we could have chosen to include the representation from the `##rrow`
subword by assigning it a copy of the `I-LOC` label, this introduces extra complexity when subwords are associated
with a `B-entity` because then we need to copy these tags and this violates the `IOB2` format.

Fortunately, all this intuition from `BERT` carries over to `XLM-R` since the architecture is based on `RoBERTa`,
which is identical to `BERT`! However, there are some slight differences, especially around the choice of tokenizer.

###SentencePiece Tokenization

Instead of using a WordPiece tokenizer, XLM-R uses a tokenizer called SentencePiece that is trained on the raw
text of all 100 languages. The SentencePiece tokenizer is based on a type of subword segmentation called Unigram
and encodes input text as a sequence of Unicode characters. 

This last feature is especially useful for multilingual
corpora since it allows SentencePiece to be agnostic about accents, punctuation, and the fact that many languages
like Japanese do not have whitespace characters.

To get a feel for how `SentencePiece` compares to `WordPiece`, let’s load the BERT and `XLM-R` tokenizers in the
usual way with `Transformers`:

In [None]:
bert_model_name = "bert-base-cased"
xlmr_model_name = "xlm-roberta-base"

bert_tokenizer = AutoTokenizer.from_pretrained(bert_model_name)
xlmr_tokenizer = AutoTokenizer.from_pretrained(xlmr_model_name)

By encoding a small sequence of text we can also retrieve the special tokens that each model used during
pretraining:

In [24]:
text = "Jack Sparrow loves New York!"
bert_tokens = bert_tokenizer(text).tokens()
xmlr_tokens = xlmr_tokenizer(text).tokens()

In [27]:
df = pd.DataFrame([bert_tokens, xmlr_tokens], ["BERT", "XLM-R"])
display_df(df, header=None)

0,1,2,3,4,5,6,7,8,9,10
BERT,[CLS],Jack,Spa,##rrow,loves,New,York,!,[SEP],
XLM-R,<s>,▁Jack,▁Spar,row,▁love,s,▁New,▁York,!,</s>


Here we see that instead of the `[CLS]` and `[SEP]` tokens that BERT uses for sentence classification tasks, XLMR
uses `<s>` and `<\s>` to denote the start and end of a sequence.

Another special feature of SentencePiece is that it
treats raw text as a sequence of Unicode characters, with whitespace given the Unicode symbol `U+2581` or `_`
character. By assigning a special symbol for whitespace, SentencePiece is able to detokenize a sequence without
ambiguities.

We can see that WordPiece has lost the information that there is no whitespace
between `York` and `!`. 

By contrast, SentencePiece preserves the whitespace in the tokenized text so we can
convert back to the raw text without ambiguity:

In [31]:
"".join(xmlr_tokens).replace("▁", " ")

'<s> Jack Sparrow loves New York!</s>'

##Transformers Model Class Anatomy