# NER 

In [1]:
!pip install datasets



In [2]:
from datasets import get_dataset_config_names

xtreme_subsets = get_dataset_config_names("google/xtreme")
print(f"XTREME has {len(xtreme_subsets)} dataset subsets")

Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


README.md: 0.00B [00:00, ?B/s]

XTREME has 183 dataset subsets


In [3]:
panx_subsets = [s for s in xtreme_subsets if s.startswith("PAN")]
panx_subsets[:3]

['PAN-X.af', 'PAN-X.ar', 'PAN-X.bg']

### Each of the subset has 2 letter suffix that apears to be language code 
### To load german corpus we pass "de" code to name argument of load dataset()

In [4]:
from datasets import load_dataset
panx = load_dataset("google/xtreme", "PAN-X.de")



PAN-X.de/train-00000-of-00001.parquet:   0%|          | 0.00/1.18M [00:00<?, ?B/s]

PAN-X.de/validation-00000-of-00001.parqu(…):   0%|          | 0.00/590k [00:00<?, ?B/s]

PAN-X.de/test-00000-of-00001.parquet:   0%|          | 0.00/588k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/20000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/10000 [00:00<?, ? examples/s]

As our problem statement is NER in switzerland , we will sample  German ,french, Italian, English according to thier spoken proportions  
So there will be an imbalance, This imbalanced dataset will simulate a comman situation when working on multilingual applications ,and we can see how we can build a model that works on all languages

### Lets create a Python defaultdict that stores the language code as the key and a PAN-X corpus of type DatasetDict as the value

In [5]:
from collections import defaultdict
from datasets import DatasetDict

langs = ["de", "fr", "it", "en"]
fracs = [0.629, 0.229, 0.084, 0.059]

# Returns a DatasetDict if a key doesn't exist
panx_ch = defaultdict(DatasetDict)

for lang, frac in zip(langs, fracs):
    # load monolingual dataset
    ds  = load_dataset("google/xtreme", name =f"PAN-X.{lang}")
    # Shuffle and downsample each split according to fraction
    for split in ds:
        panx_ch[lang][split] = (
            ds[split]
            .shuffle(seed=0)
            .select(range(int(frac * ds[split].num_rows))))
            

PAN-X.fr/train-00000-of-00001.parquet:   0%|          | 0.00/837k [00:00<?, ?B/s]

PAN-X.fr/validation-00000-of-00001.parqu(…):   0%|          | 0.00/419k [00:00<?, ?B/s]

PAN-X.fr/test-00000-of-00001.parquet:   0%|          | 0.00/423k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/20000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/10000 [00:00<?, ? examples/s]

PAN-X.it/train-00000-of-00001.parquet:   0%|          | 0.00/932k [00:00<?, ?B/s]

PAN-X.it/validation-00000-of-00001.parqu(…):   0%|          | 0.00/459k [00:00<?, ?B/s]

PAN-X.it/test-00000-of-00001.parquet:   0%|          | 0.00/464k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/20000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/10000 [00:00<?, ? examples/s]

PAN-X.en/train-00000-of-00001.parquet:   0%|          | 0.00/942k [00:00<?, ?B/s]

PAN-X.en/validation-00000-of-00001.parqu(…):   0%|          | 0.00/472k [00:00<?, ?B/s]

PAN-X.en/test-00000-of-00001.parquet:   0%|          | 0.00/472k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/20000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/10000 [00:00<?, ? examples/s]

In [6]:
import pandas as pd
pd.DataFrame(
    {lang: panx_ch[lang]["train"].num_rows for lang in langs},
    index=["# Train examples"]
)

Unnamed: 0,de,fr,it,en
# Train examples,12580,4580,1680,1180


We have more German samples . We use it aas starting point to perform  *Zero-shot cross lingual transfer* to other languages

Lets inspect one of the examples in German corpus


In [7]:
element = panx_ch["de"]["train"][0]

for key,value in element.items():
    print(f"{key}: {value}")

tokens: ['2.000', 'Einwohnern', 'an', 'der', 'Danziger', 'Bucht', 'in', 'der', 'polnischen', 'Woiwodschaft', 'Pommern', '.']
ner_tags: [0, 0, 0, 0, 5, 6, 0, 0, 5, 5, 6, 0]
langs: ['de', 'de', 'de', 'de', 'de', 'de', 'de', 'de', 'de', 'de', 'de', 'de']


This is cryptic (ner_tags) so lets create a new Column with LOC, PER, ORG tags .

Dataset object has a features attribute that specifies underlying data types assosiated with each column

In [8]:
for key,value in panx_ch["de"]["train"].features.items():
    print(f"{key}: {value}")

tokens: List(Value('string'))
ner_tags: List(ClassLabel(names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC']))
langs: List(Value('string'))


In [9]:
tags = panx_ch["de"]["train"].features["ner_tags"].feature
print(tags)

ClassLabel(names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC'])


We can use a ClassLabel.int2str() method to create a new column i our training set with class names for each tag

We will use map() to return a dict with the key corresponding to new column name and values as list of class names 

In [10]:
def create_tag_names(batch):
    return {"ner_tags_str": [tags.int2str(idx)for idx in batch["ner_tags"]]}

panx_de = panx_ch["de"].map(create_tag_names)

Map:   0%|          | 0/12580 [00:00<?, ? examples/s]

Map:   0%|          | 0/6290 [00:00<?, ? examples/s]

Map:   0%|          | 0/6290 [00:00<?, ? examples/s]

In [11]:
de_example = panx_de["train"][0]
pd.DataFrame([de_example["tokens"], de_example["ner_tags_str"]],["Tokens", "Tags"])

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
Tokens,2.000,Einwohnern,an,der,Danziger,Bucht,in,der,polnischen,Woiwodschaft,Pommern,.
Tags,O,O,O,O,B-LOC,I-LOC,O,O,B-LOC,B-LOC,I-LOC,O


We have to check for any unusual imbalance in tags.

 lets calculate fequiencies of each entity across each split 

In [12]:
from collections import Counter

split2freqs = defaultdict(Counter)
for split,dataset  in panx_de.items():
    for row in dataset["ner_tags_str"]:
        for tag in row:
            if tag.startswith("B"):
                tag_type = tag.split("-")[1]
                split2freqs[split][tag_type] +=1

pd.DataFrame.from_dict(split2freqs, orient= "index") 

Unnamed: 0,LOC,ORG,PER
train,6186,5366,5810
validation,3172,2683,2893
test,3180,2573,3071


NOW WE CAN USE MULTILINGUAL TRANSFORMER 

XLM-R is a great choice for multilingual NLU tasks 

XLM-R uses  a SentencePiece tokenizer trainined on raw text of 100 languages 



## Comparison b/w SentencePiece and WordPiece

In [13]:
from transformers import AutoTokenizer

bert_model_name = "bert-base-cased"
xlmr_model_name = "xlm-roberta-base"
bert_tokenizer = AutoTokenizer.from_pretrained(bert_model_name)
xlmr_tokenizer = AutoTokenizer.from_pretrained(xlmr_model_name)


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

In [14]:
text = "Jack sparrow loves New York!"

bert_tokens = bert_tokenizer(text).tokens()
xlmr_tokens = xlmr_tokenizer(text).tokens()
print("BERT tokens:", bert_tokens)
print("XLM-R tokens:", xlmr_tokens)


BERT tokens: ['[CLS]', 'Jack', 'spa', '##rrow', 'loves', 'New', 'York', '!', '[SEP]']
XLM-R tokens: ['<s>', '▁Jack', '▁spar', 'row', '▁love', 's', '▁New', '▁York', '!', '</s>']


## Creating a custom model for Token Classification



In [15]:
import torch.nn as nn
from transformers import XLMRobertaConfig
from transformers.modeling_outputs import TokenClassifierOutput
from transformers.models.roberta.modeling_roberta import RobertaPreTrainedModel, RobertaModel

class XLMRobertaForTokenClassification(RobertaPreTrainedModel):
    config_class = XLMRobertaConfig

    def __init__(self, config):
        super().__init__(config)
        self.num_labels = config.num_labels

        # Load model body
        self.roberta = RobertaModel(config, add_pooling_layer=False) 
        # Token classification head
        self.dropout = nn.Dropout(config.hidden_dropout_prob)   
        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
        # Initialize weights
        self.init_weights()
    
    def forward(self,input_ids=None, attention_mask=None, token_type_ids=None, labels=None, **kwargs):
        # Use model body to get hidden states
        outputs = self.roberta(input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, **kwargs)
        # Apply dropout to hidden states
        sequence_output = self.dropout(outputs[0])
        # Get logits from classifier
        logits = self.classifier(sequence_output)
        loss = None
        if labels is not None:
            # Define loss function
            loss_fct = nn.CrossEntropyLoss()
            # Compute loss, ignoring padded tokens (label = -100)
            loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))

        return TokenClassifierOutput(loss=loss, logits=logits, hidden_states=outputs.hidden_states, attentions=outputs.attentions)
    

With the super() method we call the initialization fn of the RobertaPreTrainedModel class .

During the forward pas, data first fed through the model body.There are no. of input variables only one we need in this stage is input ids and attention mask.

hidden state which is the part of model body output is then fed through the dropout and classification layer.

If We also  pass labels we can calculate loss too..

If there is attention mask we need to do a extra work to calculate loss only  on unmasked tokens .

Finally we wrap the output in  TokenClassifierOutput.

In [16]:
index2tag = {idx: tag for idx, tag in enumerate(tags.names)}
print(index2tag)
tag2index = {tag: idx for idx, tag in enumerate(tags.names)}
print(tag2index)

{0: 'O', 1: 'B-PER', 2: 'I-PER', 3: 'B-ORG', 4: 'I-ORG', 5: 'B-LOC', 6: 'I-LOC'}
{'O': 0, 'B-PER': 1, 'I-PER': 2, 'B-ORG': 3, 'I-ORG': 4, 'B-LOC': 5, 'I-LOC': 6}


We will store the mappings and tags.num_classes in AutoConfig object

Passing the Keyord arguments to the from_pretrained method overrides the default values

In [17]:
from transformers import AutoConfig
xlmr_config = AutoConfig.from_pretrained(xlmr_model_name, num_labels=tags.num_classes,id2label=index2tag,label2id=tag2index)

In [18]:
import torch
from transformers import XLMRobertaForTokenClassification

print("module:", XLMRobertaForTokenClassification.__module__)
# should be something like: transformers.models.xlm_roberta.modeling_xlm_roberta

module: transformers.models.xlm_roberta.modeling_xlm_roberta


In [19]:
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
xlmr_model =( XLMRobertaForTokenClassification.from_pretrained(xlmr_model_name, config=xlmr_config).to(device))

Using device: cpu


model.safetensors:   0%|          | 0.00/1.12G [00:00<?, ?B/s]

Loading weights:   0%|          | 0/197 [00:00<?, ?it/s]

XLMRobertaForTokenClassification LOAD REPORT from: xlm-roberta-base
Key                         | Status     | 
----------------------------+------------+-
lm_head.layer_norm.bias     | UNEXPECTED | 
lm_head.bias                | UNEXPECTED | 
roberta.pooler.dense.bias   | UNEXPECTED | 
lm_head.dense.weight        | UNEXPECTED | 
lm_head.dense.bias          | UNEXPECTED | 
lm_head.layer_norm.weight   | UNEXPECTED | 
roberta.pooler.dense.weight | UNEXPECTED | 
classifier.bias             | MISSING    | 
classifier.weight           | MISSING    | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING	:those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.


In [20]:
input_ids = xlmr_tokenizer.encode(text, return_tensors="pt")
pd.DataFrame([xlmr_tokens, input_ids[0].numpy()], index=["Tokens", "Input IDs"])

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
Tokens,<s>,▁Jack,▁spar,row,▁love,s,▁New,▁York,!,</s>
Input IDs,0,21763,27148,15555,5161,7,2356,5753,38,2


We need to pass the inputs to the models and extract the predictions by taking argmax to get the most likely class per token 

In [21]:
outputs = xlmr_model(input_ids.to(device)).logits
print("Logits shape:", outputs.shape)  # (1, sequence_length, num_labels
# Get predicted class indices
predictions = torch.argmax(outputs, dim=-1)
print(f"Number of tokens in sequence: {len(xlmr_tokens)}")
print(f"shape of outputs: {outputs.shape}")


Logits shape: torch.Size([1, 10, 7])
Number of tokens in sequence: 10
shape of outputs: torch.Size([1, 10, 7])


In [22]:
preds = [tags.names[p] for p in predictions[0].cpu().numpy()]
pd.DataFrame([xlmr_tokens, preds], index=["Tokens", "Predicted Tags"])

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
Tokens,<s>,▁Jack,▁spar,row,▁love,s,▁New,▁York,!,</s>
Predicted Tags,I-ORG,I-ORG,I-ORG,I-ORG,I-ORG,I-ORG,I-ORG,I-ORG,I-ORG,I-ORG


Wrap preceeding steps to a helper unction for later use (Fine Tune)

In [23]:
def tag_text(text, tags, model, tokenizer):
    # Get tokens with special characters
    tokens = tokenizer(text).tokens()
    # Encode the sequence into ids
    input_ids = xlmr_tokenizer(text, return_tensors="pt").input_ids.to(device)
    # Get predictions as distribution over 7 possible classes
    outputs = model(input_ids).logits
    # Take argmax to get predicted class indices
    predictions = torch.argmax(outputs, dim=2)
    # Convert to DataFrame
    preds = [tags.names[p] for p in predictions[0].cpu().numpy()]
    return pd.DataFrame([tokens, preds], index=["Tokens", "Predicted Tags"])

Now we established tokenizer and model can encode single example..Next step isto tokenize the whole dataset and pass it to XLM-R for Finetuning

The best ay to tokenize a Dataset object is with map() operation

In [24]:

# Collecting words and tags in a list
words, labels = de_example["tokens"], de_example["ner_tags"]
tokenized_inputs = xlmr_tokenizer(de_example["tokens"], is_split_into_words=True)
tokens = xlmr_tokenizer.convert_ids_to_tokens(tokenized_inputs["input_ids"])
pd.DataFrame([tokens] , index=["Tokens"]) 

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,15,16,17,18,19,20,21,22,23,24
Tokens,<s>,▁2.000,▁Einwohner,n,▁an,▁der,▁Dan,zi,ger,▁Buch,...,▁Wo,i,wod,schaft,▁Po,mmer,n,▁,.,</s>


In the above Example we see that tokenizer split "Einwohnern" into 2 subwords , "_Einwohner" and "n",

Since we have a convention only "_Einwohner" is associated with B-LOC label we need a way to mask the subword representation after the first subword

Fortunately , tokenised_input is a class that contains a word_ids() function that help us achieve this

In [25]:
word_ids = tokenized_inputs.word_ids()
pd.DataFrame([tokens, word_ids], index=["Tokens", "Word IDs"])

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,15,16,17,18,19,20,21,22,23,24
Tokens,<s>,▁2.000,▁Einwohner,n,▁an,▁der,▁Dan,zi,ger,▁Buch,...,▁Wo,i,wod,schaft,▁Po,mmer,n,▁,.,</s>
Word IDs,,0,1,1,2,3,4,4,4,5,...,9,9,9,9,10,10,10,11,11,


Now we have to set -100 as label or these spl tokens and subords e wish to mask during training


In [26]:
previous_word_idx = None
labels_ids = []

for word_idx in word_ids:
    if word_idx is None or word_idx == previous_word_idx:
        labels_ids.append(-100)
    elif word_idx != previous_word_idx:
        labels_ids.append(labels[word_idx])
    previous_word_idx = word_idx


labels = [index2tag[l] if l != -100 else "IGN" for l in labels_ids]
index = ["Tokens", "Word IDs", "Label_IDs","Labels"]
pd.DataFrame([tokens, word_ids, labels_ids, labels], index=index)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,15,16,17,18,19,20,21,22,23,24
Tokens,<s>,▁2.000,▁Einwohner,n,▁an,▁der,▁Dan,zi,ger,▁Buch,...,▁Wo,i,wod,schaft,▁Po,mmer,n,▁,.,</s>
Word IDs,,0,1,1,2,3,4,4,4,5,...,9,9,9,9,10,10,10,11,11,
Label_IDs,-100,0,0,-100,0,0,5,-100,-100,6,...,5,-100,-100,-100,6,-100,-100,0,-100,-100
Labels,IGN,O,O,IGN,O,O,B-LOC,IGN,IGN,I-LOC,...,B-LOC,IGN,IGN,IGN,I-LOC,IGN,IGN,O,IGN,IGN


In [27]:
def tokenize_and_align_labels(examples):
    tokenized_inputs = xlmr_tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)

    labels = []
    for idx, label in enumerate(examples["ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=idx)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            if word_idx is None or word_idx == previous_word_idx:
                label_ids.append(-100)
            else:
                label_ids.append(label[word_idx])
            previous_word_idx = word_idx
        labels.append(label_ids)
    tokenized_inputs["labels"] = labels
    return tokenized_inputs

In [28]:
# Function to iterate ovr

def encode_panx_dataset(corpus):
    return corpus.map(tokenize_and_align_labels, batched=True,
                      remove_columns=["langs", "ner_tags", "tokens"])

By applying this fn to a DatasetyDict object we get an encoded dataset object per split, Lets use this to encode our German dataset

In [29]:
panx_de_encoded = encode_panx_dataset(panx_ch["de"])

Map:   0%|          | 0/12580 [00:00<?, ? examples/s]

Map:   0%|          | 0/6290 [00:00<?, ? examples/s]

Map:   0%|          | 0/6290 [00:00<?, ? examples/s]

In [30]:
!pip install seqeval

Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: seqeval
  Building wheel for seqeval (setup.py) ... [?25l[?25hdone
  Created wheel for seqeval: filename=seqeval-1.2.2-py3-none-any.whl size=16162 sha256=bcbf410a8677af151e38f64c5819e1cabbd66220bdacfc696a5d75fc9f429558
  Stored in directory: /root/.cache/pip/wheels/5f/b8/73/0b2c1a76b701a677653dd79ece07cfabd7457989dbfbdcd8d7
Successfully built seqeval
Installing collected packages: seqeval
Successfully installed seqeval-1.2.2


In [31]:
from seqeval.metrics import classification_report

y_true = [["O", "O","O", "B-MISC", "I-MISC", "I-MISC", "O"],
          ["B-PER","I-PER", "O"]]
y_pred = [["O", "O","B-MISC","I-MISC", "I-MISC", "I-MISC", "O"],
          ["B-PER","I-PER", "O"]]
print(classification_report(y_true, y_pred))

              precision    recall  f1-score   support

        MISC       0.00      0.00      0.00         1
         PER       1.00      1.00      1.00         1

   micro avg       0.50      0.50      0.50         2
   macro avg       0.50      0.50      0.50         2
weighted avg       0.50      0.50      0.50         2



In [34]:
import numpy as np

def align_predictions(predictions, label_ids):
    preds = np.argmax(predictions, axis=2)

    batch_size, seq_len = preds.shape
    labels_list, preds_list = [], []

    for batch_idx in range(batch_size):
        example_labels, example_preds = [], []

        for seq_idx in range(seq_len):
            # Ignore label id = -100 (special tokens)
            if label_ids[batch_idx, seq_idx] != -100:
                example_labels.append(index2tag[label_ids[batch_idx][seq_idx]])
                example_preds.append(index2tag[preds[batch_idx][seq_idx]])
        labels_list.append(example_labels)
        preds_list.append(example_preds)
    return labels_list, preds_list



## Fine-Tuning on XLM_RoBERTa

Our first strategy is finetuning our model on German dataset and evaluate its zero shot cross-lingual performance on French, italian,english 

We will use Transfomers Traniner to handle training loop 

first we need to define training attributes using the TrainingArguments class

In [33]:
from transformers import  TrainingArguments

num_epochs = 3
batch_size = 24
logging_steps = len(panx_de_encoded["train"]) // batch_size
model_name = f"{xlmr_model_name}-finetuned-panx-de"
training_args = TrainingArguments(
    output_dir=model_name,
    log_level="error",
    num_train_epochs=num_epochs,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    eval_strategy="epoch",
    save_steps=1e6,weight_decay=0.01,disable_tqdm=False, 
    logging_steps=logging_steps,
    push_to_hub=True)


Here we evaluate the models predictions on the val set at the end of every epoch,Tweak eight decay, and save_steps to a large number to disable checkpointing and hus speed up training

In [33]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [35]:
from seqeval.metrics import f1_score

def compute_metrics(eval_pred):
    y_pred, y_true = align_predictions(eval_pred.predictions, eval_pred.label_ids)
    return {"f1": f1_score(y_true, y_pred)}

Next step is to define a "Data collator" so we can pad each input seq to the largest seq in the batch. Transformers provide a data collator for token classification that will pad the label aalong with the input

In [36]:
from transformers import DataCollatorForTokenClassification
data_collator = DataCollatorForTokenClassification(xlmr_tokenizer)


In [37]:
def model_init():
    return (XLMRobertaForTokenClassification.from_pretrained(xlmr_model_name, config=xlmr_config).to(device))

In [39]:
from transformers import Trainer

trainer = Trainer(model_init=model_init, args=training_args, data_collator=data_collator, compute_metrics=compute_metrics, train_dataset=panx_de_encoded["train"], eval_dataset=panx_de_encoded["validation"])

Loading weights:   0%|          | 0/197 [00:00<?, ?it/s]

In [None]:
trainer.train()
trainer.push_to_hub(commit_message="Finetuned XLM-R on PAN-X German subset")

Loading weights:   0%|          | 0/197 [00:00<?, ?it/s]



Epoch,Training Loss,Validation Loss


# ERROR ANALYSIS

In [37]:
# Loading the fine-tuned model from Hugging Face Hub
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

model_name ="thalhathai/xlm-roberta-base-finetuned-panx-de"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)



config.json: 0.00B [00:00, ?B/s]

tokenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/16.8M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

Loading weights:   0%|          | 0/199 [00:00<?, ?it/s]

XLMRobertaForTokenClassification(
  (dropout): Dropout(p=0.1, inplace=False)
  (classifier): Linear(in_features=768, out_features=7, bias=True)
  (roberta): XLMRobertaModel(
    (embeddings): XLMRobertaEmbeddings(
      (word_embeddings): Embedding(250002, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
    )
    (encoder): XLMRobertaEncoder(
      (layer): ModuleList(
        (0-11): 12 x XLMRobertaLayer(
          (attention): XLMRobertaAttention(
            (self): XLMRobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
 

### Look at the validation examples with highest loss

In [38]:
from torch.nn.functional import cross_entropy

def forward_pass_with_label(batch):
    # convert the dict of lists to list of dicts suitable for Datacollator
    features = [dict(zip(batch, t)) for t in zip(*batch.values())]
    # Pad inputs and labels and put all tensors on the same device as the model
    batch = data_collator(features)
    input_ids = batch["input_ids"].to(device)
    attention_mask = batch["attention_mask"].to(device)
    labels = batch["labels"].to(device)
    with torch.no_grad():
        # pass data through the model
        output = model(input_ids, attention_mask)
        # logit.size: [batch_size, seq_len, num_labels]
        # predict class with largest logit value on classes axis
        predicted_label = torch.argmax(output.logits, axis=-1).cpu().numpy()
    # compute loss per token after flattening the batch dimensions with view
    loss = cross_entropy(output.logits.view(-1,7),
                            labels.view(-1), reduction="none")
    # unflatten the batch dimensions and move to cpu for analysis
    loss = loss.view(len(input_ids), -1).cpu().numpy()

    return {"loss": loss, "predicted_label": predicted_label}

We can apply this function to the whhole validation set using map()

In [40]:
valid_set = panx_de_encoded["validation"]
valid_set = valid_set.map(forward_pass_with_label, batched=True, batch_size=32)
df = valid_set.to_pandas()

Map:   0%|          | 0/6290 [00:00<?, ? examples/s]

The tokens and labels are encoded with thier IDs -> Map tokens and labels back to strings 


provide IGN->label for -100



In [41]:
index2tag[-100] = "IGN"
df["input_tokens"] = df["input_ids"].apply(
    lambda x: xlmr_tokenizer.convert_ids_to_tokens(x))
df["predicted_label"] = df["predicted_label"].apply(
    lambda x: [index2tag[i] for i in x])
df["labels"] = df["labels"].apply(
    lambda x: [index2tag[i] for i in x])
df["loss"] = df.apply(
    lambda x: x["loss"][:len(x["input_ids"])], axis=1)
df["predicted_label"] = df.apply(
    lambda x: x["predicted_label"][:len(x["input_ids"])], axis=1)
df.head()

Unnamed: 0,input_ids,attention_mask,labels,loss,predicted_label,input_tokens
0,"[0, 10699, 11, 15, 16104, 1388, 2]","[1, 1, 1, 1, 1, 1, 1]","[IGN, B-ORG, IGN, I-ORG, I-ORG, I-ORG, IGN]","[0.0, 0.009578697, 0.0, 0.016933734, 0.0114379...","[I-ORG, B-ORG, I-ORG, I-ORG, I-ORG, I-ORG, I-ORG]","[<s>, ▁Ham, a, ▁(, ▁Unternehmen, ▁), </s>]"
1,"[0, 56530, 25216, 30121, 152385, 19229, 83982,...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[IGN, O, IGN, IGN, IGN, IGN, B-ORG, IGN, IGN, ...","[0.0, 0.00019333878, 0.0, 0.0, 0.0, 0.0, 1.006...","[O, O, O, O, O, O, B-LOC, I-LOC, I-LOC, I-ORG,...","[<s>, ▁WE, ITE, RL, EIT, UNG, ▁Luz, ky, j, ▁a,..."
2,"[0, 159093, 165, 38506, 122, 153080, 29088, 57...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]","[IGN, O, O, O, O, B-ORG, IGN, IGN, O, IGN, O, ...","[0.0, 0.00020430385, 9.2263734e-05, 0.00011812...","[O, O, O, O, O, B-ORG, I-ORG, I-ORG, O, O, O, ...","[<s>, ▁entdeckt, ▁und, ▁gehört, ▁der, ▁Spek, t..."
3,"[0, 16459, 242, 5106, 6, 198715, 5106, 242, 2]","[1, 1, 1, 1, 1, 1, 1, 1, 1]","[IGN, O, O, O, B-LOC, IGN, O, O, IGN]","[0.0, 0.0001753415, 0.00014304092, 0.000211216...","[O, O, O, O, B-LOC, I-LOC, O, O, O]","[<s>, ▁**, ▁', ▁'', ▁, Bretagne, ▁'', ▁', </s>]"
4,"[0, 11022, 2315, 7418, 1079, 8186, 57242, 97, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[IGN, O, O, O, O, O, O, O, IGN, O, O, O, B-ORG...","[0.0, 0.00011443437, 0.00011526874, 0.00014780...","[O, O, O, O, O, O, O, O, O, O, O, O, B-ORG, I-...","[<s>, ▁Nach, ▁einem, ▁Jahr, ▁bei, ▁diesem, ▁Ve..."


Lets have a look at the tokens individually by unpacking these lists

In [42]:
df_tokens = df.apply(pd.Series.explode)
df_tokens = df_tokens.query("labels != 'IGN'")
df_tokens["loss"] = df_tokens["loss"].astype(float).round(2)
df_tokens.head(10)

Unnamed: 0,input_ids,attention_mask,labels,loss,predicted_label,input_tokens
0,10699,1,B-ORG,0.01,B-ORG,▁Ham
0,15,1,I-ORG,0.02,I-ORG,▁(
0,16104,1,I-ORG,0.01,I-ORG,▁Unternehmen
0,1388,1,I-ORG,0.01,I-ORG,▁)
1,56530,1,O,0.0,O,▁WE
1,83982,1,B-ORG,1.01,B-LOC,▁Luz
1,10,1,I-ORG,0.5,I-ORG,▁a
1,57,1,I-ORG,0.77,I-LOC,▁sa
2,159093,1,O,0.0,O,▁entdeckt
2,165,1,O,0.0,O,▁und


Now as dats is in this shape we can now group it by input token and aggregate losses for each token with count,mean and sum finally we sort aggregated data by sumof the losses and see which token have accumulated the most loss in val 

In [43]:
(
    df_tokens.groupby("input_tokens")[["loss"]]
    .agg(["count","mean","sum"])
    .droplevel(level=0, axis=1) # Get rid of multi-level column index
    .sort_values("sum", ascending=False)
    .reset_index()
    .round(2)
    .head(10)
    .T
)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
input_tokens,▁,▁in,▁der,▁von,▁und,▁/,▁(,▁),▁'',▁die
count,6072,989,1388,808,1171,163,246,246,2898,860
mean,0.03,0.14,0.09,0.15,0.09,0.63,0.27,0.27,0.02,0.06
sum,210.85,139.54,127.27,124.1,105.47,103.28,66.82,66.33,66.25,52.82


In [46]:
(
        df_tokens.groupby("labels")[["loss"]]
        .agg(["count","mean","sum"])
        .droplevel(level=0, axis=1) # Get rid of multi-level column index
        .sort_values("sum", ascending=False)
        .reset_index()
        .round(2)
        .T
)

Unnamed: 0,0,1,2,3,4,5,6
labels,I-ORG,B-ORG,O,B-LOC,I-LOC,B-PER,I-PER
count,3820,2683,43648,3172,1462,2893,4139
mean,0.5,0.59,0.03,0.34,0.62,0.26,0.18
sum,1923.91,1595.01,1408.87,1080.86,912.66,761.18,744.22
