# Assignment: Research Engineer in Natural Language Processing

In this assignment, I will be working on MultiNERD Named Entity Recognition (NER) dataset and train/evaluate NER model for English.

In order to install the dataset, a token generated from *huggingface.co/settings/tokens* should be used with login to huggingface-cli

Following command can be run after uploading the requirements.txt file provided in the github repository. It will install all the dependencies and packages used in the notebook.

In [None]:
!pip install -r "/content/requirements.txt"

After installation, don't forget to restart the runtime in order to be able to see all installed packages.

### Imported Packages

In [1]:
import pandas as pd
from transformers import AutoTokenizer, DataCollatorForTokenClassification, AutoModelForTokenClassification, TrainingArguments, Trainer
from datasets import load_dataset, load_dataset_builder, get_dataset_split_names,  get_dataset_config_names
import numpy as np
import evaluate
import torch
from huggingface_hub import notebook_login




We can check the availability of GPU with the following command

In [2]:

torch.cuda.is_available()

True

In [3]:
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
print(device)

cuda


### Table of Contents

* [1-Loading MultiNERD Named Entity Recognition (NER) dataset](#loading)
* [2-Analysing MultiNERD Named Entity Recognition (NER) dataset](#analysis)
    * [2.1-Non-English Example Filtering](#nonenglish)
    * [2.2-System B labels](#system_b)
* [3-MODEL FINE-TUNING](#modeltuning)
    * [3.1-Tokenization](#tokenization)
    * [3.2-Defining the model to train](#model_define)
    * [3.3-System A Training](#system_a_train)
    * [3.4-System B Training](#system_b_train)  
* [4-Evaluation on test set](#evaluation)


## 1-Loading MultiNERD Named Entity Recognition (NER) dataset <a class="anchor" id="loading"></a>

We can analyze the MultiNERD dataset and see what kind of content it has within the following code

In [4]:
ds_builder = load_dataset_builder("Babelscape/multinerd")

Resolving data files:   0%|          | 0/20 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/20 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/20 [00:00<?, ?it/s]

In [5]:
ds_builder.info.features

{'tokens': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None),
 'ner_tags': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None),
 'lang': Value(dtype='string', id=None)}

In [6]:
get_dataset_split_names("Babelscape/multinerd")

Resolving data files:   0%|          | 0/20 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/20 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/20 [00:00<?, ?it/s]

['train', 'validation', 'test']

As it can be seen, the dataset consists of train, validation, and test sets.

Now, we can load the dataset.

In [7]:
dataset = load_dataset("Babelscape/multinerd")

Resolving data files:   0%|          | 0/20 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/20 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/20 [00:00<?, ?it/s]

In the following steps, we can load train, validation, and test datasets separately.

In [8]:
train_dataset = load_dataset("Babelscape/multinerd", split="train")

Resolving data files:   0%|          | 0/20 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/20 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/20 [00:00<?, ?it/s]

In [9]:
train_dataset

Dataset({
    features: ['tokens', 'ner_tags', 'lang'],
    num_rows: 2678400
})

In [10]:
validation_dataset = load_dataset("Babelscape/multinerd", split="validation")

Resolving data files:   0%|          | 0/20 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/20 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/20 [00:00<?, ?it/s]

In [11]:
validation_dataset

Dataset({
    features: ['tokens', 'ner_tags', 'lang'],
    num_rows: 334800
})

In [12]:
test_dataset = load_dataset("Babelscape/multinerd", split="test")

Resolving data files:   0%|          | 0/20 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/20 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/20 [00:00<?, ?it/s]

In [13]:
test_dataset

Dataset({
    features: ['tokens', 'ner_tags', 'lang'],
    num_rows: 335986
})

## 2-Analysing MultiNERD Named Entity Recognition (NER) dataset <a class="anchor" id="analysis"></a>

In the dataset page (https://huggingface.co/datasets/Babelscape/multinerd?row=17) , the tagset was provided as follows:

In [14]:
label2id = {
    "O": 0,
    "B-PER": 1,
    "I-PER": 2,
    "B-ORG": 3,
    "I-ORG": 4,
    "B-LOC": 5,
    "I-LOC": 6,
    "B-ANIM": 7,
    "I-ANIM": 8,
    "B-BIO": 9,
    "I-BIO": 10,
    "B-CEL": 11,
    "I-CEL": 12,
    "B-DIS": 13,
    "I-DIS": 14,
    "B-EVE": 15,
    "I-EVE": 16,
    "B-FOOD": 17,
    "I-FOOD": 18,
    "B-INST": 19,
    "I-INST": 20,
    "B-MEDIA": 21,
    "I-MEDIA": 22,
    "B-MYTH": 23,
    "I-MYTH": 24,
    "B-PLANT": 25,
    "I-PLANT": 26,
    "B-TIME": 27,
    "I-TIME": 28,
    "B-VEHI": 29,
    "I-VEHI": 30,
  }


### 2.1- Non-English Example Filtering <a class="anchor" id="nonenglish"></a>

We are going to only work on english subset of the data. Hence, we can check the content and start filtering.

In [15]:
print("Languages in Train Dataset:")
print(pd.DataFrame(train_dataset["lang"]).value_counts())

Languages in Train Dataset:
zh    312320
pl    311840
it    291040
pt    284000
fr    281760
es    276960
nl    274720
en    262560
de    250720
ru    132480
dtype: int64


In [16]:
print("Languages in Validation Dataset:")
print(pd.DataFrame(validation_dataset["lang"]).value_counts())

Languages in Validation Dataset:
zh    39040
pl    38980
it    36380
pt    35500
fr    35220
es    34620
nl    34340
en    32820
de    31340
ru    16560
dtype: int64


In [17]:
print("Languages in Test Dataset:")
print(pd.DataFrame(test_dataset["lang"]).value_counts())

Languages in Test Dataset:
zh    39154
pl    39110
it    36434
pt    35630
fr    35390
es    34798
nl    34362
en    32908
de    31524
ru    16676
dtype: int64


As we can see, there are several tokens that are not in English. We can remove them as follows:

In [18]:
en_train = train_dataset.filter(lambda x: x["lang"] =="en")

In [19]:
print("Languages in Train Dataset after filtering:")
print(pd.DataFrame(en_train["lang"]).value_counts())

Languages in Train Dataset after filtering:
en    262560
dtype: int64


In [20]:
en_validation = validation_dataset.filter(lambda x: x["lang"] =="en")

In [21]:
print("Languages in Validation Dataset after filtering:")
print(pd.DataFrame(en_validation["lang"]).value_counts())

Languages in Validation Dataset after filtering:
en    32820
dtype: int64


In [22]:
en_test = test_dataset.filter(lambda x: x["lang"] =="en")

In [23]:
print("Languages in Test Dataset after filtering:")
print(pd.DataFrame(en_test["lang"]).value_counts())

Languages in Test Dataset after filtering:
en    32908
dtype: int64


After filtering, it can be observed that train, validation, and test data contain only english subset of the dataset.

### 2.2 -System B labels <a class="anchor" id="system_b"></a>

In the assignment, the model will be trained on the dataset where the recognition will be done only on following tags:

**[O, PER, ORG, LOC, DIS, ANIM]** and tokens having tags other than those will set to 0. To do that, we can use map function to convert the labels of tokens not belonging to these tags to 0.

In [24]:
label2id

{'O': 0,
 'B-PER': 1,
 'I-PER': 2,
 'B-ORG': 3,
 'I-ORG': 4,
 'B-LOC': 5,
 'I-LOC': 6,
 'B-ANIM': 7,
 'I-ANIM': 8,
 'B-BIO': 9,
 'I-BIO': 10,
 'B-CEL': 11,
 'I-CEL': 12,
 'B-DIS': 13,
 'I-DIS': 14,
 'B-EVE': 15,
 'I-EVE': 16,
 'B-FOOD': 17,
 'I-FOOD': 18,
 'B-INST': 19,
 'I-INST': 20,
 'B-MEDIA': 21,
 'I-MEDIA': 22,
 'B-MYTH': 23,
 'I-MYTH': 24,
 'B-PLANT': 25,
 'I-PLANT': 26,
 'B-TIME': 27,
 'I-TIME': 28,
 'B-VEHI': 29,
 'I-VEHI': 30}

In [25]:
id2label = {value:key for (key,value) in label2id.items()}
id2label

{0: 'O',
 1: 'B-PER',
 2: 'I-PER',
 3: 'B-ORG',
 4: 'I-ORG',
 5: 'B-LOC',
 6: 'I-LOC',
 7: 'B-ANIM',
 8: 'I-ANIM',
 9: 'B-BIO',
 10: 'I-BIO',
 11: 'B-CEL',
 12: 'I-CEL',
 13: 'B-DIS',
 14: 'I-DIS',
 15: 'B-EVE',
 16: 'I-EVE',
 17: 'B-FOOD',
 18: 'I-FOOD',
 19: 'B-INST',
 20: 'I-INST',
 21: 'B-MEDIA',
 22: 'I-MEDIA',
 23: 'B-MYTH',
 24: 'I-MYTH',
 25: 'B-PLANT',
 26: 'I-PLANT',
 27: 'B-TIME',
 28: 'I-TIME',
 29: 'B-VEHI',
 30: 'I-VEHI'}

In the above, we defined label names based on the label values we have, which will be used later on in metric calculations.

Now we can create a new dictionary which will map the old label values to new values according to system b.

In [26]:
system_b_mapping= {0:0, #O
1:1,#B-PER
2:2,#I-PER
3:3,#B-ORG
4:4,#I-ORG
5:5,#B-LOC
6:6,#I-LOC
7:7,#B-ANIM
8:8,#I-ANIM
9:0,#B-BIO
10:0,#I-BIO
11:0,#B-CEL
12:0,#I-CEL
13:9,#B-DIS
14:10,#I-DIS
15:0,#B-EVE
16:0,#I-EVE
17:0,#B-FOOD
18:0,#I-FOOD
19:0,#B-INST
20:0,#I-INST
21:0,#B-MEDIA
22:0,#I-MEDIA
23:0,#B-MYTH
24:0,#I-MYTH
25:0,#B-PLANT
26:0,#I-PLANT
27:0,#B-TIME
28:0,#I-TIME
29:0,#B-VEHI
30:0}#I-VEHI

In [27]:

label2id_b = {"O":0, "B-PER":1, "I-PER":2,"B-ORG":3,"I-ORG":4, "B-LOC":5,"I-LOC":6, "B-DIS":7,"I-DIS":8, "B-ANIM":9,"I-ANIM":10}




The following function will help us on converting these labels to 0.

In [28]:
def set_system_b(data):
    return [*map(system_b_mapping.get, data)]

In [29]:
def other_entity_converter(example):
    return {"ner_tags": set_system_b(example["ner_tags"])}


en_train_b = en_train.map(other_entity_converter)

We can check if it worked by comparing system A dataset (en_train) and system B dataset (en_train_b):

In [30]:
en_train[95]["ner_tags"]

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 21, 22, 0, 0]

In [31]:
en_train_b[95]["ner_tags"]

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

In [32]:
en_train[5]["ner_tags"]

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 0]

In [33]:
en_train_b[5]["ner_tags"]

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 0]

As it can be seen from the above comparison, the labels of tokens which don't belong to included entities in system B are set to 0 while included labels are kept. We can apply the same approach to create validation and test data for System B.

In [34]:
en_validation_b = en_validation.map(other_entity_converter)
en_test_b = en_test.map(other_entity_converter)

Since we named our system B data as *en_train_b*, *en_validation_b*, and *en_test_b*, we can do the same for system A data.

In [35]:
en_train_a = en_train
en_validation_a = en_validation
en_test_a = en_test

Lastly, we can get rid of language columns from datasets and rename the existing columns

In [36]:
def column_formatting(dataset):
  dataset = dataset.remove_columns(["lang"])
  dataset = dataset.rename_column("ner_tags","labels")
  dataset = dataset.rename_column("tokens", "words")
  return dataset

In [37]:
en_train_a = column_formatting(en_train_a)
en_validation_a = column_formatting(en_validation_a)
en_test_a = column_formatting(en_test_a)

en_train_b = column_formatting(en_train_b)
en_validation_b = column_formatting(en_validation_b)
en_test_b = column_formatting(en_test_b)

## 3 - MODEL FINE-TUNING <a class="anchor" id="modeltuning"></a>

### 3.1 Tokenization <a class="anchor" id="tokenization"></a>

In this training process, I used the *bert-base-cased* model to fine-tune on MultiNERD dataset. https://huggingface.co/bert-base-cased



In [38]:
model_checkpoint = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In the below code, I utilized functions that would align labels based on the word-ids defined via tokenizer.
Ref: https://huggingface.co/learn/nlp-course/chapter7/2?fw=pt

In [39]:
def align_labels_with_tokens(labels, word_ids):
    new_labels = []
    current_word = None
    for word_id in word_ids:
        if word_id != current_word:
            #Starting with a new word
            current_word = word_id
            label = -100 if word_id is None else labels[word_id]
            new_labels.append(label)
        elif word_id is None:
            # Special token which will not be included in loss calculation
            new_labels.append(-100)
        else:
            new_labels.append(-100)

    return new_labels

In [40]:
def tokenize_and_align_labels(data):
    tokenized_inputs = tokenizer(
        data["words"], truncation=True, is_split_into_words=True
    )
    all_labels = data["labels"]
    new_labels = []
    for i, labels in enumerate(all_labels):
        word_ids = tokenized_inputs.word_ids(i)
        new_labels.append(align_labels_with_tokens(labels, word_ids))

    tokenized_inputs["labels"] = new_labels
    return tokenized_inputs

Now, we can use the prepared functions to tokenize the dataset properly.

In [41]:
tokenized_en_train_a = en_train_a.map(tokenize_and_align_labels, batched=True, remove_columns=en_train_a.column_names)
tokenized_en_validation_a = en_validation_a.map(tokenize_and_align_labels, batched=True, remove_columns=en_validation_a.column_names)
tokenized_en_test_a = en_test_a.map(tokenize_and_align_labels, batched=True, remove_columns=en_test_a.column_names)

tokenized_en_train_b = en_train_b.map(tokenize_and_align_labels, batched=True, remove_columns=en_train_b.column_names)
tokenized_en_validation_b = en_validation_b.map(tokenize_and_align_labels, batched=True, remove_columns=en_validation_b.column_names)
tokenized_en_test_b = en_test_b.map(tokenize_and_align_labels, batched=True, remove_columns=en_test_b.column_names)

Map:   0%|          | 0/32820 [00:00<?, ? examples/s]

Considering not every example has the same sentence length, we will need padding to make them equal sized through batching. For that purpose, we can define data collator for our NER object.

In [42]:
data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

In [43]:
metric = evaluate.load("seqeval")

Here, we can define the metrics we would like to calculate during training and also in the evaluation step on test data.

In [44]:
def compute_metrics(eval_preds):
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)

    # Getting label names from the label dictionary
    label_names = list(label2id.keys())
    true_labels = [[label_names[l] for l in label if l != -100] for label in labels]
    true_predictions = [
        [label_names[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    all_metrics = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": all_metrics["overall_precision"],
        "recall": all_metrics["overall_recall"],
        "f1": all_metrics["overall_f1"],
        "accuracy": all_metrics["overall_accuracy"],
    }

In [45]:
def compute_metrics_b(eval_preds):
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)

    # Getting label names from the label dictionary of system B
    label_names = list(label2id_b.keys())
    true_labels = [[label_names[l] for l in label if l != -100] for label in labels]
    true_predictions = [
        [label_names[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    all_metrics = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": all_metrics["overall_precision"],
        "recall": all_metrics["overall_recall"],
        "f1": all_metrics["overall_f1"],
        "accuracy": all_metrics["overall_accuracy"],
    }

### 3.2-Defining the model to train <a class="anchor" id="model_define"></a>

Before defining our models, we need to create a separate id2label and label2id dictionaries for system B. To do that, we will use only the tags that were included in system B.

In [46]:
label2id_a = label2id
id2label_a = id2label

In [47]:
id2label_b = {value:key for (key,value) in label2id_b.items()}
id2label_b

{0: 'O',
 1: 'B-PER',
 2: 'I-PER',
 3: 'B-ORG',
 4: 'I-ORG',
 5: 'B-LOC',
 6: 'I-LOC',
 7: 'B-DIS',
 8: 'I-DIS',
 9: 'B-ANIM',
 10: 'I-ANIM'}

In [48]:
model_a = AutoModelForTokenClassification.from_pretrained(
    model_checkpoint,
    id2label=id2label_a,
    label2id=label2id_a,
)

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [49]:
model_a.config.num_labels

31

In [50]:
model_b = AutoModelForTokenClassification.from_pretrained(
    model_checkpoint,
    id2label=id2label_b,
    label2id=label2id_b,
)

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [51]:
model_b.config.num_labels

11

In [52]:
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

### 3.3-System A Training <a class="anchor" id="system_a_train"></a>

Here, we fine-tune the chosen model on the dataset we prepared based on the definitions of System A.

In [53]:
args_a = TrainingArguments(
    "bert-finetuned-ner_a",
    evaluation_strategy="epoch",
    per_device_train_batch_size = 32,
    per_device_eval_batch_size = 32,
    save_strategy="epoch",
    learning_rate=2e-5,
    num_train_epochs=1,
    weight_decay=0.01,
)

In [None]:
# !pip install -U accelerate
# !pip install -U transformers

In [54]:
trainer_a = Trainer(
    model=model_a,
    args=args_a,
    train_dataset=tokenized_en_train_a,
    eval_dataset=tokenized_en_validation_a,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
)

In [59]:
#trainer_a.train() results from the case where I achieved to train for 2 epochs

Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,0.0229,0.039835,0.906075,0.913521,0.909783,0.985824
2,0.0126,0.046127,0.907368,0.923083,0.915158,0.985987


TrainOutput(global_step=16410, training_loss=0.02275809637298677, metrics={'train_runtime': 6294.8915, 'train_samples_per_second': 83.42, 'train_steps_per_second': 2.607, 'total_flos': 1.784275325624064e+16, 'train_loss': 0.02275809637298677, 'epoch': 2.0})

In [55]:
trainer_a.train()

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,0.0257,0.04022,0.905759,0.913828,0.909776,0.985417


TrainOutput(global_step=8205, training_loss=0.043587589031454266, metrics={'train_runtime': 3338.1608, 'train_samples_per_second': 78.654, 'train_steps_per_second': 2.458, 'total_flos': 8921556316740288.0, 'train_loss': 0.043587589031454266, 'epoch': 1.0})

### 3.4 System B Training <a class="anchor" id="system_b_train"></a>

As we are done with training of System A, we can start training our model based on the dataset prepared with System B conditions.

In [58]:
args_b = TrainingArguments(
    "bert-finetuned-ner_b",
    evaluation_strategy="epoch",
    per_device_train_batch_size = 32,
    per_device_eval_batch_size = 32,
    save_strategy="epoch",
    learning_rate=2e-5,
    num_train_epochs=1,
    weight_decay=0.01,
)

In [59]:
trainer_b = Trainer(
    model=model_b,
    args=args_b,
    train_dataset=tokenized_en_train_b,
    eval_dataset=tokenized_en_validation_b,
    data_collator=data_collator,
    compute_metrics=compute_metrics_b,
    tokenizer=tokenizer,
)

In [60]:
trainer_b.train()

Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,0.0122,0.018261,0.951384,0.959614,0.955481,0.993665


TrainOutput(global_step=8205, training_loss=0.022165504742656664, metrics={'train_runtime': 3341.9926, 'train_samples_per_second': 78.564, 'train_steps_per_second': 2.455, 'total_flos': 8919943554683328.0, 'train_loss': 0.022165504742656664, 'epoch': 1.0})

## 4-Evaluation on test set <a class="anchor" id="evaluation"></a>

As last step, we can evaluate and measure the performance metrics on test data of System A and System B with the corresponding fine-tuned models.

In [56]:
trainer_a.evaluate(tokenized_en_test_a)

{'eval_loss': 0.027162281796336174,
 'eval_precision': 0.9366507177033493,
 'eval_recall': 0.954182101774225,
 'eval_f1': 0.9453351361792545,
 'eval_accuracy': 0.9904877170232692,
 'eval_runtime': 147.9741,
 'eval_samples_per_second': 222.39,
 'eval_steps_per_second': 6.954,
 'epoch': 1.0}

In [61]:
trainer_b.evaluate(tokenized_en_test_b)

{'eval_loss': 0.016513075679540634,
 'eval_precision': 0.9616130283055447,
 'eval_recall': 0.9720830974260702,
 'eval_f1': 0.9668197175777528,
 'eval_accuracy': 0.9942540162812478,
 'eval_runtime': 145.2432,
 'eval_samples_per_second': 226.572,
 'eval_steps_per_second': 7.085,
 'epoch': 1.0}

In [None]:
!pip freeze > requirements.txt