# Token classification

Following https://huggingface.co/learn/nlp-course/chapter7/2

## Loading & inspecting data

In [2]:
from datasets import load_dataset

raw_datasets = load_dataset("conll2003")

In [3]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3453
    })
})

In [4]:
raw_datasets["train"][0]["tokens"]

['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.']

In [5]:
raw_datasets["train"][0]["ner_tags"]

[3, 0, 7, 0, 0, 0, 7, 0, 0]

In [6]:
ner_feature = raw_datasets["train"].features["ner_tags"]
ner_feature

Sequence(feature=ClassLabel(names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC'], id=None), length=-1, id=None)

In [7]:
label_names = ner_feature.feature.names
label_names

['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

In [8]:
words = raw_datasets["train"][0]["tokens"]
labels = raw_datasets["train"][0]["ner_tags"]
line1 = ""
line2 = ""
for word, label in zip(words, labels):
    full_label = label_names[label]
    max_length = max(len(word), len(full_label))
    line1 += word + " " * (max_length - len(word) + 1)
    line2 += full_label + " " * (max_length - len(full_label) + 1)

print(line1)
print(line2)

EU    rejects German call to boycott British lamb . 
B-ORG O       B-MISC O    O  O       B-MISC  O    O 


## Prepare a tokenizer

In [9]:
from transformers import AutoTokenizer

model_checkpoint = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
tokenizer

BertTokenizerFast(name_or_path='bert-base-cased', vocab_size=28996, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

In [10]:
tokenizer.is_fast

True

In [11]:
inputs = tokenizer(raw_datasets["train"][0]["tokens"], is_split_into_words=True)
inputs

{'input_ids': [101, 7270, 22961, 1528, 1840, 1106, 21423, 1418, 2495, 12913, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [12]:
inputs.tokens()

['[CLS]',
 'EU',
 'rejects',
 'German',
 'call',
 'to',
 'boycott',
 'British',
 'la',
 '##mb',
 '.',
 '[SEP]']

In [13]:
inputs.word_ids()

[None, 0, 1, 2, 3, 4, 5, 6, 7, 7, 8, None]

### Map labels to tokenised words

In [14]:
def align_labels_with_tokens(labels, word_ids):
    new_labels = []
    current_word = None
    for word_id in word_ids:
        if word_id != current_word:
            # Start of a new word!
            current_word = word_id
            label = -100 if word_id is None else labels[word_id]
            new_labels.append(label)
        elif word_id is None:
            # Special token
            new_labels.append(-100)
        else:
            # Same word as previous token
            label = labels[word_id]
            # If the label is B-XXX we change it to I-XXX
            if label % 2 == 1:
                label += 1
            new_labels.append(label)

    return new_labels

In [15]:
labels = raw_datasets["train"][0]["ner_tags"]
word_ids = inputs.word_ids()
print(labels)
print(align_labels_with_tokens(labels, word_ids))

[3, 0, 7, 0, 0, 0, 7, 0, 0]
[-100, 3, 0, 7, 0, 0, 0, 7, 0, 0, 0, -100]


In [16]:
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(
        examples["tokens"], truncation=True, is_split_into_words=True
    )
    all_labels = examples["ner_tags"]
    new_labels = []
    for i, labels in enumerate(all_labels):
        word_ids = tokenized_inputs.word_ids(i)
        new_labels.append(align_labels_with_tokens(labels, word_ids))

    tokenized_inputs["labels"] = new_labels
    return tokenized_inputs

### Prepare a tokenised dataset

In [17]:
tokenized_datasets = raw_datasets.map(
    tokenize_and_align_labels,
    batched=True,
    remove_columns=raw_datasets["train"].column_names,
)

Map:   0%|          | 0/3250 [00:00<?, ? examples/s]

Map: 100%|██████████| 3250/3250 [00:00<00:00, 7484.06 examples/s]


## Prepare a data collector (i.e. data loader?)

In [18]:
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [19]:
batch = data_collator([tokenized_datasets["train"][i] for i in range(2)])
batch["labels"]

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


tensor([[-100,    3,    0,    7,    0,    0,    0,    7,    0,    0,    0, -100],
        [-100,    1,    2, -100, -100, -100, -100, -100, -100, -100, -100, -100]])

## Prepare metrics calculator

In [20]:
import evaluate

metric = evaluate.load("seqeval")
metric

EvaluationModule(name: "seqeval", module_type: "metric", features: {'predictions': Sequence(feature=Value(dtype='string', id='label'), length=-1, id='sequence'), 'references': Sequence(feature=Value(dtype='string', id='label'), length=-1, id='sequence')}, usage: """
Produces labelling scores along with its sufficient statistics
from a source against one or more references.

Args:
    predictions: List of List of predicted labels (Estimated targets as returned by a tagger)
    references: List of List of reference labels (Ground truth (correct) target values)
    suffix: True if the IOB prefix is after type, False otherwise. default: False
    scheme: Specify target tagging scheme. Should be one of ["IOB1", "IOB2", "IOE1", "IOE2", "IOBES", "BILOU"].
        default: None
    mode: Whether to count correct entity labels with incorrect I/B tags as true positives or not.
        If you want to only count exact matches, pass mode="strict". default: None.
    sample_weight: Array-like of sha

In [21]:
labels = raw_datasets["train"][0]["ner_tags"]
labels = [label_names[i] for i in labels]
labels

['B-ORG', 'O', 'B-MISC', 'O', 'O', 'O', 'B-MISC', 'O', 'O']

In [22]:
predictions = labels.copy()
predictions[2] = "O"
metric.compute(predictions=[predictions], references=[labels])

{'MISC': {'precision': 1.0,
  'recall': 0.5,
  'f1': 0.6666666666666666,
  'number': 2},
 'ORG': {'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 1},
 'overall_precision': 1.0,
 'overall_recall': 0.6666666666666666,
 'overall_f1': 0.8,
 'overall_accuracy': 0.8888888888888888}

In [23]:
import numpy as np


def compute_metrics(eval_preds):
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)

    # Remove ignored index (special tokens) and convert to labels
    true_labels = [[label_names[l] for l in label if l != -100] for label in labels]
    true_predictions = [
        [label_names[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    all_metrics = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": all_metrics["overall_precision"],
        "recall": all_metrics["overall_recall"],
        "f1": all_metrics["overall_f1"],
        "accuracy": all_metrics["overall_accuracy"],
    }

## Give categories a numerical number

https://huggingface.co/learn/nlp-course/chapter7/2#defining-the-model

> Since we are working on a token classification problem, we will use
> the AutoModelForTokenClassification class. The main thing to remember
> when defining this model is to pass along some information on the number
> of labels we have. The easiest way to do this is to pass that number
> with the num_labels argument, but if we want a nice inference widget
> working like the one we saw at the beginning of this section, it’s
> better to set the correct label correspondences instead.

In [24]:
id2label = {i: label for i, label in enumerate(label_names)}
id2label

{0: 'O',
 1: 'B-PER',
 2: 'I-PER',
 3: 'B-ORG',
 4: 'I-ORG',
 5: 'B-LOC',
 6: 'I-LOC',
 7: 'B-MISC',
 8: 'I-MISC'}

In [25]:
label2id = {v: k for k, v in id2label.items()}
label2id

{'O': 0,
 'B-PER': 1,
 'I-PER': 2,
 'B-ORG': 3,
 'I-ORG': 4,
 'B-LOC': 5,
 'I-LOC': 6,
 'B-MISC': 7,
 'I-MISC': 8}

In [26]:
from transformers import AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained(
    model_checkpoint,
    id2label=id2label,
    label2id=label2id,
)

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [27]:
model.config.num_labels

9

### Fine-tuning the model

In [32]:
# Check MPS support
# https://developer.apple.com/metal/pytorch/
import torch

if torch.backends.mps.is_available():
    mps_device = torch.device("mps")
    x = torch.ones(1, device=mps_device)
    print(x)
    model.to(device=mps_device)
else:
    print ("MPS device not found.")

tensor([1.], device='mps:0')


In [51]:
MODEL_DIR="../models/bert-finetuned-ner"

In [41]:
%env PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.7

from transformers import Trainer, TrainingArguments

model = AutoModelForTokenClassification.from_pretrained(
    model_checkpoint,
    id2label=id2label,
    label2id=label2id,
)

args = TrainingArguments(
    output_dir=MODEL_DIR,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    num_train_epochs=3,
    weight_decay=0.01,
    push_to_hub=False,
    dataloader_num_workers=1,
    dataloader_pin_memory=True,
    use_cpu=True,   # To avoid MPS out-of-memory error on my mac.
)
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
)

trainer.train()

env: PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.7


Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  0%|          | 0/5268 [03:21<?, ?it/s]
  0%|          | 0/5268 [00:00<?, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
  9%|▉         | 500/5268 [25:30<3:33:56,  2.69s/it]

{'loss': 0.2678, 'learning_rate': 1.810174639331815e-05, 'epoch': 0.28}


 19%|█▉        | 1000/5268 [51:31<3:41:14,  3.11s/it]

{'loss': 0.1045, 'learning_rate': 1.6203492786636296e-05, 'epoch': 0.57}


 28%|██▊       | 1500/5268 [1:17:38<2:54:29,  2.78s/it]

{'loss': 0.0774, 'learning_rate': 1.4305239179954442e-05, 'epoch': 0.85}


 33%|███▎      | 1756/5268 [1:31:08<3:52:01,  3.96s/it]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

 33%|███▎      | 1756/5268 [1:36:03<3:52:01,  3.96s/it]

{'eval_loss': 0.07170938700437546, 'eval_precision': 0.8937550689375506, 'eval_recall': 0.9272972063278357, 'eval_f1': 0.9102172297018254, 'eval_accuracy': 0.9804114911402837, 'eval_runtime': 294.7229, 'eval_samples_per_second': 11.027, 'eval_steps_per_second': 1.381, 'epoch': 1.0}


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
 38%|███▊      | 2000/5268 [1:49:07<2:31:48,  2.79s/it] 

{'loss': 0.0616, 'learning_rate': 1.240698557327259e-05, 'epoch': 1.14}


 47%|████▋     | 2500/5268 [2:15:01<2:08:24,  2.78s/it]

{'loss': 0.0428, 'learning_rate': 1.0508731966590738e-05, 'epoch': 1.42}


 57%|█████▋    | 3000/5268 [2:41:40<1:57:57,  3.12s/it]

{'loss': 0.0428, 'learning_rate': 8.610478359908885e-06, 'epoch': 1.71}


 66%|██████▋   | 3500/5268 [3:07:03<1:32:13,  3.13s/it]

{'loss': 0.0356, 'learning_rate': 6.712224753227031e-06, 'epoch': 1.99}


 67%|██████▋   | 3512/5268 [3:07:42<1:50:21,  3.77s/it]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

 67%|██████▋   | 3512/5268 [3:12:18<1:50:21,  3.77s/it]

{'eval_loss': 0.06768015027046204, 'eval_precision': 0.9272847463229218, 'eval_recall': 0.9442948502187816, 'eval_f1': 0.9357124989577253, 'eval_accuracy': 0.9850473891799612, 'eval_runtime': 276.4468, 'eval_samples_per_second': 11.756, 'eval_steps_per_second': 1.472, 'epoch': 2.0}


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
 76%|███████▌  | 4000/5268 [3:37:43<1:06:54,  3.17s/it] 

{'loss': 0.0229, 'learning_rate': 4.8139711465451785e-06, 'epoch': 2.28}


 85%|████████▌ | 4500/5268 [6:46:24<2:20:02, 10.94s/it]   

{'loss': 0.0219, 'learning_rate': 2.9157175398633257e-06, 'epoch': 2.56}


 95%|█████████▍| 5000/5268 [9:02:42<26:50,  6.01s/it]    

{'loss': 0.0229, 'learning_rate': 1.0174639331814731e-06, 'epoch': 2.85}


100%|██████████| 5268/5268 [11:03:51<00:00,  6.99s/it]   huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

100%|██████████| 5268/5268 [11:44:45<00:00,  6.99s/it]

{'eval_loss': 0.06072593480348587, 'eval_precision': 0.9331789612967251, 'eval_recall': 0.9495119488387749, 'eval_f1': 0.9412746079412746, 'eval_accuracy': 0.985783246011656, 'eval_runtime': 2453.792, 'eval_samples_per_second': 1.324, 'eval_steps_per_second': 0.166, 'epoch': 3.0}


100%|██████████| 5268/5268 [11:44:51<00:00,  8.03s/it]

{'train_runtime': 42291.0704, 'train_samples_per_second': 0.996, 'train_steps_per_second': 0.125, 'train_loss': 0.06733651892561394, 'epoch': 3.0}





TrainOutput(global_step=5268, training_loss=0.06733651892561394, metrics={'train_runtime': 42291.0704, 'train_samples_per_second': 0.996, 'train_steps_per_second': 0.125, 'train_loss': 0.06733651892561394, 'epoch': 3.0})

#### Use CPU instead of GPU \[Succeeded. Took 5 hours\]

The training took ~12 hours, but it significantly slowed down after 4000 steps.
The machine might have gone low power mode.

With the pace up to 4000 steps (13063 sec / 4000 steps = 3.27 sec/step),
this 5268 steps training could have taken ~5 hours.

Other ideas by GPT-4 follows:

#### Reducing batch size \[Failed\]

> Reducing the batch size can significantly decrease the amount of GPU memory required for training. However, this might increase the total training time as more iterations would be needed to go through the whole dataset.

Removing `use_cpu=True` and setting `per_device_train_batch_size=4` instead
makes the total steps 10533. The training was estimated to complete in 6+ hours;
However the training crashed after 2%.

#### Gradient accumulation

> If reducing the batch size is not an option or doesn’t help, you can use gradient accumulation. This technique involves updating model parameters after several batches instead of after every batch. This allows you to effectively train with a larger batch size while using less memory.

#### Gradient checkpointing

> Gradient checkpointing is a technique that reduces GPU memory usage during backpropagation at the cost of longer training times. It can be helpful when training large models that don’t fit into GPU memory.

#### Increase the memory limit \[Failed\]

> The error message suggests setting the environment variable PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable the upper limit for memory allocations. However, this may cause system failure if not enough memory is available.

#### Use mixed precision training (with fp16) \[Failed\]

> Mixed precision training can reduce memory usage and increase the training speed by using a mix of float16 and float32 data types during training.

Got an error:

```
ValueError: FP16 Mixed precision training with AMP or APEX (`--fp16`) and FP16 half precision evaluation (`--fp16_full_eval`) can only be used on CUDA or NPU devices or certain XPU devices (with IPEX).
```

#### Use mixed precision training (with bf16 + CPU) \[Cancelled. Estimated 9+ days\]

Other ideas.

#### DeepSpeed (ZeRO 2) \[Failed\]

Couldn't get it working.

https://huggingface.co/docs/transformers/main_classes/deepspeed#deployment-in-notebooks

GPT-4 says that "It is primarily designed to work with Linux-based systems and relies
on CUDA for GPU acceleration."



## Using the fine-tuned model

https://huggingface.co/learn/nlp-course/chapter7/2#using-the-fine-tuned-model

In [52]:
from transformers import pipeline, AutoModelForTokenClassification

model_checkpoint = AutoModelForTokenClassification.from_pretrained(f"{MODEL_DIR}/checkpoint-5268")
tokenizer = AutoTokenizer.from_pretrained(f"{MODEL_DIR}/checkpoint-5268")

token_classifier = pipeline(
    "token-classification", model=model_checkpoint, tokenizer=tokenizer, aggregation_strategy="simple"
)

In [53]:
token_classifier("My name is Ryuichi and I work at Foobar in Melbourne.")

[{'entity_group': 'PER',
  'score': 0.99590284,
  'word': 'Ryuichi',
  'start': 11,
  'end': 18},
 {'entity_group': 'ORG',
  'score': 0.99173456,
  'word': 'Foobar',
  'start': 33,
  'end': 39},
 {'entity_group': 'LOC',
  'score': 0.9972023,
  'word': 'Melbourne',
  'start': 43,
  'end': 52}]