# Token Classification
This includes any problem that can be formulated as "attributing a label to each token in a sentence," eg
- Named Entity Recognition (NER)
- Part-of-speech tagging (POS)
- Chunking: finding the tokens that belong to the same entity

In [1]:
# import os
# import sys


# # Connect to google drive
# from google.colab import drive
# os.chdir("/content")
# drive.mount("/content/gdrive")

# # Load colab_utils funtions
# sys.path.append(f"/content/gdrive/MyDrive/repos/colab-utils")
# import colab_utils

# colab_utils.load_env_vars()
# colab_utils.git_set_config()

# PARENT_FOLDER = "/content/gdrive/MyDrive/repos"
# os.chdir(PARENT_FOLDER)

# git_repo = 'trevorki/huggingface-nlp' # replace with actual values
# colab_utils.git_clone_repo(git_repo)

# REPO_FOLDER = f"{PARENT_FOLDER}/{git_repo.split('/')[1]}"
# os.chdir(REPO_FOLDER)

# !pip install -r requirements.txt

In [2]:
## NOT NEEDED, I THINK
# # To use Huggingface hub on google colab
# !git lfs install
# !git lfs track "*.psd"
# !git add .gitattributes

## Dataset
Use the `CoNLL-2003 dataset`, which contains news stories from Reuters. It contains labels for the three tasks we mentioned earlier: NER, POS, and chunking.

A big difference from other datasets is that the input texts are not presented as sentences or documents, but lists of words (the last column is called tokens, but it contains words in the sense that these are pre-tokenized inputs that still need to go through the tokenizer for subword tokenization).



In [3]:
from datasets import load_dataset

raw_datasets = load_dataset("conll2003")
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3453
    })
})

In [4]:
# these are not tokens, but lists of words
raw_datasets["train"][0]["tokens"]

['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.']

### NER labels

In [5]:
raw_datasets["train"][0]["ner_tags"]

[3, 0, 7, 0, 0, 0, 7, 0, 0]

In [6]:
ner_feature = raw_datasets["train"].features["ner_tags"]
label_names = ner_feature.feature.names
label_names

['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

In [7]:
words = raw_datasets["train"][0]["tokens"]
labels = raw_datasets["train"][0]["ner_tags"]
line1 = ""
line2 = ""
for word, label in zip(words, labels):
    full_label = label_names[label]
    max_length = max(len(word), len(full_label))
    line1 += word + " " * (max_length - len(word) + 1)
    line2 += full_label + " " * (max_length - len(full_label) + 1)

print(line1)
print(line2)

EU    rejects German call to boycott British lamb . 
B-ORG O       B-MISC O    O  O       B-MISC  O    O 


### POS Labels

In [8]:
# raw_datasets["train"][0]["pos_tags"]

In [9]:
# pos_feature = raw_datasets["train"].features["pos_tags"]
# pos_names = pos_feature.feature.names
# print(pos_names)

In [10]:
# words = raw_datasets["train"][0]["tokens"]
# labels = raw_datasets["train"][0]["pos_tags"]
# line1 = ""
# line2 = ""
# for word, label in zip(words, labels):
#     full_label = pos_names[label]
#     max_length = max(len(word), len(full_label))
#     line1 += word + " " * (max_length - len(word) + 1)
#     line2 += full_label + " " * (max_length - len(full_label) + 1)

# print(line1)
# print(line2)

### Chunking labels

In [11]:
# raw_datasets["train"][0]["chunk_tags"]

In [12]:
# chunk_feature = raw_datasets["train"].features["chunk_tags"]
# chunk_names = chunk_feature.feature.names
# print(chunk_names)

In [13]:
# words = raw_datasets["train"][0]["tokens"]
# labels = raw_datasets["train"][0]["chunk_tags"]
# line1 = ""
# line2 = ""
# for word, label in zip(words, labels):
#     full_label = chunk_names[label]
#     max_length = max(len(word), len(full_label))
#     line1 += word + " " * (max_length - len(word) + 1)
#     line2 += full_label + " " * (max_length - len(full_label) + 1)

# print(line1)
# print(line2)

# Tokenize
Since the tokenizer splits some words into multiple tokens we have to keed track of the `word_ids` to map them to their words later.

In [14]:
from transformers import AutoTokenizer

model_checkpoint = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [15]:
inputs = tokenizer(raw_datasets["train"][0]["tokens"], is_split_into_words=True)
print(f"tokens: {inputs.tokens()}")
print(f"word_ids: {inputs.word_ids()}")

tokens: ['[CLS]', 'EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'la', '##mb', '.', '[SEP]']
word_ids: [None, 0, 1, 2, 3, 4, 5, 6, 7, 7, 8, None]


Since the labels don't have the special characters or the works split into multiple tokens, we must align the tokenized inputs with the labels.
To

In [16]:
print(f"{len(raw_datasets['train'][0]['ner_tags'])} dataset labels:\n\t{raw_datasets['train'][0]['ner_tags']}")
print(f"{len(inputs.tokens())} tokenized inputs:\n\t{inputs.tokens()}")

9 dataset labels:
	[3, 0, 7, 0, 0, 0, 7, 0, 0]
12 tokenized inputs:
	['[CLS]', 'EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'la', '##mb', '.', '[SEP]']


In [17]:
def align_labels_with_tokens(labels, word_ids):
    """Adds items to labels to make them match the length of word_ids, by
    - duplicating the labels for tokens that were split from the same word
    - giving a label of -100 for all special tokens (so that it is ignored by coss entropy loss
    RETURNS:
        list: the new labels"""
    new_labels = []
    current_word = None
    for word_id in word_ids:
        if word_id != current_word:
            # Start of a new word!
            current_word = word_id
            label = -100 if word_id is None else labels[word_id]
            new_labels.append(label)
        elif word_id is None:
            # Special token
            new_labels.append(-100)
        else:
            # Same word as previous token
            label = labels[word_id]
            # If the label is B-XXX we change it to I-XXX
            if label % 2 == 1:
                label += 1 # this changes it from B to I
            new_labels.append(label)

    return new_labels

In [18]:
labels = raw_datasets["train"][0]["ner_tags"]
word_ids = inputs.word_ids()
print(labels)
print(align_labels_with_tokens(labels, word_ids))

[3, 0, 7, 0, 0, 0, 7, 0, 0]
[-100, 3, 0, 7, 0, 0, 0, 7, 0, 0, 0, -100]


In [19]:
# Tokenize and align labels for whole dataset (We will pad it later)
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples["tokens"],
                                 truncation=True,
                                 is_split_into_words=True)
    all_labels = examples["ner_tags"]
    new_labels = []
    for i, labels in enumerate(all_labels):
        word_ids = tokenized_inputs.word_ids(i)
        new_labels.append(align_labels_with_tokens(labels, word_ids))

    tokenized_inputs["labels"] = new_labels
    return tokenized_inputs

In [20]:
tokenized_datasets = raw_datasets.map(
    tokenize_and_align_labels,
    batched=True,
    remove_columns=raw_datasets["train"].column_names,
)

# Collating data
We can't use `DataCollatorWith Padding` because we must now pad both the inputs AND the labels, so we must use `DataCollatorForTokenClassification`, which takes the tokenizer as a parameter.

In [21]:
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer,
                                                   return_tensors="tf")




In [22]:
batch = data_collator([tokenized_datasets["train"][i] for i in range(2)])
batch["labels"].numpy()

array([[-100,    3,    0,    7,    0,    0,    0,    7,    0,    0,    0,
        -100],
       [-100,    1,    2, -100, -100, -100, -100, -100, -100, -100, -100,
        -100]], dtype=int64)

In [23]:
# Compare to just the labels
for i in range(2):
    print(tokenized_datasets["train"][i]["labels"])

[-100, 3, 0, 7, 0, 0, 0, 7, 0, 0, 0, -100]
[-100, 1, 2, -100]


# Build TF Dataset

In [24]:
tf_train_dataset = tokenized_datasets["train"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "labels", "token_type_ids"],
    collate_fn=data_collator,
    shuffle=True,
    batch_size=16,
)

tf_eval_dataset = tokenized_datasets["validation"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "labels", "token_type_ids"],
    collate_fn=data_collator,
    shuffle=False,
    batch_size=16,
)

# The Model

In [25]:
model_name = "bert-finetuned-ner"
hf_user = "Roverto"
model_folder = f"models/{model_name}"
hf_repo = f"{hf_user}/{model_name}"

In [27]:
from transformers import TFAutoModelForTokenClassification

# translation dictionaries
id2label = {i: label for i, label in enumerate(label_names)}
label2id = {v: k for k, v in id2label.items()}

id2label

{0: 'O',
 1: 'B-PER',
 2: 'I-PER',
 3: 'B-ORG',
 4: 'I-ORG',
 5: 'B-LOC',
 6: 'I-LOC',
 7: 'B-MISC',
 8: 'I-MISC'}

## Train model
Do this in colab, then save to HF Hub

In [28]:
# from transformers import create_optimizer
# import tensorflow as tf

# model = TFAutoModelForTokenClassification.from_pretrained(
#     model_checkpoint,
#     id2label=id2label,
#     label2id=label2id,
# )

# # The number of training steps is the number of samples in the dataset, divided by the batch size then multiplied
# # by the total number of epochs. Note that the tf_train_dataset here is a batched tf.data.Dataset,
# # not the original Hugging Face Dataset, so its len() is already num_samples // batch_size.
# num_epochs = 3
# num_train_steps = len(tf_train_dataset) * num_epochs

# optimizer, schedule = create_optimizer(
#     init_lr=2e-5,
#     num_warmup_steps=0,
#     num_train_steps=num_train_steps,
#     weight_decay_rate=0.01,
# )
# model.compile(optimizer=optimizer)

Note also that we don’t supply a loss argument to compile(). This is because the models can actually compute loss internally — if you compile without a loss and supply your labels in the input dictionary (as we do in our datasets), then the model will train using that internal loss, which will be appropriate for the task and model type you have chosen.

In [29]:
# ## If you want to push after every epoch
# # from transformers.keras_callbacks import PushToHubCallback
# # callback = PushToHubCallback(output_dir="bert-finetuned-ner", tokenizer=tokenizer)

# model.fit(
#     tf_train_dataset,
#     validation_data=tf_eval_dataset,
#     # callbacks=[callback],
#     epochs=num_epochs,
#     verbose=1
# )

#### Save to Huggingface hub

In [30]:
# from huggingface_hub import HfApi

# # Create repo if it doesn't exist
# from huggingface_hub import create_repo
# create_repo(f"{model_name}", token = os.environ["HF_TOKEN"])

# # save model and tokenizer to local folder
# model.save_pretrained(model_folder)
# tokenizer.save_pretrained(model_folder)

# upload folder to Huggingface Hub
# api = HfApi()
# api.upload_folder(
#     folder_path=model_folder,
#     repo_id=hf_repo,
#     repo_type="model"
# )

## Download trained model from HF Hub (if already trained)

In [35]:
hf_repo

'Roverto/bert-finetuned-ner'

In [39]:
# # ############### download folder from Huggingface Hub to cache
# from huggingface_hub import snapshot_download
# snapshot_download(repo_id=hf_repo)

# Load model from cache
from transformers import AutoTokenizer, TFAutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained(
    hf_repo,
    id2label=id2label,
    label2id=label2id,
)
model = TFAutoModelForTokenClassification.from_pretrained(hf_repo)

Some layers from the model checkpoint at Roverto/bert-finetuned-ner were not used when initializing TFBertForTokenClassification: ['dropout_37']
- This IS expected if you are initializing TFBertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertForTokenClassification were initialized from the model checkpoint at Roverto/bert-finetuned-ner.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForTokenClassification for predictions without further training.


# Evaluate the model

In [40]:
import evaluate
metric = evaluate.load("seqeval")

In [41]:
# demo on training data item
labels = raw_datasets["train"][0]["ner_tags"]
labels = [label_names[i] for i in labels]

predictions = labels.copy()
predictions[2] = "O"

print (f"labels     : {labels}\npredictions: {predictions}")
metric.compute(predictions=[predictions], references=[labels])

labels     : ['B-ORG', 'O', 'B-MISC', 'O', 'O', 'O', 'B-MISC', 'O', 'O']
predictions: ['B-ORG', 'O', 'O', 'O', 'O', 'O', 'B-MISC', 'O', 'O']


{'MISC': {'precision': 1.0,
  'recall': 0.5,
  'f1': 0.6666666666666666,
  'number': 2},
 'ORG': {'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 1},
 'overall_precision': 1.0,
 'overall_recall': 0.6666666666666666,
 'overall_f1': 0.8,
 'overall_accuracy': 0.8888888888888888}

**Problem**: Tensorflow's `model.predict() isn't going to work, because the inputs have varying lengths.

**Solution**: get predictions one batch at a time and concatenate them into one big long list as we go, dropping the -100 tokens that indicate masking/padding, then compute metrics on the list at the end.

In [42]:
import numpy as np

all_predictions = []
all_labels = []
for batch in tf_eval_dataset:
    logits = model.predict_on_batch(batch)["logits"]
    labels = batch["labels"]
    predictions = np.argmax(logits, axis=-1)
    for prediction, label in zip(predictions, labels):
        for predicted_idx, label_idx in zip(prediction, label):
            if label_idx == -100:
                continue
            all_predictions.append(label_names[predicted_idx])
            all_labels.append(label_names[label_idx])
metric.compute(predictions=[all_predictions], references=[all_labels])

{'LOC': {'precision': 0.9562162162162162,
  'recall': 0.9629831246597713,
  'f1': 0.9595877407106047,
  'number': 1837},
 'MISC': {'precision': 0.8470111448834853,
  'recall': 0.9067245119305857,
  'f1': 0.8758512310110006,
  'number': 922},
 'ORG': {'precision': 0.8948137326515705,
  'recall': 0.9134973900074571,
  'f1': 0.9040590405904059,
  'number': 1341},
 'PER': {'precision': 0.9517753047164812,
  'recall': 0.9750271444082519,
  'f1': 0.9632609278626978,
  'number': 1842},
 'overall_precision': 0.9233546692926309,
 'overall_recall': 0.9468192527768429,
 'overall_f1': 0.9349397590361446,
 'overall_accuracy': 0.986210042974039}

# Use the model
Easiest way is to use in in a `pipeline`

In [46]:
from transformers import pipeline

token_classifier = pipeline(
    "token-classification", 
    model=hf_repo, 
    aggregation_strategy="simple"
)
result = token_classifier("My name is Trevor and I work at MyButt in New Westminster.")
print(result)

Some layers from the model checkpoint at Roverto/bert-finetuned-ner were not used when initializing TFBertForTokenClassification: ['dropout_37']
- This IS expected if you are initializing TFBertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertForTokenClassification were initialized from the model checkpoint at Roverto/bert-finetuned-ner.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForTokenClassification for predictions without further training.


[{'entity_group': 'PER', 'score': 0.98792493, 'word': 'Trevor', 'start': 11, 'end': 17}, {'entity_group': 'ORG', 'score': 0.967181, 'word': 'MyButt', 'start': 32, 'end': 38}, {'entity_group': 'LOC', 'score': 0.99493295, 'word': 'New Westminster', 'start': 42, 'end': 57}]
