# Introduction

A peaceful greeting to you. Welcome to Honest Qiam. Today, Qiam will show you how to train a Named Entity Recognition pipeline to extract street names and points of interest (POI, e.g. building's name, nearby construction...) from raw Indonesian addresses. This is originally the data science challenge from Shopee Code League 2021, which took place in April and May, 2021. Qiam and his friends took part in the competition, and this is an effort to replicate the result they have achieved.

Huge disclaimer: this is not the code that is used in the competition. At that time, this one is not familiar with NLP so the code was extremely messy and hard to interpret. Now, Qiam has decided to re-challenge himself by working on the full pipeline using the HuggingFace's `transformers` library.

To begin, let's install the dependencies.

In [None]:
!pip install transformers datasets sentencepiece seqeval

In [4]:
from google.colab import drive
import sys, os

# Mount drive and create alias
drive.mount('/content/drive')
os.symlink('/content/drive/My Drive/khang/datasets', '/content/data')

Mounted at /content/drive


# Data preprocessing

Alright, first thing first, let's take a look at the dataset that we are provided. We will use the `DatasetDict` class of the `datasets` library. Since we only have 2 datasets, one for training and one for submission, we will split the training data into train set and valid set with the ratio of 90/10.


In [5]:
from datasets import load_dataset

raw_data = load_dataset('csv', data_files='data/shopee_train.csv')
raw_data = raw_data['train'].train_test_split(test_size=.1)
raw_data

Using custom data configuration default-cc950563e557ded1


Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-cc950563e557ded1/0.0.0/6b9057d9e23d9d8a2f05b985917a0da84d70c5dae3d22ddd8a3f22fb01c69d9e...


  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-cc950563e557ded1/0.0.0/6b9057d9e23d9d8a2f05b985917a0da84d70c5dae3d22ddd8a3f22fb01c69d9e. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'raw_address', 'POI/street'],
        num_rows: 270000
    })
    test: Dataset({
        features: ['id', 'raw_address', 'POI/street'],
        num_rows: 30000
    })
})

In [None]:
raw_train = raw_data['train']
raw_test = raw_data['test']

The dataset contains 3 columns: **id**, **raw_address** and **POI/street**. To make it suitable for our training pipeline, here are the following things we need to do:

    1. Clean the **raw_address** field (strip and remove punctuation) and split them into tokens.

    2. Split the **POI/street** field into 2 seperate columns: **POI** and **STR**.

    3. Tag the corresponding tokens as POI and STR using IOB format, save them as **labels**.

In [8]:
import re

def process(s):
    res = re.sub(r'(\w)(\()(\w)', '\g<1> \g<2>\g<3>', s)
    res = re.sub(r'(\w)([),.:;]+)(\w)', '\g<1>\g<2> \g<3>', res)
    res = re.sub(r'(\w)(\.\()(\w)', '\g<1>. (\g<3>', res)
    res = re.sub(r'\s+', ' ', res)
    res = res.strip()
    return res

def add_token_column(example):
    return {
        'raw_address': [item.strip() for item in example['raw_address']],
        'tokens': [process(item).split() for item in example['raw_address']]
        }

def clean(example):
    return {
        'POI': [process(item.split('/')[0]).split() for item in example['POI/street']],
        'STR': [process(item.split('/')[1]).split() for item in example['POI/street']],
        'labels': [['O']*len(item) for item in example['tokens']]
        }

In [None]:
raw_train = raw_train.map(add_token_column, batched=True)
raw_train = raw_train.map(clean, batched=True)

raw_test = raw_test.map(add_token_column, batched=True)
raw_test = raw_test.map(clean, batched=True)

  0%|          | 0/270 [00:00<?, ?ba/s]

  0%|          | 0/270 [00:00<?, ?ba/s]

  0%|          | 0/30 [00:00<?, ?ba/s]

  0%|          | 0/30 [00:00<?, ?ba/s]

To use our custom labels with our tokenizer and model, we need to define the following dict. Yes, we need **both** of them. They will come in handy later on, Qima promise you.

In [None]:
label_list = ['O', 'B-POI', 'I-POI', 'B-STR', 'I-STR']
id2label = {i: label for i, label in enumerate(label_list)}
label2id = {v: k for k, v in id2label.items()}

Next, we need to convert the labels containing the actual names of the tag to their code and named the new columns `ner_tags`.

In [None]:
def label_ner(example):
    tokens = example['tokens']
    labels = example['labels']
    found_poi, found_str = False, False
    for idx in range(len(tokens)):
        if tokens[idx] in example['POI']:
            if not found_poi:
                labels[idx] = 'B-POI'
                found_poi = True
            else:
                labels[idx] = 'I-POI'
        if tokens[idx] in example['STR']:
            if not found_str:
                labels[idx] = 'B-STR'
                found_str = True
            else:
                labels[idx] = 'I-STR'
    return {
        'labels': labels,
        'ner_tags': [label2id[label] for label in labels]
    }

In [None]:
raw_train = raw_train.map(label_ner)
raw_test = raw_test.map(label_ner)

0ex [00:00, ?ex/s]

0ex [00:00, ?ex/s]

Let's print out an example to see our results.

In [None]:
words = raw_train[0]["tokens"]
labels = raw_train[0]["ner_tags"]
line1 = ""
line2 = ""
for word, label in zip(words, labels):
    full_label = label_list[label]
    max_length = max(len(word), len(full_label))
    line1 += word + " " * (max_length - len(word) + 1)
    line2 += full_label + " " * (max_length - len(full_label) + 1)

print(line1)
print(line2)

trust jaya, raya  kup   indah, no 32 rw 6 
O     O     B-STR I-STR O      O  O  O  O 


Phew, so we have done the first part of the preprocessing. Wait what? First part? We are not done yet? That's right. Though our data looks pretty neat for now, it is not yet suitable for our tokenizer. There is a tiny step to do before we proceed to thhe next part.

First, let's load the pre-trained tokenizer from the [cahya/clm-roberta-base-indonesian-NER](https://https://huggingface.co/cahya/xlm-roberta-base-indonesian-NER) checkpoint.

In [None]:
from transformers import AutoTokenizer, AutoModelForTokenClassification

checkpoint = 'cahya/xlm-roberta-base-indonesian-NER'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

Downloading:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.00k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/4.83M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/150 [00:00<?, ?B/s]

In [None]:
# sanity check
tokenizer.is_fast

True

Now here is the problem. Our tokenizer will split our tokens into subwords (you can learn more about subword embedding [here](https://d2l.ai/chapter_natural-language-processing-pretraining/subword-embedding.html). Thus, our input will be longer than our labels. That's why we have to write a function to align the labels with our new tokenized address.

Another thing to note is that the tokenizer will automatically add 2 special tokens to the beginning and the end of the input sentence: `<s>` and `</s>`. We need to mask them with label = -100 so the trainer will skip them in the training process.

In [None]:
inputs = tokenizer(raw_train[15]['tokens'], is_split_into_words=True)
print(raw_train[15]['raw_address'])
print(inputs.tokens())

jl. tem pelab marta baru 25 mantuil rt 17 banjarmasin selatan
['<s>', '▁', 'jl', '.', '▁tem', '▁pela', 'b', '▁marta', '▁baru', '▁25', '▁man', 'tu', 'il', '▁', 'rt', '▁17', '▁ban', 'jar', 'mas', 'in', '▁selatan', '</s>']


In [None]:
def align_labels_with_tokens(labels, word_ids):
    new_labels = []
    current_word = None
    for word_id in word_ids:
        if word_id != current_word:
            # Start of a new word!
            current_word = word_id
            label = -100 if word_id is None else labels[word_id]
            new_labels.append(label)
        elif word_id is None:
            # Special token
            new_labels.append(-100)
        else:
            # Same word as previous token
            label = labels[word_id]
            # If the label is B-XXX we change it to I-XXX
            if label % 2 == 1:
                label += 1
            new_labels.append(label)

    return new_labels

Let's check our new alignment function. Looks neat enough for Qiam!

In [None]:
labels = raw_train[15]["ner_tags"]
word_ids = inputs.word_ids()
print(labels)
print(align_labels_with_tokens(labels, word_ids))

[3, 4, 4, 4, 4, 0, 0, 0, 0, 0, 0]
[-100, 3, 4, 4, 4, 4, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -100]


It's time to bring the alignment function above to the `map` method via another function. This function will align every element in the dataset we are calling it on.

In [None]:
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(
        examples["tokens"], truncation=True, is_split_into_words=True
    )
    all_labels = examples["ner_tags"]
    new_labels = []
    for i, labels in enumerate(all_labels):
        word_ids = tokenized_inputs.word_ids(i)
        new_labels.append(align_labels_with_tokens(labels, word_ids))

    tokenized_inputs["labels"] = new_labels
    return tokenized_inputs

In [None]:
tokenized_train = raw_train.map(
    tokenize_and_align_labels,
    batched=True,
    remove_columns=raw_train.column_names
)

tokenized_test = raw_test.map(
    tokenize_and_align_labels,
    batched=True,
    remove_columns=raw_test.column_names
)

  0%|          | 0/270 [00:00<?, ?ba/s]

  0%|          | 0/30 [00:00<?, ?ba/s]

# Fine-tuning the XLM Roberta model

Finally, it's time to put our preprocessed data to use. We will fine-tune the pre-trained model from the same checkpoint as the tokenizer above.

## Data collator and metrics
First, let's define the data collator to feed in the `Trainer` API of HuggingFace. We also define the metric using the [Seqeval](https://github.com/chakki-works/seqeval) framework. Seqeval provides a nice evaluation method (using precision/recall, f1 score and accuracy) for chunking tasks (e.g. NER, POS tagging...)

In [None]:
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

In [None]:
from datasets import load_metric
import numpy as np

metric = load_metric("seqeval")

def compute_metrics(eval_preds):
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)

    # Remove ignored index (special tokens) and convert to labels
    true_labels = [[label_list[l] for l in label if l != -100] for label in labels]
    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    all_metrics = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": all_metrics["overall_precision"],
        "recall": all_metrics["overall_recall"],
        "f1": all_metrics["overall_f1"],
        "accuracy": all_metrics["overall_accuracy"],
    }

Downloading:   0%|          | 0.00/2.48k [00:00<?, ?B/s]

## Training

Now, all we need to do is load the pre-trained model and indicate some training arguments, such as the number of epochs, the initial learning rate... Then, simply call `train` method on the `Trainer` and the rest will be taken care for us.

In [None]:
from transformers import AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained(
    checkpoint,
    id2label=id2label,
    label2id=label2id,
    ignore_mismatched_sizes=True  # set to True to use custom labels
)

Downloading:   0%|          | 0.00/1.04G [00:00<?, ?B/s]

Some weights of XLMRobertaForTokenClassification were not initialized from the model checkpoint at cahya/xlm-roberta-base-indonesian-NER and are newly initialized because the shapes did not match:
- classifier.weight: found shape torch.Size([39, 768]) in the checkpoint and torch.Size([5, 768]) in the model instantiated
- classifier.bias: found shape torch.Size([39]) in the checkpoint and torch.Size([5]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
# double check the number of labels
model.config.num_labels

5

In [None]:
from huggingface_hub import notebook_login

notebook_login()

Login successful
Your token has been saved to /root/.huggingface/token
[1m[31mAuthenticated through git-credential store but this isn't the helper defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub. Run the following command in your terminal in case you want to set this credential helper as the default

git config --global credential.helper store[0m


In [None]:
!sudo apt-get install git-lfs

In [None]:
from transformers import TrainingArguments, Trainer

args = TrainingArguments(
    "shopee-ner",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    num_train_epochs=2,
    weight_decay=0.01,
    push_to_hub=True,
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
)

trainer.train()

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [None]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
)
trainer.train()

/content/shopee-ner is already a clone of https://huggingface.co/vkhangpham/shopee-ner. Make sure you pull the latest changes with `repo.git_pull()`.
***** Running training *****
  Num examples = 270000
  Num Epochs = 2
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 67500


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,0.2282,0.217398,0.744304,0.850633,0.793924,0.92528
2,0.1983,0.20462,0.766561,0.866585,0.81351,0.931987


***** Running Evaluation *****
  Num examples = 30000
  Batch size = 8
Saving model checkpoint to shopee-ner/checkpoint-33750
Configuration saved in shopee-ner/checkpoint-33750/config.json
Model weights saved in shopee-ner/checkpoint-33750/pytorch_model.bin
tokenizer config file saved in shopee-ner/checkpoint-33750/tokenizer_config.json
Special tokens file saved in shopee-ner/checkpoint-33750/special_tokens_map.json
tokenizer config file saved in shopee-ner/tokenizer_config.json
Special tokens file saved in shopee-ner/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 30000
  Batch size = 8
Saving model checkpoint to shopee-ner/checkpoint-67500
Configuration saved in shopee-ner/checkpoint-67500/config.json
Model weights saved in shopee-ner/checkpoint-67500/pytorch_model.bin
tokenizer config file saved in shopee-ner/checkpoint-67500/tokenizer_config.json
Special tokens file saved in shopee-ner/checkpoint-67500/special_tokens_map.json


Training completed. Do not for

TrainOutput(global_step=67500, training_loss=0.22687468013057002, metrics={'train_runtime': 7092.0113, 'train_samples_per_second': 76.142, 'train_steps_per_second': 9.518, 'total_flos': 6258802744998960.0, 'train_loss': 0.22687468013057002, 'epoch': 2.0})

Hooray, the training has been completed. It took this one over 2 hours training on a Tesla P-100 on Google Colab for 2 epochs. Let's look at the performance of our model. It achieves an accuracy of 93% with the F1 score of 0.81. This is not too bad since the dataset we began with is obviously quite "raw" and need more cleaning steps. According to the host of the competition, some of the labels are overlapped between POI and street, and some are even abbreviated (meaning that some labels are not in the tokens set).

Qiam have push the fine-tuned model to HuggingFace's Hub here. Feel free to use it as you like.

In [None]:
trainer.push_to_hub(commit_message="Training complete")

Saving model checkpoint to shopee-ner
Configuration saved in shopee-ner/config.json
Model weights saved in shopee-ner/pytorch_model.bin
tokenizer config file saved in shopee-ner/tokenizer_config.json
Special tokens file saved in shopee-ner/special_tokens_map.json
Several commits (2) will be pushed upstream.
The progress bars may be unreliable.


Upload file pytorch_model.bin:   0%|          | 3.39k/1.03G [00:00<?, ?B/s]

Upload file runs/Jan27_16-43-54_35cc52e96855/events.out.tfevents.1643301842.35cc52e96855.83.4:  13%|#3        …

To https://huggingface.co/vkhangpham/shopee-ner
   6cdfdb4..1d7f72d  main -> main

Dropping the following result as it does not have all the necessary fields:
{'task': {'name': 'Token Classification', 'type': 'token-classification'}, 'metrics': [{'name': 'Precision', 'type': 'precision', 'value': 0.7665611869066763}, {'name': 'Recall', 'type': 'recall', 'value': 0.8665845648604269}, {'name': 'F1', 'type': 'f1', 'value': 0.8135098681494123}, {'name': 'Accuracy', 'type': 'accuracy', 'value': 0.9319873796583431}]}
To https://huggingface.co/vkhangpham/shopee-ner
   1d7f72d..18627bf  main -> main



'https://huggingface.co/vkhangpham/shopee-ner/commit/1d7f72d0843bd8955aa5e47b26e4020e453993d5'

# Conclusion

In this post, you and Qiam have been walking through how to build a custom NER model with HuggingFace. Qiam choose this problem from Shopee Code League 2021 as an example because he had so much fun during one week competing in the challenge. If you are curious about the result, our team is ranked 93th over a thousand competitors. Now a superior result, but since that was his first time touching NLP, this one would consider that a win trade 😉

# Furthur reading

This work is inspired by the public [solution](https://www.kaggle.com/baohiep/1-place-scl-ds-2021-voidandtwotsts) of the winning team and the HuggingFace's [tutorial](https://huggingface.co/course/chapter7/2?fw=pt). Qiam really recommend you to check these two posts, since he has learned a lot from them.

Thank you so much for reading this. Goodbye traveler, may your road lead you to warm sands.

# Testing

In [6]:
from datasets import load_dataset

final_test = load_dataset('csv', data_files='data/shopee_test.csv')
final_test = final_test['train']
final_test

Using custom data configuration default-0108138cc7994f1f


Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-0108138cc7994f1f/0.0.0/6b9057d9e23d9d8a2f05b985917a0da84d70c5dae3d22ddd8a3f22fb01c69d9e...


  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-0108138cc7994f1f/0.0.0/6b9057d9e23d9d8a2f05b985917a0da84d70c5dae3d22ddd8a3f22fb01c69d9e. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

Dataset({
    features: ['id', 'raw_address'],
    num_rows: 50000
})

In [9]:
final_test = final_test.map(add_token_column, batched=True)
final_test[0]

  0%|          | 0/50 [00:00<?, ?ba/s]

{'id': 0,
 'raw_address': 's. par 53 sidanegara 4 cilacap tengah',
 'tokens': ['s.', 'par', '53', 'sidanegara', '4', 'cilacap', 'tengah']}

In [10]:
from transformers import pipeline

model_checkpoint = "vkhangpham/shopee-ner"
token_classifier = pipeline(
    "ner", model=model_checkpoint, aggregation_strategy="simple"
)
preds = token_classifier(final_test[0]['tokens'])
preds

Downloading:   0%|          | 0.00/976 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.03G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/631 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/4.83M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/8.68M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/150 [00:00<?, ?B/s]

[[{'end': 2,
   'entity_group': 'STR',
   'score': 0.8356286,
   'start': 0,
   'word': 's.'}],
 [{'end': 3,
   'entity_group': 'STR',
   'score': 0.70869064,
   'start': 0,
   'word': 'par'}],
 [],
 [],
 [{'end': 1,
   'entity_group': 'STR',
   'score': 0.755323,
   'start': 0,
   'word': '4'}],
 [],
 []]

In [None]:
from tqdm.auto import tqdm

result = []
for row in tqdm(final_test):
    predictions = token_classifier(row['tokens'])
    poi, street = "", ""
    for pred in predictions:
        if len(pred) > 0:
            if pred[0]['entity_group'] == 'POI':
                poi += pred[0]['word'] + " "
            else:
                street += pred[0]['word'] + " "
    result.append(poi.strip() + '/' + street.strip())

  0%|          | 0/50000 [00:00<?, ?it/s]

In [14]:
import csv

with open('/content/data/submission.csv', 'w') as file:
    writer = csv.writer(file)
    writer.writerow(['id', 'POI/street'])
    for i in range(len(result)):
        writer.writerow([i, result[i]])