In [1]:
import pandas as pd
from transformers import pipeline

We will base our analysis on a multilingual model called XLM-Roberta that has been pre-trained on 100 languages (https://huggingface.co/docs/transformers/model_doc/xlmroberta).

Furthermore, we will be using a version of XLM-Roberta that has been fine-tuned on a benchmark English NER dataset, CONLL 2003 (https://huggingface.co/xlm-roberta-large-finetuned-conll03-english).

Despite the fact that the model has not been trained on any Indonesian NER dataset, due to the multilingual nature of the pre-training, the model can transfer the "knowledge" it gains from recognising English named entities to predict Indonesian named entities.

In [2]:
ner_tagger = pipeline('ner', model="xlm-roberta-large-finetuned-conll03-english")

Let's have a quick peek at the CONLL 2003 dataset.

In [3]:
from datasets import load_dataset

conll2003 = load_dataset("conll2003")

display(conll2003["train"][0])
display(conll2003["train"].features["ner_tags"])

Reusing dataset conll2003 (/home/williamtjhi/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/40e7cb6bcc374f7c349c83acd1e9352a4f09474eb691f64f364ee62eb65d0ca6)


  0%|          | 0/3 [00:00<?, ?it/s]

{'id': '0',
 'tokens': ['EU',
  'rejects',
  'German',
  'call',
  'to',
  'boycott',
  'British',
  'lamb',
  '.'],
 'pos_tags': [22, 42, 16, 21, 35, 37, 16, 21, 7],
 'chunk_tags': [11, 21, 11, 12, 21, 22, 11, 12, 0],
 'ner_tags': [3, 0, 7, 0, 0, 0, 7, 0, 0]}

Sequence(feature=ClassLabel(num_classes=9, names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC'], names_file=None, id=None), length=-1, id=None)

We first test the model on an English sentence.

In [4]:
sample_text_en = "Donald Trump went to the White House last week."
display(pd.DataFrame(ner_tagger(sample_text_en)))

sample_text_en = "Thor is a member of the Avengers."
display(pd.DataFrame(ner_tagger(sample_text_en)))

Unnamed: 0,word,score,entity,index,start,end
0,▁Donald,0.99999,I-PER,1,0,6
1,▁Trump,0.999988,I-PER,2,7,12
2,▁White,0.999996,I-LOC,6,25,30
3,▁House,0.999997,I-LOC,7,31,36


Unnamed: 0,word,score,entity,index,start,end
0,▁Thor,0.999725,I-PER,1,0,4
1,▁Avengers,0.999931,I-ORG,7,24,32


Let's move on to test it on some Indonesian sentences.

In [5]:
sample_text_id = "Nama saya Budi. Saya lahir di tahun 1981. Sekarang saya tinggal di Ubud, Bali."
#My name is Budi. I was born in 1981. I am currently living in Ubud, Bali.
display(pd.DataFrame(ner_tagger(sample_text_id)))

sample_text_id = "Budi sudah sampai di depan Menara BCA."
#Budi has arrived in front of the BCA Tower
display(pd.DataFrame(ner_tagger(sample_text_id)))

Unnamed: 0,word,score,entity,index,start,end
0,▁Budi,0.999916,I-PER,3,10,14
1,▁U,0.999982,I-LOC,15,67,68
2,bud,0.999979,I-LOC,16,68,71
3,▁Bali,0.999992,I-LOC,18,73,77


Unnamed: 0,word,score,entity,index,start,end
0,▁Budi,0.999983,I-PER,1,0,4
1,▁Men,0.884495,I-LOC,6,27,30
2,ara,0.825791,I-LOC,7,30,33
3,▁B,0.999868,I-ORG,8,34,35
4,CA,0.999681,I-ORG,9,35,37


The results are not bad. For the first sentence, the model gives a perfect prediction. It knows that Ubud, Bali refer to LOCATION, and Budi a PERSON.

For the second sentence, it correctly tags Budi as a PERSON. It also recognises that BCA is an ORGANISATION. However, in this context, Menara BCA should be treated as a single entity, referring to a location that Budi is currently located. 

In general, Multilingual model does not capture the "local context" as well as a native Monolingual model trained in Indonesian in this case. Recent research studies confirm that in most cases, Monolingual models mostly perform more accurately than Multilingual models (see some references: https://arxiv.org/abs/2009.05387 and https://arxiv.org/abs/2011.00677.

There are some practical challenges in building monolingual models for low-resource languages though. The chief among them is the relatively small sizes of native corpora (relative to English) that can be used to pre-trained the models.

An interim way to resolve this problem is by fine-tuning a multilingual model with a native annotated dataset, e.g. and Indonesian version of CONLL2003. This exactly what we did in SEA CoreNLP. The Indonesian NER model there was trained by fine-tuning XLM-Roberta with NER-Grit (an Indonesian NER corpus https://github.com/grit-id/nergrit-corpus).  