## A more accurate Indonesian NER using IndoBERT - a monolingual model

IndoNLU is a recent development done by a collaborative effort between Gojek, ITB, Airy and others (https://github.com/indobenchmark/indonlu). It builds the Indo4B dataset for various Indonesian native NLP tasks including NER, sentiment analysis etc. It also uses Indo4B to develop and Indonesian version of BERT called IndoNLU's IndoBERT. 

We use IndoBERT here to demonstrate the performance of NER based on a monolingual model. We have fine-tuned the base IndoBERT for NER using the NER-Grit dataset. We used the fine-tuning script provided by IndoNLU here: https://github.com/indobenchmark/indonlu/blob/master/examples/finetune_ner_grit.ipynb, saved the fine-tuned model locally in the `model` folder. 

**Notes:** this demo requires `transformers` version `2.9.0`. For this reason, to run this demo, please first exit from your current environment, create a new environment and execute `pip install -r requirements_IndoBERT.txt` before running this notebook. In addition, we need to copy to our current directory the `word_classification.py` from the IndoNLU repo: https://github.com/indobenchmark/indonlu/blob/master/modules/word_classification.py. If you clone the SEA_NLP_workshop repository, this copying has already been performed. 

In [1]:
import numpy as np
import pandas as pd
import torch

from transformers import BertConfig, BertTokenizer #need transformers==2.9.0
from nltk.tokenize import word_tokenize

from word_classification import BertForWordClassification

Loading IndoBERT

In [2]:
tokenizer = BertTokenizer.from_pretrained('indobenchmark/indobert-base-p1')
config = BertConfig.from_pretrained('indobenchmark/indobert-base-p1')

Mapping the named entity labels

In [3]:
config.num_labels = 7
i2w = {0: 'I-PERSON', 1: 'B-ORGANISATION', 2: 'I-ORGANISATION', 3: 'B-PLACE', 4: 'I-PLACE', 5: 'O', 6: 'B-PERSON'}

Loading the fine-tuned model that has been saved locally before. It is possible to do your own fine-tuning by executing the notebook https://github.com/indobenchmark/indonlu/blob/master/examples/finetune_ner_grit.ipynb yourself. Do note that it requires `transformers` version `2.9.0` for it to work.

We use `cpu` as our device as we are performing inference only here.

In [4]:
model = BertForWordClassification.from_pretrained('models/', config=config).to("cpu")

Defining a function to convert a text into IndoBERT tokens

In [5]:
def word_subword_tokenize(sentence, tokenizer):
    # Add CLS token
    subwords = [tokenizer.cls_token_id]
    subword_to_word_indices = [-1] # For CLS

    # Add subwords
    for word_idx, word in enumerate(sentence):
        subword_list = tokenizer.encode(word, add_special_tokens=False)
        subword_to_word_indices += [word_idx for i in range(len(subword_list))]
        subwords += subword_list

    # Add last SEP token
    subwords += [tokenizer.sep_token_id]
    subword_to_word_indices += [-1]

    return subwords, subword_to_word_indices

Perform some tests on some Indonesian sentences.

In [6]:
text = word_tokenize('Nama saya Budi. Saya lahir di tahun 1981. Sekarang saya tinggal di Ubud, Bali.')
#My name is Budi. I was born in 1981. I am currently living in Ubud, Bali.
subwords, subword_to_word_indices = word_subword_tokenize(text, tokenizer)

subwords = torch.LongTensor(subwords).view(1, -1).to(model.device)
subword_to_word_indices = torch.LongTensor(subword_to_word_indices).view(1, -1).to(model.device)
logits = model(subwords, subword_to_word_indices)[0]

preds = torch.topk(logits, k=1, dim=-1)[1].squeeze().cpu().numpy()
labels = [i2w[preds[i]] for i in range(len(preds))]

pd.DataFrame({'words': text, 'label': labels})

Unnamed: 0,words,label
0,Nama,O
1,saya,O
2,Budi,B-PERSON
3,.,O
4,Saya,O
5,lahir,O
6,di,O
7,tahun,O
8,1981,O
9,.,O


Contrast the following test result against the output of the multilingual NER based on XLM-Roberta (see the notebook `Multilingual_NER`). Here with IndoBERT, *menara bca* is correctly understood from context to be a location. With XLM-ROBERTA, it failed to recognise the context and tagged *menara* as LOCATION and *bca* as organisation independently. 

In [7]:
text = word_tokenize('Budi sudah sampai di depan menara bca')
#Budi has arrived in front of the BCA Tower.
subwords, subword_to_word_indices = word_subword_tokenize(text, tokenizer)

subwords = torch.LongTensor(subwords).view(1, -1).to(model.device)
subword_to_word_indices = torch.LongTensor(subword_to_word_indices).view(1, -1).to(model.device)
logits = model(subwords, subword_to_word_indices)[0]

preds = torch.topk(logits, k=1, dim=-1)[1].squeeze().cpu().numpy()
labels = [i2w[preds[i]] for i in range(len(preds))]

pd.DataFrame({'words': text, 'label': labels})

Unnamed: 0,words,label
0,Budi,B-PERSON
1,sudah,O
2,sampai,O
3,di,O
4,depan,O
5,menara,B-PLACE
6,bca,I-PLACE


Similarly, here IndoBERT recognises the context 

In [15]:
text = word_tokenize('Yani pergi ke mall taman anggrek membeli kue bantal')
#Yani went to Taman Anggrek Mall to buy a Pillow Cake.
subwords, subword_to_word_indices = word_subword_tokenize(text, tokenizer)

subwords = torch.LongTensor(subwords).view(1, -1).to(model.device)
subword_to_word_indices = torch.LongTensor(subword_to_word_indices).view(1, -1).to(model.device)
logits = model(subwords, subword_to_word_indices)[0]

preds = torch.topk(logits, k=1, dim=-1)[1].squeeze().cpu().numpy()
labels = [i2w[preds[i]] for i in range(len(preds))]

pd.DataFrame({'words': text, 'label': labels})

Unnamed: 0,words,label
0,Yani,B-PERSON
1,pergi,O
2,ke,O
3,mall,B-PLACE
4,taman,I-PLACE
5,anggrek,I-PLACE
6,membeli,O
7,kue,O
8,bantal,O
