**"Huggingface is the Open AI we deseve"**

When I read that tweet, I could not have agreed more. 

Huggingface (🤗) has democratized large-scale neural network models like no others have done. We can easily build on top of them and follow the fast-pace development of NLP neural networks. 

While there are plenty of tutorials already, I want to show an example of fine-tuning the multilingual BERT model for Indonesian dataset. In particular, I am using [AiryRooms Hotel Review](https://github.com/jordhy97/final_project) dataset for Aspect-Based Sentiment Analysis (ABSA). With just 2 fine-tuning epochs, we can achieve state-of-the-art aspect-sentiment extraction.

This notebook aims to answer the following questions:
1. How do we prepare a custom ABSA dataset?
1. How to fine-tune multilingual BERT model?
2. How do we calculate F1 score as a custom metric?
3. How to use tensorboard to display training result?
4. How to make an inference of a sample sentence?

Compared to my [previous post](https://medium.com/@yoseflaw/step-by-step-ner-model-for-bahasa-indonesia-with-pytorch-and-torchtext-6f94fca08406?source=friends_link&sk=c15c89082c00c8785577e1cebb77c9c2), this post is more concise. In essence, the length difference implies how fine-tuning is a powerful approach for building NLP models. Although I am using a model pretrained from 104 languages, the fine-tuned model can perform well for Bahasa Indonesia. The result has removed my doubt on using pretrained multilingual model for just one language. In fact, the hotel review dataset is also an informal one, and the model can still predict aspects and sentiments accurately!




In [1]:
from google.colab import drive
drive.mount("/content/gdrive")

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

Mounted at /content/gdrive


In [None]:
!pip install transformers==3.0.2

import torch
import nltk
from nltk.tokenize import word_tokenize
from torch.utils.data import Dataset
from transformers import (BertTokenizerFast, BertForTokenClassification,
                          Trainer, TrainingArguments)
import numpy as np
from sklearn.metrics import f1_score, classification_report
from pathlib import Path
import os
import csv
import re

nltk.download("punkt")

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers==3.0.2
  Downloading transformers-3.0.2-py3-none-any.whl (769 kB)
[K     |████████████████████████████████| 769 kB 7.9 MB/s 
Collecting sentencepiece!=0.1.92
  Downloading sentencepiece-0.1.97-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 55.6 MB/s 
[?25hCollecting sacremoses
  Downloading sacremoses-0.0.53.tar.gz (880 kB)
[K     |████████████████████████████████| 880 kB 70.2 MB/s 
Collecting tokenizers==0.8.1.rc1
  Downloading tokenizers-0.8.1rc1-cp37-cp37m-manylinux1_x86_64.whl (3.0 MB)
[K     |████████████████████████████████| 3.0 MB 43.9 MB/s 
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel for sacremoses: filename=sacremoses-0.0.53-py3-none-any.whl size=895260 sha256=f00bae6f895d35f762be604064811a732e17eb

In [None]:
import sys
DRIVE_ROOT = "/content/gdrive/MyDrive/Colab Notebooks/Thesis/Absa"
if DRIVE_ROOT not in sys.path:
    sys.path.append(DRIVE_ROOT)

In [None]:
available_gpu = torch.cuda.is_available()
if available_gpu:
    print(f"GPU is available: {torch.cuda.get_device_name(0)}")
    use_device = torch.device("cuda")
else:
    use_device = torch.device("cpu")

# Dataset

First, I begin with reading a custom PyTorch `Dataset` class. There are two main concepts here: how to use `tokenizer` and the `encode_tags` function. The former is provided by 🤗. Since we have to use the same tokenizer for all datasets, I leave the tokenizer instantiation at `Main`. We just need to make sure to provide the correct tokenization parameters according to the custom dataset. In this case, the dataset has been pretokenized, so I put `is_pretokenized` as `True`.

The second concept is handling subwords with `encode_tags()`. The code snippet is implemented based on the 🤗 tutorial on named entity recognition with BERT. Why do subwords matter? The challenge is: which label should we apply to the subwords? In this implementation, we assign `-100` as the label, which is a code that those labels should be ignored during loss calculation. Therefore, the label predictions of the subwords do not matter for training. We will take only the first subword's label for every word.

In [None]:
class ReviewDataset(Dataset):
    def __init__(self, filepath, tokenizer, tag2idx=None):
        self.texts, self.tags = ReviewDataset.read_input(filepath)
        self.encodings = tokenizer(
            self.texts,
            is_pretokenized=True,  # skip word tokenization
            return_offsets_mapping=True,  # offsets are used in tag encoding
            padding=True,  # pad to max length
            truncation=True  # if longer than max length, truncate sentence
        )
        # make sure that the tag-to-idx dictionary is the same between train, val, and test
        if tag2idx is None:
            unique_tags = set(tag for doc in self.tags for tag in doc)
            self.num_labels = len(unique_tags)
            self.tag2idx = {tag: idx for idx, tag in enumerate(unique_tags)}
        else:
            self.tag2idx = tag2idx
        self.idx2tag = {idx: tag for tag, idx in self.tag2idx.items()}
        # tag encoding to handle subwords
        self.labels = self.encode_tags()
        self.encodings.pop("offset_mapping")

    def __getitem__(self, idx):
        item = {key: torch.tensor(value[idx]) for key, value in self.encodings.items()}
        item["labels"] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

    def encode_tags(self):
        # put -100 as labels for subwords
        # these instances will be ignored during loss calculation
        labels = [[self.tag2idx[tag] for tag in doc] for doc in self.tags]
        encoded_labels = []
        for doc_labels, doc_offset in zip(labels, self.encodings.offset_mapping):
            doc_enc_labels = np.ones(len(doc_offset), dtype=int) * -100  # empty array with -100
            # replace labels of the first subwords with the actual labels
            arr_offset = np.array(doc_offset)
            doc_enc_labels[(arr_offset[:, 0] == 0) & (arr_offset[:, 1] != 0)] = doc_labels
            encoded_labels.append(doc_enc_labels.tolist())
        return encoded_labels

    @staticmethod
    def read_input(filepath, tag_only=False):
        # tag_only determines whether the labels should include location (B, I)
        # or just the tag names (ASPECT, SENTIMENT)
        path = Path(filepath)
        raw_text = path.read_text().strip()
        raw_docs = re.split(r"\n\n", raw_text)
        token_docs = []
        tag_docs = []
        for doc in raw_docs:
            tokens = []
            tags = []
            for line in doc.split("\n"):
                token, tag = line.strip().split("\t")
                tokens.append(token)
                if tag_only:
                    tag = tag.split("-")[1] if tag != "O" else tag
                tags.append(tag)
            token_docs.append(tokens)
            tag_docs.append(tags)
        return token_docs, tag_docs

# Model

The `AspectModel` is a wrapper class for the `BertForTokenClassification` model class from 🤗. There is nothing spectacular happening. And that is a good thing. Fine-tuning pretrained model is just as easy as specifying the model parent class and name.

Additionally, I do not need to write my own training loop. 🤗 also provides `Trainer` class which cover all the behaviors that I intended to use. Finally, I add F1 score calculation at every evaluation as extended by the `compute_metrics` function.

In [None]:
class AspectModel(object):

    def __init__(self, model_reference, tokenizer, device, num_labels=None, cache_dir=None):
        model_args = {"pretrained_model_name_or_path": model_reference}
        if num_labels is not None:
            model_args["num_labels"] = num_labels
        if cache_dir is not None:
            model_args["cache_dir"] = cache_dir
        self.model = BertForTokenClassification.from_pretrained(**model_args)
        self.tokenizer = tokenizer
        self.device = device
        self.trainer = None
        self.idx2tag = {}

    def train(self,
              train_dataset,
              val_dataset,
              logging_dir,
              num_train_epochs,
              logging_steps,
              lr,
              weight_decay,
              warmup_steps,
              output_dir,
              save_model=False
              ):
        args = TrainingArguments(
            output_dir=output_dir,
            logging_dir=logging_dir,
            num_train_epochs=num_train_epochs,
            eval_steps=logging_steps,  # eval and log at the same step frequency
            logging_steps=logging_steps,
            learning_rate=lr,
            weight_decay=weight_decay,
            warmup_steps=warmup_steps,  # learning rate keeps increasing up to this point
            evaluate_during_training=True,
            per_device_train_batch_size=32,
            per_device_eval_batch_size=64,
        )
        self.trainer = Trainer(
            model=self.model,
            args=args,
            train_dataset=train_dataset,
            eval_dataset=val_dataset,
            compute_metrics=AspectModel.compute_metrics(train_dataset.idx2tag)  # F1 score
        )
        train_result = self.trainer.train()
        if save_model: 
            self.trainer.save_model(output_dir)
            with open(os.path.join(output_dir, "idx2tag.csv"), "w") as idx2tag_f:
                w = csv.writer(idx2tag_f)
                w.writerows(train_dataset.idx2tag.items())
        return train_result

    def predict(self, predict_dataset):
        if self.trainer is None:
            print("Run train() before making prediction.")
            return None
        return self.trainer.predict(predict_dataset)

    def infer(self, sentence):
        tokens = [word_tokenize(sentence)]
        encoding = self.tokenizer(
            tokens,
            return_tensors="pt",
            is_pretokenized=True,
            padding=True,
            truncation=True
        )
        input_ids = encoding["input_ids"]
        attention_mask = encoding["attention_mask"]
        subwords = self.tokenizer.convert_ids_to_tokens(input_ids[0].tolist())
        input_ids = input_ids.to(self.device)
        attention_mask = attention_mask.to(self.device)
        outputs = self.model(input_ids, attention_mask=attention_mask)
        tags = outputs[0][0].argmax(-1).tolist()
        # join subwords
        words = []
        valid_tags = []
        buffer_word = None
        buffer_tag = None
        for i, (subword, tag) in enumerate(zip(subwords, tags)):
            if buffer_word is None:
                buffer_word = subword
                buffer_tag = tag
            elif subword.startswith("##"):
                buffer_word += subword.replace("##", "")
                if i == len(subwords) - 1:
                    words.append(buffer_word)
            else:
                words.append(buffer_word)
                valid_tags.append(buffer_tag)
                buffer_word = subword
                buffer_tag = tag
        return words, valid_tags

    @staticmethod
    def compute_metrics(idx_to_tag):
        def _compute_metrics(pred):
            valid = pred.label_ids != -100
            labels = pred.label_ids[valid].flatten()
            preds = pred.predictions.argmax(-1)[valid].flatten()
            f1 = f1_score(labels, preds, average="micro", zero_division=0)
            report = classification_report(labels, preds, output_dict=True, zero_division=0)
            metrics = {"f1": f1}
            for label in report:
                try:
                    int_label = int(label)
                    if int_label in idx_to_tag:
                        metrics[f"f1_{idx_to_tag[int_label]}"] = report[label]["f1-score"]
                except ValueError as _:
                    pass
            return metrics
        return _compute_metrics

# Main

After defining the custom dataset and model wrapper classes, now we can write the main function. First, we load the pretrained tokenizer `BertTokenizerFast`. Here, I use the `Fast` version to acquire the offset mapping required during tag encodings. The train, val, and test set has been done beforehand, which correspond to 3000, 1000, and 1000 sentences, respectively.

In [None]:
model_name = "bert-base-multilingual-uncased"
tokenizer = BertTokenizerFast.from_pretrained(model_name, cache_dir=f"{DRIVE_ROOT}/pt_model/")
train_dataset = ReviewDataset(f"{DRIVE_ROOT}/data/input/train.tsv", tokenizer)
val_dataset = ReviewDataset(f"{DRIVE_ROOT}/data/input/val.tsv", tokenizer, tag2idx=train_dataset.tag2idx)
test_dataset = ReviewDataset(f"{DRIVE_ROOT}/data/input/test.tsv", tokenizer, tag2idx=train_dataset.tag2idx)
aspect_model = AspectModel(
    model_reference=model_name,
    tokenizer=tokenizer,
    device=use_device,
    num_labels=train_dataset.num_labels,
    cache_dir=f"{DRIVE_ROOT}/pt_model/"
)
aspect_model.train(
    train_dataset=train_dataset,
    val_dataset=val_dataset,
    logging_dir=f"{DRIVE_ROOT}/logs/indo-absa-hotel",
    num_train_epochs=2,
    logging_steps=24,
    lr=1e-4,
    weight_decay=1e-2,
    warmup_steps=94,
    output_dir=f"{DRIVE_ROOT}/results",
    save_model=True  # not saving the best model to save space, change to True otherwise
)

## Results

The advantages does not stop in training only. After the training is done, we can use tensorboard to view the results because 🤗 `Trainer` writes logs in tensorboard readable format by default. At the very least, this approach helps standardize model performance graphs (Bye bye plotting manually). 

You can interact with the results by yourself too. If you cannot see the tensorboard below, try to use a different browser (I cannot see the board with Firefox, but Safari displays it with no problem).

In [None]:
%load_ext tensorboard
%tensorboard --logdir "/content/gdrive/MyDrive/Colab Notebooks/Thesis/Absa/logs/indo-absa-hotel"

The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


<IPython.core.display.Javascript object>

In [None]:
from pprint import pprint
# test set final model
test_pred = aspect_model.predict(test_dataset)
pprint(test_pred[2])

Prediction:   0%|          | 0/16 [00:00<?, ?it/s]

{'eval_f1': 0.9369596076913151,
 'eval_f1_B-ASPECT': 0.9129464285714286,
 'eval_f1_B-SENTIMENT': 0.9305123418377931,
 'eval_f1_I-ASPECT': 0.8743633276740238,
 'eval_f1_I-SENTIMENT': 0.878186968838527,
 'eval_f1_O': 0.9533039647577094,
 'eval_loss': 0.17906426778063178}


In [None]:
import pandas as pd

df = pd.read_excel('/content/gdrive/MyDrive/Colab Notebooks/Thesis/Centroid/1.xlsx')

In [None]:
# text = "tempatnya sempurna, pas didepan pantai losari, namun sayang acnya kurang dingin padahal sdh lapor namun tetep tidak bisa dingin."
tokens, pred_tags = aspect_model.infer(df['ulasan'][50])
max_len = max([len(token) for token in tokens])
for token, pred_tag in zip(tokens, pred_tags):
    print(f"{token.ljust(max_len)}\t{train_dataset.idx2tag[pred_tag]}")

[CLS]   	O
lokasi  	B-ASPECT
bagus   	B-SENTIMENT
ada     	B-SENTIMENT
makanan 	B-ASPECT
enak    	B-SENTIMENT
di      	O
sekitar 	O
.       	O
hangat  	O
&       	O
amp     	O
;       	O
membantu	O
staf    	O
.       	O


In [None]:
df['ulasan'][50]

'Lokasi bagus ada makanan enak di sekitar. Hangat &amp; membantu staf.'

In [None]:
def extract_as(text):
  tokens, pred_tags = aspect_model.infer(text)
  aspect = []
  index = 0

  for token, pred_tag in zip(tokens, pred_tags):
    if("B-ASPECT" in train_dataset.idx2tag[pred_tag]):
      aspect.append(token)
      index+=1
    elif("I-ASPECT" in train_dataset.idx2tag[pred_tag]):
      if(index == 0):
        continue
      else:
        aspect[index-1]+=" "+token
    elif("SENTIMENT" in train_dataset.idx2tag[pred_tag]):
      if(index == 0):
        continue
      else:
        aspect[index-1]+=" "+token
    else:
      continue
    
  return aspect

In [None]:
df_as = []
for i in df['ulasan']:
  df_as.append(extract_as(i))

In [None]:
df['frasa'] = df_as

In [None]:
df

Unnamed: 0,nama,rating,tanggal,ulasan,ringkasan,frasa
0,Aan Z.,8.5,2018-09-05,"Lumayan menyenangkan, ambil pool access, kolam...","Kolam renang kecil, kunci pintu sempat tidak b...","[ambil pool access, kolam renang kecil, kunci ..."
1,Abby P.,8.5,2018-10-29,Saya memiliki momen paling menyenangkan mengin...,"Hotel nyaman, sangat dekat ke pantai dan makanan",[]
2,Abdul H. R.,6.1,2020-03-10,Kamar berbau kamar mandi bau air kamar mandi k...,Kamar bau dan berdebu,"[kamar berbau, kamar mandi bau, air kamar mand..."
3,Abigail C.,9.4,2020-03-16,"Layanan hebat, kamar sepadan dengan harga.",Layanan hebat dan kamar sesuai harga,"[layanan hebat, kamar sepadan]"
4,ABIRA M. G.,8.7,2018-07-25,"Kotor, saya menghabiskan 4 malam di alea dan m...",Sprei dan bantal kotor. Koneksi wifi buruk,"[seprai, bantal sangat mengecewakan, koneksi, ..."
...,...,...,...,...,...,...
510,Grieshelda N.,5.6,2019-06-09,"Kamar gelap tidak seperti di gambar, sarung da...",Kamar gelap dan fasilitas kamar kotor untuk sa...,"[kamar gelap, sarung, sprei bau tidak bersih, ..."
511,Grissela,9.0,2017-09-24,Secara keseluruhan bagus. Ragam makanan dan ra...,Makanan dibuat beragam dan peningkatan rasa,"[secara keseluruhan bagus ragam, makanan, rasa..."
512,Guest-8sg3uu,6.3,2020-02-12,"Kamar kurang bersih, AC di kamar tidak terasa ...",Kamar kurang bersih dan AC tidak terasa dingin,"[kamar kurang bersih, ac tidak terasa dingin]"
513,Guest-cvsznm,8.5,2019-03-04,Sangat nyaman menginap di hotel ini karena pel...,Hotel sangat nyaman dengan pelayanan dan fasil...,"[pelayanan, kamar sangat bagus]"


In [None]:
df.to_csv(f"{DRIVE_ROOT}/preprocessing/train.csv", index=False, encoding='utf-8')

One interesting detail from the inference example: do you notice an invalid tag sequence in the prediction? 

And the follow-up questions: How could that happen? Is there any way to prevent that? 

For now, I will leave those questions unanswered. If you have any idea, [let me know](https://twitter.com/yoseflaw)!

# Conclusion

Fine-tuning rocks! We do not have to train from scratch (and potentially waste so much resources). The multilingual BERT models from 🤗 is a good starting point for those who have limited resources and are working in non-English languages. Those models have been trained on a (very) large corpus. Even the multilingual tokenizer works better than expected, which does not make sense to me at the beginning (every language has a unique tokenization method).

You can see all the complete list of available models [here](https://huggingface.co/transformers/pretrained_models.html). Spoiler alert: it contains more models than just BERT!

Happy fun-tuning!