# TOPIC CLASSIFICATION WITH TRANSFORMERS
*VietAI Advanced NLP*

# 1. Introduction

*In this lab, VietAI introduces a tutorial on how to use huggingface's transformers library to deal with topic classification task.*

**Topic classification** is an NLP task in which we need to classify a document/text to a specific topic.

This notebook shows you how to deal with Vietnam news topics classification using transformers library. After this you will know:


1.   Simple preprocessing for Vietnamese text
2.   Using some components in *transformers*:

      *   Datasets
      *   Tokenizer
      *   Trainer
      *   Pretrained language model










You can find some of the resources used in this tutorial from:


*   dataset: https://github.com/duyvuleo/VNTC/tree/master/Data/10Topics/Ver1.1
*   underthesea: https://underthesea.readthedocs.io/en/latest/readme.html
*   references: 

> https://huggingface.co/docs/transformers/training

> https://huggingface.co/








# 2. Prepare your dataset

For convenient you can download the data from my drive [here](https://drive.google.com/drive/folders/1uqXk9qiXqJ7xd8P3khvlOawdi2tzBTem?usp=sharing)

*Let's take a look at the dataset.*

The dataset contains 10 topics:

*   Chinh tri Xa hoi
*   Doi song
*   Khoa hoc
*   Kinh doanh
*   Phap luat
*   Suc khoe
*   The gioi
*   The thao
*   Van hoa
*   Vi tinh



Example of a document

```
Tạo bộ kit phát hiện vi khuẩn bệnh thương hàn (NLĐ)- Thạc sĩ Phạm Thái Bình, Trung tâm Phát triển khoa học và công nghệ trẻ, vừa nghiên cứu chế tạo thành công bộ kit phát hiện vi khuẩn bệnh thương hàn Salmonella typhi.
Dựa trên nguyên tắc kháng thể đặc hiệu của vi khuẩn thương hàn Salmonella typhi có trong huyết thanh người, nếu huyết thanh của người có chứa vi khuẩn bệnh thương hàn thì men gắn trên bộ kit sẽ chuyển từ không màu sang có màu (xanh hoặc vàng) 
Với bộ kit ELISA, chỉ cần lấy của bệnh nhân khoảng 1cc máu, ủ trong 2 giờ, có thể phát hiện bệnh thương hàn chính xác đến 98%. Bộ kit này đã được dùng trên 100 mẫu bệnh phẩm tại Bệnh viện Đa khoa Trung tâm An Giang, TP Long Xuyên, tỉnh An Giang. Theo chủ nhiệm đề tài, nếu bộ kit này được sản xuất tại Việt Nam, giá thành sẽ giảm hơn 50% so với giá nhập khẩu từ Mỹ. Đề tài đã được hội đồng nghiệm thu do Sở Khoa học và Công nghệ TPHCM lập đánh giá vào loại xuất sắc. Công ty TNHH SX-TM-DV Nam Khoa (Q.7 – TPHCM) cũng vừa đầu tư cho nhóm nghiên cứu đề tài 100 triệu đồng để đưa vào sản xuất thử (ảnh). 
```

> Label: Khoa hoc





As you can see, a single document contains several lines and it is not tokenized. Since a pretrained language model learns the context, we do not try to remove stopwords or numbers like some traditional techniques. Therefore, we only need a few steps:

*   Combine all lines in a document
*   Apply word tokenization

In [None]:
!pip install underthesea

In [None]:
from underthesea import word_tokenize
import os

def clean_document(document):
    """
        return cleaned document (str)
          - Combine lines
          - Word tokenize using underthesea library
    """
    # combine lines: replace the newline symbols with space
    combine_lines = document.replace("\n", " ") 

    # optional step: fix the whitespace between words (maybe duplicated)
    whitespace_fix = " ".join(combine_lines.split())

    # word tokenize
    tokenized_doc = word_tokenize(whitespace_fix)
    tokenized_doc = " ".join([w.strip().replace(" ", "_") for w in tokenized_doc])
    return tokenized_doc

def process_data(datasrc, label2id, prefix):
    """
    return a dictionary with elements composed of 2 keys:
      - text: preprocessed text
      - labels: id coresspond to label
    """
    processed_data = {}

    dp_id = 0
    for label in os.listdir(datasrc):
        label_id = label2id[label]
        for filename in os.listdir(os.path.join(datasrc, label)):
            try:
                with open(os.path.join(datasrc, label, filename), "r", encoding= "utf-16") as f:
                    document = f.read()
                    cleaned_doc = clean_document(document)

                    id = prefix + str(dp_id)
                    processed_data[id] = {}
                    processed_data[id]["text"] = cleaned_doc
                    processed_data[id]["labels"] = label_id
                dp_id += 1
            except:
                print(os.path.join(label,filename))

    return processed_data


def save_data(data, type):
    # save your preprocessed data
    with open(os.path.join("data/clean/", type + ".json"), "w") as f:
        json.dump(data, f, indent= 4)
    return


In [None]:
import os
import json

# map topic to label id
datasrc = "data/raw/train/"

label2id = dict()

for label_id, label in enumerate(os.listdir(datasrc)):
    label2id[label] = label_id

# save the label dict to use later
with open("data/clean/label2id.json", "w") as f:
    json.dump(label2id, f, indent= 4)

In [None]:
# Process a large corpus may take a lot of time. Please wait.

train_data = process_data("data/raw/train", label2id, "train")
# Let's make a validation set from trainset
from sklearn.model_selection import train_test_split
train_keys = list(train_data.keys())
train_ids, valid_ids = train_test_split(train_keys, test_size= 0.1)

valid_data = {id:train_data[id] for id in valid_ids}
train_data = {id:train_data[id] for id in train_ids}

test_data = process_data("data/raw/test", label2id, "test")

In [None]:
# take a look at a processed data sample
print(json.dumps(train_data["train1"], indent= 4, ensure_ascii= False))

In [None]:
# save preprocess data
save_data(train_data, "train")
save_data(valid_data, "valid")
save_data(test_data, "test")

The testset in this corpus is quite large (even bigger than the trainset). So I sample a small amount of testset to reduce testing time and processing time before apply pretrained language model.

In [None]:
import json
with open("data/clean/test.json", "r") as f:
    test_data = json.load(f)

from sklearn.model_selection import train_test_split
test_keys = list(test_data.keys())
test, sample_test_ids = train_test_split(test_keys, test_size= 3000)
print(len(sample_test_ids))

sample_test = {id:test_data[id] for id in sample_test_ids}
save_data(sample_test, "sample_test")

In [None]:
# Visualize class distribution in training set
### YOUR CODE HERE ###




# 3. Training model with transformers

In [None]:
# install required packages
!pip install datasets transformers

## 3.1. Loading dataset

To load your dataset from file or download from url. It is suggested to write a loading script.

The script helps you load your dataset and return a DatasetDict object which is convenient to use some functions provided by huggingface (such as apply tokenizer).

You can find an example of a script in _topic_data.py_. It contains 3 main functions:
*   ```_info()```: which is in charge of specifying the dataset metadata as a datasets. DatasetInfo dataclass and in particular the datasets.Features which defines the names and types of each column in the dataset,

*   ```_split_generator()```: which is in charge of downloading or retrieving the data files, organizing them by splits and defining specific arguments for the generation process if needed,

*   ```_generate_examples()```: which is in charge of loading the files for a split and yielding examples with the format specified in the features.

For more details: https://huggingface.co/docs/datasets/dataset_script




In [None]:
from datasets import load_dataset

cleaned_data = load_dataset("topic_data.py")

In [None]:
print(cleaned_data)

## 3.2. Tokenizer

To use pretrained language model such as BERT, GPT, Roberta... 

We need to tokenize words using the strategy in these models (BPE, wordpiece, ...)


Huggingface allows us to get the tokenizer corresponding to the model by the model name on their [hub](https://huggingface.co/models). In this notebook, we use PhoBERT-base (vinai/phobert-base).

Apply tokenizer to the dataset by  ```map()``` function. This will return a dataset with additional elements which is required in function ```forward``` of the model.

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("vinai/phobert-base")# model name in huggingface's hub
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = cleaned_data.map(tokenize_function, batched=True) # Apply tokenizer to the dataset

In [None]:
"""
additional elements: 'input_ids', 'token_type_ids', 'attention_mask'
"""

print(tokenized_datasets)

## 3.3. Pretrained language model

In this notebook, we use [PhoBERT](https://arxiv.org/abs/2003.00744) which is a popular pretrained language model for Vietnamese. It's architecture is similar to BERT.

For finetuning on the classification task, we use ```ModelForSequenceClassification``` object to automatically add a linear layer (classifier) on the top of the pretrained model.

In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("vinai/phobert-base", num_labels=len(label2id))

## 3.4. Trainer

Trainer object helps us to train, evaluate the model on the dataset. The model and tokenized data are input parameters of this object.

It is also required a TrainingArguments object to define hyper-parameters such as:


*   num_train_epochs
*   batch_size
*   learning_rate
*   ...

For more details: [Huggingface Trainer](https://huggingface.co/docs/transformers/main_classes/trainer)




In [None]:
import numpy as np
from datasets import load_metric

# define metric for evaluation (accuracy)
metric = load_metric("accuracy")
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [None]:
from transformers import Trainer, TrainingArguments

epochs = 2
batch_size = 8
lr = 2e-5

training_args = TrainingArguments(
    output_dir = "topic_classification_result",
    evaluation_strategy = "steps", #print evaluation after finishing an epoch
    num_train_epochs=epochs,
    learning_rate=lr,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    save_total_limit=1,
    save_steps=2000,
    eval_steps=2000,
    gradient_accumulation_steps=2,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    compute_metrics=compute_metrics
)
trainer.train()

In [None]:
# Let's evaluate on our test set
test_eval = trainer.predict(tokenized_datasets["test"])
print(json.dumps(test_eval.metrics, indent= 4))

In [None]:
# Save the best model. The model will be saved in the output_dir defined in the training arguments
trainer.save_model()

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import accuracy_score, confusion_matrix

with open("data/clean/label2id.json" ,"r") as f:
    label2id = json.load(f)
id2label = {value: key for key, value in label2id.items()}
label_tags = [id2label[i] for i in range(10)]

predictions = []
for pred in test_eval.predictions:
    p = np.argmax(pred)
    predictions.append(p)

cfs = confusion_matrix(tokenized_datasets["test"]['labels'], predictions)
fig, ax1 = plt.subplots(1,1, figsize=(10,8))
sns.heatmap(cfs, annot=True, fmt='d',
            xticklabels=label_tags, yticklabels=label_tags, ax= ax1)

### Custom loss function

Let's try to custom a loss function. Use the above information of class distribution to define WeightedCrossEntropyLoss to deal with imbalanced data. For instance, the weight can be calculated as:

$w_i = log(\frac{N}{N_i})$ 

where $N$ is number of data points and $N_i$ is the number of data points in class $i^th$

Pretrained language model is sometimes claimed that is less sensitve to the unbalanced data. Can the result proves that?

In [None]:
from transformers import Trainer, TrainingArguments

epochs = 2
batch_size = 8
lr = 2e-5

training_args = TrainingArguments(
    output_dir = "topic_classification_result",
    evaluation_strategy = "steps", #print evaluation after finishing an epoch
    num_train_epochs=epochs,
    learning_rate=lr,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    save_total_limit=1,
    save_steps=10000,
    eval_steps=1000,
    gradient_accumulation_steps=2,
)

In [None]:
# Custom a trainer to define the loss function
# hint: read the doc https://huggingface.co/docs/transformers/main_classes/trainer

### YOUR CODE HERE ###


#####################

trainer = CustomTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    compute_metrics=compute_metrics
)

In [None]:
trainer.train()

In [None]:
test_eval = trainer.predict(tokenized_datasets["test"])
print(json.dumps(test_eval.metrics, indent= 4))

# 4. Demo

Load our fine-tuned model and make a simple demo 🤗

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("topic_classification_result")
tokenizer = AutoTokenizer.from_pretrained("vinai/phobert-base")

In [None]:
from transformers import TextClassificationPipeline #pipeline for text classfication

# load label2id
with open("data/clean/label2id.json" ,"r") as f:
    label2id = json.load(f)
id2label = {value: key for key, value in label2id.items()}

while True:
    document = input("Input text here: ")
    if document == "exit":
        break
    document = clean_document(document)
    pipe = TextClassificationPipeline(model=model, tokenizer=tokenizer)
    pred_topic_id = model.config.label2id[pipe(document)[0]['label']]
    print("Topic: ", id2label[pred_topic_id])
    print("="*100)