## Indonesian DistilBERT finetuning with ArcMargin

In this notebook we are going to first download a DistilBERT model and tokenizer from HuggingFace which is pre-treained on the Indonesian Wikipedia. Then, we fine-tune it on the titles of this dataset with the help of ArcMarginProduct to build more useful embeddings. After that, we can use the model to obtain embeddings for titles in test set and hope that they are representative enough to find similar and dissimilar products.

If you are not familiar with HuggingFace or BERT models, I've done a tutorial on them on Kaggle and there in addition to explaning how to work with HuggingFace models, I've introduced resources to learn more about NLP and Transformers in general. You can find the notebook [here](https://www.kaggle.com/moeinshariatnia/simple-distilbert-fine-tuning-0-84-lb).

In [None]:
import os
import copy
import math
import pandas as pd
import numpy as np
from tqdm.autonotebook import tqdm
import matplotlib.pyplot as plt

import torch
import torch.nn as nn
import torch.nn.functional as F

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

import transformers
from transformers import (BertTokenizer, BertModel,
                          DistilBertTokenizer, DistilBertModel)

In [None]:
train = pd.read_csv("../input/rosaccred-dataset/result_file_for_training.csv")

train['Раздел ЕП РФ (Код из ФГИС ФСА для подкатегории продукции)'] =\
                train['Раздел ЕП РФ (Код из ФГИС ФСА для подкатегории продукции)'].apply(lambda x: int(x*100))

train = train[train['Раздел ЕП РФ (Код из ФГИС ФСА для подкатегории продукции)']\
    .isin(train['Раздел ЕП РФ (Код из ФГИС ФСА для подкатегории продукции)'].value_counts().index.tolist()[:50])]

# train = pd.concat([train[train['Раздел ЕП РФ (Код из ФГИС ФСА для подкатегории продукции)'] == 930010].sample(4000),\
#                   train[~(train['Раздел ЕП РФ (Код из ФГИС ФСА для подкатегории продукции)'] == 930010)]])
# train = train.sample(train.shape[0]).reset_index(drop=True)

display(train.head(), train.shape[0])

The following histogram gives us an idea that roughly how many words are there in each title. It is not a precise count of the tokens fed to the model because DistilBERT tokenizer does a more sophisticated function than simply splitting the sentence from its white spaces.

In [None]:
title_lengths = train['Общее наименование продукции'].apply(lambda x: len(x.split(" "))).to_numpy()
print(f"MIN words: {title_lengths.min()}, MAX words: {title_lengths.max()}")
plt.hist(title_lengths);

max_length is set to 30 according to the histogram. But you can safely change it.

In [None]:
class CFG:
    DistilBERT = True # if set to False, BERT model will be used
    bert_hidden_size = 768
    
    batch_size = 64
    epochs = 30
    num_workers = 4
    learning_rate = 1e-5 #3e-5
    scheduler = "ReduceLROnPlateau"
    step = 'epoch'
    patience = 2
    factor = 0.8
    dropout = 0.5
    model_path = "/kaggle/working"
    max_length = 30
    model_save_name = "model.pt"
    device = torch.device("cuda" if torch.cuda.is_available() else 'cpu')

Loading the model and its tokenizer from amazing HuggingFace model hub. As mentioned before, this model has been pre-trained on indonesian wikipedia.

In [None]:
if CFG.DistilBERT:
    model_name='cahya/distilbert-base-indonesian'
    tokenizer = DistilBertTokenizer.from_pretrained(model_name)
    bert_model = DistilBertModel.from_pretrained(model_name)
else:
    model_name='cahya/bert-base-indonesian-522M'
    tokenizer = BertTokenizer.from_pretrained(model_name)
    bert_model = BertModel.from_pretrained(model_name)

See an example

In [None]:
text = train['title'].values[np.random.randint(0, len(train) - 1, 1)[0]]
print(f"Text of the title: {text}")
encoded_input = tokenizer(text, return_tensors='pt')
print(f"Input tokens: {encoded_input['input_ids']}")
decoded_input = tokenizer.decode(encoded_input['input_ids'][0])
print(f"Decoded tokens: {decoded_input}")
output = bert_model(**encoded_input)
print(f"last layer's output shape: {output.last_hidden_state.shape}")

## Dataset

Encoding label_group coulmn to numeric labels so we can feed them to the model and loss function.

In [None]:
lbl_encoder = LabelEncoder()
train['label_code'] = lbl_encoder.fit_transform(train['Раздел ЕП РФ (Код из ФГИС ФСА для подкатегории продукции)'])
NUM_CLASSES = train['label_code'].nunique()

In [None]:
NUM_CLASSES

In [None]:
class TextDataset(torch.utils.data.Dataset):
    def __init__(self, dataframe, tokenizer, mode="train", max_length=None):
        self.dataframe = dataframe
        if mode != "test":
            self.targets = dataframe['label_code'].values
        texts = list(dataframe['Общее наименование продукции'].apply(lambda o: str(o)).values)
        self.encodings = tokenizer(texts, 
                                   padding=True, 
                                   truncation=True, 
                                   max_length=max_length)
        self.mode = mode
        
        
    def __getitem__(self, idx):
        # putting each tensor in front of the corresponding key from the tokenizer
        # HuggingFace tokenizers give you whatever you need to feed to the corresponding model
        item = {key: torch.tensor(values[idx]) for key, values in self.encodings.items()}
        # when testing, there are no targets so we won't do the following
        if self.mode != "test":
            item['labels'] = torch.tensor(self.targets[idx]).long()
        return item
    
    def __len__(self):
        return len(self.dataframe)

In [None]:
dataset = TextDataset(train.sample(100), tokenizer, max_length=CFG.max_length)
dataloader = torch.utils.data.DataLoader(dataset, 
#                                          batch_size=CFG.batch_size, 
                                         num_workers=CFG.num_workers, 
                                         shuffle=True)
batch = next(iter(dataloader))
print(batch['input_ids'].shape, batch['labels'].shape)

In [None]:
dataset

In [None]:
# code from https://github.com/ronghuaiyang/arcface-pytorch/blob/47ace80b128042cd8d2efd408f55c5a3e156b032/models/metrics.py#L10

class ArcMarginProduct(nn.Module):
    r"""Implement of large margin arc distance: :
        Args:
            in_features: size of each input sample
            out_features: size of each output sample
            s: norm of input feature
            m: margin
            cos(theta + m)
        """
    def __init__(self, in_features, out_features, s=30.0, m=0.50, easy_margin=False):
        super(ArcMarginProduct, self).__init__()
        self.in_features = in_features
        self.out_features = out_features
        self.s = s
        self.m = m
        self.weight = nn.Parameter(torch.FloatTensor(out_features, in_features))
        nn.init.xavier_uniform_(self.weight)

        self.easy_margin = easy_margin
        self.cos_m = math.cos(m)
        self.sin_m = math.sin(m)
        self.th = math.cos(math.pi - m)
        self.mm = math.sin(math.pi - m) * m

    def forward(self, input, label):
        # --------------------------- cos(theta) & phi(theta) ---------------------------
        cosine = F.linear(F.normalize(input), F.normalize(self.weight))
        sine = torch.sqrt((1.0 - torch.pow(cosine, 2)).clamp(0, 1))
        phi = cosine * self.cos_m - sine * self.sin_m
        if self.easy_margin:
            phi = torch.where(cosine > 0, phi, cosine)
        else:
            phi = torch.where(cosine > self.th, phi, cosine - self.mm)
        # --------------------------- convert label to one-hot ---------------------------
        # one_hot = torch.zeros(cosine.size(), requires_grad=True, device='cuda')
        one_hot = torch.zeros(cosine.size(), device=CFG.device)
        one_hot.scatter_(1, label.view(-1, 1).long(), 1)
        # -------------torch.where(out_i = {x_i if condition_i else y_i) -------------
        output = (one_hot * phi) + ((1.0 - one_hot) * cosine)  # you can use torch.where if your torch.__version__ is 0.4
        output *= self.s
        # print(output)

        return output

In [None]:
class Model(nn.Module):
    def __init__(self, 
                 bert_model, 
                 num_classes=NUM_CLASSES, 
                 last_hidden_size=CFG.bert_hidden_size):
        
        super().__init__()
        self.bert_model = bert_model
        self.arc_margin = ArcMarginProduct(last_hidden_size, 
                                           num_classes,
                                           s=30.0, 
                                           m=0.50, 
                                           easy_margin=False)
    
    def get_bert_features(self, batch):
        output = self.bert_model(input_ids=batch['input_ids'], attention_mask=batch['attention_mask'])
        last_hidden_state = output.last_hidden_state # shape: (batch_size, seq_length, bert_hidden_dim)
        CLS_token_state = last_hidden_state[:, 0, :] # obtaining CLS token state which is the first token.
        return CLS_token_state
    
    def forward(self, batch):
        CLS_hidden_state = self.get_bert_features(batch)
#         output = self.arc_margin(CLS_hidden_state, batch['labels'])
        return CLS_hidden_state

In [None]:
class AvgMeter:
    def __init__(self, name="Metric"):
        self.name = name
        self.reset()
    
    def reset(self):
        self.avg, self.sum, self.count = [0]*3
    
    def update(self, val, count=1):
        self.count += count
        self.sum += val * count
        self.avg = self.sum / self.count
    
    def __repr__(self):
        text = f"{self.name}: {self.avg:.4f}"
        return text

def one_epoch(model, 
              criterion, 
              loader,
              optimizer=None, 
              lr_scheduler=None, 
              mode="train", 
              step="batch"):
    
    loss_meter = AvgMeter()
    acc_meter = AvgMeter()
    
    tqdm_object = tqdm(loader, total=len(loader))
    for batch in tqdm_object:
        batch = {k: v.to(CFG.device) for k, v in batch.items()}
        preds = model(batch)
        loss = criterion(preds, batch['labels'])
        if mode == "train":
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            if step == "batch":
                lr_scheduler.step()
                
        count = batch['input_ids'].size(0)
        loss_meter.update(loss.item(), count)
        
        accuracy = get_accuracy(preds.detach(), batch['labels'])
        acc_meter.update(accuracy.item(), count)
        if mode == "train":
            tqdm_object.set_postfix(train_loss=loss_meter.avg, accuracy=acc_meter.avg, lr=get_lr(optimizer))
        else:
            tqdm_object.set_postfix(valid_loss=loss_meter.avg, accuracy=acc_meter.avg)
    
    return loss_meter, acc_meter

def get_lr(optimizer):
    for param_group in optimizer.param_groups:
        return param_group["lr"]

def get_accuracy(preds, targets):
    """
    preds shape: (batch_size, num_labels)
    targets shape: (batch_size)
    """
    preds = preds.argmax(dim=1)
    acc = (preds == targets).float().mean()
    return acc

In [None]:
def train_eval(epochs, model, train_loader, valid_loader, 
               criterion, optimizer, lr_scheduler=None):
    
    best_loss = float('inf')
    best_model_weights = copy.deepcopy(model.state_dict())
    
    for epoch in range(epochs):
        print("*" * 30)
        print(f"Epoch {epoch + 1}")
        current_lr = get_lr(optimizer)
        
        model.train()
        train_loss, train_acc = one_epoch(model, 
                                          criterion, 
                                          train_loader, 
                                          optimizer=optimizer,
                                          lr_scheduler=lr_scheduler,
                                          mode="train",
                                          step=CFG.step)                     
        model.eval()
        with torch.no_grad():
            valid_loss, valid_acc = one_epoch(model, 
                                              criterion, 
                                              valid_loader, 
                                              optimizer=None,
                                              lr_scheduler=None,
                                              mode="valid")
        
        if valid_loss.avg < best_loss:
            best_loss = valid_loss.avg
            best_model_weights = copy.deepcopy(model.state_dict())
            tmp_model_state = model.state_dict()
            torch.save(model.state_dict(), f'{CFG.model_path}/{CFG.model_save_name}')
            print("Saved best model!")
            break
        
        if isinstance(lr_scheduler, torch.optim.lr_scheduler.ReduceLROnPlateau):
            lr_scheduler.step(valid_loss.avg)
            if current_lr != get_lr(optimizer):
                print("Loading best model weights!")
                model.load_state_dict(torch.load(f'{CFG.model_path}/{CFG.model_save_name}', 
                                                 map_location=CFG.device))
        
        print("*" * 30)

In [None]:
train_df, valid_df = train_test_split(train, 
                                      test_size=0.33, 
                                      shuffle=True, 
                                      random_state=42,
                                      stratify=train['label_code'])

train_dataset = TextDataset(train_df, tokenizer, max_length=CFG.max_length)
train_loader = torch.utils.data.DataLoader(train_dataset, 
                                           batch_size=CFG.batch_size, 
                                           num_workers=CFG.num_workers, 
                                           shuffle=True)

valid_dataset = TextDataset(valid_df, tokenizer, max_length=CFG.max_length)
valid_loader = torch.utils.data.DataLoader(valid_dataset, 
                                           batch_size=CFG.batch_size, 
                                           num_workers=CFG.num_workers, 
                                           shuffle=False)

In [None]:
model = Model(bert_model).to(CFG.device)
model.state_dict()
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=CFG.learning_rate)
if CFG.scheduler == "ReduceLROnPlateau":
    lr_scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, 
                                                              mode="min", 
                                                              factor=CFG.factor, 
                                                              patience=CFG.patience)

train_eval(CFG.epochs, model, train_loader, valid_loader,
           criterion, optimizer, lr_scheduler=lr_scheduler)

In [None]:
# !mkdir tokenizer
# tokenizer.save_pretrained("./tokenizer")
torch.save(model.state_dict(), "final.pt")

In [None]:
model = Model(bert_model)
model.load_state_dict(torch.load('../input/zaebalomenya-eto-vse/model(1).pt', map_location=torch.device('cpu')))
model.cpu()

In [None]:
def get_predicts(model, dataloader):
    tqdm_object = tqdm(dataloader, total=len(dataloader))
    preds = []
    for batch in tqdm_object:
        batch = {k: v.cuda() for k, v in batch.items()}
        preds.append(model(batch))
    return preds

In [None]:
model = Model(bert_model)
model.load_state_dict(torch.load('../input/zaebalomenya-eto-vse/model(1).pt'))
model.eval()
model.cuda()

catogories = train['Раздел ЕП РФ (Код из ФГИС ФСА для подкатегории продукции)'].unique()
base_vectors_for_unique_categories = {}
for subCategory in tqdm(catogories):
    dataframe_categories = train[train['Раздел ЕП РФ (Код из ФГИС ФСА для подкатегории продукции)']== subCategory]
    if dataframe_categories.shape[0] > 500:
        corpus = dataframe_categories.sample(500).reset_index(drop=True)
    else:
        corpus = dataframe_categories.reset_index(drop=True)
        
    dataset = TextDataset(corpus, tokenizer, max_length=CFG.max_length)
    dataloader = torch.utils.data.DataLoader(dataset,
                                             batch_size=32,
                                             num_workers=CFG.num_workers, 
                                             shuffle=True)
    embedings = get_predicts(model, dataloader)
    
    
    mean_emb = np.zeros(embedings[0].shape[1])
    for predict in embedings:
        mean_emb += predict.mean(axis=0).cpu().detach().numpy()
    mean_emb /= len(embedings)
    
                         
    base_vectors_for_unique_categories.update({str(subCategory):mean_emb})
{'base_vectors':base_vectors_for_unique_categories}

In [None]:
base_vectors_for_unique_categories

In [None]:
json_with_emb = pd.DataFrame(base_vectors_for_unique_categories)
json_with_emb.to_csv('df_with_embs.csv', index=False)


In [None]:
json_with_emb = pd.read_csv('df_with_embs.csv')
test_df = train.sample(50).reset_index(drop=True)
test_df['predict'] = None
for i in range(test_df.shape[0]-1):
    dataset = TextDataset(test_df.iloc[i:i+1, :], tokenizer, max_length=CFG.max_length)
    dataloader = torch.utils.data.DataLoader(dataset,
                                         batch_size=1,
                                         num_workers=CFG.num_workers, 
                                         shuffle=True)
    predicts = get_predicts(model, dataloader)[0][0].cpu().detach().numpy()

    dists = np.sum((np.square(predicts - json_with_emb_.values.T)), axis=1)
    indices = np.argsort(dists)[:5]
    predict_category = list(base_vectors_for_unique_categories.keys())[indices[0]]
    test_df.loc[i, 'predict'] = predict_category


In [None]:
from sklearn.metrics import accuracy_score
test_df = test_df[~(test_df['predict'].isna())]
test_df['predict'] = test_df['predict'].apply(lambda x: int(x))

accuracy_score(test_df['predict'].values,\
               test_df['Раздел ЕП РФ (Код из ФГИС ФСА для подкатегории продукции)'].values)

# Для ЦПУ без лейбла

In [None]:
model = Model(bert_model)
model.load_state_dict(torch.load('../input/zaebalomenya-eto-vse/model(1).pt', map_location=torch.device('cpu')))
model.cpu()

In [None]:
test_df.loc[10, 'Общее наименование продукции']

In [None]:
test_df = train.sample(50).reset_index(drop=True)
test_df['predict'] = None
for i in range(test_df.shape[0]+1):
    dataset = TextDataset(test_df.iloc[10:10+1, :], tokenizer, max_length=CFG.max_length, mode='test')
    dataloader = torch.utils.data.DataLoader(dataset,
                                             batch_size=1,
                                             num_workers=CFG.num_workers, 
                                             shuffle=True)
    tqdm_object = tqdm(dataloader, total=len(dataloader))
    preds = []
    model.cpu()
    model.eval()
    for batch in tqdm_object:
        print(batch)
        with torch.no_grad():
            batch = {k: v.cpu() for k, v in batch.items()}
            preds.append(model(batch))

    dists = np.sum((np.square(preds[0][0].cpu().detach().numpy() - np.array(list(base_vectors_for_unique_categories.values())))), axis=1)
    indices = np.argsort(dists)[:5]
    predict_category = list(base_vectors_for_unique_categories.keys())[indices[0]]
    test_df.loc[i, 'predict'] = predict_category
predict_category

In [None]:
predict_category

In [None]:
test_df = test_df[~(test_df['predict'].isna())]
test_df['predict'] = test_df['predict'].apply(lambda x: int(x))

accuracy_score(test_df['predict'].values,\
               test_df['Раздел ЕП РФ (Код из ФГИС ФСА для подкатегории продукции)'].values)

# predict with tensor and label

In [None]:
dists = np.sum((np.square(predicts - np.array(list(base_vectors_for_unique_categories.values())))), axis=1)
indices = np.argsort(dists)[:5]
predict_category = list(base_vectors_for_unique_categories.keys())[indices[0]]
predict_category

In [None]:
indices = np.argsort(dists)[:5]
list(base_vectors_for_unique_categories.keys())[indices[0]]

In [None]:
list(base_vectors_for_unique_categories.keys())[10]

In [None]:
dataset = TextDataset(train.sample(500), tokenizer, max_length=CFG.max_length)
dataloader = torch.utils.data.DataLoader(dataset,
                                         batch_size=64,
                                         num_workers=CFG.num_workers, 
                                         shuffle=True)

In [None]:
train

In [None]:
def get_predicts(model, dataloader):
    tqdm_object = tqdm(dataloader, total=len(dataloader))
    preds = []
    for batch in tqdm_object:
        batch = {k: v.cuda() for k, v in batch.items()}
        preds.append(model(batch))
    return preds

predicts = get_predicts(model, dataloader)

In [None]:
predicts[0][0]

In [None]:
mean_embs = np.zeros(predicts[0].shape[1])
for predict in predicts:
    mean_embs += predict.mean(axis=0).cpu().detach().numpy()
mean_embs.shape

In [None]:
predicts[0].shape[1]

In [None]:
np.zeros(5)

In [None]:
len(predicts)

In [None]:
mean_embs[0]

In [None]:
text = 'max_length=maxl, pad_to_max_length=True, truncation=True'
from torch.utils.data import TensorDataset, DataLoader
X_test = torch.tensor(tokenizer.encode(text, max_length=30, pad_to_max_length=True, truncation=True))
test_data = TensorDataset(X_test)
test_dataloader = DataLoader(
    test_data,
    batch_size=1,
    num_workers=4,
    pin_memory=True
)

In [None]:
for batch in test_dataloader:
    batch = batch[0]
    batch.cuda()
    with torch.no_grad():
        logits = model(batch)

In [None]:
batch['input_ids'][0].shape

In [None]:
train.loc[0, 'Общее наименование продукции']

In [None]:
def model_vector(text, tokenizer_model):
    token = tokenizer_model.encode(text, max_length=30, pad_to_max_length=True, truncation=True)
    token_with_dop_dimension = np.expand_dims(token, axis=0)
    return torch.tensor(token_with_dop_dimension).to('cuda')

model(model_vector(train.loc[0, 'Общее наименование продукции'], tokenizer))

In [None]:
model.eval()
model(batch['input_ids'][0])

In [None]:
model_['bert_model.embeddings.word_embeddings.weight'].shape

In [None]:
model_ = torch.load('../input/zaebalomenya-eto-vse/model(1).pt')
model_['arc_margin.weight'].shape