<a href="https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/MarkupLM/Fine_tune_MarkupLMForTokenClassification_on_a_custom_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Set-up environment

First, we install 🤗 Transformers.

We also install 🤗 Evaluate and Seqeval, for computing metrics like F1, recall and precision.

In [20]:
import os

os.environ['CUDA_VISIBLE_DEVICES'] = '1'

In [21]:
import torch
print(torch.cuda.device_count())

1


## Prepare dataset

Next, let's load a toy dataset which we'll use to fine-tune MarkupLM on.

The goal for the model is to label nodes of HTML strings with the appropriate class.

In [22]:
import numpy as np

def sample_websites(dataset, k, seed):
    all_websites = list(dataset.keys())
    training_websites = np.random.choice(all_websites, size=k, replace=False)
    print(training_websites)

    train = []
    valid = []
    for website in all_websites:
        if website in training_websites:
            train += dataset[website]
        else:
            valid += dataset[website]
    
    return train, valid

In [23]:
import json

with open("/home/savkin/vera/all_camera_datasets_merged.json", "r") as file:
    data = json.load(file)

train_data, valid_data = sample_websites(data, k=5, seed=0)

['ecost' 'amazon' 'thenerds' 'beachaudio' 'jr']


In [38]:
print(train_data.keys)

AttributeError: 'list' object has no attribute 'keys'

## Create PyTorch Dataset

Next, we'll create a regular PyTorch dataset. Each item of the dataset is an HTML string, encoded using MarkupLMProcessor. Note that we initialize the processor with parse_html = False, as we have already parsed the HTML ourselves and we're providing the nodes, xpaths and node labels.

Note that by default, the processor will only label the first token of a given node and label the remaining tokens with -100. you can change this by setting the `only_label_first_subword` attribute of the processor's tokenizer to `False`.

In [24]:
from torch.utils.data import Dataset

max_length=384

class MarkupLMDataset(Dataset):
    """Dataset for token classification with MarkupLM."""

    def __init__(self, data, processor=None, max_length=max_length):
        self.data = data
        self.processor = processor
        self.max_length = max_length

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        # first, get nodes, xpaths and node labels
        item = self.data[idx]
        nodes, xpaths, node_labels = item['nodes'], item['xpaths'], item['node_labels']

        # provide to processor
        encoding = self.processor(nodes=nodes, 
                                  xpaths=xpaths, 
                                  node_labels=node_labels, 
                                  padding=True, 
                                  truncation=True,
                                  max_length=self.max_length, 
                                  return_tensors="pt")

        # remove batch dimension
        encoding = {k: v.squeeze() for k, v in encoding.items()}

        return encoding

In [25]:
from transformers import MarkupLMProcessor

processor = MarkupLMProcessor.from_pretrained("microsoft/markuplm-base")
processor.parse_html = False

train_dataset = MarkupLMDataset(data=train_data, processor=processor, max_length=max_length)
valid_dataset = MarkupLMDataset(data=valid_data, processor=processor, max_length=max_length)

In [26]:
# dataset[0]['labels']

Let's check an example:

In [27]:
# example = dataset[0]
# for k,v in example.items():
#   print(k,v.shape)

Let's decode the input_ids back to text:

In [28]:
# processor.decode(example['input_ids'])

Let's verify the correspondence between input_ids and labels. -100 means that those tokens will be ignored by the loss function, hence these won't contribute to the final loss. 

In [29]:
# for id, label in zip(example['input_ids'].tolist(), example['labels'].tolist()):
#   if label != -100:
#     print(processor.decode([id]), id2label[label])
#   else:
#     print(processor.decode([id]), label)

## Create PyTorch Dataloaders

The next step is to create a PyTorch DataLoader, which allows us to get batches from the dataset.

In [30]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(train_dataset, batch_size=64, shuffle=True)
eval_dataloader = DataLoader(valid_dataset, batch_size=64, shuffle=True)

## Define model

We define the model here, which is a MarkupLM-base Transformer, with a token classifier head on top. The token classifier will have randomly initialized weights, while the base Transformer has pre-trained weights.



In [31]:
from transformers import MarkupLMForTokenClassification

id2label = {0: "model", 1: "price", 2: "manufacturer", 3: "other"}
label2id = {label:id for id, label in id2label.items()}

model = MarkupLMForTokenClassification.from_pretrained("microsoft/markuplm-base", id2label=id2label, label2id=label2id)

Some weights of MarkupLMForTokenClassification were not initialized from the model checkpoint at microsoft/markuplm-base and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


We also create a label_list, where each tag starts with a B (as seqeval expects the labels to be in IOB format).

In [32]:
label_list = ["B-" + x for x in list(id2label.values())]
label_list

['B-model', 'B-price', 'B-manufacturer', 'B-other']

We also define metric calculations (as we'd like to know the F1 score etc. during training). We'll use 🤗 [Evaluate](https://huggingface.co/docs/evaluate/index) for that, which is a library containing many tools for evaluating ML models.

In [33]:
# import evaluate

# # Metric
# metric = evaluate.load("seqeval")

# def get_labels(y_pred, y_true):
#     # Transform predictions and references tensos to numpy arrays

#     # y_pred = predictions.detach().cpu().clone().numpy()
#     # y_true = references.detach().cpu().clone().numpy()

#     # Remove ignored index (special tokens)
#     true_predictions = [
#         [label_list[p] for (p, l) in zip(pred, gold_label) if l != -100]
#         for pred, gold_label in zip(y_pred, y_true)
#     ]
#     true_labels = [
#         [label_list[l] for (p, l) in zip(pred, gold_label) if l != -100]
#         for pred, gold_label in zip(y_pred, y_true)
#     ]
#     return true_predictions, true_labels


# def compute_metrics(eval_pred):
#     logits, labels = eval_pred
    

#     predictions = np.argmax(logits, axis=-1)
#     # predictions = logits.argmax(dim=-1)
#     print(labels)
#     print("________________")
    
#     preds, refs = get_labels(predictions, labels)
#     print(preds)
#     print("________________")
#     print(refs)

#     return metric.compute(predictions=predictions, references=labels)



In [34]:
# from transformers import Trainer, TrainingArguments

# train_args = TrainingArguments(
#     output_dir="train_logs",
#     evaluation_strategy="epoch",
#     learning_rate=2e-05,
#     num_train_epochs=10,
#     per_device_train_batch_size=24,
#     per_device_eval_batch_size=24,
#     warmup_ratio=0.1
# )

# trainer = Trainer(
#     model,
#     args=train_args,
#     train_dataset=train_dataset,
#     eval_dataset=valid_dataset,
#     compute_metrics=compute_metrics
# )

# trainer.evaluate()

In [35]:
import evaluate

# Metric
metric = evaluate.load("seqeval")

def get_labels(predictions, references):
    # Transform predictions and references tensos to numpy arrays
    if device.type == "cpu":
        y_pred = predictions.detach().clone().numpy()
        y_true = references.detach().clone().numpy()
    else:
        y_pred = predictions.detach().cpu().clone().numpy()
        y_true = references.detach().cpu().clone().numpy()


    # Remove ignored index (special tokens)
    true_predictions = [
        [p for (p, l) in zip(pred, gold_label) if l != -100]
        for pred, gold_label in zip(y_pred, y_true)
    ]
    true_labels = [
        [l for (p, l) in zip(pred, gold_label) if l != -100]
        for pred, gold_label in zip(y_pred, y_true)
    ]

    # print("Before: ")
    # print(y_true)
    # print("___________")
    # print(y_pred)
    # print("After: ")
    # print(true_labels)
    # print("___________")
    # print(true_predictions)

    return true_predictions, true_labels

def compute_metrics(return_entity_level_metrics=True):
    results = metric.compute()
    if return_entity_level_metrics:
        # Unpack nested dictionaries
        final_results = {}
        for key, value in results.items():
            if isinstance(value, dict):
                for n, v in value.items():
                    final_results[f"{key}_{n}"] = v
            else:
                final_results[key] = value
        return final_results
    else:
        return {
            "precision": results["overall_precision"],
            "recall": results["overall_recall"],
            "f1": results["overall_f1"],
            "accuracy": results["overall_accuracy"],
        }

## Train

Alright, let's train! Here we're training the model in native PyTorch, but of course you could also opt for things like 🤗 Accelerate, 🤗 Trainer, PyTorch Lightning,...

In [36]:
import torch
from torch.optim import AdamW
from tqdm.auto import tqdm

optimizer = AdamW(model.parameters(), lr=2e-5)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model.to(device)

model.train()
for epoch in range(10):  # loop over the dataset multiple times
    for batch in tqdm(train_dataloader):
        # get the inputs;
        inputs = {k:v.to(device) for k,v in batch.items()}

        # zero the parameter gradients
        optimizer.zero_grad()

        # forward + backward + optimize
        outputs = model(**inputs)

        loss = outputs.loss
        loss.backward()
        optimizer.step()

        print("Loss:", loss.item())

        predictions = outputs.logits.argmax(dim=-1)
        labels = batch["labels"]
        preds, refs = get_labels(predictions, labels)
        metric.add_batch(
            predictions=preds,
            references=refs,
        )

    eval_metric = compute_metrics()
    print(f"Epoch {epoch}:", eval_metric)

  0%|          | 0/57 [00:00<?, ?it/s]

Loss: 1.4032880067825317


  0%|          | 0/57 [00:07<?, ?it/s]


ValueError: Predictions and/or references don't match the expected format.
Expected format: {'predictions': Sequence(feature=Value(dtype='string', id='label'), length=-1, id='sequence'), 'references': Sequence(feature=Value(dtype='string', id='label'), length=-1, id='sequence')},
Input predictions: [[2], [2, 3, 3, 3, 3, 2], [2], ..., [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2], [2], [2]],
Input references: [[3], [3, 3, 3, 3, 3, 3], [3], ..., [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3], [3], [3]]

In [None]:
metric = evaluate.load("seqeval")

all_preds = []
all_refs = []

model.eval()
for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    # with torch.no_grad():
    #     outputs = model(**batch)


    # predictions = outputs.logits.argmax(dim=-1)
    labels = batch["labels"]
    # print(len(labels))

    # DELETE ME DELETE ME DELETE ME DELETE ME DELETE ME DELETE ME DELETE ME DELETE ME DELETE ME DELETE ME DELETE ME 
    # predictions = labels
    # preds, refs = get_labels(predictions, labels)
    # print(len(refs))

    # all_preds += preds
    all_refs += labels

    # print(compute_metrics())
# metric.add_batch(
#         predictions=all_preds,
#         references=all_refs,
#     )
# eval_metrics = metric.compute()

In [None]:
res = {}

for x in all_refs:
    for y in x:
        y = int(y)
        if y in res:
            res[y] += 1
        else:
            res[y] = 1
print(res)

In [None]:
print(all_refs[0])

In [None]:
eval_metrics

In [None]:
from collections import Counter

# count = {
#     "model": 0,
#     "manufacturer": 0,
#     "other": 0,
#     "price": 0
# }

count = {
    0: 0,
    1: 0,
    2: 0,
    3: 0
}

for example in valid_data:
    example_cnt = Counter(example["node_labels"][0])
    # print(example_cnt)
    for key in count.keys():
        count[key] += example_cnt[key]

print(count)

## Inference

Let's try out the model on a new web page for which we have the nodes and xpaths. Here we'll just use one of our training set.

In [None]:
nodes = data[0]['nodes']
xpaths = data[0]['xpaths']
node_labels = data[0]['node_labels']
print("Nodes:", nodes)
print("Xpaths:", xpaths)

We'll prepare the example for the model using the processor. Note that we're passing `return_offsets_mapping=True`, as the offsets allow us to determine which tokens are at the start of a given word at which aren't.

In [None]:
# prepare for model
# note that you don't need to prepare node_labels, we just have them available here so we'll compare to the ground truth
# print(processor.max_length)
encoding = processor(nodes=nodes, xpaths=xpaths, node_labels=node_labels, truncation=True, max_length=max_length, return_offsets_mapping=True, return_tensors="pt").to(device)
for k,v in encoding.items():
  print(k,v.shape)

Let's perform a forward pass:

In [None]:
# we don't need the offset mapping and labels for the forward pass
offset_mapping = encoding.pop("offset_mapping")
labels = encoding.pop("labels")

# forward pass
with torch.no_grad():
  outputs = model(**encoding)

The model outputs logits of shape (batch_size, seq_len, num_labels). We just take the highest logit (score) per token as prediction:

In [None]:
predictions = outputs.logits.argmax(-1)
print(predictions)

The model makes predictions at the token level, however we're only interested in the predicted label for the first token of each node.

This can be achieved by accessing the word_ids (to know whether or not the token is a special token or not) and the offset_mapping (to know whether or not the token is the first of a particular node).

In [None]:
results = {"Node": [], "Predicted": [], "Ground truth": []}

for pred_id, word_id, offset, label_id in zip(predictions[0].tolist(), encoding.word_ids(0), offset_mapping[0].tolist(), labels[0].tolist()):
  if word_id is not None and offset[0] == 0:
    # print(f"Node: {nodes[0][word_id]}")
    # print(f"Predicted: {id2label[pred_id]}")
    # print(f"Ground truth: {id2label[label_id]}")
    # print("----------")
    results["Node"].append(nodes[0][word_id])
    results["Predicted"].append(id2label[pred_id])
    results["Ground truth"].append(id2label[label_id])

Let's pretty print the results as a Pandas dataframe:

In [None]:
import pandas as pd

pd.DataFrame.from_dict(results).head()