# LayoutLMv3


## Overview
The LayoutLMv3 model was proposed in [LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking](https://arxiv.org/abs/2204.08387) by Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, Furu Wei. LayoutLMv3 simplifies [LayoutLMv2](https://huggingface.co/docs/transformers/model_doc/layoutlmv2) by using patch embeddings (as in [ViT](https://huggingface.co/docs/transformers/model_doc/vit)) instead of leveraging a CNN backbone, and pre-trains the model on 3 objectives: masked language modeling (MLM), masked image modeling (MIM) and word-patch alignment (WPA).

The abstract from the paper is the following:

*Self-supervised pre-training techniques have achieved remarkable progress in Document AI. Most multimodal pre-trained models use a masked language modeling objective to learn bidirectional representations on the text modality, but they differ in pre-training objectives for the image modality. This discrepancy adds difficulty to multimodal representation learning. In this paper, we propose LayoutLMv3 to pre-train multimodal Transformers for Document AI with unified text and image masking. Additionally, LayoutLMv3 is pre-trained with a word-patch alignment objective to learn cross-modal alignment by predicting whether the corresponding image patch of a text word is masked. The simple unified architecture and training objectives make LayoutLMv3 a general-purpose pre-trained model for both text-centric and image-centric Document AI tasks. Experimental results show that LayoutLMv3 achieves state-of-the-art performance not only in text-centric tasks, including form understanding, receipt understanding, and document visual question answering, but also in image-centric tasks such as document image classification and document layout analysis.*

![layoutlmvs](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/layoutlmv3_architecture.png)

## Imports and NVIDIA GPU Device Assignment

In [None]:
import torch
import base64
from io import BytesIO
import numpy as np
import json
from PIL import Image, ImageDraw, ImageFont
from transformers import (
    LayoutLMv3Processor,
    LayoutLMv3ForTokenClassification,
    LayoutLMv3Model,
)

from pytesseract import apply_tesseract, iob_to_label, unnormalize_box

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Torch v{} running on {}".format(torch.__version__, device))
model_path = "nielsr/layoutlmv3-finetuned-funsd"
preprocessor_path = "microsoft/layoutlmv3-base"

## Sample Image
Also saved to `data.json` for use with `perf_analyzer` as default input.

In [None]:
image_path = "sample.png"
image = Image.open(image_path).convert("RGB")
width, height = image.size

buffered = BytesIO()
image.save(buffered, format="PNG")
img_str = base64.b64encode(buffered.getvalue())
data_dict = {
    "data": [
        {
            "raw_image_array": {
                "content": {"b64": "{}".format(img_str.decode("utf-8"))},
                "shape": [len(buffered.getvalue())],
            }
        }
    ]
}
with open(
    "/root/.cache/huggingface/triton-models/layoutlmv3_1_preprocess/layoutlmv3_inputs.json",
    "w",
) as f:
    json.dump(data_dict, f)

## Tesseract Optical Character Recognition

The implimentation of Tesseract-OCR within this container was [compiled](https://tesseract-ocr.github.io/tessdoc/TesseractOpenCL.html) to leverage OpenCL devices. It's decoupled from the Tokenizer to other OCR methods may be used if available. 

In [None]:
processor = LayoutLMv3Processor.from_pretrained(
    preprocessor_path, torchscript=True, apply_ocr=False
)

text, boxes = apply_tesseract(image, lang="eng", tesseract_config="--oem 1")

encoding = processor(
    image,
    text=text,
    boxes=boxes,
    return_offsets_mapping=True,
    return_tensors="pt",
)

## Instantiate the LayoutLMv3 Processor

In [None]:
for i in encoding.keys():
    print("{} shape: {}".format(i, encoding[i].shape))
    print("{} dtype: {}".format(i, encoding[i].dtype))
    print("")

for k, v in encoding.items():
    try:
        encoding[k] = v.to(device)
    except:
        pass
offset_mapping = encoding.pop("offset_mapping")

## Instantiate a Trained LayoutLMv3 Model

In [None]:
model = LayoutLMv3ForTokenClassification.from_pretrained(
    model_path, torchscript=True
).to(device)
id2label = model.config.id2label
id2label

## Run inference to retrieve logits

In [None]:
outputs = model(**encoding)

In [None]:
data = {"data": []}
for i in encoding:
    sub_data = {i: {}}
    a = np.array(encoding.data[i].cpu())
    sub_data[i]["content"] = a.flatten().tolist()
    sub_data["shape"] = list(a.shape)
    data["data"].append(sub_data)
    print(" - {}:{}".format(i, list(a.shape)))

with open(
    "/root/.cache/huggingface/triton-models/layoutlmv3_2_inference/layoutlmv3_inputs.json",
    "w",
) as fp:
    json.dump(data, fp)

## Parse Predictions to Labels

In [None]:
predictions = outputs[0].argmax(-1).squeeze().tolist()
token_boxes = encoding.bbox.squeeze().tolist()
is_subword = np.array(offset_mapping.squeeze().tolist())[:, 0] != 0
true_predictions = [
    id2label[pred] for idx, pred in enumerate(predictions) if not is_subword[idx]
]
true_boxes = [
    unnormalize_box(box, width, height)
    for idx, box in enumerate(token_boxes)
    if not is_subword[idx]
]

## Draw Bounding Boxes with Labels

In [None]:
draw = ImageDraw.Draw(image)

font = ImageFont.load_default()

label2color = {
    "question": "blue",
    "answer": "green",
    "header": "orange",
    "other": "violet",
}

for prediction, box in zip(true_predictions, true_boxes):
    predicted_label = iob_to_label(prediction).lower()
    draw.rectangle(box, outline=label2color[predicted_label])
    draw.text(
        (box[0] + 10, box[1] - 10),
        text=predicted_label,
        fill=label2color[predicted_label],
        font=font,
    )

image

### Encoding Output

In [None]:
# encoding

### Offset Mapping Output

In [None]:
# offset_mapping

### Model Output (Prediction)

In [None]:
# outputs

## Convert to ONNX

In [None]:
!cp /root/.cache/huggingface/5a806f3a6ea0fadc67e6a7c9b86ee34d20cfb694fef7f5cf61e3e442aa87bf01.b3f43b13348b0046ddf48e57ee1f9bf6f5445d452d4401b53e88ec565b2e03d3 \
    /root/.cache/huggingface/layoutlmv3/pytorch_model.bin \
&& cp /root/.cache/huggingface/6a1296143bda78d4e76520a102633eab7d8c7a7436e0240740d583b13e513634.85a59041585b9df84cb2409000e75ec862472acc9fd2753e360422482468cdb3 \
    /root/.cache/huggingface/layoutlmv3/tokenizer_config.json \
&& cp /root/.cache/huggingface/12e3fbf8d2bc2a2331583c2b01603725959b563e70dac5da35e57975788ff9b9.647b4548b6d9ea817e82e7a9231a320231a1c9ea24053cc9e758f3fe68216f05 \
    /root/.cache/huggingface/layoutlmv3/vocab.json \
&& cp /root/.cache/huggingface/93b2ea2c7da83bab15f33c9981644685853156b746135bc522c550c812d68b93.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b \
    /root/.cache/huggingface/layoutlmv3/merges.txt \
&& cp /root/.cache/huggingface/a8f2f8aefcea7536ff4a117fdedb45594a098cf9613afed040d5713f8c422150.ed72adea09fba297feb464926d4fd9dc8c8cd9fad692961cc38123e3316598e2 \
    /root/.cache/huggingface/layoutlmv3/config.json \
&& cp /root/.cache/huggingface/e90f549bb33101e8141284f850ad2b907e919417da746aca394c6a403dc4151f.4f4fbbd7db79618fdf8c9a37cf26bd2881f493d22820d058af4c37bb42d657ba \
    /root/.cache/huggingface/layoutlmv3/preprocessor_config.json

In [None]:
# !python -m transformers.onnx --help

In [None]:
!python -m transformers.onnx \
    --model=/root/.cache/huggingface/layoutlmv3/ \
    --atol=2e-4 \
    --opset=13 \
    --feature={"token-classification"} \
    --framework={"pt"} \
    /root/.cache/huggingface/triton-models/layoutlmv3_2_inference/1/

In [None]:
!cp /root/.cache/huggingface/6a1296143bda78d4e76520a102633eab7d8c7a7436e0240740d583b13e513634.85a59041585b9df84cb2409000e75ec862472acc9fd2753e360422482468cdb3 \
    /root/.cache/huggingface/triton-models/layoutlmv3_1_preprocess/1/preprocessing_config/tokenizer_config.json \
&& cp /root/.cache/huggingface/12e3fbf8d2bc2a2331583c2b01603725959b563e70dac5da35e57975788ff9b9.647b4548b6d9ea817e82e7a9231a320231a1c9ea24053cc9e758f3fe68216f05 \
    /root/.cache/huggingface/triton-models/layoutlmv3_1_preprocess/1/preprocessing_config/vocab.json \
&& cp /root/.cache/huggingface/93b2ea2c7da83bab15f33c9981644685853156b746135bc522c550c812d68b93.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b \
    /root/.cache/huggingface/triton-models/layoutlmv3_1_preprocess/1/preprocessing_config/merges.txt \
&& cp /root/.cache/huggingface/a8f2f8aefcea7536ff4a117fdedb45594a098cf9613afed040d5713f8c422150.ed72adea09fba297feb464926d4fd9dc8c8cd9fad692961cc38123e3316598e2 \
    /root/.cache/huggingface/triton-models/layoutlmv3_1_preprocess/1/preprocessing_config/config.json \
&& cp /root/.cache/huggingface/e90f549bb33101e8141284f850ad2b907e919417da746aca394c6a403dc4151f.4f4fbbd7db79618fdf8c9a37cf26bd2881f493d22820d058af4c37bb42d657ba \
    /root/.cache/huggingface/triton-models/layoutlmv3_1_preprocess/1/preprocessing_config/preprocessor_config.json

___
*You can close this container and run `docker compose up layoutlmv3-triton-server`*