LayoutLM on SROIE

This code fine-tunes LayoutLM on the SROIE scanned receipts data, and uses Weights & Biases to log losses and metrics during training, and annotated images with bounding box predictions. Here is the accompanying Report.

Example annotated receipt

Plots of training metrics

Getting started

First, make sure to install the pipenv environment, using pipenv install. This requires pipenv to have access to python 3.9. To install and manage different python versions, try out pyenv. All instructions below assume the pipenv environment is activated; to activate, run pipenv shell.

Preprocessing

The preprocessing for this slightly nonstandard, since the OCR and labels are given in a format that is not consistent with the per-token level classification setup that LayoutLM requires. More details given in this section of the report.

To run the preprocessing step, from the base directory, run

python -m scripts.preprocess

Training

To train, run the following command from base directory

python -m scripts.train

Objects

The different objects used in preprocessing the data and training the model are contained in the objects directory. Below is a rough listing of the files and objects contained

objects
- constants.py
  - config
  - task_1_dir
- dataset.py
  - SROIE(Dataset)
- model.py
  - tokenizer
  - model
- trainer.py
  - Trainer
- transforms.py
  - GetTokenBoxesLabels

GetTokenBoxesLabels

Special attention should be brought to the callable class GetTokenBoxesLabels defined in transforms.py. This does three main things

Tracks tokenization of words and appropriately duplicates the bounding boxes accommodate the tokenized sequence.
Pads the input sequence to the max length allowable by the tokenizer (here it is BERTTokenizer, so 256).
Normalizes coordinates to be between 0 and 1000. This is required by LayoutLM.

An example of why #1 is necessary might be if the sequence of (word, bbox) pairs corresponding to a segment of text on a document is

[("I", [100, 100, 120, 150]), ("am", [130, 100, 160, 150]), ("sleeping", [140, 100, 280, 150])]

Here the bounding box coordinates are in the format [x1, y1, x2, y2], where x1 and x2 are the left- and right- most coordinates of the bounding box; and similarly y1 and y2 are the top- and bottom- most coordinates. The tokenizer itself operates only on the sequence of words

I am sleeping

and returns the sequence of tokens

I am sleep ##ing

But does not operate on the bounding boxes. For the purposes of LayoutLM, we want the (token, bbox) sequence to be

[("I", [100, 100, 120, 150]), ("am", [130, 100, 160, 150]), ("sleep", [140, 100, 280, 150]), ("##ing", [140, 100, 280, 150])]

GetTokenBoxesLabels takes care of this.

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
.vscode		.vscode
images		images
objects		objects
scripts		scripts
.DS_Store		.DS_Store
.gitignore		.gitignore
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.vscode

.vscode

images

images

objects

objects

scripts

scripts

.DS_Store

.DS_Store

.gitignore

.gitignore

Pipfile

Pipfile

Pipfile.lock

Pipfile.lock

README.md

README.md

Repository files navigation

LayoutLM on SROIE

Example annotated receipt

Plots of training metrics

Getting started

Preprocessing

Training

Objects

GetTokenBoxesLabels

About

Releases

Packages

Languages

wandb/layoutlm_sroie_demo

Folders and files

Latest commit

History

Repository files navigation

LayoutLM on SROIE

Example annotated receipt

Plots of training metrics

Getting started

Preprocessing

Training

Objects

GetTokenBoxesLabels

About

Resources

Stars

Watchers

Forks

Languages