# Quickstart: From dataset to predictions and evaluation

This tutorial works with a sample dataset (with only one document) that is present in the repo (no need to download anything). To see how to properly work with larger datasets, check [Load and Sample Dataset](load_and_sample_dataset.md) tutorial.

## Load sample dataset

In [None]:
from pathlib import Path
from docile.dataset import CachingConfig, Dataset

DATASET_PATH = Path("/app/tests/data/sample-dataset/")
sample_dataset = Dataset("dev", DATASET_PATH)

In [None]:
print(f"{sample_dataset} with {len(sample_dataset)} docs and {sample_dataset.total_page_count()} pages")

In [None]:
sample_doc = sample_dataset[0]
print(sample_doc)

## Browse available document resources

**Read pdf and convert it to image**

In [None]:
pdf_bytes = sample_doc.data_paths.pdf_path(sample_doc.docid).read_bytes()

In [None]:
pdf_bytes[:20]

Convert to image with a width set to 500 pixels.

In [None]:
sample_doc.page_image(page=0, image_size=(500, None))

**Access & visualize annotations**

In [None]:
print(f"{sample_doc.annotation.fields[0]=}\n")
print(f"{sample_doc.annotation.li_fields[0]=}\n")
print(f"{sample_doc.annotation.li_headers[0]=}\n")
print(f"{sample_doc.annotation.cluster_id=}")
print(f"{sample_doc.annotation.document_type=}")
print(f"{len(sample_doc.annotation.get_table_grid(page=0).rows_bbox_with_type)=}")
print()
print("Access raw annotations dictionary:")
print(f"{sample_doc.annotation.content.keys()=}")

Show annotations in the dataset browser

In [None]:
from docile.tools.dataset_browser import DatasetBrowser

browser = DatasetBrowser(sample_dataset)

**Access Pre-computed OCR**

Word tokens of the pre-computed OCR can be easily accessed in two variants, with `snapped=False` and `snapped=True`. The first version is computed by DocTR and the second version uses some heuristics to remove whitespace around the edges of the predictions. The snapped OCR word boxes are also used to generate the Pseudo-Character-Centers which are used in evaluation (check the dataset paper or code for details).

In [None]:
words = sample_doc.ocr.get_all_words(page=0)
snapped_words = sample_doc.ocr.get_all_words(page=0, snapped=True)
print(words[0])
print(snapped_words[0])

Show crop of the document page with pre-computed OCR words. Blue boxes are the original boxes, red boxes are the snapped boxes

In [None]:
from PIL import ImageDraw

page_img = sample_doc.page_image(0, image_size=(1600, None))

draw_img = page_img.copy()
draw = ImageDraw.Draw(draw_img, "RGB")
for word in sample_doc.ocr.get_all_words(page=0, snapped=False):
    scaled_bbox = word.bbox.to_absolute_coords(*draw_img.size)
    draw.rectangle(scaled_bbox.to_tuple(), outline="blue")
for word in sample_doc.ocr.get_all_words(page=0, snapped=True):
    scaled_bbox = word.bbox.to_absolute_coords(*draw_img.size)
    draw.rectangle(scaled_bbox.to_tuple(), outline="red")
draw_img.crop((680, 480, 950, 580))

Access raw OCR content

In [None]:
ocr_dict = sample_doc.ocr.content
ocr_dict["pages"][0]["blocks"][4]

## Create dummy predictions

Create predictions as perturbations of the gold labels (just as example). Some labels are thrown away and for some labels, two predictions are created instead of one.

In [None]:
from dataclasses import replace
from random import Random
from typing import List, Sequence, Tuple

from docile.dataset import BBox, Document, Field

def fields_perturbation(rng: Random, fields: Sequence[Field], max_shift: Tuple[float, float]) -> List[Field]:
    new_fields = []
    for field in fields:
        p = rng.random()
        generate_fields = 1
        if p < 0.2:
            generate_fields = 0
        elif p > 0.9:
            generate_fields = 2
        for _ in range(generate_fields):
            max_shift_horizontal, max_shift_vertical = max_shift
            left = field.bbox.left + (rng.random() * 2 - 1) * max_shift_horizontal
            right = field.bbox.right + (rng.random() * 2 - 1) * max_shift_horizontal
            if right < left:
                left, right = right, left
            top = field.bbox.top + (rng.random() * 2 - 1) * max_shift_vertical
            bottom = field.bbox.bottom + (rng.random() * 2 - 1) * max_shift_vertical
            if bottom < top:
                top, bottom = bottom, top
            new_field = replace(field, bbox=BBox(left, top, right, bottom))
            new_fields.append(new_field)
    return new_fields
        
def get_max_shift_in_relative_coords(doc: Document, max_shift_px_at_200dpi: Tuple[int, int]) -> Tuple[float, float]:
    size_at_200dpi = doc.page_image_size(page=0, dpi=200)
    return (max_shift_px_at_200dpi[0] / size_at_200dpi[0], max_shift_px_at_200dpi[1] / size_at_200dpi[1])

In [None]:
rng = Random(42)

max_shift = get_max_shift_in_relative_coords(sample_doc, max_shift_px_at_200dpi=(15, 5))
kile_predictions = {sample_doc.docid: fields_perturbation(rng, sample_doc.annotation.fields, max_shift)}
lir_predictions = {sample_doc.docid: fields_perturbation(rng, sample_doc.annotation.li_fields, max_shift)}

**Store predictions to json**

In this format predictions are submitted to the benchmark. With the predictions stored on disk you can also run the evaluation from command line with `docile_evaluate` command.

In [None]:
from docile.dataset import store_predictions

store_predictions(Path("/tmp/kile-perturbations.json"), kile_predictions)

**Run evaluation**

In [None]:
from docile.evaluation import evaluate_dataset

evaluation_result = evaluate_dataset(sample_dataset, kile_predictions, lir_predictions)

In [None]:
print(evaluation_result.print_report())

In [None]:
evaluation_result.get_primary_metric("kile")

**Visualize matching**

In [None]:
kile_matching = evaluation_result.task_to_docid_to_matching["kile"]
lir_matching = evaluation_result.task_to_docid_to_matching["lir"]
DatasetBrowser(sample_dataset, kile_matching=kile_matching, lir_matching=lir_matching)