# Tutorial: Load and Sample Datasets

In this tutorial you will learn what are the different ways to load datasets, subsample them.

To run all cells, you need the `annotated-trainval` and `synthetic` datasets unzipped in `data/docile` and `annotated-trainval.zip` file in `data/`.

In [None]:
from pathlib import Path
from docile.dataset import CachingConfig, Dataset

DATASET_PATH = Path("/app/data/docile/")
DATASET_PATH_ZIP = Path("/app/data/annotated-trainval.zip")

## Load from folder with unzipped dataset

In [None]:
val = Dataset("val", DATASET_PATH)

## Load from zip

Dataset can be loaded directly from zip as well but image caching to disk must be turned off.

In [None]:
val = Dataset("val", DATASET_PATH_ZIP, cache_images=CachingConfig.OFF)

## Preloading document resources to memory and image caching

By default, dataset is loaded with these settings:

* annotations and pre-computed OCR are loaded to memory
* images generated from PDFs are cached to disk (for faster access of the images in future iterations)

Below you see options how to change this default behaviour, which is especially useful for large datasets.

**Only load annotations, not pre-computed OCR**

In [None]:
train = Dataset("train", DATASET_PATH, load_ocr=False)

**Postpone loading of document resources**

In [None]:
# Do not preload document resources
synthetic = Dataset("synthetic", DATASET_PATH, load_annotations=False, load_ocr=False, cache_images=CachingConfig.OFF)
# You can load part of the dataset later
synthetic_sample = synthetic[:100].load()
# And release it from memory later
synthetic_sample.release()

**Cache images to both disk and memory**

In [None]:
# Cache images also in memory. Make sure you have enough RAM memory to do this!
# Images are not loaded to memory right away but only after first
train = Dataset("train", DATASET_PATH, cache_images=CachingConfig.DISK_AND_MEMORY)

## Sample and chunk documents

For experiments and to work with large datasets, it can be useful to take samples of the datasets.

For this, you can use slicing `[start:end:step]`, `.sample()`, `.get_cluster()` or `.from_documents()` methods.

In [None]:
synthetic = Dataset("synthetic", DATASET_PATH, load_annotations=False, load_ocr=False, cache_images=CachingConfig.OFF)
trainval = Dataset("trainval", DATASET_PATH, load_annotations=False, load_ocr=False, cache_images=CachingConfig.OFF)

**Slicing**

In [None]:
# Synthetic document has 100 chunks of 1000 documents from the same template document, so the
# following line selects 1 document for each template document:
synthetic_slice = synthetic[8::1000]
print(synthetic_slice.docids[:5])

**Random sample**

In [None]:
synthetic_sample = synthetic.sample(5)
print(synthetic_sample)
print(synthetic_sample.docids)

**Documents belonging to the same cluster**

In [None]:
trainval_cluster = trainval.get_cluster(synthetic[0].annotation.cluster_id)
print(f"Found {len(trainval_cluster)} documents in {trainval_cluster}.")

from PIL import Image

print("Showing 10 images from the cluster:")
imgs = [doc.page_image(page=0, image_size=(None, 100)) for doc in trainval_cluster[:10]]
concat_img = Image.new("RGB", (sum(img.width for img in imgs), 100))
start_from = 0
for img in imgs:
    concat_img.paste(img, (start_from, 0))
    start_from += img.width
concat_img

**Using custom filter**

In [None]:
trainval_ucsf = Dataset.from_documents("trainval-ucsf", [doc for doc in trainval if doc.annotation.source == "ucsf"])
trainval_pif = Dataset.from_documents("trainval-pif", [doc for doc in trainval if doc.annotation.source == "pif"])
print(f"{trainval_ucsf} with {len(trainval_ucsf)} documents")
print(f"{trainval_pif} with {len(trainval_pif)} documents")

## Chunk dataset into parts with the same number of pages

Create dataset chunks that have a limited number of pages. This can be especially useful for large datasets, such as the unlabeled dataset.

In [None]:
from typing import Iterable

def chunk_dataset(dataset: Dataset, max_pages_per_chunk: int) -> Iterable[Dataset]:
    start_doc_i = 0
    pages = 0
    for doc_i, document in enumerate(dataset.documents):
        documents = doc_i - start_doc_i + 1
        pages += document.page_count
        if doc_i > start_doc_i and pages > max_pages_per_chunk:
            yield dataset[start_doc_i:doc_i]
            start_doc_i = doc_i
            pages = document.page_count
    yield dataset[start_doc_i:]

In [None]:
trainval = Dataset("trainval", DATASET_PATH, load_annotations=False, load_ocr=False, cache_images=CachingConfig.OFF)

max_pages_per_chunk = 2000

for chunk in chunk_dataset(trainval, max_pages_per_chunk):
    print(f"{chunk}, pages: {chunk.total_page_count()}")
    chunk.load(annotations=True, ocr=False)
    # ... work with the chunk here ...
    chunk.release() # don't forget to free up the memory