<img src="https://fsdl.me/logo-720-dark-horizontal">

# Arcane Information about the IAM Dataset

This notebook walks through the code for handling the `IAM` dataset
that underlies our line- and paragraph-level text recognition datasets.

It's intended to write down and make visible the fiddly details of data processing that are otherwise easily lost when code is handed off between engineers.

It runs against the source repo, rather than the labs repo,
so we include an environment variable to change the `bootstrap` behavior.

In [None]:
%env FSDL_REPO=fsdl-text-recognizer-2022

In [None]:
lab_idx = None

if "bootstrap" not in locals() or bootstrap.run:
    # path management for Python
    pythonpath, = !echo $PYTHONPATH
    if "." not in pythonpath.split(":"):
        pythonpath = ".:" + pythonpath
        %env PYTHONPATH={pythonpath}
        !echo $PYTHONPATH

    # get both Colab and local notebooks into the same state
    !wget --quiet https://fsdl.me/gist-bootstrap -O bootstrap.py
    import bootstrap

    # change into the lab directory
    bootstrap.change_to_lab_dir(lab_idx=lab_idx)

    # allow "hot-reloading" of modules
    %load_ext autoreload
    %autoreload 2
    # needed for inline plots in some contexts
    %matplotlib inline

    bootstrap.run = False  # change to True re-run setup
    
!pwd
%ls

# Wipe the Slate Clean

In [None]:
from text_recognizer.metadata.iam import DL_DATA_DIRNAME

In [None]:
starting_fresh = False

if starting_fresh:
    !rm -rf {DL_DATA_DIRNAME}

This class downloads the data --
we'll talk more about it later,
but we want to have the data present for the first part of the discussion.

In [None]:
from text_recognizer.data.iam import IAM

iam = IAM()
iam.prepare_data()

# Reviewing the Stucture of the Data on Disk


The `IAM` dataset is downloaded as zip file:

In [None]:
iam_dir = DL_DATA_DIRNAME
!ls {iam_dir}

Inside that zip file are the following files:

In [None]:
iamdb = iam_dir / "iamdb"

!du -h {iamdb}

## Where are the "inputs" and "targets"?

There are >3000 files, almost all of which are `.xml` or `.jpg`:

In [None]:
!find {iamdb} | grep "\.jpg$\|\.xml$" | wc -l

And they are equal in number:

In [None]:
!find {iamdb}/xml | grep "\.xml$" | wc -l

In [None]:
!find {iamdb}/forms | grep "\.jpg$" | wc -l

Where there are many small files in equal number, there are inputs and targets.

And indeed, an individual "datapoint" in `IAM` is a "form", because the humans whose hands wrote the data were writing on "forms", as below:

In [None]:
import text_recognizer.util as util


file, = !find {iamdb}/forms | grep ".jpg$" | head -n 1

print(file)
util.read_image_pil(file)

And the `xml` files indeed contain the targets:

In [None]:
file, = !find {iamdb}/xml | grep "\.xml$" | head -n 1

!cat {file} | grep -A 100 "handwritten-part" | grep "<word"

But they also contain the metadata required to convert images of entire forms into more useful images, e.g. of lines or paragraphs of handwritten text:

In [None]:
file, = !find {iamdb}/xml | grep "\.xml$" | head -n 1

!cat {file} | grep -A 25 "handwritten-part" | grep -A 5 "<word"

The `ascii` folder has metadata in `.txt` files in the ASCII format.

There's a handful of other files full of metadata -- e.g. the training, validation, and test splits:

In [None]:
!find {iamdb} | grep "\\.txt$"

The `ascii` folder has metadata in `.txt` files in the ASCII format.

In [None]:
!ls -lh {iamdb}/ascii

# `IAM`

The `data.iam` module and `IAM` class
have a bunch of useful utilities for managing this data,
plus a `prepare_data` method that just downloads and unzips.

In [None]:
iam = IAM()
iam.prepare_data()

In [None]:
iam.metadata

In [None]:
iam

## `IAMLines`

We start from the raw forms and need to get to lines.

In [None]:
import text_recognizer.util as util

fn = iam.form_filenames[0]

print(fn)
img = util.read_image_pil(fn)
print(img.size)
img

There's a high-level method that returns the lines:

In [None]:
from text_recognizer.data.iam_lines import generate_line_crops_and_labels

In [None]:
crops, labels = generate_line_crops_and_labels(iam, "test")

In [None]:
type(crops[0]), type(labels[0])

In [None]:
print(labels[0])
print(crops[0].size)
crops[0]

But the details matter here.

So let's look at the code, first all at once, then step by step.

In [None]:
generate_line_crops_and_labels??

And we'll apply the procedure to this image:

In [None]:
img = util.read_image_pil(fn)
img

We iterate over the `ids` from the dataset `split` of interest, here `test`.

We first pull the labels, using the form ID,
which we can pull here easily
because it's also the filename:

In [None]:
print(fn.stem)
print("", *iam.line_strings_by_id[fn.stem], sep="\n\t")

We operate on these files using `PIL` utilities -- inside of `iam.load_image`.

First, from RGB to `grayscale`:

In [None]:
from PIL import ImageOps

In [None]:
img_g = ImageOps.grayscale(img)
img_g

then -- importantly! -- we invert it:

In [None]:
gmi = ImageOps.invert(img_g)
gmi

We then pull out the `coord`inate`s` of the line crops using the `line_regions_by_id` property:

In [None]:
iam.line_regions_by_id[fn.stem]

Quick sanity check with direct array manipulation:

In [None]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np

idx = 0
line_coords = iam.line_regions_by_id[fn.stem][idx]
im_arr = np.array(gmi)
im_arr[line_coords["y1"]:line_coords["y2"], line_coords["x1"]:line_coords["x2"]] += 100

plt.matshow(im_arr, cmap="Greys_r");

And we pull them out with a `crop` operation:

In [None]:
line = gmi.crop(line_coords[point] for point in ["x1", "y1", "x2", "y2"])
line

And resize:

In [None]:
smaller_line = line.resize((line.width // 2, line.height // 2))
print(smaller_line.size)
smaller_line

# `IAMParagraphs`

We again use a high-level method to get the `crops_and_labels`.

In [None]:
from text_recognizer.data.iam_paragraphs import get_paragraph_crops_and_labels

p_crops, p_labels = get_paragraph_crops_and_labels(iam, split="val")

But now we have `dict`s as outputs.

Let's get the data for the form we were just working with. Note that we had to pick a different split above!

In [None]:
p_crop, p_label = p_crops[fn.stem], p_labels[fn.stem]

In [None]:
p_crop

We read the image,

In [None]:
img = util.read_image_pil(fn)
img

Then we grayscale it and invert:

In [None]:
img_g = ImageOps.grayscale(img)
gmi = ImageOps.invert(img_g)
gmi

We need to go from "line regions" to "paragraph regions", which means "concatenating" the lines -- by joining with min/max.

That data is inside `iam.paragraph_region_by_id`, which is computed but shows up as just a `dict`.

Quick sanity check with direct array manipulation:

In [None]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np

idx = 0
para_coords = iam.paragraph_region_by_id[fn.stem]
im_arr = np.array(gmi)
im_arr[para_coords["y1"]:para_coords["y2"], para_coords["x1"]:para_coords["x2"]] += 100

plt.matshow(im_arr, cmap="Greys_r");

In [None]:
p_text = iam.paragraph_string_by_id[fn.stem]
print(fn.stem)
print(p_text)

and again, we crop and scale:

In [None]:
p_crop = gmi.crop(para_coords.values())
p_crop

In [None]:
smaller_p_crop = p_crop.resize((p_crop.size[0] // 2, p_crop.size[1] // 2))

In [None]:
smaller_p_crop

In [None]:
smaller_p_crop.size

This size information percolates into the definitions of models --
we use it to set the target input shapes,
even though the models can handle different sizes.

So it's important to make sure the `metadata` files for
`iam`, `iam_lines`, and `iam_paragraphs` are changed together
and that the values are compatible with the assumptions of the relevant models.

# `IAMSyntheticParagraphs`

We use data synthesis to bootstrap our data -- better models on a budget.

## High-Level Methods

In the `prepare_data` method of this class,
we again pull out _line_ crops and save them to disk.

In [None]:
s_line_crops, s_line_labels = generate_line_crops_and_labels(iam, "val")

Then, we generate the synthetic _paragraph_ crops and labels during `setup`,
using `generate_synthetic_paragraphs`:

In [None]:
from text_recognizer.data.iam_synthetic_paragraphs import (
    generate_synthetic_paragraphs
)

In [None]:
X, para_labels = generate_synthetic_paragraphs(s_line_crops, s_line_labels)
print(len(X))

In [None]:
for idx, (crop, label) in enumerate(zip(X, para_labels)):
    if "\n" in label:
        first_paragraph, first_label = crop, label
        break
        
print(str(idx) + ":\n", label)
first_paragraph

## Arcane Details

We want to build fake paragraphs (with labels!) out of real lines with labels.

So first let's make that possible:

In [None]:
from text_recognizer.data import paragraph_synthesis as psyn

In [None]:
psyn.build_paragraph_from_indices??

The meat of this function is `join_line_crops_to_from_paragraph`.

In [None]:
psyn.join_line_crops_to_form_paragraph??

And if we grab the first few crops and labels, we can generate a new labeled paragraph!

In [None]:
sample_synth_crop, sample_synth_label = psyn.build_paragraph_from_indices(s_line_crops[:3], s_line_labels[:3])

In [None]:
print(sample_synth_label)
sample_synth_crop

But this is just a snippet of an existing paragraph!

We need to be able to create _random combinations_ of indices.

In [None]:
import random

In [None]:
sample_synth_crop, sample_synth_label = psyn.build_paragraph_from_indices(
    random.choices(s_line_crops, k=3), random.choices(s_line_labels, k=3))

In [None]:
print(sample_synth_label)
sample_synth_crop

That's more like it!

From here, we want to get a method that works on a whole dataset at once.

In [None]:
psyn._build_paragraphs_from_indices??

In [None]:
indices = list(range(len(s_line_crops)))
shuffled_indices = indices.copy()
random.shuffle(shuffled_indices)

cur_idx, shorter_paragraph_indices = 0, []
while cur_idx < len(shuffled_indices) // 5:
    shorter_paragraph_indices.append(shuffled_indices[cur_idx : cur_idx + 3])
    cur_idx += 3

In [None]:
short_para_crops, short_para_labels = psyn._build_paragraphs_from_indices(
    shorter_paragraph_indices, s_line_crops, s_line_labels)

In [None]:
print(len(short_para_crops))  # now we've got more paragraphs

In [None]:
idx = random.choice(range(len(short_para_crops)))

print(short_para_labels[idx])
short_para_crops[idx]

But we want varying lengths of paragraphs.

We _could_ have just done that directly, by picking some distribution over lengths,
but instead we've got a slightly baroque mechanism for taking the existing lines
and partitioning them into new paragraphs based on this function:

In [None]:
psyn.generate_random_partition??

This allows us to have multiple instances of each line from the real data in the synthetic dataset.

In [None]:
short_paragraph_indices = psyn.generate_random_partition(
    indices, min_size=2, max_size=4)
long_paragraph_indices = psyn.generate_random_partition(
    indices, min_size=5, max_size=9)

In [None]:
long_para_crops, long_para_labels = psyn._build_paragraphs_from_indices(long_paragraph_indices, s_line_crops, s_line_labels)

In [None]:
idx = random.choice(range(len(long_para_crops)))

print(long_para_labels[idx])
long_para_crops[idx]

We wrap all this up in a big ol' grandaddy function:

In [None]:
psyn.generate_synthetic_paragraphs??

Which provides our abstracted high-level interface for paragraph generation.