## The problem

I have images of document pages and a ground-truth transcription of the pages. But what I really need is the ground truth for individual lines.

How do I match lines from the page image to the corresponding line in the GT text?

## A solution

I run page segmentation and OCR on the 
document image and match the segmented lines by exploiting
1. reading order (recognized by the page segmenter for the image and given by the GT text)
2. similiarity of recognized text in the image and the GT text line

In this experiment I just use an extracted OCR text to see if this approach works:

In [1]:
!head ocr.txt

Achtung! Telefonische Bestellungen werden angenommen, jedoch
nur nech schriftlicher Bestätigung ausgeführt.
Bestellungen erbeten en
Zentralantiqueriat der Deutschen Demokre
DDR –-_701_Leipzig, Talstr. 29, Postfach _1080
oder an folgende Vertragapartner:
Helios-Buchhandlung u. Antiausriat GmbH
1 Berlin 52, Eichborndamm 141-167
Heidelberger Humani tas
Literetur-Vertriebs-GmbH


However this would also work for, say, a sequence of text lines from a PAGE-XML document (with corresponding line image!).

The GT text:

In [2]:
!head gt.txt

Achtung! Telefonische Bestellungen werden angenommen, jedoch
nur nach schriftlicher Bestätigung ausgeführt.

Bestellungen erbeten an
Zentralantiquariat der Deutschen Demokratischen Republik,
DDR - 701 Leipzig, Talstr. 29, Postfach 1080
oder an folgende Vertragspartner:
Helios-Buchhandlung u. Antiquariat GmbH
1 Berlin 52, Eichborndamm 141-167



I'll use the sequence alignment functions of dinglehopper to match the sequence of lines in the OCR text to match the sequence of lines in the GT text. For this to work properly, I need a class that treats similiar strings as equal - based on some kind of distance. Here I use a kind of normalized Levensthein distance:

In [3]:
from qurator.dinglehopper import distance

class SimilarString:
    def __init__(self, string):
        self._string = string

    def __eq__(self, other):
        # Just an example!
        min_len = min(len(self._string), len(other._string))
        if min_len > 0:
            normalized_distance = distance(self._string, other._string) / min_len
            similar = normalized_distance < 0.1
        else:
            similar = False
        return similar

    def __ne__(self, other):
        return not self.__eq__(other)

    def __repr__(self):
        return "SimilarString('%s')" % self._string

    def __hash__(self):
        return hash(self._string)

A few tests:

In [4]:
assert(SimilarString("the same") == SimilarString("the same"))

In [5]:
assert(SimilarString("not at all") != SimilarString("the same"))

In [6]:
assert(SimilarString("abcdefghijk") == SimilarString("abcdefghjjk"))  # Note the double jj

I read the texts as sequences of this class:

In [7]:
ocr = [SimilarString(line.rstrip()) for line in open("ocr.txt", "r")]
gt = [SimilarString(line.rstrip()) for line in open("gt.txt", "r")]

And align (match) these sequences:

In [8]:
from qurator.dinglehopper import seq_align
result = list(seq_align(ocr, gt))

### Matched lines


The way this sequence alignment works, I just need to check where the elements of this alignment match to find matching lines:

In [9]:
for o, g in result:
    if o and g and o == g:
        print(o)
        print(g)
        print()

SimilarString('Achtung! Telefonische Bestellungen werden angenommen, jedoch')
SimilarString('Achtung! Telefonische Bestellungen werden angenommen, jedoch')

SimilarString('nur nech schriftlicher Bestätigung ausgeführt.')
SimilarString('nur nach schriftlicher Bestätigung ausgeführt.')

SimilarString('Bestellungen erbeten en')
SimilarString('Bestellungen erbeten an')

SimilarString('DDR –-_701_Leipzig, Talstr. 29, Postfach _1080')
SimilarString('DDR - 701 Leipzig, Talstr. 29, Postfach 1080')

SimilarString('oder an folgende Vertragapartner:')
SimilarString('oder an folgende Vertragspartner:')

SimilarString('Helios-Buchhandlung u. Antiausriat GmbH')
SimilarString('Helios-Buchhandlung u. Antiquariat GmbH')

SimilarString('1 Berlin 52, Eichborndamm 141-167')
SimilarString('1 Berlin 52, Eichborndamm 141-167')

SimilarString('Heidelberger Humani tas')
SimilarString('Heidelberger Humanitas')

SimilarString('Literetur-Vertriebs-GmbH')
SimilarString('Literatur-Vertriebs-GmbH')

SimilarString('6

### Unmatched lines

Similary, I can find unmatched lines from the GT page text:

In [10]:
for o, g in result:
    if not g or g._string == '':
        continue
    if (not o) or (o != g):
        print(g)
        print()

SimilarString('Zentralantiquariat der Deutschen Demokratischen Republik,')

SimilarString('----------------------------------------')

SimilarString('Preise in Mark der Deutschen')



## Discussion

This approach has some obvious flaws: It will only work if segmentation (including reading order) and OCR yields good enough results to produce useful line images and corresponding matches in the GT text. I also see unmatched lines above, so there may even some "wasted" lines.

However, if I have a large corpus of transscriptions (like the Deutsches Textarchiv) this can yield a large amount of line GT without a huge investent in manual labeling.