# Re-order Transkribus line segments

Line segments in the output of the hand-written text recognition software [Transkribus](https://readcoop.eu/transkribus/) are not always placed in the right order[1,2]. A typical problem is that Transkribus splits a line and then puts the right part in front of the left part because the start of the right part is a few pixels higher than the left part. This notebook reorders the lines by also taking into account the height of the line: a right line segment will only be put in front of a left segment if its location in the text is higher than the left line's highest point. The notebook reads a Transkribus XML file and outputs a modified version of the file.

**References**

1. Lisa Hoek, [Extracting Entities from Handwritten Civil Records using HTR and RegExes](https://www.ru.nl/publish/pages/769526/lisa_hoek.pdf). Master’s thesis, Radboud University Nijmegen, 2023, sections 5.3, 5.4.4 and 9.1.
2. Erik Tjong Kim Sang, [REE-HDSC: Recognizing Extracted Entities for the Historical Database Suriname Curacao](https://ifarm.nl/erikt/papers/ree-hdsc-2023.pdf). Technical Report, Netherlands eScience Center, 2023, section 7.

## 1. Read files

In [None]:
import os
import sys
import xml.etree.ElementTree as ET

In [None]:
data_dir = "tmp/1609526/Training_set_2/page"

In [None]:
def coordinates2rectangle(coords):
    left, right, top, bottom =  sys.maxsize, 0, sys.maxsize, 0
    for pair in coords.split():
        x, y = pair.split(",")
        if int(x) > right:
            right = int(x)
        if int(x) < left:
            left = int(x)
        if int(y) > bottom:
            bottom = int(y)
        if int(y) < top:
            top = int(y)
    return left, right, top, bottom

In [None]:
file_name = "p001.xml"
tree = ET.parse(os.path.join(data_dir, file_name))
root = tree.getroot()
textregions = []
for textregion in root.findall(".//{*}TextRegion"):
    for textline in textregion.findall("./{*}TextLine"):
        for coords in textline.findall("./{*}Coords"):
            print(coordinates2rectangle(coords.attrib["points"]))
    print()