# split Transkribus metadata

Transkribus provide alot of useful data during text recognition process.
We have correspondence printed in a book.
Letter end can be determined by checking baseline coordinates.
Being able to split text and also merge letters spanning serveral pages
into one letter object, which has a format ready to extract letter properties
like location, date and author.

For further refining still properly messy data, a json file is written.

## Released under MIT License

Copyright (c) 2020 Walter Obweger (twitter:`@wobweger`).

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

## background

created during [ACDH and CH](https://www.oeaw.ac.at/acdh/) [internship](https://www.oeaw.ac.at/acdh/education/acdh-ch-internships/) 2020

big thanks to the whole ACDH and CH team (twitter: `@ACDH_OeAW`),
special thanks to my instructor [Martin Anton Mueller](https://www.oeaw.ac.at/acdh/team/current-team/martin-anton-mueller/) (twitter:@f46906169)

twitter `#wroACDHitp202003`

## documentation

documentation is a love letter to your future self

+ [github repo](https://github.com/wobweger/wroACDHitp202003) source, manual, documenation source
+ [google drive](https://drive.google.com/drive/folders/1qOIcFVc9RVIO3tkh3m2P_z7vdrsKHNTU) documentation and jupyter notebooks

released under MIT license, feel free to use it

In [669]:
import os
import os.path
import datetime
import json
import traceback
from xml.dom.minidom import parse

higher values on `iVerbose` produce more logs at stdout

In [670]:
iVerbose=0

prepare data container

In [671]:
dDat={
    'lLetter':[],'oLetter':None,
    'org':{'sDN':None,'sFN':None,'idTextLine':''}
}

parse coordinate string and convert to numbers

In [672]:
def parseCoor(sCoor):
    lPos=sCoor.split(',')
    try:
        iX=int(lPos[0])
        iY=int(lPos[1])
        return iX,iY
    except:
        return None

write json output

In [673]:
def writeJson(sOutFN,lDat):
    with open(sOutFN,'w') as fOut:
        fOut.write(json.dumps(lDat))

class Letter hold text lines parsed.
letter start is detected by method `isNewLetter`.
constants `MIN_LINE_DELTA_Y` and `MIN_X_DATE_TAB_STOP` make use of printed format.

stored text is transform into letter properties by `procDat`
hardest part is to extract date, serveral different formats are used and
location is not always at the same spot, `calcDate` does the trick

`getDict` ease json output

In [674]:
class Letter:
    VERBOSE_LV=5                     # when to print information
    MIN_LINE_DELTA_Y=30              # minimum line distance in y position, used to detect new letter
    MIN_X_DATE_TAB_STOP=500          # min x-coordinate of location / date tabulator stop
    MONTH={'Januar':1,'Jänner':1,'Februar':2,'Feber':2,'März':3,'April':4,'Mai':5,'Juni':6,
           'Juli':7,'August':8,'September':9,'Sept':9,'Oktober':10,'Okteber':10,'November':11,'Dezember':12,}
    def __init__(self):
        self.lLine=[]
        self.iLen=0
        # +++++ beg:origin
        self.dOrg={
            'sDN':None,
            'sFN':None,
            'idLineFirst':None,
        }
        self.iLogged=0
        # ----- end:origin
        # +++++ beg:letter data
        #self.sLocation=''
        #self.sSnd=sSnd
        #self.iUnCertain=0
        # ----- end:letter data
    def printOrg(self):
        if self.iLogged<1:
            print(self.dOrg)
        self.iLogged+=1
    def setOrgByDat(self):
        try:
            self.dOrg['sDN']=dDat['org']['sDN']
            self.dOrg['sFN']=dDat['org']['sFN']
        except:
            self.dOrg['sDN']=None
            self.dOrg['sFN']=None
        try:
            self.dOrg['idLineFirst']=dDat['org']['idTextLine']
        except:
            self.idLineFirst=None
    def addLine(self,x,y,sText,idLine=None):
        dLine={'iX':x,'iY':y,'sTxt':sText}
        if idLine is not None:
            dLine['id']=idLine
        self.lLine.append(dLine)
        self.iLen+=1
    def isNewLetter(self,x,y):
        if self.iLen>0:
            lLast=self.lLine[-1]
            iX=lLast['iX']
            iY=lLast['iY']
            if iVerbose>self.VERBOSE_LV:
                self.printOrg()
                print('y',y,'iY',iY,'delta',abs(y-iY))
            if abs(y-iY)<self.MIN_LINE_DELTA_Y:
                return 1
        return 0
    def getNew(self):
        oNew=Letter()
        oNew.lLine.append(self.lLine[-1])
        del self.lLine[-1]
        return oNew
    def calcDate(self,sLine,lStatErr):
        #print(sLine)
        iPosLoc=sLine.find(',')
        if iPosLoc>0:
            iDateFound=0
            self.sLocation=sLine[:iPosLoc]
            sDate=sLine[iPosLoc+1:].strip()
        else:
            sDate=sLine
        if 1:
            # +++++ beg:get date part
            iDateFound=0
            lDate=sDate.split(' ')
            if len(lDate)<2:
                if iVerbose>self.VERBOSE_LV:
                    self.printOrg()
                    print('sDate',sDate,'lDate',lDate,'uups')
                iPosDay=0
            else:
                #print('sDate',sDate,'lDate',lDate)
                lD=[s for s in lDate if len(s)>0]
                if lD[1][-1]=='.':
                    pass
                else:
                    lD[1]=lD[1]+'.'
                sDateFind=''.join(lD)
                #print(sDateFind)
                iPosDay=sDateFind.find('.')
            # ----- end:get date part
            if iPosDay>0:
                # +++++ beg:find possible month, 2 variant used
                iPosMon=sDateFind.find(' ',iPosDay+1)
                if iPosMon>0:
                    iDateFound=1
                else:
                    iPosMon=sDateFind.find('.',iPosDay+1)
                    if iPosMon>0:
                        iDateFound=1
                # ----- end:find possible month, 2 variant used
                if iDateFound>0:
                    # +++++ beg:extract date strings
                    sDay=sDateFind[:iPosDay]
                    sMon=sDateFind[iPosDay+1:iPosMon]
                    sYear=sDateFind[iPosMon+1:]
                    # ----- end:extract date strings
                    # +++++ beg:convert day into number
                    try:
                        iDay=int(sDay)
                    except:
                        iDay=0
                        self.printOrg()
                        print('day is strange',sDay)
                        lStatErr.append('day is strange %s'%sDay)
                    # ----- end:convert day into number
                    # +++++ beg:convert month into number
                    sMon=sMon.strip()
                    if sMon in self.MONTH:
                        iMon=self.MONTH[sMon]
                    else:
                        try:
                            iMon=int(sMon)
                        except:
                            iMon=0
                    # ----- end:convert month into number
                    # +++++ beg:convert year into number
                    try:
                        if sYear[-1]=='.':
                            sYear=sYear[:-1]
                        # +++++ beg:year 2000 problem again, but different
                        iYear=int(sYear)
                        if iYear>50:
                            if iYear<1800:
                                iYear+=1800   
                        else:
                            iYear+=1900
                        # ----- end:year 2000 problem again, but different
                    except:
                        iYear=0
                        self.printOrg()
                        print('year is strange',sYear)
                        lStatErr.append('year is strange %s'%sYear)
                    # ----- end:convert year into number
                    # +++++ beg:check month value
                    if iMon<1:
                        self.printOrg()
                        print('month not recognized',sLine)
                        print('  ','year',sYear,'mon',sMon,'day',sDay)
                        lStatErr.append('month not recognized %s'%sLine)
                    # ----- end:check month value
                    # +++++ beg:conver into datetime, any invalid part will trigger exception
                    iUnCertain=0
                    try:
                        zSnd=datetime.date(iYear,iMon,iDay)
                        sSnd=zSnd.__str__()
                    except:
                        iUnCertain=1
                        sSnd=sLine[iPosLoc+1].strip()
                    self.sSnd=sSnd
                    self.iUnCertain=iUnCertain
                    # ----- end:conver into datetime, any invalid part will trigger exception
                    return 1
        return 0
    def procDat(self):
        try:
            if self.iLen>1:
                lLineLf=[]
                lLineRg=[]
                dStatErr={}
                iOfsFound=-1
                # +++++ beg:find date and location in first max 7 lines
                for iOfs in range(min(7,self.iLen)):
                    dLine=self.lLine[iOfs]
                    if dLine['iX']>self.MIN_X_DATE_TAB_STOP:
                        lStatErr=[]
                        if iOfsFound<0:
                            iRet=self.calcDate(dLine['sTxt'],lStatErr)
                            if iVerbose>4:
                                print('iOfs',iOfs,'iRet',iRet,lStatErr)
                            if len(lStatErr)>0:
                                sKey='%d'%iOfs
                                if 'id' in dLine:
                                    sKey+='.%s'%dLine['id']
                                dStatErr[sKey]=lStatErr
                            if iRet>0:
                                iOfsFound=iOfs
                            else:
                                lLineRg.append(dLine['sTxt'])
                    elif dLine['iX']<self.MIN_X_DATE_TAB_STOP:
                        lLineLf.append(dLine['sTxt'])
                    else:
                        # you are not supposed to be here
                        pass
                # ----- end:find date and location in first max 7 lines
                # +++++ beg:prepare status error list
                #print(dStatErr)
                lStatErr=[]
                if iOfsFound==-1:
                    lStatErr.append({'category':'letter date','msg':'calcuation unsuccessful','dat':dStatErr})
                # ----- end:prepare status error list
                # +++++ beg:letter reference is left
                iLenLf=len(lLineLf)
                if iLenLf>0:
                    self.sRef=lLineLf[0]
                elif iLenLf>1:
                    sMsg='more lines than expected to form letter reference, problem?'
                    self.printOrg()
                    print(sMsg)
                    lStatErr.append({'category':'letter reference','msg':sMsg,'dat':lLineLf})
                # ----- end:letter reference is left
                # +++++ beg:location an date are on the other side
                self.lLetterData='\n'.join(lLineRg)
                # ----- end:location an date are on the other side
                # +++++ beg:remeber status error information
                if len(lStatErr)>0:
                    try:
                        print(self.dOrg)
                    except:
                        pass
                    self.lStatErr=lStatErr
                # ----- end:remeber status error information
            # +++++ beg:trivial attempt
            if 0:
                iRet=self.calcDate(self.lLine[0]['sTxt'])
                if iRet>0:
                    self.sRef=self.lLine[1]['sTxt']
                else:
                    iRet=self.calcDate(self.lLine[1]['sTxt'])
                    if iRet>0:
                        self.sRef=self.lLine[0]['sTxt']
                    else:
                        iRet=self.calcDate(self.lLine[2]['sTxt'])
                        if iRet>0:
                            if len(self.lLine[0]['sTxt'])<10:
                                self.sRef=self.lLine[0]['sTxt']
                            elif len(self.lLine[1]['sTxt'])<10:
                                self.sRef=self.lLine[1]['sTxt']
            # ----- end:trivial attempt
        except:
            traceback.print_exc()
    def getDict(self,iInclLine=0,iInclLetter=0):
        d={}
        self.procDat()
        d['ref']=self.__dict__.get('sRef','---')
        d['sLocation']=self.__dict__.get('sLocation','---')
        d['sSnd']=self.__dict__.get('sSnd','---')
        d['iUnCertain']=self.__dict__.get('iUnCertain','---')
        if iInclLine>0:
            d['lines']=self.lLine
        if iInclLetter>0:
            d['letter']='\n'.join([dLn['sTxt'] for dLn in self.lLine])
        lStatErr=self.__dict__.get('lStatErr',None)
        if lStatErr is not None:
            d['lStatErr']=lStatErr
        lTeaser=[dLn['sTxt'] for dLn in self.lLine if dLn['iX']<self.MIN_X_DATE_TAB_STOP]
        d['letterTeaser']='\n'.join(lTeaser[:5])
        lLetterData=[dLn['sTxt'] for dLn in self.lLine if dLn['iX']>self.MIN_X_DATE_TAB_STOP]
        d['letterData']='\n'.join(lLetterData[:5])
        #lLetterData=self.__dict__.get('lLetterData',None)
        #if lLetterData is not None:
        #    d['lLetterData']=lLetterData
        d['org']=self.dOrg
        return d

handle transkribus metadata xml tag `TextRegion`

In [675]:
def handleTextRegion(oNode,dDat):
    sLv='  '*2
    if iVerbose>1:
        print(sLv,'handleTextRegion')
    sLv='  '*3
    oChild=oNode.firstChild
    while oChild is not None:
        if oChild.nodeType==oChild.ELEMENT_NODE:
            sTag=oChild.tagName
            if sTag=='Coords':
                sPoints=oChild.getAttribute('points')
                lCoorRegion=[parseCoor(s) for s in sPoints.split(' ')]
                if iVerbose>3:
                    print(sLv,'region extends:',lCoorRegion)
                dDat['region']=lCoorRegion
                dDat['xBeg']=lCoorRegion[0][0]
                dDat['xEnd']=lCoorRegion[3][0]
                dDat['dX']=float(lCoorRegion[3][0]-lCoorRegion[0][0])
                return 1
        oChild=oChild.nextSibling
    return 0

In [676]:
handle transkribus metadata xml tag `TextLine`

SyntaxError: invalid syntax (<ipython-input-676-51742c2f402e>, line 1)

In [677]:
def handleTextLine(oNode,dDat):
    idTextLine=oNode.getAttribute('id')
    dDat['org']['idTextLine']=idTextLine
    sLv='  '*2
    if iVerbose>1:
        print(sLv,'handleTextLine','id',idTextLine)
    sLv='  '*3
    lBase=oNode.getElementsByTagName('Baseline')
    for oBase in lBase:
        sPoints=oBase.getAttribute('points')
        if 0:
            print(sLv,sPoints)
        lCoord=sPoints.split(' ')
        sCoordBeg=lCoord[0]
        tCoorBeg=parseCoor(sCoordBeg)
        if iVerbose>3:
            print(sLv,'coord beg:',sCoordBeg,'parsed:',tCoorBeg)
        xRel=(tCoorBeg[0]-dDat['xBeg'])/dDat['xEnd']
        if iVerbose>3:
            print(sLv,'xRel',xRel)
        lTxt=oNode.getElementsByTagName('Unicode')
        if len(lTxt)>0:
            #print('lTxt',len(lTxt))
            if lTxt[0].firstChild is None:
                if dDat['org']['iLogged']<1:
                    print(dDat['org'])
                    dDat['org']['iLogged']=1
                print(sLv,'handleTextLine','id',idTextLine,'problem')
                sTxt=''
            else:
                sTxt=lTxt[0].firstChild.data
            #print(sTxt)
        else:
            sTxt=''
        oLetter=dDat.get('letter',None)
        if oLetter is None:
            oLetter=Letter()
            oLetter.setOrgByDat()
            dDat['letter']=oLetter
        iR=oLetter.isNewLetter(tCoorBeg[0],tCoorBeg[1])
        if iR>0:
            if iVerbose>3:
                print(sLv,'new letter detected')
            dDat['lLetter'].append(oLetter)
            oLetter=oLetter.getNew()
            oLetter.setOrgByDat()
            dDat['letter']=oLetter
        oLetter.addLine(tCoorBeg[0],tCoorBeg[1],sTxt,idLine=idTextLine)
    return 0

define transkribus metadata xml parse instructions

In [678]:
dDef={
    'Metadata':None,
    'Page.ReadingOrder':None,
    'Page.TextRegion':(handleTextRegion,(dDat,),{}),
    'Page.TextRegion.TextLine':(handleTextLine,(dDat,),{}),
}

walk through xml dom

In [679]:
def parseChild(oNode,lHier):
    oChild=oNode.firstChild
    while oChild is not None:
        if oChild.nodeType==oChild.ELEMENT_NODE:
            sTag=oChild.tagName
            if iVerbose>5:
                print('  ',sTag)
            lHierNxt=lHier+[sTag]
            sHier='.'.join(lHierNxt)
            if sHier in dDef:
                # +++++ beg:handle xml tag
                if dDef[sHier] is not None:
                    if iVerbose>2:
                        print('  ','found',sHier)
                    func,args,kwargs=dDef[sHier]
                    try:
                        iR=func(oChild,*args,**kwargs)
                    except:
                        traceback.print_exc()
                        iR=0
                    # +++++ beg:dive deeper
                    if iR>0:
                        parseChild(oChild,lHierNxt)
                    # ----- end:dive deeper
                # ----- end:handle xml tag
            else:
                parseChild(oChild,lHierNxt)
        oChild=oChild.nextSibling

parse xml file

In [680]:
def parseXML(sFN,dDat):
    if iVerbose>0:
        print('parseXML',sFN)
    # +++++ beg:remember filename
    sTmpDN,sTmpFN=os.path.split(sFN)
    dDat['org']={
        'sDN':sTmpDN,
        'sFN':sTmpFN,
        'iLogged':0,
    }
    # ----- end:remember filename
    # +++++ beg:parse xml file
    try:
        oDom=parse(sFN)
        oRoot=oDom.documentElement
        lHier=[]
        parseChild(oRoot,lHier)
    except:
        traceback.print_exc()
    # ----- end:parse xml file

## parse transkribus metadata

define folder to process

set folder where transkribus metadata is stored (pXXX.xml, where 001 <= XXX <= max page number of your document)
you find file in transkribus document export (<job>/<docName>/page)

all xml files are process in alphabetical order

In [681]:
sDN='../1975_Brahm_Seidlin_OCR'

clear data container

In [682]:
dDat['lLetter']=[]
dDat['oLetter']=None

walk through directory defined by `sDN`  
pick xml files and parse transkribus metadata structure

In [683]:
lXml=[]
for sRoot,lDN,lFN in os.walk(sDN):
    for sFN in lFN:
        if sFN.endswith('xml'):
            sFullFN=os.path.join(sRoot,sFN)
            lXml.append(sFullFN)
print('#file:',len(lXml),'found to parse')
lXml.sort()
for sFullFN in lXml:
    parseXML(sFullFN,dDat)

dDat['lLetter'].append(oLetter)  # don't forget of current one

#file: 363 found to parse
{'sDN': '../1975_Brahm_Seidlin_OCR', 'sFN': 'p081.xml', 'iLogged': 0, 'idTextLine': 'r1l45'}
       handleTextLine id r1l45 problem
{'sDN': '../1975_Brahm_Seidlin_OCR', 'sFN': 'p148.xml', 'iLogged': 0, 'idTextLine': 'r1l15'}
       handleTextLine id r1l15 problem
{'sDN': '../1975_Brahm_Seidlin_OCR', 'sFN': 'p305.xml', 'iLogged': 0, 'idTextLine': 'r1l1'}
       handleTextLine id r1l1 problem


## json output  

all detected letters are in container


convert into dictionary

In [684]:
lDat=[]
for oLetter in dDat['lLetter']:
    d=oLetter.getDict(iInclLetter=0)
    lDat.append(d)
print('#letter',len(lDat),'found')

{'sDN': '../1975_Brahm_Seidlin_OCR', 'sFN': 'p044.xml', 'idLineFirst': 'r1l24'}
{'sDN': '../1975_Brahm_Seidlin_OCR', 'sFN': 'p047.xml', 'idLineFirst': 'r2l13'}
{'sDN': '../1975_Brahm_Seidlin_OCR', 'sFN': 'p050.xml', 'idLineFirst': 'r1l24'}
{'sDN': '../1975_Brahm_Seidlin_OCR', 'sFN': 'p051.xml', 'idLineFirst': 'r2l6'}
{'sDN': '../1975_Brahm_Seidlin_OCR', 'sFN': 'p051.xml', 'idLineFirst': 'r2l11'}
{'sDN': '../1975_Brahm_Seidlin_OCR', 'sFN': 'p051.xml', 'idLineFirst': 'r2l16'}
{'sDN': '../1975_Brahm_Seidlin_OCR', 'sFN': 'p053.xml', 'idLineFirst': 'r2l13'}
{'sDN': '../1975_Brahm_Seidlin_OCR', 'sFN': 'p055.xml', 'idLineFirst': 'r1l31'}
{'sDN': '../1975_Brahm_Seidlin_OCR', 'sFN': 'p057.xml', 'idLineFirst': 'r1l38'}
{'sDN': '../1975_Brahm_Seidlin_OCR', 'sFN': 'p062.xml', 'idLineFirst': 'r1l35'}
{'sDN': '../1975_Brahm_Seidlin_OCR', 'sFN': 'p063.xml', 'idLineFirst': 'r1l37'}
{'sDN': '../1975_Brahm_Seidlin_OCR', 'sFN': 'p067.xml', 'idLineFirst': 'r1l30'}
{'sDN': '../1975_Brahm_Seidlin_OCR', 'sFN

Traceback (most recent call last):
  File "<ipython-input-674-5ccab6181549>", line 171, in procDat
    dLine=self.lLine[iOfs]
IndexError: list index out of range


write json

In [685]:
writeJson('../1975_Brahm_Seidl_02.json',lDat)

done

+ check output
+ dive into it
+ edit and refine it, [OpenRefine](https://openrefine.org/) is highly recommended

have a nice day

## playground

In [686]:
print(lDat)

[{'ref': '---', 'sLocation': '---', 'sSnd': '---', 'iUnCertain': '---', 'letterTeaser': 'Die Briefe', 'letterData': '', 'org': {'sDN': '../1975_Brahm_Seidlin_OCR', 'sFN': 'p042.xml', 'idLineFirst': 'r1l1'}}, {'ref': '[S1]', 'sLocation': 'Wien IX', 'sSnd': '1894-05-20', 'iUnCertain': 0, 'letterTeaser': '[S1]\nSehr geehrter Herr Doktor,\nich erlaube mir, Ihnen mein Schauspiel Das Märchen zur even-\ntuellen Aufführung am Deutschen Theater zu überreichen. Das\nStück ist im verflossenen Winter im Deutschen Volkstheater zu', 'letterData': 'Wien IX, Frankgasse 1\n20. Mai 94\nHochachtungsvoll ergebenst\nDr. Arthur Schnitzler', 'org': {'sDN': '../1975_Brahm_Seidlin_OCR', 'sFN': 'p042.xml', 'idLineFirst': 'r1l3'}}, {'ref': '[S2]', 'sLocation': 'Wien IX', 'sSnd': '1894-05-30', 'iUnCertain': 0, 'letterTeaser': '[S2]\nSehr geehrter Herr Doktor,\nich habe Ihnen mein dreiaktiges Schauspiel Das Märchen zur\nAufführung am Deutschen Theater überreicht und habe mich na-\ntürlich verpflichtet gefühlt, Ihn

In [183]:
os.getcwd()

'/wrk/dat/src/gitlab/wroacdhitp20203/D_dbs/3_py'

In [125]:
abs(-1)

1

In [154]:
s=' Nov '
print(s.strip(),len(s.strip()))

Nov 3


In [157]:
zD=datetime.date(1898,2,28)

In [158]:
print(zD)

1898-02-28


In [159]:
zD.__str__()

'1898-02-28'

In [160]:
zD=datetime.date(1898,0,28)

ValueError: month must be in 1..12

In [235]:
sDate='25. Februar 96'

In [236]:
lDate=sDate.split(' ')

In [237]:
print(lDate)

['25.', 'Februar', '96']


In [239]:
sDate='25.    Februar   96'
lDate=sDate.split(' ')
print(lDate)

['25.', '', '', '', 'Februar', '', '', '96']


In [242]:
lD=[s for s in lDate if len(s)>0]
print(lD)

['25.', 'Februar', '96']


In [243]:
int(lD[0])

ValueError: invalid literal for int() with base 10: '25.'

In [416]:
min(3,5)

3