Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Good segmentation results but bad OCR when using sbb-textline-detection with OCR-D #42

Open
maxnth opened this issue Oct 2, 2020 · 11 comments
Labels
bug Something isn't working

Comments

@maxnth
Copy link

maxnth commented Oct 2, 2020

I'm not sure whether this is the right place to ask as sbb-textline-detector itself worked perfectly in our OCR-D workflows and the produced segmentation results look good as well but running any recognition (calamari-recognize as well as tesserocr-recognize) afterwards yields weird text output that seems worse than it should be (regarding the good segmentation results).

I basically used the (formerly) recommended workflow and substituted everything starting from the region segmentation up to the line segmentation with sbb-textline-detector.

The region segmentation produced by this looks pretty good and this impression is confirmed by the pixel accuracy evaluation we ran for several segmentation workflows (with cis-ocropy-segment, tesserocr-segment-region, …). The line segmentation looks pretty good as well and should probably be a good basis for running OCR on it but as stated above the results are somehow surprisingly bad. I tried to run the recognition directly on the produced segmentation (OCR-D-SEG-LINE) without dewarping first but the results are even worse that way.

Am I missing something obvious (e.g. adding a certain step after running sbb-textline-detector)?

Workflow steps
"olena-binarize -I input -O OCR-D-BIN -P impl sauvola"
"anybaseocr-crop -I OCR-D-BIN -O OCR-D-CROP"
"olena-binarize -I OCR-D-CROP -O OCR-D-BIN2 -P impl kim"
"cis-ocropy-denoise -I OCR-D-BIN2 -O OCR-D-BIN-DENOISE -P level-of-operation page"
"cis-ocropy-deskew -I OCR-D-BIN-DENOISE -O OCR-D-BIN-DENOISE-DESKEW -P level-of-operation page"
"sbb-textline-detector -I OCR-D-BIN-DENOISE-DESKEW -O OCR-D-SEG-LINE -P model /home/mn/Desktop/sbbmodels/mixed"
"cis-ocropy-dewarp -I OCR-D-SEG-LINE -O OCR-D-SEG-LINE-RESEG-DEWARP"
"calamari-recognize -I OCR-D-SEG-LINE-RESEG-DEWARP -O OCR-D-OCR -P checkpoint /home/mn/Desktop/ocrd_calamari/gt4histocr-calamari/\*.ckpt.json"
    
Region segmentation output
Line segmentation output
Text output!




) weglaſſen. — Und — — großen, oder bald einen groͤßern, kleinern Raum dazwiſchen laſſen. ..C 2 ..H. A.. .. ſ. A
rts, ſondern gerade auf das Papier
rr uue— rſ — ewoͤhnlichſte Schrift iſt die Current— n .nehr von der Rechten zur Linken g me o ene i die unter oder uͤber die Linie hervor— uchſtaben alle gleich weit hervorragen. roßer Fehler, wenn die Buchſtaben zu br ao ſ — ſ o i ui — in Wort ausmachen, einzeln zu ſchrei— ern ſie muͤſſen, ſo viel moͤglich iſt, ſo en vorhergehenden, als mit den fol— — . ſRa o. ſ X2 itt — — —

The input image for the example page and the produced PAGE XML can be found here in case it helps.

@mikegerber
Copy link
Member

mikegerber commented Oct 2, 2020

Could you upload the result before the dewarping step, too? My hunch is that the dewarping produces too thick lines. Ideally, upload the whole workspace contents for this page.

@mikegerber
Copy link
Member

mikegerber commented Oct 2, 2020

There is also an issue that the line texts aren't matching the line images, but this could just be an issue with PAGEViewer:

image

The text box should give the result for the first line, but gives some text from the second line.

@mikegerber mikegerber self-assigned this Oct 2, 2020
@mikegerber mikegerber added the bug Something isn't working label Oct 2, 2020
@maxnth
Copy link
Author

maxnth commented Oct 2, 2020

Could you upload the result before the dewarping step, too?

OCR-D-SEG-LINE_0005.xml (sbb-seg.zip) is the the PAGE XML outputted by sbb-textline-detector, OCR-D-OCR2_0005.xml and OCR-TXT2_0005.txt is the OCR output when running calamari-recognize directly on OCR-D-SEG-LINE_0005.xml.

There is also an issue that the line texts aren't matching the line images

That bug (?) appears in LAREX as well but my first thought was that it's a problem caused by LAREX as it's not really 100% compatible with OCR-D yet.

@mikegerber
Copy link
Member

mikegerber commented Oct 2, 2020

Here is the result with my (a lot simpler) my_ocrd_workflow.

2020-10-sbb_textline_detection-issue-42.zip

The result is fine (paragraphs 2+3):

Die gewoͤhnlichſte Schrift iſt die Current⸗
ſchrift, deren Buchſtaben nicht zu gerade herun—
ter, ſondern mehr von der Rechten zur Linken
herabliegend geſchrieben werden muͤſſen — Es iſt
gut, wenn die unter oder uͤber die Linie hervor⸗
ragende Buchſtaben alle gleich weit hervorragen.
Es iſt ein großer Fehler, wenn die Buchſtaben zu
gedraͤngt ſtehen, oder zu weit gedehnt ſind. Auch
muß man ſich huͤten, die Buchſtaben, die zu—
ſammen Ein Wort ausmachen, einzeln zu ſchrei—
ben, ſondern ſie muͤſſen, ſo viel moͤglich iſt, ſo
wohl mit den vorhergehenden, als mit den fol—
genden zuſammenhaͤngen. —
Man muß den Currentbuchſtaben nicht un—
nuͤtze Zierrathen anhaͤngen, oder ihre Schwei⸗
fungen zu ſehr vergroͤßern.

So the problem is somewhere in all the cropping/dewarping/deskewing or the handling thereof. This is going to take some time to debug. But I wanted to check out the dewarping anyway ;-)

@maxnth
Copy link
Author

maxnth commented Oct 2, 2020

Using your more minimal workflow with sbb-textline-detector gave me the same results (which look a bit more like the result I expected :D )

@mikegerber
Copy link
Member

Yeah, superficially I only see problems with the hyphens.

@maxnth
Copy link
Author

maxnth commented Oct 2, 2020

I tried switching off different pre-processing steps before segmenting (seeing that minimal pre-processing seems to work just fine in this case) and it seems that cropping is responsible for the bad results.
The above workflow without anybaseocr-crop yields good results, turning off the other pre-processing steps but leaving cropping in the workflow always yields bad results for this page.

@mikegerber
Copy link
Member

mikegerber commented Oct 2, 2020

Thanks for the analysis. I'll look into the problem, could be an API problem in ocrd-sbb-texline-detector.

@cneud
Copy link
Member

cneud commented Oct 2, 2020

Other @OCR-D users also reported issues with anybaseocr-crop. But if the expected results can in fact be achieved with https://github.com/mikegerber/my_ocrd_workflow/, this rather hints at a problem in the OCR-D workflow or in the way ocrd-sbb-textline-detector writes its output PAGE-XML (cc @kba).

Btw there is also this nice fork https://github.com/sulzbals/gbn which provides a more granular API that is @OCR-D compliant, in case this may be useful for testing/debugging.

Regarding the way cropping and line-deskewing/dewarping are applied by sbb-textline-detector, @vahidrezanezhad can fill in the details much better than me.

@mikegerber
Copy link
Member

mikegerber commented Oct 15, 2020

I changed the code to retrieve the image and to calculate the coordinates, could you try again with current master/ 020ffbc? (I don't have a setup of anybaseocr + cis-ocropy yet, so it would help if you could try again.)

@mikegerber mikegerber removed their assignment Oct 15, 2020
@cneud
Copy link
Member

cneud commented Oct 31, 2020

Possibly relates to #48

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants