Documentation Enhancement #26

GrazingScientist · 2020-09-15T10:05:00Z

Please consider explaining some details in your documentation:

What is WER and CER. If you are not familiar with these terms, you don't grasp it emediately although it is an easy concept.
Does dinglehopper automatically recognize the import format?
How are text files and XML files compared? Are the XML files simply stripped down to their text representation? How do you assure that there is no additional (or missing) empty paragraph screwing the evaluation?

Also, please provide a --verbose parameter. I just ran a comparison of a text file with an XML file (both not part of any OCR-D process) and waited for every. FInally, I aborted the process. It would be nice to know what was the issue internally.... (Moved to #30)

And, despite this critique, thank you for providing such a handy tool! :)

Edit:
I found even more:

How can I process a bunch of ground truth files that are not part of the OCR-D mets.xml. Or, how can I assign them their corresponding page in the mets.xml? There should be some way!

mikegerber · 2020-09-24T18:31:44Z

What is WER and CER. If you are not familiar with these terms, you don't grasp it emediately although it is an easy concept.

Does dinglehopper automatically recognize the import format?

How are text files and XML files compared? Are the XML files simply stripped down to their text representation? How do you assure that there is no additional (or missing) empty paragraph screwing the evaluation?

I'll address this in the documentation. Answering here shortly:

Word and character error rates, e.g. 1 in 10 characters are wrong (due to insertion,deletion,substitution) → CER = 1/10 = 0.1
Yes it detects if a file is ALTO or PAGE and falls back to text
It's always the extracted text that is compared. You can (and should) have a look at the visual comparison of the text. (If you have any specific data, please send and I have a look.)

Also, please provide a --verbose parameter. I just ran a comparison of a text file with an XML file (both not part of any OCR-D process) and waited for every. FInally, I aborted the process. It would be nice to know what was the issue internally....

Heh, yeah I'll consider a progress bar. Could you please open a second issue for this?

And, despite this critique, thank you for providing such a handy tool! :)

Thanks 🥇 ;) Consider giving the project a star here on GitHub!

How can I process a bunch of ground truth files that are not part of the OCR-D mets.xml. Or, how can I assign them their corresponding page in the mets.xml? There should be some way!

This is out of scope for this tool, but this workshop material by @kba should help: https://ocr-d.de/slides/2019-03-25-dhd/praxis-new-mets – You need to add the GT to the METS with the right page id (matching the OCR and image files). So it should be (untested) ocrd workspace add -g P0015 -G OCR-D-GT-PAGE -m application/vnd.prima.page+xml -i OCR-D-GT-PAGE_0015 OCR-D-GT/OCR-D-GT-PAGE_0015.xml for a GT XML with page id P0015.

cneud · 2020-09-25T23:05:10Z

What is WER and CER

This is fairly extensive and clear https://sites.google.com/site/textdigitisation/qualitymeasures

it detects if a file is ALTO or PAGE and falls back to text

I agree this would be nice to have covered in the README ;)

how can I assign them their corresponding page in the mets.xml?

See also the up-to-date docs https://ocr-d.de/en/user_guide#non-existing-mets

mikegerber self-assigned this Sep 24, 2020

mikegerber added the documentation Improvements or additions to documentation label Sep 24, 2020

mikegerber added this to the 1.0 milestone Sep 24, 2020

GrazingScientist mentioned this issue Sep 25, 2020

Add --progress parameter #30

Closed

mikegerber closed this as completed in d706ef4 Sep 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documentation Enhancement #26

Documentation Enhancement #26

GrazingScientist commented Sep 15, 2020 •

edited by mikegerber

Loading

mikegerber commented Sep 24, 2020 •

edited

Loading

cneud commented Sep 25, 2020

Documentation Enhancement #26

Documentation Enhancement #26

Comments

GrazingScientist commented Sep 15, 2020 • edited by mikegerber Loading

mikegerber commented Sep 24, 2020 • edited Loading

cneud commented Sep 25, 2020

GrazingScientist commented Sep 15, 2020 •

edited by mikegerber

Loading

mikegerber commented Sep 24, 2020 •

edited

Loading