Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation Enhancement #26

Closed
GrazingScientist opened this issue Sep 15, 2020 · 2 comments
Closed

Documentation Enhancement #26

GrazingScientist opened this issue Sep 15, 2020 · 2 comments
Assignees
Labels
documentation Improvements or additions to documentation
Milestone

Comments

@GrazingScientist
Copy link

GrazingScientist commented Sep 15, 2020

Please consider explaining some details in your documentation:

  • What is WER and CER. If you are not familiar with these terms, you don't grasp it emediately although it is an easy concept.
  • Does dinglehopper automatically recognize the import format?
  • How are text files and XML files compared? Are the XML files simply stripped down to their text representation? How do you assure that there is no additional (or missing) empty paragraph screwing the evaluation?

Also, please provide a --verbose parameter. I just ran a comparison of a text file with an XML file (both not part of any OCR-D process) and waited for every. FInally, I aborted the process. It would be nice to know what was the issue internally.... (Moved to #30)

And, despite this critique, thank you for providing such a handy tool! :)

Edit:
I found even more:

  • How can I process a bunch of ground truth files that are not part of the OCR-D mets.xml. Or, how can I assign them their corresponding page in the mets.xml? There should be some way!
@mikegerber mikegerber self-assigned this Sep 24, 2020
@mikegerber mikegerber added the documentation Improvements or additions to documentation label Sep 24, 2020
@mikegerber mikegerber added this to the 1.0 milestone Sep 24, 2020
@mikegerber
Copy link
Member

mikegerber commented Sep 24, 2020

  • What is WER and CER. If you are not familiar with these terms, you don't grasp it emediately although it is an easy concept.
  • Does dinglehopper automatically recognize the import format?
  • How are text files and XML files compared? Are the XML files simply stripped down to their text representation? How do you assure that there is no additional (or missing) empty paragraph screwing the evaluation?

I'll address this in the documentation. Answering here shortly:

  • Word and character error rates, e.g. 1 in 10 characters are wrong (due to insertion,deletion,substitution) → CER = 1/10 = 0.1
  • Yes it detects if a file is ALTO or PAGE and falls back to text
  • It's always the extracted text that is compared. You can (and should) have a look at the visual comparison of the text. (If you have any specific data, please send and I have a look.)

Also, please provide a --verbose parameter. I just ran a comparison of a text file with an XML file (both not part of any OCR-D process) and waited for every. FInally, I aborted the process. It would be nice to know what was the issue internally....

Heh, yeah I'll consider a progress bar. Could you please open a second issue for this?

And, despite this critique, thank you for providing such a handy tool! :)

Thanks 🥇 ;) Consider giving the project a star here on GitHub!

  • How can I process a bunch of ground truth files that are not part of the OCR-D mets.xml. Or, how can I assign them their corresponding page in the mets.xml? There should be some way!

This is out of scope for this tool, but this workshop material by @kba should help: https://ocr-d.de/slides/2019-03-25-dhd/praxis-new-mets – You need to add the GT to the METS with the right page id (matching the OCR and image files). So it should be (untested) ocrd workspace add -g P0015 -G OCR-D-GT-PAGE -m application/vnd.prima.page+xml -i OCR-D-GT-PAGE_0015 OCR-D-GT/OCR-D-GT-PAGE_0015.xml for a GT XML with page id P0015.

@cneud
Copy link
Member

cneud commented Sep 25, 2020

What is WER and CER

This is fairly extensive and clear https://sites.google.com/site/textdigitisation/qualitymeasures

it detects if a file is ALTO or PAGE and falls back to text

I agree this would be nice to have covered in the README ;)

how can I assign them their corresponding page in the mets.xml?

See also the up-to-date docs https://ocr-d.de/en/user_guide#non-existing-mets

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

3 participants