Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

COMBINING LATIN SMALL LETTER O and COMBINING RING ABOVE can be ignored in compare #28

Open
stweil opened this issue Sep 23, 2020 · 10 comments
Assignees
Labels
enhancement New feature or request

Comments

@stweil
Copy link
Contributor

stweil commented Sep 23, 2020

UTF-8 allows different representations for the same character. Dinglehoppers currently does not detect that such different representations are identical characters, but handles them like a recognition error.

This can be fixed by normalizing the text before doing the comparison.

Example: We just had a case where the GT transcription used zuͦſein (u + COMBINING RING ABOVE) while the OCR detected zůſein (LATIN SMALL LETTER U WITH RING ABOVE). See https://digi.bib.uni-mannheim.de/fileadmin/digi/477429599/dh_055.html.

@stweil
Copy link
Contributor Author

stweil commented Sep 24, 2020

Dinglehopper already uses normalization (NFC). The delta in the example is caused by different characters which look the same: u + COMBINING LATIN SMALL LETTER O != u + COMBINING RING ABOVE.

So COMBINING LATIN SMALL LETTER O and COMBINING RING ABOVE should be handled as similar when comparing.

@stweil stweil changed the title Normalize text before comparing COMBINING LATIN SMALL LETTER O and COMBINING RING ABOVE can be ignored in compare Sep 24, 2020
@mikegerber mikegerber added the enhancement New feature or request label Sep 24, 2020
@mikegerber
Copy link
Member

I do think that COMBINING LATIN SMALL LETTER O and COMBINING RING ABOVE are entirely different characters and not equivalent.

However,

I do think that the normalization behaviour should be configurable by the user. So if you choose to consider these characters to be the same for your use case, then you should be able to configure this. I've done some WIP on this that I need to merge and work further on it.

@mikegerber mikegerber self-assigned this Sep 24, 2020
@mikegerber mikegerber added this to the 1.0 milestone Sep 24, 2020
@cneud
Copy link
Member

cneud commented Sep 25, 2020

Duplicate of #11?

@stweil
Copy link
Contributor Author

stweil commented Sep 26, 2020

@cneud, yes, the issue can be solved with substitutions which can be configured by the users.

@mikegerber, sure, and ů are different characters, maybe like some cyrillic characters which look like latin ones. But which of them is the right one for historic German texts? Do you agree that it does not make sense to train both for OCR models in that context?

@mikegerber
Copy link
Member

mikegerber commented Sep 30, 2020

@cneud, yes, the issue can be solved with substitutions which can be configured by the users.

Exactly.

I aim to support (Unicode terms) canonical equivalence (= using NFC consistently) and maybe the same idea for MUFI characters. And make all further equivalence considerations user configurable. The latter also means that some of the hardcoded substitutions will be moved to some kind of configuration (#11) and @stweil can make uͦ and ů equivalent.

@bertsky
Copy link
Contributor

bertsky commented Oct 30, 2020

@cneud, yes, the issue can be solved with substitutions which can be configured by the users.

Exactly.

I would like to point out here that allowing arbitrary equivalences also makes comparing much more difficult. Ideally, there should be sensible sets of transformations (like OCR-D GT levels) which many researchers/practitioners can agree on. And then, to facilitate commensurability, ideally, the evaluation should produce multiple metrics next to each other in the report (like: always the chosen metric plus maximum normalization (GT level 1) metric plus minimum (GT level 3 / Levenshtein) metric).

Also, cf. existing metrics in my module.

@cneud
Copy link
Member

cneud commented Oct 30, 2020

Basically this comes down to a number of pre-defined common use cases or scenarios, with the added possibility for users to create their own scenarios. This is also the approach that was followed during the IMPACT project and which I believe would be a sound tradeoff between comparable results and flexibility with regard to differing applications.

@mikegerber
Copy link
Member

I agree mostly with what @bertsky and @cneud said. I just want to throw in some doubt on the belief that CERs are somehow comparable when produced by different tools. Do they count whitespace the same way? grapheme clusters? punctuation?

Side note: Is there really a set of transformations defined for OCR-D's GT level 1?

@cneud
Copy link
Member

cneud commented Oct 30, 2020

the belief that CERs are somehow comparable when produced by different tools

I too strongly doubt they are! Looking at e.g. results and metrics from ICDAR papers, many resort to their own implementation for evaluation, which obviously creates considerable blur around any exact performance comparison.

@bertsky
Copy link
Contributor

bertsky commented Oct 30, 2020

I just want to throw in some doubt on the belief that CERs are somehow comparable when produced by different tools. Do they count whitespace the same way? grapheme clusters? punctuation?

We have to get there! As a community. Otherwise, where's the objectivity?

White space should be easy, given the strictly implicit PAGE-XML whitespace model. Grapheme clusters is something we agreed on earlier, only our implementations differ (so it should be interesting to compare them to find edge cases). Punctuation, not sure what you mean – punctuation normalization in CER, or tokenization for WER? (The latter I agree is a hard one to find any one standard for...)

Side note: Is there really a set of transformations defined for OCR-D's GT level 1?

No, unfortunately not. That's one of the things I have been adamant about in phase 2 but never got @tboenig or @kba to implement a runnable definition 😟

@mikegerber mikegerber removed this from the 1.0 milestone Mar 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants