Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dealing with marks in the data #28

Closed
MrBrezina opened this issue Mar 26, 2021 · 6 comments
Closed

Dealing with marks in the data #28

MrBrezina opened this issue Mar 26, 2021 · 6 comments
Assignees
Labels
enhancement New feature or request

Comments

@MrBrezina
Copy link
Member

MrBrezina commented Mar 26, 2021

The intention is to include combining marks that are used in canonical decomposition defined by Unicode for the characters used by an orthography. The current plan (not implemented yet) would work like this:

  • for example, if base contains š this will imply inclusion of ◌̌ (combining caron) in base_marks
  • for example, if auxiliary contains á this will imply inclusion of ◌́ (combining acute) in auxiliary_marks
  • a flag will be provided for the CLI tool and methods in the Python package and a toggle will be added to the web app to not check for marks when detecting language support

The reason for this is that some fonts that do not cover code points for combining marks might not be recognized as supporting a language even though the font could be used for the language without any problem thanks to the characters for precomposed combinations. We are not able to evaluate the quality of the mark positioning in a font.

Proposition: In situations where all combinations in base are covered without marks, with precomposed characters only, we could include the combining marks in auxiliary_marks only rather than in base_marks. The detection would check, by default, only base and base_marks. This way, languages would still be detected (without any flag or toggle use) and the marks will be noted.

We think that the inclusion of combining marks in fonts is: a) more future-proof and better design strategy, b) useful for technical reasons when precomposed characters get decomposed automatically in some situations.

@moyogo
Copy link
Contributor

moyogo commented Mar 26, 2021

One of the issue with the current approach or the proposed approach is that the presence of a combining mark does not guarantee that graphemes of an orthography that don't have precomposed character forms are supported.
The data model does not store which characters combining marks are meant to combine with, which prevents from testing them in hyperglot or in tools using hyperglot.

For example, Guarani uses ã ẽ g̃ ĩ õ ũ ỹ, where all but g̃ have precomposed characters encoded. A font that has combining tilde does not necessarily support Guarani properly unless it positions the combining tilde on g or substitutes the sequence for a single glyph.

A first step to resolve this would be to store these graphemes in base or auxiliary.
Then a check that verifies these sequences are modified either with positioning or substitution by default features in the relevant language system could confirm that the font has some support for those graphemes.
That said there may be cases where positioning or substitution are not required for support.

The quality of support is probably out of scope, the same way it is for simple graphemes, meaning if the positioning is different than if there was no positioning but is still incorrect is no different than if the position was wrong in a precomposed character. This would have to be assessed differently.

@MrBrezina
Copy link
Member Author

MrBrezina commented Mar 26, 2021

Originally, we wanted to include the combinations and I wanted to check for presence of relevant OpenType features and lookups. I think we only postponed the combinations for later. I am definitely for their inclusion in base and auxiliary or possibly in a separate combinations entry or even separate database if it gets too large (Indian languages). What do you think @kontur ?

Regarding OT feature checks, I am not convinced that it makes sense to do an elaborate check that would not completely tackle the issue (be it OT feature analysis or ML-driven OCR on test documents). The ultimate check has to be visual and according to the objectives of the designer and the purpose of what they are trying to do.

For example:

  1. If we attempted to analyse the existence of mark positioning, we would not know if it is any good. And the mark could be well positioned, but poorly designed.
  2. If we attempted to analyse the quality of mark positioning, we would be tackling an artificial design problem. We would need to know what constitutes a good positioning. That is not far from making good positioning and comparing with the font. And that way we would be imposing our design standards (embedded in the judgement of what constitutes a “good positioning”) on others.
  3. If we compared a combination before and after activating OT features we would not be much wiser. I know people who design their marks to overhang above characters so they do not need to rely on mark positioning. But maybe that is a rare case.

I think I drew the line around design differently from you and included the quality of OpenType features in it. Thus, my current inclination is to trust (!) the users and provide general (format agnostic :) ) notes and perhaps point them to design guidelines where these are available.

[editted for better clarity]

@MrBrezina MrBrezina added the enhancement New feature or request label Mar 26, 2021
@moyogo
Copy link
Contributor

moyogo commented Mar 26, 2021

I agree, OT feature check is a difficult one.
Knowing which combinations occur is the minimum that's currently missing.

@kontur
Copy link
Contributor

kontur commented Mar 26, 2021

Yes, I think this is one of the edge cases where conceptually the decomposition approach does not work well. Even more so, if the unencoded glyph, let's say /gtilde/, is in a charset in which no other encoded glyphs that can be decomposed to base + /tildecomb/ is present. I think we could simply tolerate such unencoded base + mark combinations in the data, but it gets a little tricky with the parsing and saving of that data. Normally, we split everything into unicodes that can be split, alas we'd somehow have to implement "keeping" those combinations joined.

I think a better implementation for support checking would be indeed to check the mark features and make sure the glyphs involved are listed. This would be a solution that side steps design considerations, but also detects false positives of the current approach if the involved glyphs just happen to be in the font, but are not linked via the mark feature at all.

For now a good step will be to split to base, base_marks, auxiliary and auxiliary_marks, and an implementation to test for those in the CLI.

@meehkal
Copy link
Contributor

meehkal commented Apr 2, 2021

Just a note that this is also relevant for Kildin Sami (sjd), where some of the following letters of the alphabet are precomposed, but not all of them. (The list is not complete, but just for illustration.)

А̄ Е̄ Ӣ Э̄ Я̄ Ӯ Ё̄

@kontur kontur mentioned this issue Apr 6, 2021
@kontur
Copy link
Contributor

kontur commented Apr 9, 2021

For the record in this discussion, we've just released 0.3.0 which addresses these issues with marks in a different way. Characters in list no longer are required to be only encoded unicode characters, but also unencoded base + mark combinations can be stores.

marks will still be extracted from base and auxiliary and saved in marks, but marks can further list other marks the language requires which are not part of the character lists. The CLI now requires the --marks option to explicitly check for the presence of combining marks. --decompose works as before.

More details in the changelog

We've done some review of orthographies where the data suggested some unencoded combinations have previously gotten dropped, but @meehkal and @moyogo if you have languages/orthographies in mind where this was an issue feel free to point them out to us or submit a fix — I hope the way the character list storing is now less unicode-centric and more like linguists might think about orthographies eases the readability and input of language data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants