-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dealing with marks in the data #28
Comments
One of the issue with the current approach or the proposed approach is that the presence of a combining mark does not guarantee that graphemes of an orthography that don't have precomposed character forms are supported. For example, Guarani uses ã ẽ g̃ ĩ õ ũ ỹ, where all but g̃ have precomposed characters encoded. A font that has combining tilde does not necessarily support Guarani properly unless it positions the combining tilde on g or substitutes the sequence for a single glyph. A first step to resolve this would be to store these graphemes in base or auxiliary. The quality of support is probably out of scope, the same way it is for simple graphemes, meaning if the positioning is different than if there was no positioning but is still incorrect is no different than if the position was wrong in a precomposed character. This would have to be assessed differently. |
Originally, we wanted to include the combinations and I wanted to check for presence of relevant OpenType features and lookups. I think we only postponed the combinations for later. I am definitely for their inclusion in Regarding OT feature checks, I am not convinced that it makes sense to do an elaborate check that would not completely tackle the issue (be it OT feature analysis or ML-driven OCR on test documents). The ultimate check has to be visual and according to the objectives of the designer and the purpose of what they are trying to do. For example:
I think I drew the line around design differently from you and included the quality of OpenType features in it. Thus, my current inclination is to trust (!) the users and provide general (format agnostic :) ) notes and perhaps point them to design guidelines where these are available. [editted for better clarity] |
I agree, OT feature check is a difficult one. |
Yes, I think this is one of the edge cases where conceptually the decomposition approach does not work well. Even more so, if the unencoded glyph, let's say /gtilde/, is in a charset in which no other encoded glyphs that can be decomposed to base + /tildecomb/ is present. I think we could simply tolerate such unencoded base + mark combinations in the data, but it gets a little tricky with the parsing and saving of that data. Normally, we split everything into unicodes that can be split, alas we'd somehow have to implement "keeping" those combinations joined. I think a better implementation for support checking would be indeed to check the mark features and make sure the glyphs involved are listed. This would be a solution that side steps design considerations, but also detects false positives of the current approach if the involved glyphs just happen to be in the font, but are not linked via the mark feature at all. For now a good step will be to split to |
Just a note that this is also relevant for Kildin Sami (sjd), where some of the following letters of the alphabet are precomposed, but not all of them. (The list is not complete, but just for illustration.) А̄ Е̄ Ӣ Э̄ Я̄ Ӯ Ё̄ |
For the record in this discussion, we've just released
More details in the changelog We've done some review of orthographies where the data suggested some unencoded combinations have previously gotten dropped, but @meehkal and @moyogo if you have languages/orthographies in mind where this was an issue feel free to point them out to us or submit a fix — I hope the way the character list storing is now less unicode-centric and more like linguists might think about orthographies eases the readability and input of language data. |
The intention is to include combining marks that are used in canonical decomposition defined by Unicode for the characters used by an orthography. The current plan (not implemented yet) would work like this:
base
containsš
this will imply inclusion of◌̌
(combining caron) inbase_marks
auxiliary
containsá
this will imply inclusion of◌́
(combining acute) inauxiliary_marks
The reason for this is that some fonts that do not cover code points for combining marks might not be recognized as supporting a language even though the font could be used for the language without any problem thanks to the characters for precomposed combinations. We are not able to evaluate the quality of the mark positioning in a font.
Proposition: In situations where all combinations in
base
are covered without marks, with precomposed characters only, we could include the combining marks inauxiliary_marks
only rather than inbase_marks
. The detection would check, by default, onlybase
andbase_marks
. This way, languages would still be detected (without any flag or toggle use) and the marks will be noted.We think that the inclusion of combining marks in fonts is: a) more future-proof and better design strategy, b) useful for technical reasons when precomposed characters get decomposed automatically in some situations.
The text was updated successfully, but these errors were encountered: