Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does the ilreq ABNF work for consonant clusters that don't form conjuncts? #18

Open
r12a opened this issue Aug 10, 2018 · 4 comments
Open
Labels
i:segmentation Grapheme/word segmentation & selection l:bn Bengali language & script l:hi Hindi, Devanagari script l:ta Tamil language & script question Further information is requested

Comments

@r12a
Copy link
Contributor

r12a commented Aug 10, 2018

This issue is carried over from an unanswered issue at w3c/ilreq#31

In the following i distinguish between consonant clusters and conjuncts, where the latter involves special shaping and the hiding of the VIRAMA, because that's where the difference lies afaict.

In the ilreq doc section 2. Indic orthographic syllable boundaries, contains a set of ABNF rules for indicating syllable boundaries, which are referred to for many applications, such as vertical text, line wrapping, initial-letter styling, etc. The examples include Tamil, however (with the exception of க்ஷ, ஶ்ரீ , and ஸ்ரீ ) modern consonant clusters in Tamil don't form conjuncts in the same way as, say, Devanagari or Bengali. Instead, Tamil simply applies a pulli (virama) dot above the consonant without a following vowel, eg. கேட்டுக்.

Given a Tamil word such as யாவற்றையும் (yāvaṟṟaiyum), should the break points for text segmentation, line breaking , drop letter (if the cluster appeared at the start of the text), letter spacing in horizontal text, and vertical text representation conform to this:

A screen shot 2017-12-06 at 09 21 15

or this?

B screen shot 2017-12-06 at 09 21 32

The latter is what the ilreq document currently suggests.

A similar question arises when fonts don't produce certain conjuncts in other scripts, for one reason or another, or where a ZWNJ is added to prevent a conjunct forming. Where are the break points for the following? Are they:

C screen shot 2017-12-06 at 09 50 50

or

D screen shot 2017-12-06 at 09 51 06

Given that for a more typical rendering of the text the break points, as described in the ilreq doc, would be:

E screen shot 2017-12-06 at 09 51 18

UAX#29 currently doesn't produce E for Devanagari (which is what the ilreq doc requires). It produces something more like C. But UAX#29 is about to change, so that by default a whole consonant cluster will be seen as a unit (ie. E). The effect of that upcoming change is not completely clear, however, for scripts like Tamil, or Devanagari when the virama is showing. I'm looking for someone to provide expert advice for what would be expected in those situations.

A reliance on the shape of the text is not described in the ilreq document, which i think is problematic. (It's also problematic for the general concept of grapheme clusters in Unicode, which should count as one unit the whole of a conjunct sequence such as ஶ்ரீ but not a Tamil consonant cluster such as த்தை.)

@r12a r12a added l:hi Hindi, Devanagari script l:bn Bengali language & script l:ta Tamil language & script labels Aug 10, 2018
@akshatsj
Copy link
Contributor

akshatsj commented Aug 23, 2018

I am not a native Tamil speaker however have worked on the issue at ICANN's Neo-Brahmi Generation Panel where similar work was undertaken to identify some sort of validation rules for prospective domain name labels. The said validation rules were in a way doing syllable boundary identification and enforcing proper akshar formation of Indian language domain names.
As I see, a native Tamil speaker would only say that there are only two conjuncts in Tamil which are ksha and shree. Apart from these, there are none. By this what they mean is, their interpretation of a valid conjunct cluster is limited to these two conjuncts. Any other (apart from ksha and shree) CHC combination, probably is CH | C (two separate akshars) and expects a cursor to stop, line to break and drop cap to end at the end of H. My thoughts.

So, the ILreq ABNF may need to be changed to accommodate this requirement.

@r12a
Copy link
Contributor Author

r12a commented Aug 23, 2018

And would you say that the same rule applies for devanagari text such as the examples above, where the virama is explicitly shown? (This is where things become difficult, because the explicit virama may be a side-effect of the font, rather than an encoding difference, but i'd like to see if we can at least first clarify what the user would expect.)

@r12a
Copy link
Contributor Author

r12a commented Aug 24, 2018

This issue was discussed in a meeting.

  • ACTION: Document some requirements for segmentation using the examples in issue #18 [on Akshat Joshi - due 2018-08-31].
View the transcript muthu: the segmentation of tamil is per example a in the issue
ri: this means that the ilreq ABNF is not valid for Tamil other than the two conjuncts
... for devanagari we suspect that it may be necessary to segment inside an 'orthographic syllable' if the virama is shown
... this means that the capability of the font is the deciding factor, rather than the encoded text
in summary: for Tamil we're reasonably sure what to do, but not totally sure for devanagari
issue #25 just contains information we should put in the requiements doc
#24
muthu: default should be ascii numerals
https://w3c.github.io/predefined-counter-styles/#tamil-styles
http://r12a.github.io/apps/counterconverter/
1௧ 2௨ 3௩ 4௪ 5௫ 6௬ 7௭ 8௮ 9௯ 10௰ ... 22௨௰௨ ... 222௨௱௨௰௨ ...
https://w3c.github.io/iip/templates/lreq_doc/index.html
#18
akshat: only break deva inside orthographic syllable when the user explicitly breaks the conjunct using ZWNJ
... otherwise continue to keep whole orth syllable together
... can we capture it if the font fails to render a conjunct?
muthu: cursor handling is not related to the font
... if you want to show that this should be a conjunct, but not in font, it's usually handled by the script processor (halfbuzz etc)
akshat: so the situation with the font cannot be communicated - can only base decisions on text
... this is also relevant for colouring part of a conjunct - currently not possible to tell what the font is doing
... so if the font breaks the conjunct, segment handling goes across cluster
ri: so the segmentation behaviour depends on the script, because tamil is different from devanagari
<scribe> ACTION: Akshat to document some requirements for segmentation using the examples in issue #18
<trackbot> Created ACTION-17 - Document some requirements for segmentation using the examples in issue #18 [on Akshat Joshi - due 2018-08-31].
க்ஷ, ஶ்ரீ , and ஸ்ரீ
śrī , srī
<Akshat> ksha, śrī , srī
<Akshat> ksha, shri, sri
https://r12a.github.io/pickers/tamil/?text=%E0%AE%95%E0%AF%8D%E0%AE%B7%2C%20%E0%AE%B6%E0%AF%8D%E0%AE%B0%E0%AF%80%20%2C%20%E0%AE%B8%E0%AF%8D%E0%AE%B0%E0%AF%80%20

@r12a
Copy link
Contributor Author

r12a commented May 14, 2024

See also w3c/ilreq#31

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
i:segmentation Grapheme/word segmentation & selection l:bn Bengali language & script l:hi Hindi, Devanagari script l:ta Tamil language & script question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants