Does the ilreq ABNF work for consonant clusters that don't form conjuncts? #18

r12a · 2018-08-10T11:02:14Z

This issue is carried over from an unanswered issue at w3c/ilreq#31

In the following i distinguish between consonant clusters and conjuncts, where the latter involves special shaping and the hiding of the VIRAMA, because that's where the difference lies afaict.

In the ilreq doc section 2. Indic orthographic syllable boundaries, contains a set of ABNF rules for indicating syllable boundaries, which are referred to for many applications, such as vertical text, line wrapping, initial-letter styling, etc. The examples include Tamil, however (with the exception of க்ஷ, ஶ்ரீ , and ஸ்ரீ ) modern consonant clusters in Tamil don't form conjuncts in the same way as, say, Devanagari or Bengali. Instead, Tamil simply applies a pulli (virama) dot above the consonant without a following vowel, eg. கேட்டுக்.

Given a Tamil word such as யாவற்றையும் (yāvaṟṟaiyum), should the break points for text segmentation, line breaking , drop letter (if the cluster appeared at the start of the text), letter spacing in horizontal text, and vertical text representation conform to this:

A

or this?

B

The latter is what the ilreq document currently suggests.

A similar question arises when fonts don't produce certain conjuncts in other scripts, for one reason or another, or where a ZWNJ is added to prevent a conjunct forming. Where are the break points for the following? Are they:

C

or

D

Given that for a more typical rendering of the text the break points, as described in the ilreq doc, would be:

E

UAX#29 currently doesn't produce E for Devanagari (which is what the ilreq doc requires). It produces something more like C. But UAX#29 is about to change, so that by default a whole consonant cluster will be seen as a unit (ie. E). The effect of that upcoming change is not completely clear, however, for scripts like Tamil, or Devanagari when the virama is showing. I'm looking for someone to provide expert advice for what would be expected in those situations.

A reliance on the shape of the text is not described in the ilreq document, which i think is problematic. (It's also problematic for the general concept of grapheme clusters in Unicode, which should count as one unit the whole of a conjunct sequence such as ஶ்ரீ but not a Tamil consonant cluster such as த்தை.)

akshatsj · 2018-08-23T06:18:33Z

I am not a native Tamil speaker however have worked on the issue at ICANN's Neo-Brahmi Generation Panel where similar work was undertaken to identify some sort of validation rules for prospective domain name labels. The said validation rules were in a way doing syllable boundary identification and enforcing proper akshar formation of Indian language domain names.
As I see, a native Tamil speaker would only say that there are only two conjuncts in Tamil which are ksha and shree. Apart from these, there are none. By this what they mean is, their interpretation of a valid conjunct cluster is limited to these two conjuncts. Any other (apart from ksha and shree) CHC combination, probably is CH | C (two separate akshars) and expects a cursor to stop, line to break and drop cap to end at the end of H. My thoughts.

So, the ILreq ABNF may need to be changed to accommodate this requirement.

r12a · 2018-08-23T14:19:18Z

And would you say that the same rule applies for devanagari text such as the examples above, where the virama is explicitly shown? (This is where things become difficult, because the explicit virama may be a side-effect of the font, rather than an encoding difference, but i'd like to see if we can at least first clarify what the user would expect.)

r12a · 2018-08-24T07:12:20Z

This issue was discussed in a meeting.

ACTION: Document some requirements for segmentation using the examples in issue #18 [on Akshat Joshi - due 2018-08-31].

View the transcript

muthu: the segmentation of tamil is per example a in the issue
ri: this means that the ilreq ABNF is not valid for Tamil other than the two conjuncts
... for devanagari we suspect that it may be necessary to segment inside an 'orthographic syllable' if the virama is shown
... this means that the capability of the font is the deciding factor, rather than the encoded text
in summary: for Tamil we're reasonably sure what to do, but not totally sure for devanagari
issue #25 just contains information we should put in the requiements doc
#24
muthu: default should be ascii numerals
https://w3c.github.io/predefined-counter-styles/#tamil-styles
http://r12a.github.io/apps/counterconverter/
1௧ 2௨ 3௩ 4௪ 5௫ 6௬ 7௭ 8௮ 9௯ 10௰ ... 22௨௰௨ ... 222௨௱௨௰௨ ...
https://w3c.github.io/iip/templates/lreq_doc/index.html
#18
akshat: only break deva inside orthographic syllable when the user explicitly breaks the conjunct using ZWNJ
... otherwise continue to keep whole orth syllable together
... can we capture it if the font fails to render a conjunct?
muthu: cursor handling is not related to the font
... if you want to show that this should be a conjunct, but not in font, it's usually handled by the script processor (halfbuzz etc)
akshat: so the situation with the font cannot be communicated - can only base decisions on text
... this is also relevant for colouring part of a conjunct - currently not possible to tell what the font is doing
... so if the font breaks the conjunct, segment handling goes across cluster
ri: so the segmentation behaviour depends on the script, because tamil is different from devanagari
<scribe> ACTION: Akshat to document some requirements for segmentation using the examples in issue #18
<trackbot> Created ACTION-17 - Document some requirements for segmentation using the examples in issue #18 [on Akshat Joshi - due 2018-08-31].
க்ஷ, ஶ்ரீ , and ஸ்ரீ
śrī , srī
<Akshat> ksha, śrī , srī
<Akshat> ksha, shri, sri
https://r12a.github.io/pickers/tamil/?text=%E0%AE%95%E0%AF%8D%E0%AE%B7%2C%20%E0%AE%B6%E0%AF%8D%E0%AE%B0%E0%AF%80%20%2C%20%E0%AE%B8%E0%AF%8D%E0%AE%B0%E0%AF%80%20

r12a · 2024-05-14T12:39:39Z

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does the ilreq ABNF work for consonant clusters that don't form conjuncts? #18

Does the ilreq ABNF work for consonant clusters that don't form conjuncts? #18

r12a commented Aug 10, 2018 •

edited

Loading

akshatsj commented Aug 23, 2018 •

edited

Loading

r12a commented Aug 23, 2018

r12a commented Aug 24, 2018

r12a commented May 14, 2024

Does the ilreq ABNF work for consonant clusters that don't form conjuncts? #18

Does the ilreq ABNF work for consonant clusters that don't form conjuncts? #18

Comments

r12a commented Aug 10, 2018 • edited Loading

akshatsj commented Aug 23, 2018 • edited Loading

r12a commented Aug 23, 2018

r12a commented Aug 24, 2018

r12a commented May 14, 2024

r12a commented Aug 10, 2018 •

edited

Loading

akshatsj commented Aug 23, 2018 •

edited

Loading