When does the ABNF work for Tamil consonant clusters? #31

r12a · 2017-04-12T19:43:22Z

The document largely gives the impression that the ABNF rules indicate what must be kept together for "text segmentation, line breaking , drop letter, letter spacing in horizontal text and vertical text representation".

However, is that true for Tamil? Consonant clusters in Tamil don't interact with left-positioned vowel signs in the same way as Devanagari or Bengali conjuncts. Here are some examples i took from the UDHR.

in these words the left-positioned vowel appears between the two consonants in a cluster:
யாவற்றையும்
yāvaṟṟaiyum

கௌரவத்தையும்
kauravattaiyum

அசிரத்தையும் அவற்றை
acirattaiyum avaṟṟai

ஏற்கப்பெற்று
ēṟkappeṟṟu

எல்லோரும்
ellōrum

in these the vowel shaping interacts only with the final consonant:
செயல்களுக்கு
ceyalkaḷukku

கேட்டுக்
kēṭṭuk

The table of examples of the ABNF doesn't include this type of cluster, only conjuncts such as க்ஷ, ஶ்ரீ , and ஸ்ரீ , which are special because they ligate.

So, given examples such as those in the list above, is it or is it not normal to keep consonant clusters together in Tamil for text segmentation, line breaking , drop letter, letter spacing in horizontal text and vertical text representation?

miloush · 2017-04-24T16:51:45Z

What do you mean by "normal" (and is "normal" relevant)? Also what do you mean by "must" - that rendering would be broken, or that the text would feel strange to the reader?
What timeframe are we considering to the past?

I have seen drop caps made of syllables, or just the consonant of a syllable. I have seen vertical text by syllables. I think I haven't seen a drop cap of the vowel mark only, but wouldn't be that much shocked.

Let me point out that there was a script reform in the late seventies prior to which consonants were interacting with left-positioned vowel signs, mostly with AI. Some fonts are still using those ligatures, or offer them as historic ligatures.

Richard57 · 2017-04-26T22:48:20Z

As far as I can tell, the ABNF works for Tamil in Tamil script when there is no pulli (U+0BCD TAMIL SIGN VIRAMA) in sight. You can get a flavour of how Tamils feel about their script from TACE16 (a.k.a. TUNE). See the invective at https://en.wikipedia.org/wiki/Tamil_All_Character_Encoding. The only conjuncts I am aware of are those involving <kṣ> க்ஷ <U+0B95, U+0BCD, U+0BB7> and 'shri' ஸ்ரீ <U+0BB8, U+0BCD, U+0BB0, U+0BC0>. Otherwise, U+0BCD terminates an orthographic syllable.

Tamil seems to be the good example of an abugida as a neosyllabary.

The ABNF, which doesn't even work for Sanskrit in the Devanagari, also fails massively for varga-distinguishing Sanskrit in Tamil script. Subscript or superscript numbers are used to distinguish the 4 plosive vargas, for which there is mostly only a single letter in Tamil. For examples of this scheme , one can look at http://sanskritdocuments.org/tamil/by-category/krishna.php.

r12a · 2017-04-27T14:35:32Z

As far as I can tell, the ABNF works for Tamil in Tamil script when there is no pulli (U+0BCD TAMIL SIGN VIRAMA) in sight.

Yes. In my comment i tried to distinguish between consonant clusters and conjuncts, where the latter involves special shaping and the hiding of the VIRAMA, because i assumed that that's where the difference lies.

If this is an appropriate distinction for application of the ABNF rules, however, there is presumably a problem, since if one were to apply a font to Tamil that contains shaping based on the older forms of the script (mentioned by @miloush), the ANBF would be relevant for sequences of characters for which it wasn't relevant before.

Such a reliance on the shape of the text is not described in the document, which i think is problematic. (It's also problematic for the general concept of grapheme clusters in Unicode, which should count as one unit the whole of a conjunct sequence such as ஶ்ரீ but not a Tamil consonant cluster such as த்தை.)

Richard57 · 2017-04-27T18:23:26Z

@miloush only mention ligatures of vowels and consonants. The reason that they might be relevant is a natural reluctance to break a ligature.

I believe the potential problem on fonts is more likely to apply to Devanagari, where the deliberate appearance of a halant should normally signal the end of an orthographic syllable, than to Tamil. It is not for nothing that UAX#29 cautions that the tailoring of grapheme clusters may be font dependent. Malayalam may be an interesting study in this regard.

r12a · 2017-12-06T09:54:27Z

Let me try to make my question clearer. It is only about situations where the pulli is visible.

Given a word such as யாவற்றையும் (yāvaṟṟaiyum), should the break points for text segmentation, line breaking , drop letter (if the conjunct appeared at the start of the text), letter spacing in horizontal text, and vertical text representation conform to this?

A

or this?

B

The latter is what the ilreq document currently suggests.

A similar question arises when fonts don't produce certain conjuncts, for one reason or another, or where a ZWNJ is added to prevent a conjunct forming. Where are the break points for the following? Are they:

C

or

D

Given that for a more typical rendering of the text the break points, as described in the ilreq doc, would be:

E

miloush · 2017-12-07T11:30:28Z

@r12a your Tamil example is only interesting because the doubled consonant results in a phonetic change, but I don't see any reason why B should be preferred over A. Even the script supports A, as otherwise you would expect the ai sign to be in front of the first ṟ.

Note that you can really find pretty much any breaking for vertical text around:

A is consistent with caret stops when editing documents from my experience. Either way, is there a reason to not just follow/refer UAX#29 Unicode Text Segmentation?

I don't have enough experience with Devenagari, but from technical point C makes more sense to me, especially if there is ZWNJ.

r12a · 2017-12-07T12:38:21Z

the doubled consonant results in a phonetic change, but I don't see any reason why B should be preferred over A

Well, yes, that's exactly my point. :) The ilreq document currently suggests that only B is correct, and i'm asking whether that is true.

is there a reason to not just follow/refer UAX#29 Unicode Text Segmentation?

UAX#29 currently doesn't produce E for Devanagari (which is what the ilreq doc requires). It produces something more like C. But UAX#29 is about to change, so that by default a whole consonant cluster will be seen as a unit (ie. E). The effect of that upcoming change is not completely clear, however, for scripts like Tamil, or Devanagari when the virama is showing. I'm looking for someone to provide expert advice for what would be expected in those situations.

Richard57 · 2017-12-08T00:43:14Z

If the visible viramas in C are all produced by ZWNJ, then the grapheme cluster boundaries will remain as the breaks in C.
However, CLDR is not the right place to preserve A; Tamil pulli should be removed from the category of virama. I believe A is also appropriate for Sanskrit in Tamil script, but do we expect browsers to look up a Sanskrit locale for the rendering of shlokas? Tamil K.SSA and SH.RII are problems, for their consonants do belong in the same grapheme cluster.

r12a · 2018-08-10T11:03:12Z

I am closing this issue in favour of w3c/iip#18

r12a added the question label Apr 12, 2017

r12a changed the title ~~When does the ABNF work for Tamil conjuncts?~~ When does the ABNF work for Tamil consonant clusters? Apr 12, 2017

r12a mentioned this issue Jul 24, 2017

When does the ABNF work for Tamil consonant clusters? w3c/i18n-activity#466

Closed

r12a mentioned this issue Aug 10, 2018

Does the ilreq ABNF work for consonant clusters that don't form conjuncts? w3c/iip#18

Open

r12a closed this as completed Aug 10, 2018

r12a mentioned this issue Aug 10, 2018

Tamil: 2.8 Text boundaries & selection w3c/iip#20

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When does the ABNF work for Tamil consonant clusters? #31

When does the ABNF work for Tamil consonant clusters? #31

r12a commented Apr 12, 2017 •

edited

Loading

miloush commented Apr 24, 2017

Richard57 commented Apr 26, 2017

r12a commented Apr 27, 2017 •

edited

Loading

Richard57 commented Apr 27, 2017

r12a commented Dec 6, 2017 •

edited

Loading

miloush commented Dec 7, 2017

r12a commented Dec 7, 2017

Richard57 commented Dec 8, 2017

r12a commented Aug 10, 2018

When does the ABNF work for Tamil consonant clusters? #31

When does the ABNF work for Tamil consonant clusters? #31

Comments

r12a commented Apr 12, 2017 • edited Loading

miloush commented Apr 24, 2017

Richard57 commented Apr 26, 2017

r12a commented Apr 27, 2017 • edited Loading

Richard57 commented Apr 27, 2017

r12a commented Dec 6, 2017 • edited Loading

miloush commented Dec 7, 2017

r12a commented Dec 7, 2017

Richard57 commented Dec 8, 2017

r12a commented Aug 10, 2018

r12a commented Apr 12, 2017 •

edited

Loading

r12a commented Apr 27, 2017 •

edited

Loading

r12a commented Dec 6, 2017 •

edited

Loading