Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When does the ABNF work for Tamil consonant clusters? #31

Closed
r12a opened this issue Apr 12, 2017 · 9 comments
Closed

When does the ABNF work for Tamil consonant clusters? #31

r12a opened this issue Apr 12, 2017 · 9 comments
Labels

Comments

@r12a
Copy link
Contributor

r12a commented Apr 12, 2017

The document largely gives the impression that the ABNF rules indicate what must be kept together for "text segmentation, line breaking , drop letter, letter spacing in horizontal text and vertical text representation".

However, is that true for Tamil? Consonant clusters in Tamil don't interact with left-positioned vowel signs in the same way as Devanagari or Bengali conjuncts. Here are some examples i took from the UDHR.

  1. in these words the left-positioned vowel appears between the two consonants in a cluster:
    யாவற்றையும்
    yāvaṟṟaiyum

கௌரவத்தையும்
kauravattaiyum

அசிரத்தையும் அவற்றை
acirattaiyum avaṟṟai

ஏற்கப்பெற்று
ēṟkappeṟṟu

எல்லோரும்
ellōrum

  1. in these the vowel shaping interacts only with the final consonant:
    செயல்களுக்கு
    ceyalkaḷukku

கேட்டுக்
kēṭṭuk

The table of examples of the ABNF doesn't include this type of cluster, only conjuncts such as க்ஷ, ஶ்ரீ , and ஸ்ரீ , which are special because they ligate.

So, given examples such as those in the list above, is it or is it not normal to keep consonant clusters together in Tamil for text segmentation, line breaking , drop letter, letter spacing in horizontal text and vertical text representation?

@r12a r12a added the question label Apr 12, 2017
@r12a r12a changed the title When does the ABNF work for Tamil conjuncts? When does the ABNF work for Tamil consonant clusters? Apr 12, 2017
@miloush
Copy link

miloush commented Apr 24, 2017

What do you mean by "normal" (and is "normal" relevant)? Also what do you mean by "must" - that rendering would be broken, or that the text would feel strange to the reader?
What timeframe are we considering to the past?

I have seen drop caps made of syllables, or just the consonant of a syllable. I have seen vertical text by syllables. I think I haven't seen a drop cap of the vowel mark only, but wouldn't be that much shocked.

Let me point out that there was a script reform in the late seventies prior to which consonants were interacting with left-positioned vowel signs, mostly with AI. Some fonts are still using those ligatures, or offer them as historic ligatures.

@Richard57
Copy link

As far as I can tell, the ABNF works for Tamil in Tamil script when there is no pulli (U+0BCD TAMIL SIGN VIRAMA) in sight. You can get a flavour of how Tamils feel about their script from TACE16 (a.k.a. TUNE). See the invective at https://en.wikipedia.org/wiki/Tamil_All_Character_Encoding. The only conjuncts I am aware of are those involving <kṣ> க்ஷ <U+0B95, U+0BCD, U+0BB7> and 'shri' ஸ்ரீ <U+0BB8, U+0BCD, U+0BB0, U+0BC0>. Otherwise, U+0BCD terminates an orthographic syllable.

Tamil seems to be the good example of an abugida as a neosyllabary.

The ABNF, which doesn't even work for Sanskrit in the Devanagari, also fails massively for varga-distinguishing Sanskrit in Tamil script. Subscript or superscript numbers are used to distinguish the 4 plosive vargas, for which there is mostly only a single letter in Tamil. For examples of this scheme , one can look at http://sanskritdocuments.org/tamil/by-category/krishna.php.

@r12a
Copy link
Contributor Author

r12a commented Apr 27, 2017

As far as I can tell, the ABNF works for Tamil in Tamil script when there is no pulli (U+0BCD TAMIL SIGN VIRAMA) in sight.

Yes. In my comment i tried to distinguish between consonant clusters and conjuncts, where the latter involves special shaping and the hiding of the VIRAMA, because i assumed that that's where the difference lies.

If this is an appropriate distinction for application of the ABNF rules, however, there is presumably a problem, since if one were to apply a font to Tamil that contains shaping based on the older forms of the script (mentioned by @miloush), the ANBF would be relevant for sequences of characters for which it wasn't relevant before.

Such a reliance on the shape of the text is not described in the document, which i think is problematic. (It's also problematic for the general concept of grapheme clusters in Unicode, which should count as one unit the whole of a conjunct sequence such as ஶ்ரீ but not a Tamil consonant cluster such as த்தை.)

@Richard57
Copy link

@miloush only mention ligatures of vowels and consonants. The reason that they might be relevant is a natural reluctance to break a ligature.

I believe the potential problem on fonts is more likely to apply to Devanagari, where the deliberate appearance of a halant should normally signal the end of an orthographic syllable, than to Tamil. It is not for nothing that UAX#29 cautions that the tailoring of grapheme clusters may be font dependent. Malayalam may be an interesting study in this regard.

@r12a
Copy link
Contributor Author

r12a commented Dec 6, 2017

Let me try to make my question clearer. It is only about situations where the pulli is visible.

Given a word such as யாவற்றையும் (yāvaṟṟaiyum), should the break points for text segmentation, line breaking , drop letter (if the conjunct appeared at the start of the text), letter spacing in horizontal text, and vertical text representation conform to this?

A screen shot 2017-12-06 at 09 21 15

or this?

B screen shot 2017-12-06 at 09 21 32

The latter is what the ilreq document currently suggests.

A similar question arises when fonts don't produce certain conjuncts, for one reason or another, or where a ZWNJ is added to prevent a conjunct forming. Where are the break points for the following? Are they:

C screen shot 2017-12-06 at 09 50 50

or

D screen shot 2017-12-06 at 09 51 06

Given that for a more typical rendering of the text the break points, as described in the ilreq doc, would be:

E screen shot 2017-12-06 at 09 51 18

@miloush
Copy link

miloush commented Dec 7, 2017

@r12a your Tamil example is only interesting because the doubled consonant results in a phonetic change, but I don't see any reason why B should be preferred over A. Even the script supports A, as otherwise you would expect the ai sign to be in front of the first .

Note that you can really find pretty much any breaking for vertical text around:
த்தகங்கள்

A is consistent with caret stops when editing documents from my experience. Either way, is there a reason to not just follow/refer UAX#29 Unicode Text Segmentation?

I don't have enough experience with Devenagari, but from technical point C makes more sense to me, especially if there is ZWNJ.

@r12a
Copy link
Contributor Author

r12a commented Dec 7, 2017

the doubled consonant results in a phonetic change, but I don't see any reason why B should be preferred over A

Well, yes, that's exactly my point. :) The ilreq document currently suggests that only B is correct, and i'm asking whether that is true.

is there a reason to not just follow/refer UAX#29 Unicode Text Segmentation?

UAX#29 currently doesn't produce E for Devanagari (which is what the ilreq doc requires). It produces something more like C. But UAX#29 is about to change, so that by default a whole consonant cluster will be seen as a unit (ie. E). The effect of that upcoming change is not completely clear, however, for scripts like Tamil, or Devanagari when the virama is showing. I'm looking for someone to provide expert advice for what would be expected in those situations.

@Richard57
Copy link

If the visible viramas in C are all produced by ZWNJ, then the grapheme cluster boundaries will remain as the breaks in C.
However, CLDR is not the right place to preserve A; Tamil pulli should be removed from the category of virama. I believe A is also appropriate for Sanskrit in Tamil script, but do we expect browsers to look up a Sanskrit locale for the rendering of shlokas? Tamil K.SSA and SH.RII are problems, for their consonants do belong in the same grapheme cluster.

@r12a
Copy link
Contributor Author

r12a commented Aug 10, 2018

I am closing this issue in favour of w3c/iip#18

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants