[css-text] Should zero width space break Arabic shaping? #3861

frivoal · 2019-04-22T02:56:23Z

This is probably more of a unicode issue than a css issue, but we have a fair bit of people involved with text layout and i18n over here, so filing it here first to figure out if we should take it to unicode or not.

When writing web-platform-tests/wpt#14673, I had misread the unicode standard, and though that ZERO WIDTH SPACE was supposed to break arabic shaping, based on a table that said "all spacing characters" do so. But there's a distinction between "spacing characters" and "spaces characters", and ZERO WIDTH SPACE is part of the later, not the former.

https://www.unicode.org/Public/UCD/latest/ucd/ArabicShaping.txt gives further details about which character does what to shaping, and classifies ZERO WIDTH SPACE as T (transparent), which neither forces nor breaks shaping, and just behaves as if it wasn't there for shaping purposes.

So Unicode has a definite answer as to what's supposed to happen, but several people in the thread about my tests were surprised by that answer (including @behdad, @r12a, and myself), because ZERO WIDTH SPACE is used as a word divider, and that suggests it ought to be breaking shaping. @r12a brought up nastaliq as a reasonable use case, because:

when using nastaliq script, esp. in Urdu, inter-word spaces are often not applied, because words are separated enough by the arrangement of glyphs along the sloping baselines. If you do, however, want to indicate word boundaries in those situations without unsightly spacing, using ZWSP seems to be an obvious way of doing so.

So, what do we collectively think? Is unicode likely enough to be mistaken that we should raise this issue with them? Is there a know good reason for why things are the way they are?

behdad · 2019-04-22T04:09:18Z

It feels to me that the intention / rationale behind this was to be a character that adds a word- (and hence possible line-) break without changing any other behavior.

frivoal · 2019-04-22T23:27:48Z

Does it make sense to add a word-break without affecting shaping? That sounds counter intuitive to me.

frivoal · 2019-04-23T05:15:39Z

@fantasai I've seen you make comments in various places about ZWSP that implied that you thought it ought to break shaping (here's just one), so I'd be interested in your feedback on this.

ntounsi · 2019-04-23T12:45:11Z

https://unicode.org/reports/tr44/#Release_Stability (Date 2019-02-27)

says §2.3.1 "Updates to character properties [...] may be required [...] to change the assigned values for a property".
[...]
"For example, U+200B ZERO WIDTH SPACE was originally classified as a space character (General_Category=Zs), but it was reclassified as a Format character (General_Category=Cf) to clearly distinguish it from space characters in its function as a format control for line breaking"

It follows that, as a Format character, ZWS can also serve "to indicate word boundaries" as raised by @r12a.
Is ZWS still a "risky" character for a stable implementation?

@frivoal, I didn't find ZWS as "Transparent" in the link "https://www.unicode.org/Public/UCD/latest/ucd/ArabicShaping.txt". Am I missing something?

behdad · 2019-04-23T12:50:42Z

@frivoal, I didn't find ZWS as "Transparent" in the link "https://www.unicode.org/Public/UCD/latest/ucd/ArabicShaping.txt". Am I missing something?

Read the file header re missing values.

frivoal · 2019-04-23T12:50:44Z

I didn't find ZWS as "Transparent" in the link "https://www.unicode.org/Public/UCD/latest/ucd/ArabicShaping.txt". Am I missing something?

It's not listed explicitely, but it is covered by the generic rule:

Note: Code points that are not explicitly listed in this file are
either of joining type T or U:

Those that are not explicitly listed and that are of General Category Mn, Me, or Cf
have joining type T.

ZWS is in category Cf.

behdad · 2019-04-23T12:56:31Z

cc @roozbehp again, since I think he can articulate the reasoning behind this best.

Does it make sense to add a word-break without affecting shaping? That sounds counter intuitive to me.

My understanding is that ZWSP is inherently a line-break control character. So it's misleadingly named at best. It's closer to SOFT HYPHEN, than to a space. The difference from SOFT HYPHEN seems to be that this one is not expected to turn into a hyphen if line break does happen. I think the original use case was to be used with scripts that don't use inter-word spaces, to mark line break opportunities. It feels to me that ZWSP and SOFT HYPHEN should have been one character, to mark "line break allowed".

Other example of ZWSP is to mark break opportunities / word boundaries in concatenated words like long URLs or hash tags like "justanotherawesomelylongurl". The idea is that whether or not you mark a location as line-break opportunity should be separate from whether Arabic shaping happens. So, if Arabic shaping is not desired, one should use a ZWNJ to control that, separately from ZWSP.

Another explanation is that many characters, like ZWSP, only affect one aspect of Unicode processing and are ignored for all other processes. This is done to manage complexity. Such that instead of having to specify behavior of each control / format character on every process on every script, this can be specified independent of scripts for the most part.

Anyway. Just my understanding. As I said, I was also surprised by this, but I understand what the rationale / thinking has been.

behdad · 2019-04-23T12:56:55Z

I'm also not convinced this ZWSP should be used with Nastaliq...

r12a · 2019-04-23T17:03:09Z

I'm also not convinced this ZWSP should be used with Nastaliq...

I could be convinced that you're right there. It certainly doesn't seem a good idea given what TUS expects wrt joining behaviour.

I note, however, that Firefox and Edge both break the cursive joining, although Chrome and Safari don't. See https://r12a.github.io/pickers/urdu/?text=%DB%81%D8%B1%E2%80%8B%D8%B4%D8%AE%D8%B5%E2%80%8B%DA%A9%D9%88%E2%80%8B%D8%A7%D8%B3%E2%80%8B%D8%A8%D8%A7%D8%AA%E2%80%8B%DA%A9%D8%A7%E2%80%8B%D8%AD%D9%82 for an example.

Note further, however, that a difference ZWSP and soft hyphen is that ZWSP is expected to expand when letter spacing or justification is applied to a line. This makes it not quite a simple line-break opportunity indicator. (TUS describes it as "indicates a word break or line break opportunity" and goes on to describe how justification algorithms are likely to add space on p872 of TUS version 12.) I'm wondering now how that fits with justification of cursive scripts.... Would it be expected to create kashida-like behaviour when justified, but shrink back to nothing otherwise (unlike tatweel)?

behdad · 2019-04-23T17:06:24Z

I note, however, that Firefox and Edge both break the cursive joining, although Chrome and Safari don't.

I'm fairly sure Firefox breaks it just because the font doesn't have ZWSP. If you use an Arabic font that does have ZWSP I expect Firefox to NOT break shaping. Quite possibly the same about Safari.

You're right re justification. At this point I can't make up my mind about the exact intended use of ZWSP anymore.

Richard57 · 2019-04-24T01:25:26Z

The text on justification and ZWSP in TUS is on p872, in Volume 23. I see no difference in behaviour between ZWSP and soft hyphen for justification; if the line-break opportunity is not taken, then both characters are simply ignored; justification behaves as though they were not present. The issue seems to be mentioned for ZWSP because someone once thought that ZWSP suppressed the expansion of inter-character spacing.

A rendering difference between ZWSP and soft hyphen is that when the line break opportunity is taken, a soft hyphen has hyphenation effects, such as the appearance of a hyphen and changes in spelling. (Some varieties of Thai use hyphenation with hyphens.) Another difference is that ZWSP marks a word boundary whereas a soft hyphen has no such effect; this matters for spell checkers.

jfkthame · 2019-05-02T17:41:57Z

I'm also not convinced this ZWSP should be used with Nastaliq...

Agreed. Nastaliq text (just like the same text if it were written in a "simple" Arabic font) would use U+0020 for inter-word spaces. A Nastaliq font might give its space glyph a very narrow (or even zero) default advance width, but it's still U+0020.

r12a · 2019-05-02T17:50:17Z

So does anyone know of any circumstances in which it would make sense to use ZWSP inside a word or sequence of characters in a script that is cursive?

Would control of justification behaviour be such a circumstance? (Can't see it really, but if so, we need to ask further questions about how that would work.)

For that matter, does ZWSP have a useful application in any script outside those that don't separate words with spaces (such as Khmer, Japanese, etc.)?

behdad · 2019-05-02T17:57:01Z

Would control of justification behaviour be such a circumstance? (Can't see it really, but if so, we need to ask further questions about how that would work.)

As someone pointed out, ZWSP doesn't have any justification behavior either. It's as if it wasn't there. The clarification in Unicode saying that it might stretch simply means that it might stretch the same way as it might without ZWSP.

For that matter, does ZWSP have a useful application in any script outside those that don't separate words with spaces (such as Khmer, Japanese, etc.)?

I suppose not, because cursive scripts use hyphenation and as such should use soft-hyphen instead.

Richard57 · 2019-05-02T21:19:38Z

I for one regularly use ZWSP to provide line-breaking opportunities for long ASCII file paths, as breaks at soft hyphens could be mistaken for breaks at hyphens.

behdad · 2019-05-02T21:58:15Z

I for one regularly use ZWSP to provide line-breaking opportunities for long ASCII file paths, as breaks at soft hyphens could be mistaken for breaks at hyphens.

Right. That's other legit usecase for them.

r12a · 2019-05-13T12:16:02Z

Right. That's other legit usecase for them.

Watch out though, because https://r12a.github.io/scripts/punctuation/block.html#char200B works, whereas https://r12a.github.io/scripts/punctuation/block.html#char200B fails. And it's not at all clear why, to the user who may have copy/pasted the URL into github, email (or many other applications). Sounds like a dangerous thing to recommend generally.

frivoal · 2019-09-09T08:07:21Z

Ok, so the conclusion seems to be that this might be surprising if you don't think about it too much, but that people have thought this through, and that ZWS is more like a soft hyphen, and that not breaking shaping is very much intentional, and documented in unicode.

There might be problems in browsers about failing to the do the correct shaping if Zero Width Space is missing from the Arabic font, but that's an implementation matter, not a question about whether ZWS is supposed to break shaping.

So, I was feeling ready to close this with no action, but…

It seems that neither MS Word (regardless of font), nor InDesign (regardless of font) nor Firefox (regardless of font) or EdgeHTML (regardless of font) respect that, and they do break shaping on Zero Width Space always. Additionally, whether Chrome breaks shaping or not depends on the font: it shapes with most good Unicode/Arabic fonts, but not with (some?) default system fonts.

Fonts tested: Arial Unicode MS, Segoe UI, Noto Naskh Arabic, Code2000, Adobe Arabic, Calibri.

So, is this a situation where all/most implementation need to get their acts together and get fixed to match the standard, or is the standard fiction that needs to get fixed?

skynavga · 2019-09-16T00:19:16Z

@r12a re: #3861 (comment)

So does anyone know of any circumstances in which it would make sense to use ZWSP inside a word or sequence of characters in a script that is cursive?

If the sequence does not otherwise contain any spacing or spaces characters, then no.

Would control of justification behaviour be such a circumstance? (Can't see it really, but if so, we need to ask further questions about how that would work.)

Justification should entirely ignore ZWSP, in particular, it should not assign a non-zero width to it.

For that matter, does ZWSP have a useful application in any script outside those that don't separate words with spaces (such as Khmer, Japanese, etc.)?

Yes, as has been mentioned above.

fantasai · 2020-04-01T22:30:59Z

Posted as an issue to the UTC. I guess we wait for their response.

I can't link to it because they (still!) don't use a public bug tracker. :/

frivoal · 2020-04-06T07:44:42Z

I'm the original commenter, so I guess "Commenter Response Pending" is for me. I agree this is out of scope for the css-text-3 spec, so I'm OK with closing it.

But even if it isn't in scope for the spec, it is relevant for linebreaking, affects tests and implementations, and it would be good to put this question to rest. UTC not having a public tracker is unfortunate. Is there a way (not necessarily a URL or an API, a human contact is OK as well) to track the status?

r12a · 2020-04-06T10:32:25Z

@fantasai what did you say to the UTC? (and where did you post it?)

r12a · 2020-04-06T14:05:03Z

fwiw, I created some interactive tests. See https://w3c.github.io/i18n-tests/results/int-cursive for a summary, and click on #26 to see the detailed results, listed by font, at https://github.com/w3c/character_phrase_tests/issues/26

fantasai · 2020-04-08T06:46:10Z

@r12a Basically posed the question @frivoal asked in #3861 (comment) - should implementations be changed to match the Unicode spec, or should Unicode be adjusted to match implementations?

Leaving it open; waiting on UTC reply. The comment was posted through their official feedback channel which, as I mentioned, does not have any public-facing tracking.

xfq · 2020-04-27T00:46:37Z

FYI - it is tracked in F5 of https://www.unicode.org/L2/L2020/20108-properties-feedback.pdf and will probably be discussed in UTC #163 this week.

frivoal · 2020-04-27T09:34:43Z

Admittedly, I don't know all that much about Unicode, but I'm a little puzzled about the (planned) response documented in the file @xfq linked to. The bug report by @fantasai roughly says "the spec doesn't match implementations, maybe the spec should change to match implementations", and a part of the response boils down to be "no, that would break implementations". The other part of the response, that setting ZWSP's General_Category to Cf rather than Zs was very deliberate and justified, seems perfectly reasonable, and even might possibly be enough to go against compatibility concerns (though I'm not the right person to make this call). But citing compatibility concerns in support for keeping the specification unchanged when faced with a claim that implementations do not follow the specification is perplexing.

It seems to me that a little bit more testing would be appropriate. I did not do extensive testing, but the implementations I did find to be in violation of the spec are pretty major ones. InDesign, MS Word, LibreOffice, Firefox, EdgeHTML, Chrome (depending on the font), Apple Mail & Apple TextEdit (so presumably the built-in text component of macOS)…

jfkthame · 2020-04-27T12:02:39Z

The (proposed) response suggests that they "Respond to Elika, informing them that the UTC declines to change the General_Category of U+200B Zero Width Space". But AIUI, Elika's report did not specifically ask for the General_Category to be changed; it only queried the Arabic shaping behavior.

I believe our concern here would be adequately addressed by just adding an entry for ZWSP to ArabicShaping.txt, assigning it joining type U (rather than the default T for characters of General Category Cf. At the point (16 years ago) when ZWSP was changed from GC=Zs to GC=Cf, it doesn't look like Arabic joining behavior was considered.

The primary use for ZWSP, I think, is to control the provision of potentialLineBreakPositions within long strings of otherwise-unbreakable text (e.g. in paths, or in scriptio continua writing systems), where it means, more or less, "word boundary with no visible space". As such, I think it is correct for it to interrupt cursive joining: if I write Arabic words without spaces between them, I'd still expect to interrupt joining at the word boundaries (it's somewhat analogous to the use of camelCase when writing a multiWordEnglishLanguageIdentifier).

So I think the proposed response is answering the wrong question. We're not requesting a change of General Category but a change of Arabic Joining Type.

xfq · 2020-04-27T23:56:20Z

+1 to @jfkthame.

Here's a link to a revised recommended UTC action: https://www.unicode.org/L2/L2020/20108r-properties-feedback.pdf (not discussed in UTC yet)

jfkthame · 2020-04-28T14:09:47Z

FTR, I sent a note to the UTC with essentially the same content as my comment here yesterday, which has led to the revised feedback doc linked above. If (as I expect) the UTC accepts that recommendation, the next step will be to follow up with additional documentation (assuming we want to pursue this).

xfq · 2020-05-15T00:39:04Z

Link to the draft minutes of UTC: https://www.unicode.org/L2/L2020/20102.htm#163-A74

frivoal · 2020-06-05T00:51:07Z

Unicode's response is that the decline to change. I find the rational a bit light, but nevertheless, we seem to have reached the end of this. Time to close?

Richard57 · 2020-06-05T11:10:29Z

Actually, the UTC response reads to me as "No change yet. Write a formal request, giving a full justification and an assessment of the impact of the change."

On the camelCase analogy, ZWNJ will break the cursive connections quite nicely. If one wants line break opportunities between the elements, then an additional ZWSP offers the scope, as an additional effect.

I had a look at some well-written handwritten Tai Tham. There's scope for the subscript consonants and vowels of adjacent clusters to clash. I noticed that clashes below were avoided, and that the presence of zero width word boundaries didn't seem to affect the avoidance strategies. So shaping in general carries on across ZWSP. I note further that U+00AD SOFT HYPHEN is also transparent, and have been told that the Arabic joining rules are continued across line breaks. That argues that joining should continue across ZWSP. Of course, it doesn't say whether complex ligatures (e.g. vertical stacking) should be allowed across ZWSP.

The Unicode standard offers the following instructions:

Instruction

Break joining, no line break: ZWNJ
Break joining, allow line break with hyphen: ZWNJ, 00AD or 00AD, ZWNJ
Break joining, allow line break without hyphen: ZWNJ, ZWSP or ZWSP, ZWNJ
Continue joining, no line break: (default)
Continue joining, allow line break with hyphen: 00AD (TBC)
Continue joining, allow line break without hyphen: ZWSP

The proposed change would remove the ability to command no. 6.

The proposed change would disallow option 5.

jfkthame · 2020-06-05T11:40:20Z

shaping in general carries on across ZWSP

I don't think this is particularly relevant. Shaping in general carries on across everything (including non-zero-width spaces as well). For a nice example, see the Awami Nastaliq font, for example (check the examples of "Diagonal Cluster Fitting" at http://software.sil.org/awami/what-is-special/).

The question here is only about whether ZWSP should be class T or U for the purposes of the Arabic Shaping property. This is unrelated to whether shaping (as a general mechanism) may take effect across it.

fantasai · 2020-09-03T16:13:38Z

Point that was brought up in a discussion with @frivoal: Arabic letters should not change shape depending on whether there was a line break or not. That's the correct behavior for line-break: anywhere and the correct behavior in the presence of hyphenation. So breaks caused by ZWSP should have consistent shaping behavior whether the line breaks at that point or not.

frivoal added css-text-3 Current Work Needs i18n feedback i18n-alreq Arabic language enablement labels Apr 22, 2019

frivoal self-assigned this Apr 22, 2019

This was referenced Apr 22, 2019

[css-text] Arabic cursive joining and ZWJ web-platform-tests/wpt#14673

Merged

[css-text] Verifying the effect of ZERO WIDTH SPACE on Arabic shaping web-platform-tests/wpt#16427

Open

r12a added i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. i18n-mlreq Mongolian language enablement i18n-afrlreq African language enablement labels Apr 23, 2019

himorin mentioned this issue Apr 25, 2019

[css-text] Should zero width space break Arabic shaping? w3c/i18n-activity#686

Open

tabatkins added this to Needs triage in Blink Issue Triage via automation Apr 26, 2019

tabatkins moved this from Needs triage to No Action Yet in Blink Issue Triage Apr 26, 2019

r12a mentioned this issue Jul 10, 2019

Is zero-width space problematic for cursive African scripts? w3c/afrlreq#1

Open

plehegar removed the Needs i18n feedback label Dec 5, 2019

frivoal added the Closed Rejected as OutOfScope label Jan 22, 2020

fantasai added the Commenter Response Pending label Apr 1, 2020

frivoal closed this as completed Apr 6, 2020

frivoal added the Commenter Satisfied Commenter has indicated satisfaction with the resolution / edits. label Apr 6, 2020

fantasai reopened this Apr 8, 2020

Blink Issue Triage automation moved this from No Action Yet to Needs triage Apr 8, 2020

frivoal removed the Commenter Response Pending label Jul 2, 2020

simonbuchan mentioned this issue Jul 23, 2020

Design requirements kas-gui/kas-text#1

Open

9 tasks

frivoal added css-text-4 and removed css-text-3 Current Work labels Sep 3, 2020

frivoal mentioned this issue Sep 3, 2020

[css-text] Reconsider the resolution on #855 #5410

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[css-text] Should zero width space break Arabic shaping? #3861

[css-text] Should zero width space break Arabic shaping? #3861

frivoal commented Apr 22, 2019

behdad commented Apr 22, 2019

frivoal commented Apr 22, 2019

frivoal commented Apr 23, 2019 •

edited

Loading

ntounsi commented Apr 23, 2019

behdad commented Apr 23, 2019

frivoal commented Apr 23, 2019

behdad commented Apr 23, 2019

behdad commented Apr 23, 2019

r12a commented Apr 23, 2019 •

edited

Loading

behdad commented Apr 23, 2019

Richard57 commented Apr 24, 2019

jfkthame commented May 2, 2019

r12a commented May 2, 2019

behdad commented May 2, 2019

Richard57 commented May 2, 2019

behdad commented May 2, 2019

r12a commented May 13, 2019

frivoal commented Sep 9, 2019 •

edited

Loading

skynavga commented Sep 16, 2019

fantasai commented Apr 1, 2020

frivoal commented Apr 6, 2020

r12a commented Apr 6, 2020

r12a commented Apr 6, 2020 •

edited

Loading

fantasai commented Apr 8, 2020 •

edited

Loading

xfq commented Apr 27, 2020

frivoal commented Apr 27, 2020

jfkthame commented Apr 27, 2020

xfq commented Apr 27, 2020

jfkthame commented Apr 28, 2020 •

edited

Loading

xfq commented May 15, 2020

frivoal commented Jun 5, 2020 •

edited

Loading

Richard57 commented Jun 5, 2020

jfkthame commented Jun 5, 2020

fantasai commented Sep 3, 2020

[css-text] Should zero width space break Arabic shaping? #3861

[css-text] Should zero width space break Arabic shaping? #3861

Comments

frivoal commented Apr 22, 2019

behdad commented Apr 22, 2019

frivoal commented Apr 22, 2019

frivoal commented Apr 23, 2019 • edited Loading

ntounsi commented Apr 23, 2019

behdad commented Apr 23, 2019

frivoal commented Apr 23, 2019

behdad commented Apr 23, 2019

behdad commented Apr 23, 2019

r12a commented Apr 23, 2019 • edited Loading

behdad commented Apr 23, 2019

Richard57 commented Apr 24, 2019

jfkthame commented May 2, 2019

r12a commented May 2, 2019

behdad commented May 2, 2019

Richard57 commented May 2, 2019

behdad commented May 2, 2019

r12a commented May 13, 2019

frivoal commented Sep 9, 2019 • edited Loading

skynavga commented Sep 16, 2019

fantasai commented Apr 1, 2020

frivoal commented Apr 6, 2020

r12a commented Apr 6, 2020

r12a commented Apr 6, 2020 • edited Loading

fantasai commented Apr 8, 2020 • edited Loading

xfq commented Apr 27, 2020

frivoal commented Apr 27, 2020

jfkthame commented Apr 27, 2020

xfq commented Apr 27, 2020

jfkthame commented Apr 28, 2020 • edited Loading

xfq commented May 15, 2020

frivoal commented Jun 5, 2020 • edited Loading

Richard57 commented Jun 5, 2020

jfkthame commented Jun 5, 2020

fantasai commented Sep 3, 2020

frivoal commented Apr 23, 2019 •

edited

Loading

r12a commented Apr 23, 2019 •

edited

Loading

frivoal commented Sep 9, 2019 •

edited

Loading

r12a commented Apr 6, 2020 •

edited

Loading

fantasai commented Apr 8, 2020 •

edited

Loading

jfkthame commented Apr 28, 2020 •

edited

Loading

frivoal commented Jun 5, 2020 •

edited

Loading