Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[css-text] Should zero width space break Arabic shaping? #3861

Open
frivoal opened this issue Apr 22, 2019 · 34 comments
Open

[css-text] Should zero width space break Arabic shaping? #3861

frivoal opened this issue Apr 22, 2019 · 34 comments
Assignees
Labels
Closed Rejected as OutOfScope Commenter Satisfied Commenter has indicated satisfaction with the resolution / edits. css-text-4 i18n-afrlreq African language enablement i18n-alreq Arabic language enablement i18n-mlreq Mongolian language enablement i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response.

Comments

@frivoal
Copy link
Collaborator

frivoal commented Apr 22, 2019

This is probably more of a unicode issue than a css issue, but we have a fair bit of people involved with text layout and i18n over here, so filing it here first to figure out if we should take it to unicode or not.

When writing web-platform-tests/wpt#14673, I had misread the unicode standard, and though that ZERO WIDTH SPACE was supposed to break arabic shaping, based on a table that said "all spacing characters" do so. But there's a distinction between "spacing characters" and "spaces characters", and ZERO WIDTH SPACE is part of the later, not the former.

https://www.unicode.org/Public/UCD/latest/ucd/ArabicShaping.txt gives further details about which character does what to shaping, and classifies ZERO WIDTH SPACE as T (transparent), which neither forces nor breaks shaping, and just behaves as if it wasn't there for shaping purposes.

So Unicode has a definite answer as to what's supposed to happen, but several people in the thread about my tests were surprised by that answer (including @behdad, @r12a, and myself), because ZERO WIDTH SPACE is used as a word divider, and that suggests it ought to be breaking shaping. @r12a brought up nastaliq as a reasonable use case, because:

when using nastaliq script, esp. in Urdu, inter-word spaces are often not applied, because words are separated enough by the arrangement of glyphs along the sloping baselines. If you do, however, want to indicate word boundaries in those situations without unsightly spacing, using ZWSP seems to be an obvious way of doing so.

So, what do we collectively think? Is unicode likely enough to be mistaken that we should raise this issue with them? Is there a know good reason for why things are the way they are?

@behdad
Copy link

behdad commented Apr 22, 2019

It feels to me that the intention / rationale behind this was to be a character that adds a word- (and hence possible line-) break without changing any other behavior.

@frivoal
Copy link
Collaborator Author

frivoal commented Apr 22, 2019

Does it make sense to add a word-break without affecting shaping? That sounds counter intuitive to me.

@frivoal
Copy link
Collaborator Author

frivoal commented Apr 23, 2019

@fantasai I've seen you make comments in various places about ZWSP that implied that you thought it ought to break shaping (here's just one), so I'd be interested in your feedback on this.

@ntounsi
Copy link

ntounsi commented Apr 23, 2019

https://unicode.org/reports/tr44/#Release_Stability (Date 2019-02-27)

says §2.3.1 "Updates to character properties [...] may be required [...] to change the assigned values for a property".
[...]
"For example, U+200B ZERO WIDTH SPACE was originally classified as a space character (General_Category=Zs), but it was reclassified as a Format character (General_Category=Cf) to clearly distinguish it from space characters in its function as a format control for line breaking"

  1. It follows that, as a Format character, ZWS can also serve "to indicate word boundaries" as raised by @r12a.

  2. Is ZWS still a "risky" character for a stable implementation?

@frivoal, I didn't find ZWS as "Transparent" in the link "https://www.unicode.org/Public/UCD/latest/ucd/ArabicShaping.txt". Am I missing something?

@behdad
Copy link

behdad commented Apr 23, 2019

@frivoal, I didn't find ZWS as "Transparent" in the link "https://www.unicode.org/Public/UCD/latest/ucd/ArabicShaping.txt". Am I missing something?

Read the file header re missing values.

@frivoal
Copy link
Collaborator Author

frivoal commented Apr 23, 2019

I didn't find ZWS as "Transparent" in the link "https://www.unicode.org/Public/UCD/latest/ucd/ArabicShaping.txt". Am I missing something?

It's not listed explicitely, but it is covered by the generic rule:

Note: Code points that are not explicitly listed in this file are
either of joining type T or U:

  • Those that are not explicitly listed and that are of General Category Mn, Me, or Cf
    have joining type T.

ZWS is in category Cf.

@behdad
Copy link

behdad commented Apr 23, 2019

cc @roozbehp again, since I think he can articulate the reasoning behind this best.

Does it make sense to add a word-break without affecting shaping? That sounds counter intuitive to me.

My understanding is that ZWSP is inherently a line-break control character. So it's misleadingly named at best. It's closer to SOFT HYPHEN, than to a space. The difference from SOFT HYPHEN seems to be that this one is not expected to turn into a hyphen if line break does happen. I think the original use case was to be used with scripts that don't use inter-word spaces, to mark line break opportunities. It feels to me that ZWSP and SOFT HYPHEN should have been one character, to mark "line break allowed".

Other example of ZWSP is to mark break opportunities / word boundaries in concatenated words like long URLs or hash tags like "justanotherawesomelylongurl". The idea is that whether or not you mark a location as line-break opportunity should be separate from whether Arabic shaping happens. So, if Arabic shaping is not desired, one should use a ZWNJ to control that, separately from ZWSP.

Another explanation is that many characters, like ZWSP, only affect one aspect of Unicode processing and are ignored for all other processes. This is done to manage complexity. Such that instead of having to specify behavior of each control / format character on every process on every script, this can be specified independent of scripts for the most part.

Anyway. Just my understanding. As I said, I was also surprised by this, but I understand what the rationale / thinking has been.

@behdad
Copy link

behdad commented Apr 23, 2019

I'm also not convinced this ZWSP should be used with Nastaliq...

@r12a r12a added i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. i18n-mlreq Mongolian language enablement i18n-afrlreq African language enablement labels Apr 23, 2019
@r12a
Copy link
Contributor

r12a commented Apr 23, 2019

I'm also not convinced this ZWSP should be used with Nastaliq...

I could be convinced that you're right there. It certainly doesn't seem a good idea given what TUS expects wrt joining behaviour.

I note, however, that Firefox and Edge both break the cursive joining, although Chrome and Safari don't. See https://r12a.github.io/pickers/urdu/?text=%DB%81%D8%B1%E2%80%8B%D8%B4%D8%AE%D8%B5%E2%80%8B%DA%A9%D9%88%E2%80%8B%D8%A7%D8%B3%E2%80%8B%D8%A8%D8%A7%D8%AA%E2%80%8B%DA%A9%D8%A7%E2%80%8B%D8%AD%D9%82 for an example.

Note further, however, that a difference ZWSP and soft hyphen is that ZWSP is expected to expand when letter spacing or justification is applied to a line. This makes it not quite a simple line-break opportunity indicator. (TUS describes it as "indicates a word break or line break opportunity" and goes on to describe how justification algorithms are likely to add space on p872 of TUS version 12.) I'm wondering now how that fits with justification of cursive scripts.... Would it be expected to create kashida-like behaviour when justified, but shrink back to nothing otherwise (unlike tatweel)?

@behdad
Copy link

behdad commented Apr 23, 2019

I note, however, that Firefox and Edge both break the cursive joining, although Chrome and Safari don't.

I'm fairly sure Firefox breaks it just because the font doesn't have ZWSP. If you use an Arabic font that does have ZWSP I expect Firefox to NOT break shaping. Quite possibly the same about Safari.

You're right re justification. At this point I can't make up my mind about the exact intended use of ZWSP anymore.

@Richard57
Copy link

The text on justification and ZWSP in TUS is on p872, in Volume 23. I see no difference in behaviour between ZWSP and soft hyphen for justification; if the line-break opportunity is not taken, then both characters are simply ignored; justification behaves as though they were not present. The issue seems to be mentioned for ZWSP because someone once thought that ZWSP suppressed the expansion of inter-character spacing.

A rendering difference between ZWSP and soft hyphen is that when the line break opportunity is taken, a soft hyphen has hyphenation effects, such as the appearance of a hyphen and changes in spelling. (Some varieties of Thai use hyphenation with hyphens.) Another difference is that ZWSP marks a word boundary whereas a soft hyphen has no such effect; this matters for spell checkers.

@tabatkins tabatkins added this to Needs triage in Blink Issue Triage via automation Apr 26, 2019
@tabatkins tabatkins moved this from Needs triage to No Action Yet in Blink Issue Triage Apr 26, 2019
@jfkthame
Copy link
Contributor

jfkthame commented May 2, 2019

I'm also not convinced this ZWSP should be used with Nastaliq...

Agreed. Nastaliq text (just like the same text if it were written in a "simple" Arabic font) would use U+0020 for inter-word spaces. A Nastaliq font might give its space glyph a very narrow (or even zero) default advance width, but it's still U+0020.

@r12a
Copy link
Contributor

r12a commented May 2, 2019

So does anyone know of any circumstances in which it would make sense to use ZWSP inside a word or sequence of characters in a script that is cursive?

Would control of justification behaviour be such a circumstance? (Can't see it really, but if so, we need to ask further questions about how that would work.)

For that matter, does ZWSP have a useful application in any script outside those that don't separate words with spaces (such as Khmer, Japanese, etc.)?

@behdad
Copy link

behdad commented May 2, 2019

Would control of justification behaviour be such a circumstance? (Can't see it really, but if so, we need to ask further questions about how that would work.)

As someone pointed out, ZWSP doesn't have any justification behavior either. It's as if it wasn't there. The clarification in Unicode saying that it might stretch simply means that it might stretch the same way as it might without ZWSP.

For that matter, does ZWSP have a useful application in any script outside those that don't separate words with spaces (such as Khmer, Japanese, etc.)?

I suppose not, because cursive scripts use hyphenation and as such should use soft-hyphen instead.

@Richard57
Copy link

I for one regularly use ZWSP to provide line-breaking opportunities for long ASCII file paths, as breaks at soft hyphens could be mistaken for breaks at hyphens.

@behdad
Copy link

behdad commented May 2, 2019

I for one regularly use ZWSP to provide line-breaking opportunities for long ASCII file paths, as breaks at soft hyphens could be mistaken for breaks at hyphens.

Right. That's other legit usecase for them.

@r12a
Copy link
Contributor

r12a commented May 13, 2019

Right. That's other legit usecase for them.

Watch out though, because https://r12a.github.io/scripts/punctuation/block.html#char200B works, whereas https://r12a.github.io/scripts/punctuation​/block.html#char200B fails. And it's not at all clear why, to the user who may have copy/pasted the URL into github, email (or many other applications). Sounds like a dangerous thing to recommend generally.

@frivoal
Copy link
Collaborator Author

frivoal commented Sep 9, 2019

Ok, so the conclusion seems to be that this might be surprising if you don't think about it too much, but that people have thought this through, and that ZWS is more like a soft hyphen, and that not breaking shaping is very much intentional, and documented in unicode.

There might be problems in browsers about failing to the do the correct shaping if Zero Width Space is missing from the Arabic font, but that's an implementation matter, not a question about whether ZWS is supposed to break shaping.

So, I was feeling ready to close this with no action, but…

It seems that neither MS Word (regardless of font), nor InDesign (regardless of font) nor Firefox (regardless of font) or EdgeHTML (regardless of font) respect that, and they do break shaping on Zero Width Space always. Additionally, whether Chrome breaks shaping or not depends on the font: it shapes with most good Unicode/Arabic fonts, but not with (some?) default system fonts.

Fonts tested: Arial Unicode MS, Segoe UI, Noto Naskh Arabic, Code2000, Adobe Arabic, Calibri.

So, is this a situation where all/most implementation need to get their acts together and get fixed to match the standard, or is the standard fiction that needs to get fixed?

@skynavga
Copy link
Contributor

@r12a re: #3861 (comment)

So does anyone know of any circumstances in which it would make sense to use ZWSP inside a word or sequence of characters in a script that is cursive?

If the sequence does not otherwise contain any spacing or spaces characters, then no.

Would control of justification behaviour be such a circumstance? (Can't see it really, but if so, we need to ask further questions about how that would work.)

Justification should entirely ignore ZWSP, in particular, it should not assign a non-zero width to it.

For that matter, does ZWSP have a useful application in any script outside those that don't separate words with spaces (such as Khmer, Japanese, etc.)?

Yes, as has been mentioned above.

@fantasai
Copy link
Collaborator

fantasai commented Apr 1, 2020

Posted as an issue to the UTC. I guess we wait for their response.

I can't link to it because they (still!) don't use a public bug tracker. :/

@frivoal
Copy link
Collaborator Author

frivoal commented Apr 6, 2020

I'm the original commenter, so I guess "Commenter Response Pending" is for me. I agree this is out of scope for the css-text-3 spec, so I'm OK with closing it.

But even if it isn't in scope for the spec, it is relevant for linebreaking, affects tests and implementations, and it would be good to put this question to rest. UTC not having a public tracker is unfortunate. Is there a way (not necessarily a URL or an API, a human contact is OK as well) to track the status?

@frivoal frivoal closed this as completed Apr 6, 2020
@frivoal frivoal added the Commenter Satisfied Commenter has indicated satisfaction with the resolution / edits. label Apr 6, 2020
@r12a
Copy link
Contributor

r12a commented Apr 6, 2020

@fantasai what did you say to the UTC? (and where did you post it?)

@r12a
Copy link
Contributor

r12a commented Apr 6, 2020

fwiw, I created some interactive tests. See https://w3c.github.io/i18n-tests/results/int-cursive for a summary, and click on #26 to see the detailed results, listed by font, at https://github.com/w3c/character_phrase_tests/issues/26

@fantasai fantasai reopened this Apr 8, 2020
Blink Issue Triage automation moved this from No Action Yet to Needs triage Apr 8, 2020
@fantasai
Copy link
Collaborator

fantasai commented Apr 8, 2020

@r12a Basically posed the question @frivoal asked in #3861 (comment) - should implementations be changed to match the Unicode spec, or should Unicode be adjusted to match implementations?

Leaving it open; waiting on UTC reply. The comment was posted through their official feedback channel which, as I mentioned, does not have any public-facing tracking.

@xfq
Copy link
Member

xfq commented Apr 27, 2020

FYI - it is tracked in F5 of https://www.unicode.org/L2/L2020/20108-properties-feedback.pdf and will probably be discussed in UTC #163 this week.

@frivoal
Copy link
Collaborator Author

frivoal commented Apr 27, 2020

Admittedly, I don't know all that much about Unicode, but I'm a little puzzled about the (planned) response documented in the file @xfq linked to. The bug report by @fantasai roughly says "the spec doesn't match implementations, maybe the spec should change to match implementations", and a part of the response boils down to be "no, that would break implementations". The other part of the response, that setting ZWSP's General_Category to Cf rather than Zs was very deliberate and justified, seems perfectly reasonable, and even might possibly be enough to go against compatibility concerns (though I'm not the right person to make this call). But citing compatibility concerns in support for keeping the specification unchanged when faced with a claim that implementations do not follow the specification is perplexing.

It seems to me that a little bit more testing would be appropriate. I did not do extensive testing, but the implementations I did find to be in violation of the spec are pretty major ones. InDesign, MS Word, LibreOffice, Firefox, EdgeHTML, Chrome (depending on the font), Apple Mail & Apple TextEdit (so presumably the built-in text component of macOS)…

@jfkthame
Copy link
Contributor

The (proposed) response suggests that they "Respond to Elika, informing them that the UTC declines to change the General_Category of U+200B Zero Width Space". But AIUI, Elika's report did not specifically ask for the General_Category to be changed; it only queried the Arabic shaping behavior.

I believe our concern here would be adequately addressed by just adding an entry for ZWSP to ArabicShaping.txt, assigning it joining type U (rather than the default T for characters of General Category Cf. At the point (16 years ago) when ZWSP was changed from GC=Zs to GC=Cf, it doesn't look like Arabic joining behavior was considered.

The primary use for ZWSP, I think, is to control the provision of potentialLineBreakPositions within long strings of otherwise-unbreakable text (e.g. in paths, or in scriptio continua writing systems), where it means, more or less, "word boundary with no visible space". As such, I think it is correct for it to interrupt cursive joining: if I write Arabic words without spaces between them, I'd still expect to interrupt joining at the word boundaries (it's somewhat analogous to the use of camelCase when writing a multiWordEnglishLanguageIdentifier).

So I think the proposed response is answering the wrong question. We're not requesting a change of General Category but a change of Arabic Joining Type.

@xfq
Copy link
Member

xfq commented Apr 27, 2020

+1 to @jfkthame.

Here's a link to a revised recommended UTC action: https://www.unicode.org/L2/L2020/20108r-properties-feedback.pdf (not discussed in UTC yet)

@jfkthame
Copy link
Contributor

jfkthame commented Apr 28, 2020

FTR, I sent a note to the UTC with essentially the same content as my comment here yesterday, which has led to the revised feedback doc linked above. If (as I expect) the UTC accepts that recommendation, the next step will be to follow up with additional documentation (assuming we want to pursue this).

@xfq
Copy link
Member

xfq commented May 15, 2020

Link to the draft minutes of UTC: https://www.unicode.org/L2/L2020/20102.htm#163-A74

@frivoal
Copy link
Collaborator Author

frivoal commented Jun 5, 2020

Unicode's response is that the decline to change. I find the rational a bit light, but nevertheless, we seem to have reached the end of this. Time to close?

@Richard57
Copy link

Actually, the UTC response reads to me as "No change yet. Write a formal request, giving a full justification and an assessment of the impact of the change."

On the camelCase analogy, ZWNJ will break the cursive connections quite nicely. If one wants line break opportunities between the elements, then an additional ZWSP offers the scope, as an additional effect.

I had a look at some well-written handwritten Tai Tham. There's scope for the subscript consonants and vowels of adjacent clusters to clash. I noticed that clashes below were avoided, and that the presence of zero width word boundaries didn't seem to affect the avoidance strategies. So shaping in general carries on across ZWSP. I note further that U+00AD SOFT HYPHEN is also transparent, and have been told that the Arabic joining rules are continued across line breaks. That argues that joining should continue across ZWSP. Of course, it doesn't say whether complex ligatures (e.g. vertical stacking) should be allowed across ZWSP.

The Unicode standard offers the following instructions:

Instruction

  1. Break joining, no line break: ZWNJ
  2. Break joining, allow line break with hyphen: ZWNJ, 00AD or 00AD, ZWNJ
  3. Break joining, allow line break without hyphen: ZWNJ, ZWSP or ZWSP, ZWNJ
  4. Continue joining, no line break: (default)
  5. Continue joining, allow line break with hyphen: 00AD (TBC)
  6. Continue joining, allow line break without hyphen: ZWSP

The proposed change would remove the ability to command no. 6.

The proposed change would disallow option 5.

@jfkthame
Copy link
Contributor

jfkthame commented Jun 5, 2020

shaping in general carries on across ZWSP

I don't think this is particularly relevant. Shaping in general carries on across everything (including non-zero-width spaces as well). For a nice example, see the Awami Nastaliq font, for example (check the examples of "Diagonal Cluster Fitting" at http://software.sil.org/awami/what-is-special/).

The question here is only about whether ZWSP should be class T or U for the purposes of the Arabic Shaping property. This is unrelated to whether shaping (as a general mechanism) may take effect across it.

@fantasai
Copy link
Collaborator

fantasai commented Sep 3, 2020

Point that was brought up in a discussion with @frivoal: Arabic letters should not change shape depending on whether there was a line break or not. That's the correct behavior for line-break: anywhere and the correct behavior in the presence of hyphenation. So breaks caused by ZWSP should have consistent shaping behavior whether the line breaks at that point or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Closed Rejected as OutOfScope Commenter Satisfied Commenter has indicated satisfaction with the resolution / edits. css-text-4 i18n-afrlreq African language enablement i18n-alreq Arabic language enablement i18n-mlreq Mongolian language enablement i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response.
Projects
Blink Issue Triage
  
Needs triage
Development

No branches or pull requests

10 participants