Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[css-text-3] line-break, word-break: language unclear, and a new testcase. #2559

Closed
faceless2 opened this issue Apr 13, 2018 · 3 comments
Closed
Assignees
Labels
Closed Accepted as Editorial css-text-3 Current Work i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. Tested Memory aid - issue has WPT tests Tracked in DoC

Comments

@faceless2
Copy link

The language for line-break and (in particular) word-break, is unclear with regard to what changes are required to the UAX14 algorithm.

I've made a pull request for a new testcase we've been working up at web-platform-tests/wpt#10420. This testcases is complete but will require review due to the ambiguities described below.

While developing this is became apparent that some of the language in the spec was a bit unclear - certainly to me, and as I'm seeing different results with this testcase in different browsers, maybe others.

First, I expect I am not the first to point out that "word-break" and "line-break" have some considerable overlap. As described, breaks within words like ちょっと (UAX14 classes ID CJ CJ ID) are covered by the line-break rule, although this is a single word. And of course, "line-break: anywhere" will break words. Some sort of clarifying note as to the interaction of these two features might help.

Specific areas of the text that are a bit confusing or incomplete:

  • word-break states it "controls whether a soft wrap opportunity exists between adjacent typographic letter units (or other typographic character units belonging to the NU, AL, AI, or ID Unicode line breaking classes" - although the note at the bottom of "keep-all" explicitly mentions Korean, the classes H2, H3, JL, JT and JV are excluded from this list. I don't know Korean so I'm unsure if that is a deliberate omission. It also doesn't mention classes CJ or NS, and again I'm not sure if this is a deliberate omission. Given the overlap with line-break it may be better to dump this descriptive paragraph completely in favour of exact descriptions of the behaviour of each property with regard to UAX14, as I've added below.

  • The language of "word-break: keep-all" is still a bit unclear with regards to the changes it mandates to UAX14. For example, "Breaking is forbidden within “words”: implicit soft wrap opportunities between typographic letter units are suppressed" makes no mention of character class, so isn't much help if you're implementing this. UAX14 describes this same customization as used for "ragged" korean text, and specifies "... breaking after spaces (as in Latin text)". I believe the intention here is to treat all ideographic characters as if they were latin text.

  • line-break: anywhere is described as providing "a soft wrap opportunity around every typographic character unit, including around any punctuation character or preserved spaces, or in the middle of words, disregarding any prohibition against line breaks introduced by characters with the GL, JW, or ZJW character class". It then states in the note that "This value triggers the line breaking rules typically seen in terminals.". If that's the intention then the mention of GL, JW and ZJW (which should be WJ and ZWJ by the way) is superfluous and confusing. And also superfluous. The final sentence should be "disregarding any prohibition", full-stop end of. Literally anywhere in the text is a valid break-point, even before U+20

  • What happens if I specify "word-break: keep-all; line-break: anywhere". The two rules contradict eachother; which one wins?

  • Using the language of the text as an input to the algorithm seems a bit odd to me. Is there any reason "loose-cj" and "normal-cj" values for line-break could not be used to achieve the same thing? Not really a serious issue and I can't think of a specific reason why it's a problem, it just feels out of character with the rest of the spec so thought I'd raise it while I'm typing.

We've interpreted the various property values as having the following meaning. Whether they're correct or not is almost a secondary issue at this stage; what I'm getting at is that these definitions are exact enough to work from, so I think it would be great if the descriptions for these property values were rewritten in this form, i.e. detailing exactly what changes need to be made to UAX14.

  • "word-break: normal" controls breakpoints between AI, AL, CJ, H2, H3, HL, ID, JL, JT and JV exactly as defined in UAX14. This allows breakpoints in the middle of CJK words, and denies them in non-CJK words. (note: existing description states "customary rules as described above", which is nowhere near exact enough)

  • "word-break: break-all" treats any glyphs of class AI, AL, HL, NU and SA as class ID for the purposes of UAX14. (note: class AI is not listed in the current description; it probably should be, as UAX14 LB1 suggests that class AI is resolved to another class. HL was also missing, I think it should be treated as for AL)

  • "word-break: keep-all" treats any glyphs of class AI, CJ, H2, H3, ID, JL, JT and JV as if they were class AL for the purposes of UAX14. In other words, CJK text will be broken exactly as if it was latin text, i.e. with spaces.

  • "line-break: anywhere" allows a breakpoint between any two typographic character units. The restrictions defined in UAX14 do not apply, and the value of "word-break" is ignored.

(note: this issue originally posted against the wrong repository at web-platform-tests/wpt#10423)

@frivoal frivoal added the css-text-3 Current Work label Apr 13, 2018
@frivoal frivoal changed the title line-break, word-break: language unclear, and a new testcase. [css-text-3] line-break, word-break: language unclear, and a new testcase. Apr 13, 2018
@frivoal frivoal self-assigned this Apr 13, 2018
@xfq xfq added the i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. label May 7, 2018
@astearns
Copy link
Member

astearns commented Jul 2, 2018

(removed agenda+ for now on @fantasai's recommendation)

@fantasai
Copy link
Collaborator

fantasai commented Dec 6, 2018

First, I expect I am not the first to point out that "word-break" and "line-break" have some considerable overlap. As described, breaks within words like ちょっと (UAX14 classes ID CJ CJ ID) are covered by the line-break rule, although this is a single word. And of course, "line-break: anywhere" will break words. Some sort of clarifying note as to the interaction of these two features might help.

I've tried to clarify the specific interactions. Not sure exactly how to explain the interactions at a high level other than what's there, but I'll give it a try later.

word-break states it "controls whether a soft wrap opportunity exists between adjacent typographic letter units (or other typographic character units belonging to the NU, AL, AI, or ID Unicode line breaking classes" - although the note at the bottom of "keep-all" explicitly mentions Korean, the classes H2, H3, JL, JT and JV are excluded from this list. I don't know Korean so I'm unsure if that is a deliberate omission. It also doesn't mention classes CJ or NS, and again I'm not sure if this is a deliberate omission. Given the overlap with line-break it may be better to dump this descriptive paragraph completely in favour of exact descriptions of the behaviour of each property with regard to UAX14, as I've added below.

H2, H3, JL, JT, JV, and CJ are excluded from that list because they are all letters, so they're included in “typographic letter units” already. Line breaking around NS is controlled by line-break: word-break is not able to influence it. I've changed “other” to “non-letter” here to clarify. The sentence is a bit awkward because I don't know how to grammatically construct the sentence to make it clear that the “belonging to” phrase attaches only to “typographic character units” and not to “typographic letter units”, hence the parentheses. Anyway the sentence now looks like

“Specifically it controls whether a soft wrap opportunity exists between adjacent typographic letter units (and/or non-letter typographic character units belonging to the NU, AL, AI, or ID Unicode line breaking classes [UAX14]).”

The language of "word-break: keep-all" is still a bit unclear with regards to the changes it mandates to UAX14. For example, "Breaking is forbidden within “words”: implicit soft wrap opportunities between typographic letter units are suppressed" makes no mention of character class, so isn't much help if you're implementing this.

“typographic letter unit” is very specifically defined in https://www.w3.org/TR/css-text-3/#typographic-letter-unit so I don't know why you think there's “no mention of character class”.

I believe the intention here is to treat all ideographic characters as if they were latin text.

Yes.

line-break: anywhere is described as providing "a soft wrap opportunity around every typographic character unit, including around any punctuation character or preserved spaces, or in the middle of words, disregarding any prohibition against line breaks introduced by characters with the GL, JW, or ZJW character class". It then states in the note that "This value triggers the line breaking rules typically seen in terminals.". If that's the intention then the mention of GL, JW and ZJW (which should be WJ and ZWJ by the way) is superfluous and confusing. And also superfluous. The final sentence should be "disregarding any prohibition", full-stop end of. Literally anywhere in the text is a valid break-point, even before U+20

Edited as “any prohibition against line breaks, even those introduced by characters ...”. I want it to be clear that explicit wrapping controls are also ignored.

What happens if I specify "word-break: keep-all; line-break: anywhere". The two rules contradict each other; which one wins?

line-break: anywhere. I'll clarify that point.

Using the language of the text as an input to the algorithm seems a bit odd to me. Is there any reason "loose-cj" and "normal-cj" values for line-break could not be used to achieve the same thing? Not really a serious issue and I can't think of a specific reason why it's a problem, it just feels out of character with the rest of the spec so thought I'd raise it while I'm typing.

There's a lot of stuff in the spec that is language- or writing-system-dependent. Much of it is not called out in such explicit terms as these rules, but line-breaking, justification, white-space collapsing, and text transforms are all language-dependent. We do this because a) we want things to work optimally by default, without the author having to think about every single CSS property that does or will exist b) we want to keep the number of values limited to what switches are useful for an author to think about rather than overloading everyone in the world with more values than they can easily reason about or even need to know about.

(note: existing description states "customary rules as described above", which is nowhere near exact enough)

UAX14 is a starting point for universal line breaking, not the ultimate authority on quality typesetting. We are intentionally not requiring it.

@fantasai
Copy link
Collaborator

I've tried to clarify the specific interactions. Not sure exactly how to explain the interactions at a high level other than what's there, but I'll give it a try later.

OK, did a bunch of editorial work to try to clean up overview sections and interactions. :) I think this should be fixed now, let me know if you have further suggestions.

@frivoal frivoal added Tested Memory aid - issue has WPT tests and removed Needs Review of Test Case(s) labels Dec 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Closed Accepted as Editorial css-text-3 Current Work i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. Tested Memory aid - issue has WPT tests Tracked in DoC
Projects
None yet
Development

No branches or pull requests

5 participants