Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[css-text] Render U+2028 LINE SEPARATOR as a forced line break #6992

Closed
tabatkins opened this issue Jan 26, 2022 · 15 comments
Closed

[css-text] Render U+2028 LINE SEPARATOR as a forced line break #6992

tabatkins opened this issue Jan 26, 2022 · 15 comments
Labels
Closed Accepted as Editorial css-text-3 Current Work Tested Memory aid - issue has WPT tests

Comments

@tabatkins
Copy link
Member

tabatkins commented Jan 26, 2022

Originally posted by Ka-Ping Yee

I'd like to propose that U+2028 be rendered as a forced line break.

The changes to the CSS Text Module Level 3 draft would be minimal; for example:

  • In Section 3, append the sentence "U+2028 LINE SEPARATOR is always a forced line break."
  • In Section 4.1, exclude U+2028 from the definition of "other space separators."
  • Optionally, add a "U+2028" column to the table in Section 3, with "Forced line break" in every row.

The rationale is straightforward:

  • Unicode is very clear about the purpose of U+2028.
  • There are many circumstances in which it is useful to represent visible line breaks in text strings without additional markup.
  • There is solid precedent for a character with whitespace behaviour that supersedes all the CSS white-space options, U+00A0 NO-BREAK SPACE.
  • The essential layout functionality needed to implement U+2028 as a forced line break is not new; browsers already have it if they support "white-space: pre-line".
  • Current browsers typically render U+2028 as a visible glyph, such as an empty black box. Many developers find this surprising; most likely, it would be less surprising for U+2028 LINE SEPARATOR to be rendered as a line separator, as befits its name.

For reference, the Unicode Standard 14.0 defines U+2028 LINE SEPARATOR as an "unambiguous separator character". By my reading, it could hardly be more clear as to what U+2028 is intended to represent, and what the most sensible rendering should be:

5.8 Newline Guidelines

[...]

Line Separator and Paragraph Separator

A paragraph separator—independent of how it is encoded—is used to indicate a separation between paragraphs. A line separator indicates where a line break alone should occur, typically within a paragraph. [...] For comparison, line separators basically correspond to HTML <BR>, and paragraph separators to older usage of HTML <P> (modern HTML delimits paragraphs by enclosing them in <P>...</P>).

[...]

Recommendations

The Unicode Standard defines two unambiguous separator characters: U+2029 paragraph separator (PS) and U+2028 line separator (LS). In Unicode text, the PS and LS characters should be used wherever the desired function is unambiguous.

I'd appreciate hearing your thoughts and suggested next steps on this.

Thanks very much!

@zestyping
Copy link

zestyping commented Jan 26, 2022

Thanks, @tabatkins!

I can't edit the issue description directly, but here it is with the markup fixed up to render correctly on GitHub: [Copied into OP]

@xfq
Copy link
Member

xfq commented Jan 27, 2022

I tested the rendering of this character in various browsers and editors, for you reference.


In Chromium it is rendered as a box with a cross: chromium (font is Hiragino Kaku Gothic ProN)

In Firefox, Safari, and iCab, it doesn't display at all.


In Visual Studio Code, the editor will emit a warning when it detects this character. See microsoft/vscode#96142


In Atom, it is not rendered. See atom/atom#12157


In Sublime Text 4, it is rendered as <0x2028>:

sublime


In TextEdit it is rendered as a forced line break.


In GNU Emacs (27.2) it is rendered as horizontal whitespace instead of a line break, even after enabling whitespace-mode.

In Vim (8.2) it is the same.


For the applications I tested, only TextEdit renders this character as a newline.

See also:

@zestyping
Copy link

Thank you for doing this research, @xfq !

@fantasai
Copy link
Collaborator

fantasai commented Jan 27, 2022

I think this issue is filed on the basis of some misunderstandings.

  • Sections 3 and 4 are concerned with document white space characters, specifically U+0020, U+0009, and segment breaks. U+2028 LINE SEPARATOR is not included explicitly, and unless the host language defines it as a segment break, it is not affected by any of the rules therein.
  • U+2028 is not an "other space separator". U+2028 belongs to category Zl, not Zs.
  • CSS Text 3 already normatively references UAX14's forced break behavior for U+2028, see section 5.1 list item 2.

CSS3 Text has, technically, required LS to be treated as a forced break for at least a decade. If browsers are not treating it as such, that should be considered a bug against them. Closing as invalid (not a spec issue).

@zestyping Copied your fixed markup into the OP! Thanks for caring about this issue, I hope your concern can motivate the browsers to fix this longstanding problem.

@xfq
Copy link
Member

xfq commented Jan 28, 2022

Browser bug reports: GeckoBlinkWebkit

Since this code point isn't directly mentioned in css-text, I'm not quite sure if we need to add a relevant test in WPT.

@fantasai
Copy link
Collaborator

@xfq Tests for any behavior specced in css-text-3, even if indirectly, are welcome in WPT. :) Probably best to do it as a test for all BK/NL characters.

@zestyping
Copy link

zestyping commented Jan 28, 2022

@fantasai Thank you for clarifying this! I do see now that Section 4.1 did not mean to refer to U+2028 when defining "other space separators".

CSS3 Text has, technically, required LS to be treated as a forced break for at least a decade. If browsers are not treating it as such, that should be considered a bug against them.

Can this be taken as an official statement on the WG's intended interpretation of LS? I would be delighted to know that treating U+2028 as a forced line break is already the behaviour that CSS Text 3 intends to specify!

I can imagine browser developers not finding this to be obvious from the spec. If this interpretation is not clear to them, would it be appropriate for me to point them at this comment thread as an authoritative ruling?

Here is why I suspect they might find it rather subtle. CSS Text 3 mentions many other relevant characters by code point (such as U+000A, U+0020, etc.) and name (CARRIAGE RETURN, IDEOGRAPHIC SPACE, etc.). Yet U+2028 is never mentioned anywhere in the entire spec. Neither LINE SEPARATOR nor its abbreviation LSEP is mentioned anywhere. Neither the "Line Separator" category nor its abbreviation "Zl" is mentioned anywhere. An ordinary person can wonder "I wonder why U+2028 doesn't render as a line break", search for the spec, arrive at CSS Text 3, search the entire document for every imaginable term related to U+2028, and find nothing — indeed, that was my experience, and what led me to file this issue. And, of course, we have the empirical evidence of a decade of browser development oblivious to this rule.

Would the CSS editors be willing to consider making this a little more explicit? I can think of one small change that would clear this all up.

As you pointed out, Section 5.1, bullet point 2 says "lines always break at each preserved forced break character".

Regardless of the 'white-space' value, lines always break at each preserved forced break character: thus for all values, line-breaking behavior defined for the BK and NL Unicode line breaking classes must be honored. [UAX14]

But there is no definition for the term "forced break character" in the spec. If you assume that a "forced break character" has something to do with a "forced line break", then the term "preserved forced break character" is nonsensical: "forced line break" is defined in terms of preserved characters, so there can be no such thing as a non-preserved forced break character. If you instead start by trying to understand the term "preserved", you find that it is defined only as part of the term "preserved white space", wherein the default meaning of "white space" is "document white space characters", which consists of U+0020, U+0009, and segment breaks; so "preserved" has no meaning when applied to other characters like U+2028.

Fixing this is easy; delete the confusing term and simplify the bullet point to:

Regardless of the white-space value, Unicode characters with the mandatory break property (BK) must be treated as forced line breaks. This includes U+000C, U+2028, and U+2029 [UAX14].

(I am omitting VT and NEL here because UAX#14 says "implementations are not required to support the VT character" and "implementations are not required to support the NEL character".)

@zestyping
Copy link

zestyping commented Jan 29, 2022

@xfq Thank you for filing https://bugs.webkit.org/show_bug.cgi?id=235753 !

@frivoal
Copy link
Collaborator

frivoal commented Jan 31, 2022

Can this be taken as an official statement on the WG's intended interpretation of LS? I would be delighted to know that treating U+2028 as a forced line break is already the behaviour that CSS Text 3 intends to specify!

I'd agree with that interpretation. css-text-3 states that:

or the BK and NL Unicode line breaking classes must be honored. [UAX14]

UAX14 States that 2028 has non-tailorable BK class, and that “The text after [it] starts at the beginning of the line”.

There's a level of indirection, which may make it non obvious on a casual read, but I think it's unambiguous that this is the expected behavior.

CSS Text 3 mentions many other relevant characters by code point (such as U+000A, U+0020, etc.) and name (CARRIAGE RETURN, IDEOGRAPHIC SPACE, etc.). Yet U+2028 is never mentioned anywhere in the entire spec

css-text-3 mentions those characters where special css-specific processing going beyond (or against) Unicode is needed. For the rest, as stated in 1.5, “CSS is built on Unicode. UAs […] must adhere to all normative requirements of the Unicode Core Standard, except where explicitly overridden by CSS.” So css-text-3 cannot be implemented correctly without referencing Unicode (and in particular UAX14), which in the case of U+2028, gives us a definitive normative answer.

That said, if an editorial chance can make this clearer, I'd be happy to take that on.

Fixing this is easy; delete the confusing term and simplify the bullet point to:

Regardless of the white-space value, Unicode characters with the mandatory break property (BK) must be treated as forced line breaks. This includes U+000C, U+2028, and U+2029. [UAX14]

I don't think this quite works. That covers the BK class, but leaves off preserved segments breaks (U+000A).

Also

I am omitting VT and NEL here because UAX#14 says "implementations are not required to support…

I am interpreting css-text-3 to be going beyond Unicode here, removing the optionality, and adding a requirement that this be supported for the sake of interoperability, so I'd rather keep it.

How about

Preserved segment breaks, and—regardless of the white-space value—any Unicode character with the BK or LN line breaking class, must be treated as forced line breaks. [UAX14]
Note: As of Unicode 14, the BK and NL classes include U+000B, U+000C, U+0085, U+2028, and U+2029.

@fantasai fantasai reopened this Jan 31, 2022
@zestyping
Copy link

zestyping commented Feb 4, 2022

@frivoal That looks great! I agree with your reasoning. Thank you for the careful review and clarification.

@frivoal
Copy link
Collaborator

frivoal commented Feb 7, 2022

@fantasai does the proposal at the bottom of #6992 (comment) look reasonable to you, or do you think I missed something?

@zestyping
Copy link

@fantasai I see that the first sentence of @frivoal's suggestion made it into https://www.w3.org/TR/css-text-4/:

Regardless of the white-space value, lines always break at each preserved forced break character: thus for all values, line-breaking behavior defined for the BK and NL Unicode line breaking classes must be honored. [UAX14]

but not the second sentence:

Note: As of Unicode 14, the BK and NL classes include U+000B, U+000C, U+0085, U+2028, and U+2029.

Any particular reason why this should not be included? I realize these code points are implied by reference to UAX14, but it seems nice to be explicit, especially given that plenty of other code points are mentioned by number in this draft.

@fantasai
Copy link
Collaborator

@zestyping As noted in #6992 (comment), that sentence was always there: https://www.w3.org/TR/css-text-3/#line-break-details

@fantasai
Copy link
Collaborator

Updated the specs to use Florian's rephrasing. As for a note listing all the individual codepoints... I think it's better to just make sure there's testcases in WPT.

@frivoal frivoal added the Tested Memory aid - issue has WPT tests label Dec 29, 2022
frivoal added a commit to web-platform-tests/wpt that referenced this issue Dec 29, 2022
moz-v2v-gh pushed a commit to mozilla/gecko-dev that referenced this issue Jan 5, 2023
…rs creating line breaks, a=testonly

Automatic update from web-platform-tests
Add tests for BK and NL Unicode characters creating line breaks

See w3c/csswg-drafts#6992

--

wpt-commits: a8ee96901b9eabf3876d38d3328bf1320b115ca6
wpt-pr: 37696
jamienicol pushed a commit to jamienicol/gecko that referenced this issue Jan 13, 2023
…rs creating line breaks, a=testonly

Automatic update from web-platform-tests
Add tests for BK and NL Unicode characters creating line breaks

See w3c/csswg-drafts#6992

--

wpt-commits: a8ee96901b9eabf3876d38d3328bf1320b115ca6
wpt-pr: 37696
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Closed Accepted as Editorial css-text-3 Current Work Tested Memory aid - issue has WPT tests
Projects
None yet
Development

No branches or pull requests

6 participants