Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[css-text] Line breaking for ambiguous characters; e.g., U+2010, U+2013 #4419

Closed
kojiishi opened this issue Oct 15, 2019 · 12 comments
Closed
Labels
Closed Accepted by CSSWG Resolution css-text-3 Current Work i18n-clreq Chinese language enablement i18n-jlreq Japanese language enablement i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. Tested Memory aid - issue has WPT tests Tracked in DoC

Comments

@kojiishi
Copy link
Contributor

@litherum found that Gecko handles U+2010 very nicely, and I'd like to consider using their idea.

Currently, the line-break property requires:

The following breaks are allowed for normal and loose line breaking if the writing system is Chinese or Japanese, and are otherwise forbidden:
breaks before hyphens:
‐ U+2010, – U+2013, 〜 U+301C, ゠ U+30A0

U+2010 and U+2013 are unified code points, and that it may affect English words in an undesired way. Not sure if this is intentional or not, Gecko supports this only when they follow Japanese characters, and not when they follow Latin letters, regardless of the content language.

jsbin test

It looks to me that this is a very good idea. Maybe not applicable to all cases, but at least these two code points a) are unified and ambiguous, and b) prohibit break before, so looking at the previous character makes sense to me.

Note, the jsbin test includes U+2010 and U+2013 in common CJK fonts, it looks like fonts disagree which code points have full-width CJK glyph and which has Latin glyph.

Thoughts?

/cc @fantasai @frivoal @emilio @jfkthame @drott

@kojiishi kojiishi changed the title [css-text] [css-text] Line breaking for ambiguous characters; e.g., U+2010, U+2013 Oct 18, 2019
@kojiishi kojiishi added css-text-3 Current Work i18n-jlreq Japanese language enablement i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. labels Oct 18, 2019
@kojiishi
Copy link
Contributor Author

@himorin can JLTF discuss this?

Maybe another wild idea is just remove these rules. U+2010 and U+2013 are not commonly used code points IIUC. I'm wondering whether having different behavior for these code points for lang=ja is net-plus or not.

@hftf
Copy link

hftf commented Oct 24, 2019

U+2010 and U+2013 are not commonly used code points IIUC.

Just to clarify, are you stating that U+2013 EN DASH is not commonly used, or only not commonly used in a specific context (Japanese text)? Because it is certainly one of the most frequently used characters in the General Punctuation block (cf. https://stackoverflow.com/a/5575000/1057509).

Marginally related issue about U+2010 HYPHEN and line breaking: #3434

@kojiishi
Copy link
Contributor Author

Thank you for pointing that out, I meant the latter, in Japanese context.

@fantasai
Copy link
Collaborator

fantasai commented Nov 6, 2019

@kojiishi EN DASH is frequently used for numeric ranges, e.g. 7–11. I think numbers are often used in Japanese, what would you expect to happen there?

@kojiishi
Copy link
Contributor Author

kojiishi commented Nov 6, 2019

@fantasai After discussing with Kobayashi-sensei, I need your help. Do you remember why we allow breaking before these code points for normal?

Kobayashi-sensei thinks these are rarely used code points in Japanese, but he feels more natural to prohibit breaking before them. Even if there were cases/reasons where we want to break before them, if it has side effects, changing would be fine.

I checked JLREQ line break table, and found that it prohibits break before cl-03. I thought normal is a copy of JLREQ rules, but maybe we tweaked for some reasons I don't remember.

I think we can remove normal from this rule. Actually, other than this one, normal matches UAX#14. How about removing all rules for normal and defer to UAX#14? UAX#14 doesn't have strict / loose, so we can keep them, but normal does not seem to be necessary if we can match these 4 code points to UAX#14.

@kojiishi
Copy link
Contributor Author

kojiishi commented Nov 6, 2019

@kojiishi EN DASH is frequently used for numeric ranges, e.g. 7–11. I think numbers are often used in Japanese, what would you expect to happen there?

U+2013 was added to JIS in 2013, it wasn't used before. I think most Japanese use U+002D HYPHEN-MINUS for numeric ranges, or its full-width counterpart if digits are in full-width.

@frivoal
Copy link
Collaborator

frivoal commented Nov 7, 2019

I'd say that for ranges, Japanese people would often write 7〜11, rather than use a dash or hyphen of some kind.

@xfq xfq added the i18n-clreq Chinese language enablement label Nov 7, 2019
@himorin
Copy link
Contributor

himorin commented Nov 13, 2019

just for record,,,
initially this section was introduced by commit 0b1a55a, to have developed list of line breaking rule.

Also, in commit 8d2b106, these text changed from

Following breaks be forbidden in strict line breaking and allowed in normal:

  • breaks before the hyphens (U+2010, U+2013, U+301C, U+30A0)

to

Additionally, if the language is known to be Chinese or Japanese, breaks
before hyphens (U+2010, U+2013, U+301C, U+30A0) may be allowed in
normal’.

@himorin
Copy link
Contributor

himorin commented Nov 13, 2019

@kojiishi
Copy link
Contributor Author

just for record,,,
initially this section was introduced by commit 0b1a55a, to have developed list of line breaking rule.

Oh, thank you. Looks like just an error, I guess the intention was to make sure they are prohibited for strict, but not to allow for normal.

@fantasai
Copy link
Collaborator

Just discussed with @frivoal @kojiishi @jfkthame, conclusion is:

  • Disallow before hyphen in normal and strict.
  • Allow break between ID and hyphen in loose. This means Kanji+Hyphen breaks; and Alphabetic+Hyphen doesn't break, unless word-break: break-all makes Alphabetic behave like ID.

@css-meeting-bot
Copy link
Member

The CSS Working Group just discussed [css-text] Line breaking for ambiguous characters; e.g., U+2010, U+2013, and agreed to the following:

  • RESOLVED: Adopt the suggestion in https://github.com/w3c/csswg-drafts/issues/4419#issuecomment-577700150
  • RESOLVED: Disallow before hyphen in normal and strict. Allow break between ID and hyphen in loose. This means Kanji+Hyphen breaks; and Alphabetic+Hyphen doesn't break, unless word-break: break-all makes Alphabetic behave like ID.
The full IRC log of that discussion <Rossen__> Topic: [css-text] Line breaking for ambiguous characters; e.g., U+2010, U+2013
<Rossen__> github: https://github.com//issues/4419
<fantasai> ScribeNick: fantasai
<fantasai> ScribeNick: emilio
<emilio> koji: the current CSS spec says that if the language is japanese and line-break: normal there should be a break opportunity before 2010 and 2013
<emilio> ... it can break strangely for english words within japanese text
<emilio> ... gecko fixed it by not breaking if the previous character is a latin character
<emilio> ... but I want to fix this in the spec
<emilio> ... and make sure all browsers agree
<emilio> fantasai: we got together yesterday and concluded in all langs you want to disallow breaks before hyphens in normal breaking mode but japanese wants to allow it in loose mode
<fantasai> https://github.com//issues/4419#issuecomment-577700150
<emilio> ... so word-break break-all would allow between the latin letter and the hyphen
<Rossen__> q?
<emilio> ... so that's the solution outlined in the last comment (above)
<emilio> myles: are we going to contact ICU
<emilio> koji: if we agree I'll do
<emilio> florian: I support this
<myles> s/ICU/ICU and CLDR?
<myles> s/ICU/ICU and CLDR?/
<emilio> Rossen__: objections?
<emilio> RESOLVED: Adopt the suggestion in https://github.com//issues/4419#issuecomment-577700150
<emilio> RESOLVED: Disallow before hyphen in normal and strict. Allow break between ID and hyphen in loose. This means Kanji+Hyphen breaks; and Alphabetic+Hyphen doesn't break, unless word-break: break-all makes Alphabetic behave like ID.

@frivoal frivoal closed this as completed in 2cfcf43 Feb 6, 2020
frivoal added a commit to web-platform-tests/wpt that referenced this issue Feb 6, 2020
moz-v2v-gh pushed a commit to mozilla/gecko-dev that referenced this issue Feb 15, 2020
…s-wg resolution, a=testonly

Automatic update from web-platform-tests
[css-text] Adjust test cases to match css-wg resolution

See w3c/csswg-drafts#4419

--

wpt-commits: d2767c04559c016e04ad43fcc07f63f1153d18bf
wpt-pr: 21626
gecko-dev-updater pushed a commit to marco-c/gecko-dev-wordified-and-comments-removed that referenced this issue Feb 16, 2020
…s-wg resolution, a=testonly

Automatic update from web-platform-tests
[css-text] Adjust test cases to match css-wg resolution

See w3c/csswg-drafts#4419

--

wpt-commits: d2767c04559c016e04ad43fcc07f63f1153d18bf
wpt-pr: 21626

UltraBlame original commit: 2fe9fdc7c1b09ffa178cc20496b02f1d7ee52878
gecko-dev-updater pushed a commit to marco-c/gecko-dev-wordified that referenced this issue Feb 16, 2020
…s-wg resolution, a=testonly

Automatic update from web-platform-tests
[css-text] Adjust test cases to match css-wg resolution

See w3c/csswg-drafts#4419

--

wpt-commits: d2767c04559c016e04ad43fcc07f63f1153d18bf
wpt-pr: 21626

UltraBlame original commit: 2fe9fdc7c1b09ffa178cc20496b02f1d7ee52878
gecko-dev-updater pushed a commit to marco-c/gecko-dev-comments-removed that referenced this issue Feb 16, 2020
…s-wg resolution, a=testonly

Automatic update from web-platform-tests
[css-text] Adjust test cases to match css-wg resolution

See w3c/csswg-drafts#4419

--

wpt-commits: d2767c04559c016e04ad43fcc07f63f1153d18bf
wpt-pr: 21626

UltraBlame original commit: 2fe9fdc7c1b09ffa178cc20496b02f1d7ee52878
xeonchen pushed a commit to xeonchen/gecko that referenced this issue Feb 18, 2020
…s-wg resolution, a=testonly

Automatic update from web-platform-tests
[css-text] Adjust test cases to match css-wg resolution

See w3c/csswg-drafts#4419

--

wpt-commits: d2767c04559c016e04ad43fcc07f63f1153d18bf
wpt-pr: 21626
@frivoal frivoal added Tested Memory aid - issue has WPT tests and removed Needs Edits Needs Testcase (WPT) labels Apr 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Closed Accepted by CSSWG Resolution css-text-3 Current Work i18n-clreq Chinese language enablement i18n-jlreq Japanese language enablement i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. Tested Memory aid - issue has WPT tests Tracked in DoC
Projects
None yet
Development

No branches or pull requests

8 participants