-
Notifications
You must be signed in to change notification settings - Fork 664
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[css-text-4] Add support for content-detection, Bunsetsu- (the smallest unit of words that sounds natural) or phrases-based line breaking #6730
Comments
I think that https://www.w3.org/TR/css-text-4/#example-avoid If it is possible to automate finding phrase boundaries, perhaps the right layer to apply it is to semantic markup. |
Hi Alan, I'm not sure what you're suggesting here. Are you suggesting that the UA can simulate semantic markup changes in the HTML via an automated process? |
No, I was thinking of an authoring step that would do that. Do you have a particular “library that tries to determine phrase boundaries in text” in mind? |
Yes. There are some libraries under development that can do semantic detection like this. |
@chrishtr Can you have a look at https://drafts.csswg.org/css-text-4/#word-boundaries, and more specifically at https://drafts.csswg.org/css-text-4/#word-boundary-detection Even though it uses the notion of "word" while you're speaking about phrases, it seems to me that these are very closely related, and aiming for the same (or at least overlapping) use cases. |
Oops, sorry, I just saw that you did mention it. It seems that the main restriction is that you don't want to specify the language. If that can be made to work reliably, a language-agnostic value could be added to that property. |
Anyway, to me, it seems we need more iteration on https://drafts.csswg.org/css-text-4/#word-boundaries rather than a completely separate thing, as there's significant overlap between what that's trying to achieve and what you're proposing. |
@frivoal iteration on an existing property would be fine. As you mentioned, it's I think necessary not to specify the language (plus the additional implications I mentioned in my original comment way above), and also likely have the fallback semantics I mentioned.. |
JFYI, in case of Korean, 'word-break: keep-all' almost works for use cases described in the proposal because Korean does use spaces between words/phrases unlike Chinese and Japanese. 'Almost' (not completely) because compound nouns ("concatenation" of multiple nouns without inter-word space) wound't have line-breaking opportunities with 'word-break: keep-all' because 'word-break: keep-all' is (almost) entirely space-based. In addition to use cases enumerated in the proposal, web page authors may want to use this type of line-breaking for ragged paragraphs as opposed to justified. |
First thoughts, a small correction and then a question or two.
The table should read:
Note that, linguistically, the topic particle actually describes the whole phrase '私の名前', not just 名前. So we should probably define clearly what we mean by 'phrase'. My initial suspicion is that this is actually only relevant to Japanese, and aims to prevent particles from wrapping without the preceding word. I think that in most languages attached suffixes are not separated from the word, and spaces are used around both, as mentioned for Korean. (Mongolian has gaps between some words and suffixes, but these are created by dedicated characters such as NNBSP or MVS.) I'm curious to understand the application for Thai, which i thought doesn't have particles of this kind, and where line break opportunities are generally indicated by heuristics that divide words, or by use of ZWSP. Do you have examples of where Thai needs help? Do you also have examples of Chinese needing to keep together things that are associated with an adjoining word in this way? Since this mentions non-CJK languages, is there an idea that languages that separate words with spaces will also need this option? I find myself wondering whether the issue at hand is rather how word-boundary detection works, and whether instead we should define a property for that. Note, for example, that if you double-click on 名前は the browser usually highlights the compound noun and the particle separately. However, perhaps one could define a property that tells the browser to keep nouns and particles together as a single 'word' unit. That kind of instruction may be more widely useful than just for line-breaking, eg. it may change the 'word' selection behaviour too. |
@r12a: Thanks for the feedback.
If we were to define it, I'd like to suggest that the definition of the "Bunsetsu" ("phrase" in Japanese) from Wikipedia+Google Translate:
Since it's about "natural" line breaking, which is ambiguous, I think both are correct. Multiple Japanese organizations publish different guidelines, and they may produce different results for the same text. I think it is similar to different organizations may define differently whether a word is a noun or a compound noun.
We're still learning them, sorry, not enough details yet, but we hear Thai and Chinese think the current line breaking is sometimes "unnatural" and want more natural one. For example, if you look at the source HTML of the Apple Thai page, you will find <span class="nowrap">ในคอลเลกชั่น</span> Black Unity ใหม่ <wbr><span class="nowrap">ได้รับแรงบันดาลใจ</span>
We're still learning them too, but examples from English Apple Card page: For Apple Card eligibility requirements
Get started<br /> with Apple Card. It looks like the page author thinks not breaking after "For" or "with", and before product name, is more natural.
Do you want to select "with Apple Card" as one word? |
For Russian would you want to keep certain prepositions like «с» (with) together with the word following using this setting? Also +1 to @r12a ’s comment about there being different ways to break Japanese phrases than just linguistic analysis of the parts. It would seem there should be levels of grouping allowed similar to Kinsoku levels of “weak” and “strong” to cover this. |
Thank you for the feedback. I know nothing about Russian, but if Russian looks more "natural" not to break there, I think it should. The intention of the feature is about "natural" line breaking, so I think the results are likely to vary by languages, or by engines.
That's an interesting idea, thank you. We can allow the phrase-based line breaking engine to use the |
(Disclaimer: although I speak Chinese, I am not familiar with Chinese grammar or the style rules of Chinese publishers. These are just my personal experiences and thoughts.) @r12a said:
One such example would be the classifier and the preceding numeral, e.g., in 三双筷子 there should be no line break between these 三 and 双. I think this is true even when the classifier is Western text (like "m" instead of 米 for metre). Although there may be spacing between Chinese and Western text, there should be no line breaks. There are similar examples in Japanese, like シャツ三枚. Another example is the perfective-aspect le (了), which immediately follows the verb. IMHO when used as an aspect marker, le and the preceding verb should not be broken into two lines. Back to the original discussion, I think whether the result of line breaking is satisfactory is a subjective question, and it is difficult to have a method that works in all cases. We'd better provide a not-too-bad default value and allow developers to customize it (by switching strictness profiles/levels or changing phrase-based line breaking to word-based and modifying it by adding things like |
In case it produces a faster response than the email i sent to the CSS WG list, let me mention here that the link to https://drafts.csswg.org/css-text-4/#word-boundary-detection has not been working for some time, and i'm unable to read the examples in the text by looking directly at the .bs file. This is making it difficult for me to formulate suggestions for this issue. Can someone reading this fix the link? |
I'm quite not sure whether there is similar processor or not, but for Japanese, MeCab or some similar processor will mark |
Here are some more questions that occurred to me while thinking this through.
(Btw, fwiw, no-one has mentioned it yet, and i don't remember seeing it in the css-text-4 spec, but if you want to do this kind of thing manually then U+2060 WORD JOINER is your friend. (Does the opposite of ZWSP/ |
I think the title and the original description was misleading, sorry about that. The actual intention of this issue is about supporting a "natural" line breaking. From our point of view, handling particles as part of a phrase is an example for Japanese to explain what it wants to achieve. It may also include handling compound nouns for Japanese/Chinese, or handling "ใหม่" (new) as part of a phrase in Thai.
One possible way to implement is to use ML, as done in BudouX (you can play with it by an extension.) Currently it supports Japanese only, but its basic idea is applicable to any languages.
Thanks, it is indeed helpful. I hope CSS can support better ways than inserting WORD JOINER on every break opportunities, but until then, we can use the workaround. |
Yes, i get that, thanks. Perhaps we should change the issue title ?
Unless there are user preferences for whether or not Thai compounds are split/broken as a general rule, i worry that that is in the territory of in/correct segmentation, rather than natural segmentation, if you see what i mean. [I added clreq and sealreq (SE Asia) labels to the issue, so that those folks will see it.] |
Is the intention is to treat natural line breaking as a separate set of controls from those used for kinsoku-like rules (punctuation wrapping) and the strict|normal|loose controls for controlling line-breaking around small kana? |
My personal read on this issue is that there is a lot more research and development to be done here, and that it's premature to build this into CSS. If we just added a value to "turn this on", each implementation will break substantially differently as it tries to find what "natural line breaking" is for any given language, and we'd end up locked into one particular algorithm as Web compat builds on whatever implementations came first, regardless of what is actually more "natural". And minority languages will suffer the worst mistakes. Line-breaking has significant impact on layout, especially if we're working with higher-level, and therefore larger, constructs. Non-interop across implementations or across time can create real breakage. Instead, I'd like to see the Yes, this is more heavyweight on a given page than a native browser implementation. But it avoids locking us into compat restrictions that prevent any improvement in the feature once deployed. For something that depends on linguistic analysis, which is full of constantly improving heuristics, I think it's important not to get locked in. It's OK if a page chooses a library that it deems good enough. It's a problem if the browser chooses a library that, in the context of all the world's content, is not good enough. CC @litherum |
A few thoughts:
|
Thanks for the feedback again and sorry for my belated replies.
Yes. Authors want to control the strength of Kinsoku-rules separately from this feature.
Fully agree with you.
Not only for performance, but this should be authors' choice. For example, the default line breaking of Japanese TeX is "balanced" with normal break opportunities (every character except where the Kinsoku rules apply.) This is because ragged right is rather a large penalty for CJK line breaking. Authors normally prefer less ragged-right lines over phrase-based break opportunities for body text, but may prefer phrase-based line breaking for display text. They may want to use "balanced" line breaking for both cases.
ICU 71 supports Japanese phrase-based line breaking with a new value for the The original BudouX supports Python and JavaScript only, but it was ported to Swift, Go, and Rust. Android 13 supports wrap text by Bunsetsu (the smallest unit of words that sounds natural) or phrases.
Greedy vs paragraph-level-balanced is a related topic, but they should be set separately, at least for some languages such as Japanese. I'm not sure other languages, such as English, always want to turn on/off both switches together.
Excellent point, thank you for pointing this out. I believe it should apply too, as Apple web site does and as I relied to @r12a above, but we are not sure how exactly it should work yet. We know it's applicable for Japanese. We have some good ideas for Chinese, and some rough ideas for Thai and English. I think, at this moment, "it might apply to other languages" is fine to define a property in CSS. It's similar to how CSS defines a CJK "word" today; sometimes a compound noun is a word, sometimes it's multiple words, they vary depending on the dictionaries, era, or how authors feel more "natural".
Agreed. Also for pre-processors like the BudouX, wrapping each phrase in a span is a complicated work. For example: <div>Phra<span style="border: 1px solid blue">se1 Phra</span>se2</div> It's not easy to wrap "Phrase1" and "Phrase2" each in a span. Maybe one can adjust borders, but there are more -- background-image, filter, etc. It'd be great if it becomes easier for pre-processors. |
Done, happy to hear if there are any better suggestions.
Thai is still in early stage, we may conclude that it's not possible to create general rules for Thai. But we hear desires to improve line breaking for display text from multiple Thai authors, so I think it's worth investigating further. Does this match to what you meant? |
From our point of view, what this issue is asking are:
Can I ask whether you're against the current Word Boundaries in the CSS Text, or you feel the current "word boundaries (or phrase boundaries)" defined in Word Boundaries is solid but phrase boundaries in this issue is premature? If the latter, can I ask how you see these two differently? |
FYI, BudouX now supports Simplified Chinese. @r12a |
This from the ICU 71 release note:
|
🆒 |
It is extremely difficult to precisely define what Bunsetsu is in the Japanese language. Moreover, I know that a major Japanese publisher failed to apply its in-house definition consistently when preparing its textbooks. (And nobody cares.) I also think that no matter what definition we provide, different dictionaries will lead to different results. I hope that a rough consensus of Bunsetsu is achieved and applied to every relevant feature of CSS. |
There are two major styles of space-separated Japanese writing. One is based on bunsetsu. The other is based on words (This is not actually correct, but I do not want to go into details here). For example, the bunsetsu-style provides 私の 名前は 中野です while the other provides 私 の 名前 は 中野 です One of my dyslexic colleagues found it difficult read the former when he was an elementary school student. However, is this a common problem? Based on a Japanese government research fund, Okumura-sensei of Osaka Medical and Pharmaceutical University and I have conducted a series of experiments for three years. More than ten students in the Learning Disability Centre of this university participated in these experiments. Up to now, we have no reasons to believe that the second style is significantly better than the first style to any of the students. |
Some practitioners in the Japan DAISY Consortium have tried BudouX. They welcome it wholeheartedly and are trying to use it in Japanese DAISY textbooks. |
I don't speak Chinese, but knows a tiny bit of classical Chinese (think of it as Latin of East Asia) and can come up with a few examples.
|
There is also a need for grammar-based line breaking in English - see for example the BBC Subtitle Guidelines' section on breaking at natural points, which requires manual line breaks to be inserted based on grammatical rules. This has been a subtitle/caption authoring practice in the UK for decades. If there were a good way to move this to the rendering domain that would have accessibility benefits, for example by avoiding the need for authors to insert explicit line breaks, so that the text is easy to read regardless of how many lines it flows onto. This is not the same thing as |
Closing, as all what this issue needs were resolved at #7193 (comment) |
Proposal
Add a CSS property that provides a way for developers to specify that they would like to use a phrase-based, content detection-based algorithm for line breaking. Implementations would use this CSS property to trigger use of a library that tries to determine phrase boundaries in text and break lines accordingly.
Example (hopefully I got this right, I don't speak Japanese):
A phrase often consists of multiple words. The following Japanese example consists of 6 words, but has 3 phrases.
Phrase-based line breaking is often desired for headline-type text--text in a graphic display context, usually at large sizes, such as titles, headings, billboards, or advertisement graphics, especially in language such as CJK or Thai.
In some use cases such as accessibility content for children, phrase-based line breaking is also useful in a reading context at regular body text sizes.
Design constraints I know of:
(Note: There may be a use case for a developer overriding this fallback path, e.g. by specifying “word phrase none” as the mode, meaning word line breaking is preferred, falling back to phrase, and then to character.)
When the CSS property puts the UA into a mode that allows phrase-based line breaking, the UA may ignore keep-all and break-all.
The CSS property to enable phrase-based line breaking must not specify a specific language. (+)
(+) Compare with the word-boundary-detection property, which currently requires a language when using the auto keyword. word-boundary-detection has this restriction because it is paired tightly with keep-all, whereas the phrase-based feature is not.
Existing support for word-based line breaking does not quite meet these requirements.
The text was updated successfully, but these errors were encountered: