-
Notifications
You must be signed in to change notification settings - Fork 639
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[css-text] Need additional value of word-break for Korean #4285
Comments
Sounds like |
Well, we already have an "auto" value, which is actually called |
Does it work for Hangul/Hanja mixed-content? cc @jungshik |
Is there a reason to give Korean a special value here, versus just changing the behavior of "normal" to better reflect current Korean writing practice? |
Depends what you mean by work. It would break between the Hanja, and not between the Hangul. This is not the ideal behavior, which would also keep Hanja of a single word together, but:
Korean writing/typography culture is undergoing a transition. The Also, it is likely (I don't have data, but it is logically probable) that there are websites out there that would break if we changed the default:
|
I'm curious about the reason for separating Korean from Chinese and Japanese. For my experience, in the text editor or web page for Korean |
@jihyerish The reason why some people want that behavior, is that Korean (nowadays) uses spaces between words, but Chinese and Japanese don't. Breaking words the same in all 3 languages is the traditional way to do things, and should continue to exist (and to be the default). However, since Korean does have spaces, doing line breaking in Korean the same as in English is also something (some) people want. |
@frivoal I don't think that 'keep-all' is increasingly common. Neither do I buy your reasoning that putting inter-word space is the cause for preferring to have 'keep-all'. The vast majority of Korean text in books, newspapers, magazines (when the correct typographic standard is adopted) have 'break-all' period. Some ill-typeset documents (especially in 1990's made by poorly i18n'ized DTP software) may use 'keep-all', but that's an aberration !! Modern Korean orthography always dictates the use of inter-word space (over 80 years at minimum). Yet, breaking at the syllable boundary has been the norm for paragraphs. Let me tell you what Korean web authors did in mid-1990's when Netscape 1.x didn't do the right thing with Korean line breaking. They wrote a script to insert keep-all does have its use. keep-all is preferred for Korean when the corresponding English text does NOT want hyphenation. That is, multi-line titles (song, movie, book, article), multi-line ad copies, etc. However, they're exceptions rather than norm. Changing 'word-break: normal' behave like 'word-break: keep-all' for Korean is akin to winding the clock back to 1994 (Netscape 1.x behavior). |
One more reason 'keep-all' does not work for Korean is that some Koreans tend to be very fond of German style mega-compound words. So, instead of writing 'Korea University College of Natural Science Department of Physics' (한국대학교 자연 과학 대학 물리학과), they write 'KoreaUniversityCollegeOfNaturalScienceDepartmentOfPhysics' (한국대학교자연과학대학물리학과). I am not a fan of these mega-compound words at all, but a lot of Koreans do use them to my chagrin. What would happen to those mega-compound words with 'keep-all'? |
Note also that Chinese and Japanese do NOT want line-breaking at any random character boundaries, either in the above cases. They also want line-breaking at word-boundary plus alpha. 'Plus alpha' is for keeping 'particles' and 'non-content bearing words' together with content-bearing counterparts. For instance, even though 'わさだ だいがく の がくせい' can be broken into 3 words. the 2nd word (の ; 'of') has to be kept together with the first word in 'titles', 'ad copies', etc. わさだ だいがく Because CSS does not support this use case (it requires PoS tagging), Google has a library for this use case. See Note that this is not for regular paragraphs but for multi-line titles, etc. |
Another way of saying what I wrote above is that 'justified paragraph alignment' has been the norm in Korean typesetting. Justified alignment works best with 'break-all' (break at syllable boundaries). It's similar to English typesetting for 'justified on both edge' works best with hyphenation (at syllable boundary) enabled. To have 'keep-all' (English equivalent of NO hyphenation) and 'justified alignment', inter-word spacing has to be adjusted (some can be rather large). In CSS, 'text-align: justify' and 'word-break: keep-all' can be used together. There are cases where 'ragged alignment on the right' is preferred and 'keep-all' is necessary. However, they're not for regular paragraphs but for multi-line titles and ad-copies,etc. |
@jungshik I want to clarify one thing: I am not proposing to change the default behavior of 'word-break: normal' to behave like 'word-break: keep-all'. You are right that "break-all" is and needs to remain the default. I am proposing to add a value, not change the behavior of one. |
Sorry for misunderstanding your proposal. However, I don't see a strong need for that. 'keep-all' behavior is not preferred by the majority of Korean speakers in those cases (UGC where the langauge of a content is not known in advance) and most other cases (exceptions were noted above). Think about why |
The CSS Working Group just discussed The full IRC log of that discussion<dael> Topic: Need additional value of word-break for Korean<dael> github: https://github.com//issues/4285 <dael> florian: Reminding people: Korean traditionally written like Japanese without spaces. Now use spaces, but line-breaking has not changes where you can break like Japanese <dael> florian: Some typographers agree in many contexts it's nice to line-break Korean like English. not everyone agrees with that. Discussion in GH shows that. <dael> florian: We need another value because the existing 'keep-all' only works if you can lang-tag. Do we care about allowing this behavior for Korean that can't be lang tagged? I think we do. <dael> florian: If you're writing Korean in a text editor or from a database where you don't have language tags it's tricky to tag on the fly. Amount of magic you have to do is really obnoxious. <dael> florian: Either we say when editing this behavior is impossible or we say for the Korean alphabet you get the normal or we add keep-all-hangul <dael> myles: putting hangul in the value doesn't make sense when you use lang <dael> florian: But you can't put lang on contenteditable section because you don't know what will go in there. If they do a mix of languages you can't language tag. Adding spans on the fly depending on what user types is performance-wise terrible. <dael> myles: Seems wrong leevl of abstraction. Wish it could be generalized. Worried eventually have 100 different lang specific values. <dael> florian: Possibly. It's really that there are two normal behaviors so normal can't do the right thing. We need two normals. I don't think there are that many languages that need two normals <dael> AmeliaBR: That's my concern too. Before settling on language specific keyword do more research to see if more languages have this issue. How much input have you had from general i118n experts beyond Korean use case <dael> florian: Have not heard of any language. People who would probably know have been involved <dael> fantasai: I think if there were other languages they would need a keyword. It's separating Hangul from CHinese and Japanese. Most other writing systems don't mix in the name way and not that many that break everywhere like this. I'm not aware of any others that alternate in the same way as Korean <fantasai> s/keyword/separate keyword/ <tantek> q+ to suggest raising this to i18n here to get broader input from experts in more languages: https://github.com/w3c/i18n-discuss/issues <dael> jensimmons: I like what florian is proposing. I understand concern on break from purity, but I feel like one thing web didn't do well was support international languages. THis is a way web can keep up with evolving graphic design changes. Feels like a way to make sure web supports a culture and its ability to evolve instead of saying it's complicated and we don't know where it's going to go <dael> chris: Good idea as long as clearly defined what this value does when don't meet Korean text. I think it's a thing we need. If web started in Korea we would have had this from the start <Rossen_> q? <dael> florian: If you're not in Hangul you do the same as keep-all <Rossen_> ack tantek <Zakim> tantek, you wanted to suggest raising this to i18n here to get broader input from experts in more languages: https://github.com/w3c/i18n-discuss/issues <dael> tantek: I want to re-raise something from AmeliaBR. AmeliaBR asked how much input we had from general i18n experts. I want to raise that and propose before we resolve we file and issue on i18n discuss to get input from broader experts. <dael> tantek: florian saying you haven't heard of other languages isn't quite sufficient <dael> florian: Reaching out to i18n, yes. But to have a modern language that has the exact same behavior so we can name the keyword something else we need a language who by defaults breaks between every language and want to move away from that and there aren't that many. <myles> q+ <dael> tantek: I'm saying it shouldn't be dependent on just your expertise. You may be correct, but worth getting that group to take a look. <dael> myles: Wanted to ask if any thought given on how to impl? Like are there line breaking libraries that impl this behavior? <dael> florian: This would need to be impl in ICU. ICU seems amenable to this but if we expand ICU would have to expand as well. <dael> Rossen_: I hear requests to get more from i18n. florian are you okay to do that this week? To get traction or a checkmark to say it's good? <dael> florian: Yes, I can look into this |
Implementation ConcernThe fact that this kind of line-breaking isn't implemented anywhere in any line-breaking utility library is concerning. It's difficult to believe that CSS is the first place where software engineers have ever wanted this line breaking behavior. I'd like to discuss this with the ICU maintainers to get their thoughts about this. ProposalAdding a new Hangul-specific keyword seems like the wrong design to solve this problem because the values don't stack. It's unlikely that Korean is the only language with two Perhaps something like: word-break: normal customization(Kore, keep-all) which would mean " This way, the customizations are stackable: for languages with multiple Limiting ExpressivenessThe intent for this proposal is only to select which of the Alternative ConsideredThis proposal could also use unicode blocks instead of unicode scripts, though that would require something like |
+1 to Myles suggestion of considering a more extensible syntax. Maybe worth considering whether an extensible syntax could handle different typographic patterns when it comes to breaks around punctuation, as well. |
This issue is interesting because it only makes sense on content that isn't perfectly language tagged. In the general case, we can't always guarantee that all content will be language tagged perfectly. This has implications for many CSS specs. |
I was under an assumption that CSS WG has a general policy to recommend language tagging. While tagging every word looks a bit too much to me (e.g., English words appearing in Arabic/Japanese text), tagging each document should be reasonable. |
I know of two places that have that. InDesign has the two behaviors, although they're entangled with something else: depending on whether you justify on or, you get something like It also exists in the Bloomberg Terminal. That previously ran on proprietary software, and now runs on a modified browser engine, which has this ability. They use the |
As far as author-supplied text goes, yes. For text that comes from users of the site, I don't think it's practical. See my first comment in this issue for a number of reasons why #4285 (comment) |
I think it's practical for the site to add a few lines of code to scan the text and emit I'm afraid we will lose the reasons to recommend language tagging, because there are cases where it is not possible to add languages without scanning the content. Why do we recommend a page to be tagged as |
@litherum Why introduce something new when we can already use wildcards in the language pseudo-class and rely on “the :lang(ko,
und-Hang, mul-Hang, "*-Hang",
und-Kore, mul-Kore, "*-Kore"
)
{word-break: keep-all;} If the pseudo-class should only rely on author-supplied, explicit metadata, CSS could introduce a dedicated (highlighting) pseudo-element for writing systems or scripts: ::char(Hang, Kore)
{word-break: keep-all;} |
While this would not be hard to write if you put keep-all on the whole text, it would also not serve the need: multilingual text exists, even in user generated content, and would be broken by this rule. Let's say I want to tweet / email / blog / write into a word-365 document / ... the first sentence of the Japanese wikipedia page about Seoul (https://ja.wikipedia.org/wiki/%E3%82%BD%E3%82%A6%E3%83%AB%E7%89%B9%E5%88%A5%E5%B8%82). This is mostly Japanese text, which would be broken by applying keep-all, but it contains 4 syllables of hangul, which would trigger the kind of script you mentioned. Sure, it isn't appropriate to use this script if you know the the content is going to be in Japanese, but the author of the twitter / gmail / wordpress / office 365 / ... doesn't know what the content is going to be when it is content from users. That doesn't mean they can't have an opinion of who text ought to be typeset if it is or contains hangul. On the other hand, it would be hard to write the kind of script that generate a span around the bits that are in Hangul to put keep-all on, and leave the rest alone. |
That is exactly why I recommend scripts to do this kind of work. With script, authors have full control for how I'm not happy but ok to add the value to the sepc, but as @litherum said, if we were changing the policy and start saying we don't recommend language tagging for twitter/gmail/wordpress/Office/etc., I think we should apply that policy to all other CSS properties too, such as |
The kind of script that would reliably do what |
Inserting spans is very common technique on the web. I don't understand why it's so hard, impossible, nor practical. As I said, I'll not make a formal objection if other browsers want to implement, but this doesn't look like a good primitive to add to the platform and that I'll be opposed to implement in Blink. |
Inserting spans in static content is easy. Inserting spans in (rich text) content while it is being edited is not. |
I was actioned to ask i18n for investigation of other languages where something like this is relevant. This is happening over there: w3c/i18n-discuss#11 |
Pending the i18n investigation, I'm seeing three ways forward here:
Recap: The fundamental behavior of IMHO: If there aren't use cases for more than one or two values in |
InDesign conflates the hyphenation setting with the choice of breaking anywhere or on space with Korean, which we have heard feedback about from users requesting its own distinct setting. Choosing to break on space or break anywhere does relate to other user choices, such as full-justifying the text or creating legal documents where meaning is critical and line breaks can change meaning. We have also heard that the default should move away from break anywhere to breaking on space with some user control to break where they want, for what it's worth. |
In regular situations,
word-break: normal
is expected to pick the right kind of word breaking for various scripts, keeping letters of a word together in languages that have word-based line breaking, while allowing wraps in the between letters of a word in languages where that's the normal behavior.However, Korean typography has been evolving, and while the
normal
values corresponds to what used to be normal (allowing wraps in the middle of words), and needs to continue to have this behavior for compat reasons, the preferred behavior is increasingly the one achieved bykeep-all
.In a document that where all parts are properly language tagged,
* { word-break: normal; } lang(ko) { word-break: keep-all; }
achieves the desired behavior.However, this is not quite enough to solve the problem in the case of documents with user-generated content: when a user types content in a textarea, or a contenteditable (of if user generated content is retrieved from a database), the author of the page does not generally know what the language is, and cannot tag it in the markup. The following options are available to them, none of them great:
word-break: normal
on elements accepting user input: This will do "the right thing" for all languages, except for that style of Korean, which will break too often.word-break: keep-all
on elements accepting user input: this will do "the right thing" for space separated languages, including that style of Korean, but will badly break languages like Japanese or Chinese, by disabling wrapping opportunities and causing potential overflow.word-break: normal
on elements accepting user input, but also add a piece of javascript that monitors the content for changes, and switches the whole element towork-break: keep-all
if any hangul text is detected:keep-all
to them as well.* { word-break: normal; } lang(ko) { word-break: keep-all; }
together with a piece of Javascript that adds thelang=ko
attribute (and creates spans/divs as necessary to apply it) on the parts of the text input by the user that contain hangul, and lang="" (or lang=somethingelse, if the somethingelse can be detected reliably) on parts that don't:contenteditable
element? etcSo, to solve this, I propose that we add a
keep-all-hangul
value (or justkeep-hangul
), that behaves the same askeep-all
for the unicode characters that correspond to hangul, andnormal
for everything else.The text was updated successfully, but these errors were encountered: