-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Remove spaces from newlines between CJK characters #7350
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Thank you for implementing this! As for space-collapsing-cjk-strong ( Besides, the cjk-unbreak package has accumulated a few test cases in https://github.com/KZNS/cjk-unbreak/blob/2ea9b0ce3654ab537116499f63aa4077165192bc/test.typ. They might be relevant. |
crates/typst-syntax/src/lexer.rs
Outdated
| pub fn is_cjk(c: char) -> bool { | ||
| matches!( | ||
| c.script(), | ||
| Script::Han | Script::Hiragana | Script::Katakana | Script::Hangul |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest we also include punctuation marks, for the following use case.
Typst source:
我就站住,
豫备她来讨钱。
“你回来了?”
她先这样问。
Result:
我就站住,豫备她来讨钱。“你回来了?”她先这样问。
The list of CJK punctuation marks can be found in clreq and jlreq. (I don't know if K should be included here.)
This might be more complicated than it seems to be. For example, the following three characters are widely used in Chinese documents, but their categories are different. See regex.pdf for further comparisons.
-
U+3001
、顿号 (secondary comma) matches\p{Script_Extensions=Han}. -
U+FF0C
,逗号 (regular full-width comma) matches\p{Script_Extensions=Common}. -
U+201C
“上双引号 (left double quotation mark) matches\p{Script_Extensions=Common}, and it is also used in Latin documents.
They all match \p{General_Category=Punctuation} and \p{Script=Common}.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Korean, they should be included, along with its half-width counter parts of the Latin punctuations where possible---both of which are used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't include Hangul here. Modern Korean uses spaces to separate words, so it's not relavant here. The function is better named as is_cj.
I used 'CJK' in the title of #792, that was a mistake. This issue only affects Chinese and Japanese.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the links, they're very helpful!
I can change to just CJ removing Hangul. I'll plan to move the function out of the lexer and leave the lexer behavior untouched (unless that should change too?)
However, I'd appreciate more thoughts on what to do for punctuation. It seems we have two categories of codepoints: non-ambiguous CJ punctuation and ambiguous CJ punctuation, such as left/right quotes. For non-ambiguous, I guess we can just treat them like all other CJ codepoints for collapsing, but I'm less sure what to do for ambiguous punctuation.
I presume it would be a good behavior for ambiguous punctuation followed by a CJ character (or vice-versa) to collapse a newline space, but what about an ambiguous punctuation next to another ambiguous punctuation? Are there any characters that would be likely to be split across lines? Should we look at the text language to determine this?
Also, some of the non-ambiguous CJ punctuation overlap in usage with Hangul. Should we be taking that into account?
This is also relevant for #5858, which has some related discussion around quotation marks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another approach would be to just set the space collapsing behavior based on the text language, or an explicit property of say, par() (similar to the request in #710). This is more coarse-grained, but would simplify many of these considerations. Another tradeoff 😮💨
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@YDX-2147483647 According to https://www.unicode.org/Public/17.0.0/ucd/EastAsianWidth.txt, The EAW of U+17A4 is N, not W or F. Did you confuse it with something different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Script property of U+115F is Hangul and EAW of it is W.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The EAW of U+17A4 is N, not W or F. Did you confuse it with something different?
Hi! I didn't make it clear. What I mean is as follows.
c.width()uses a complex rule to determine the width, and EAW is one of the factors.- The full rule is documented on https://docs.rs/unicode-width, and it says that U+17A4 and U+115F will give width 2. (For these two specific characters, EAW does not contribute to
c.width().) - Therefore, I don't think
c.width() == Some(2)is a good criterion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with your opinion that unicode_with is not suitable for determining whether the character is CJ(K).
We should use the EAW property directly. Also, U+FF61 HALFWIDTH IDEOGRAPHIC FULL STOP is a (legacy) Japanese character but whose Script is not Katakana but Common and whose EAW is H. All non-Hangul/Korean characters whose EAW is H must be treated as Chinese/Japanese.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have issues with determining whether characters are Korean, we should test their test cases in FIrefox and report them in https://bugzilla.mozilla.org/ (its bug tracker):
<div lang=ja><!-- ← or zh -->
Test
Case
Here
</div>Removing a newline around a punctuation is enabled only in Chinese and Japanese.
|
Note: I personally prefer the |
be0c698 to
dd58d2b
Compare
dd58d2b to
a70304f
Compare
|
@tats-u thanks! The link to the CSS WG Draft is also helpful. |
It should be just a link to tracker. No concrete specification is stipulated in CSS WG Draft (https://drafts.csswg.org/css-text-4/#line-break-transform) now. Prettier's issue (fixed): "K" should be excluded from the title, too. FYI, the following JS expression returns Iterator.from((function*() {for (let i = 0; i <= 0x10ffff; i++) yield i;})()).filter(cp => /[\p{P}&&\p{sc=Hang}]/v.test(String.fromCodePoint(cp))).toArray() |
This PR makes spaces due to newlines between CJK characters collapse to avoid creating a space when rendered. I've done so by splitting the
Spacesyntax kind into two kinds to determine whether a space had a newline, and then by modifying the space-collapsing algorithm during Typst's realization step.This is the behavior I mentioned in #792 (comment) (I have since changed my username from
wrziantoisuffix).Closes #792
Besides the below tradeoffs, I also have a few
TODOcomments that I would like input on.Tradeoffs
This is a robust solution, but it makes multiple choices with tradeoffs that I'm not sure are desireable. I do not speak any CJK languages, so I'd appreciate feedback from the community about what is/isn't desired :)
CC @peng1999, @YDX-2147483647, @account-login
The main alternative design would be to solely resolve this in the parser or the AST, like YDX mentions in #792 (comment). That may improve or harm any of the tradeoffs below based on your opinion.
Each of the tradeoffs is exemplified in one or more new test cases:
Tradeoff: Collapsing happens during realization
Because collapsing happens after realization, a single newline space in the document may or may not collapse if its neighbors evaluate to CJK/non-CJK text dynamically.
Additionally, since a space element can itself be stored in a variable, static CJK characters can have different spacing due to stored space variables.
This is obviously one of the more contentious tradeoffs, but I think it's mostly fine. The first case is reasonable, and I doubt the second case is likely to affect real documents. But I am ok with changing either.
Test and output:
space-collapsing-cjk-dynamicTradeoff: Normal spaces collapse only if they are adjacent to a newline space
Spaces that are not from newlines are kept, as in
空 格(see the test case dropdown), but when adjacent to a newline space they will collapse.While the basic behavior of collapsing a newline space or a newline space followed by a comment are straightforward, it's less clear whether a normal space followed by a comment and a newline space should combine as one space and collapse together, or if they should act separately. In addition, it's unclear if spaces with different styling should be able to combine and collapse together.
Currently, all three of the cases in this codeblock will combine and collapse their spaces. For me, I think the first is good, and the second is probably desired, but I'm less sure about the third case. (these are rendered in the dropdown below)
Tests and output:
issue-792-space-collapsing-cjkandspace-collapsing-cjk-strongTradeoff: Treating space values as equal
This is the one I'm least certain about, and would be improved by ignoring spaces in the parser/AST.
There are a few other test cases (
list-indent-trivia-nesting,list-indent-bracket-nesting) that implicitly expect space elements to be equal to each other regardless of newlines. I feel like I also generally expect this behavior (I wrote those tetsts), so breaking them feels odd. I added a customPartialEqimplementation toSpaceElemthat always returns true to make this work. However, I'm not sure if this is sound with the way Comemo caches data.I'm totally ok with removing this if we stay with space-collapsing during realization, but it will require modifying the other test cases if we do, and we would probably also want to modify the
reprofSpaceElem.Test:
space-eq-newline