Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode categories are missing some CJK characters that are formatted differently in the Unicode data tables #1432

Open
razzeee opened this issue Oct 9, 2021 · 10 comments

Comments

@razzeee
Copy link
Contributor

razzeee commented Oct 9, 2021

Hey everybody,

I've been using \p{Lu}[_\d\p{L}]* since #906 landed.
Now a user came up with this issue: elm-tooling/tree-sitter-elm#105

I'm not to sure if that should work or not. Testing with https://regex101.com/ gives various results, depending on the engine used.

And tree sitter being it's own engine, leads to me being unsure, what's expected.

@maxbrunsfeld
Copy link
Contributor

That's interesting.

This is the script that we use to build the mappings from those character categories into sets of concrete code points. If you're interested, you could consult the unicode sources that the script pulls from, and see if that japanese character is intended to belong to that category.

Maybe there is some tweak that we could make to the way that we interpret the categories.

@razzeee
Copy link
Contributor Author

razzeee commented Oct 9, 2021

So playing around with regex101.com again, shows that it's not depending on the engine. I just failed to set the u flag. So it needs to be this /\p{Lu}[_\d\p{L}]*/gu to match anything, even without the Japanese character.

I think it might make sense to confirm this with a unit test first?

@razzeee
Copy link
Contributor Author

razzeee commented Oct 10, 2021

This causes tests::corpus_test::test_feature_corpus_files to fail.

diff --git a/test/fixtures/test_grammars/unicode_classes/corpus.txt b/test/fixtures/test_grammars/unicode_classes/corpus.txt
index 9c35be27..7b655973 100644
--- a/test/fixtures/test_grammars/unicode_classes/corpus.txt
+++ b/test/fixtures/test_grammars/unicode_classes/corpus.txt
@@ -2,23 +2,23 @@
 Uppercase words
 ===============
 
-Δბㄱ  Ψ  Ɓƀ  Ƒ  Ɣ  Śřř
+Δბㄱ  Ψ  Ɓƀ  Ƒ  Ɣ  Śřř Color青
 
 ---
 
 (program
-  (upper) (upper) (upper) (upper) (upper) (upper))
+  (upper) (upper) (upper) (upper) (upper) (upper) (upper))
 
 ================
 Lowercase words
 ================
 
-śś  ťť  ßß
+śś  ťť  ßß color青
 
 ---
 
 (program
-  (lower) (lower) (lower))
+  (lower) (lower) (lower) (lower))
 
 ================
 Math symbols

@razzeee
Copy link
Contributor Author

razzeee commented Oct 11, 2021

Here's an example:

https://codepoints.net/U+9752?lang=en

According to that website it should be part of Lo. But 38738 can't be found in the file where we define Lo (or for that matter in the Lo part of the array)

@maxbrunsfeld
Copy link
Contributor

maxbrunsfeld commented Oct 11, 2021

Oh, looking at the unicode data table (version 14), I think that they don't enumerate all of the CJK characters. They use a separate syntax that I previously did not notice.

Most rows use a row per code point:

4DFB;HEXAGRAM FOR LIMITATION;So;0;ON;;;;;N;;;;;
4DFC;HEXAGRAM FOR INNER TRUTH;So;0;ON;;;;;N;;;;;
4DFD;HEXAGRAM FOR SMALL PREPONDERANCE;So;0;ON;;;;;N;;;;;
4DFE;HEXAGRAM FOR AFTER COMPLETION;So;0;ON;;;;;N;;;;;
4DFF;HEXAGRAM FOR BEFORE COMPLETION;So;0;ON;;;;;N;;;;;

But then, for these CJK ideographs, they have one row that represents many code points:

4E00;<CJK Ideograph, First>;Lo;0;L;;;;;N;;;;;
9FFF;<CJK Ideograph, Last>;Lo;0;L;;;;;N;;;;;

We need to adjust our parsing scripts to account for these ranges. Also, we may need to change these JSON files to store ranges instead of individual code points, or else they might get too large.

@razzeee
Copy link
Contributor Author

razzeee commented Oct 11, 2021

Glad you spotted that, I've been missing that all evening.

@maxbrunsfeld maxbrunsfeld changed the title Supporting japanese in regex Unicode categories are missing some CJK characters that are formatted differently in the Unicode data tables Oct 12, 2021
@razzeee
Copy link
Contributor Author

razzeee commented Dec 9, 2021

Are there any plans to work on this? I have more and more users finding this in different languages.

@razzeee
Copy link
Contributor Author

razzeee commented Dec 27, 2021

FYI, I'll just revert this for the time being elm-tooling/tree-sitter-elm#112

@alemuller
Copy link
Contributor

By the way, there is Unicode Technical Standard for regular expressions: UTS #18.

@invrainbow
Copy link

invrainbow commented Apr 30, 2022

I got this working in tree-sitter-go with an external scanner: #1726 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants