Unicode categories are missing some CJK characters that are formatted differently in the Unicode data tables #1432

razzeee · 2021-10-09T20:16:19Z

Hey everybody,

I've been using \p{Lu}[_\d\p{L}]* since #906 landed.
Now a user came up with this issue: elm-tooling/tree-sitter-elm#105

I'm not to sure if that should work or not. Testing with https://regex101.com/ gives various results, depending on the engine used.

And tree sitter being it's own engine, leads to me being unsure, what's expected.

The text was updated successfully, but these errors were encountered:

maxbrunsfeld · 2021-10-09T21:04:44Z

That's interesting.

This is the script that we use to build the mappings from those character categories into sets of concrete code points. If you're interested, you could consult the unicode sources that the script pulls from, and see if that japanese character is intended to belong to that category.

Maybe there is some tweak that we could make to the way that we interpret the categories.

razzeee · 2021-10-09T22:51:33Z

So playing around with regex101.com again, shows that it's not depending on the engine. I just failed to set the u flag. So it needs to be this /\p{Lu}[_\d\p{L}]*/gu to match anything, even without the Japanese character.

I think it might make sense to confirm this with a unit test first?

razzeee · 2021-10-10T13:03:08Z

This causes tests::corpus_test::test_feature_corpus_files to fail.

diff --git a/test/fixtures/test_grammars/unicode_classes/corpus.txt b/test/fixtures/test_grammars/unicode_classes/corpus.txt
index 9c35be27..7b655973 100644
--- a/test/fixtures/test_grammars/unicode_classes/corpus.txt
+++ b/test/fixtures/test_grammars/unicode_classes/corpus.txt
@@ -2,23 +2,23 @@
 Uppercase words
 ===============
 
-Δბㄱ  Ψ  Ɓƀ  Ƒ  Ɣ  Śřř
+Δბㄱ  Ψ  Ɓƀ  Ƒ  Ɣ  Śřř Color青
 
 ---
 
 (program
-  (upper) (upper) (upper) (upper) (upper) (upper))
+  (upper) (upper) (upper) (upper) (upper) (upper) (upper))
 
 ================
 Lowercase words
 ================
 
-śś  ťť  ßß
+śś  ťť  ßß color青
 
 ---
 
 (program
-  (lower) (lower) (lower))
+  (lower) (lower) (lower) (lower))
 
 ================
 Math symbols

razzeee · 2021-10-11T20:51:14Z

Here's an example:

https://codepoints.net/U+9752?lang=en

According to that website it should be part of Lo. But 38738 can't be found in the file where we define Lo (or for that matter in the Lo part of the array)

maxbrunsfeld · 2021-10-11T21:05:50Z

Oh, looking at the unicode data table (version 14), I think that they don't enumerate all of the CJK characters. They use a separate syntax that I previously did not notice.

Most rows use a row per code point:

4DFB;HEXAGRAM FOR LIMITATION;So;0;ON;;;;;N;;;;;
4DFC;HEXAGRAM FOR INNER TRUTH;So;0;ON;;;;;N;;;;;
4DFD;HEXAGRAM FOR SMALL PREPONDERANCE;So;0;ON;;;;;N;;;;;
4DFE;HEXAGRAM FOR AFTER COMPLETION;So;0;ON;;;;;N;;;;;
4DFF;HEXAGRAM FOR BEFORE COMPLETION;So;0;ON;;;;;N;;;;;

But then, for these CJK ideographs, they have one row that represents many code points:

4E00;<CJK Ideograph, First>;Lo;0;L;;;;;N;;;;;
9FFF;<CJK Ideograph, Last>;Lo;0;L;;;;;N;;;;;

We need to adjust our parsing scripts to account for these ranges. Also, we may need to change these JSON files to store ranges instead of individual code points, or else they might get too large.

razzeee · 2021-10-11T21:30:43Z

Glad you spotted that, I've been missing that all evening.

razzeee · 2021-12-09T08:17:21Z

Are there any plans to work on this? I have more and more users finding this in different languages.

razzeee · 2021-12-27T20:47:45Z

FYI, I'll just revert this for the time being elm-tooling/tree-sitter-elm#112

alemuller · 2021-12-28T08:55:26Z

By the way, there is Unicode Technical Standard for regular expressions: UTS #18.

invrainbow · 2022-04-30T06:17:49Z

I got this working in tree-sitter-go with an external scanner: #1726 (comment)

maxbrunsfeld changed the title ~~Supporting japanese in regex~~ Unicode categories are missing some CJK characters that are formatted differently in the Unicode data tables Oct 12, 2021

maxbrunsfeld added bug parser-generation labels Oct 12, 2021

razzeee mentioned this issue Apr 26, 2022

Identifiers not working with Chinese characters #1726

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode categories are missing some CJK characters that are formatted differently in the Unicode data tables #1432

Unicode categories are missing some CJK characters that are formatted differently in the Unicode data tables #1432

razzeee commented Oct 9, 2021

maxbrunsfeld commented Oct 9, 2021

razzeee commented Oct 9, 2021

razzeee commented Oct 10, 2021 •

edited

Loading

razzeee commented Oct 11, 2021

maxbrunsfeld commented Oct 11, 2021 •

edited

Loading

razzeee commented Oct 11, 2021

razzeee commented Dec 9, 2021

razzeee commented Dec 27, 2021

alemuller commented Dec 28, 2021

invrainbow commented Apr 30, 2022 •

edited

Loading

Unicode categories are missing some CJK characters that are formatted differently in the Unicode data tables #1432

Unicode categories are missing some CJK characters that are formatted differently in the Unicode data tables #1432

Comments

razzeee commented Oct 9, 2021

maxbrunsfeld commented Oct 9, 2021

razzeee commented Oct 9, 2021

razzeee commented Oct 10, 2021 • edited Loading

razzeee commented Oct 11, 2021

maxbrunsfeld commented Oct 11, 2021 • edited Loading

razzeee commented Oct 11, 2021

razzeee commented Dec 9, 2021

razzeee commented Dec 27, 2021

alemuller commented Dec 28, 2021

invrainbow commented Apr 30, 2022 • edited Loading

razzeee commented Oct 10, 2021 •

edited

Loading

maxbrunsfeld commented Oct 11, 2021 •

edited

Loading

invrainbow commented Apr 30, 2022 •

edited

Loading