-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unicode categories are missing some CJK characters that are formatted differently in the Unicode data tables #1432
Comments
That's interesting. This is the script that we use to build the mappings from those character categories into sets of concrete code points. If you're interested, you could consult the unicode sources that the script pulls from, and see if that japanese character is intended to belong to that category. Maybe there is some tweak that we could make to the way that we interpret the categories. |
So playing around with regex101.com again, shows that it's not depending on the engine. I just failed to set the I think it might make sense to confirm this with a unit test first? |
This causes diff --git a/test/fixtures/test_grammars/unicode_classes/corpus.txt b/test/fixtures/test_grammars/unicode_classes/corpus.txt
index 9c35be27..7b655973 100644
--- a/test/fixtures/test_grammars/unicode_classes/corpus.txt
+++ b/test/fixtures/test_grammars/unicode_classes/corpus.txt
@@ -2,23 +2,23 @@
Uppercase words
===============
-Δბㄱ Ψ Ɓƀ Ƒ Ɣ Śřř
+Δბㄱ Ψ Ɓƀ Ƒ Ɣ Śřř Color青
---
(program
- (upper) (upper) (upper) (upper) (upper) (upper))
+ (upper) (upper) (upper) (upper) (upper) (upper) (upper))
================
Lowercase words
================
-śś ťť ßß
+śś ťť ßß color青
---
(program
- (lower) (lower) (lower))
+ (lower) (lower) (lower) (lower))
================
Math symbols
|
Here's an example: https://codepoints.net/U+9752?lang=en According to that website it should be part of |
Oh, looking at the unicode data table (version 14), I think that they don't enumerate all of the CJK characters. They use a separate syntax that I previously did not notice. Most rows use a row per code point:
But then, for these CJK ideographs, they have one row that represents many code points:
We need to adjust our parsing scripts to account for these ranges. Also, we may need to change these JSON files to store ranges instead of individual code points, or else they might get too large. |
Glad you spotted that, I've been missing that all evening. |
Are there any plans to work on this? I have more and more users finding this in different languages. |
FYI, I'll just revert this for the time being elm-tooling/tree-sitter-elm#112 |
By the way, there is Unicode Technical Standard for regular expressions: UTS #18. |
I got this working in tree-sitter-go with an external scanner: #1726 (comment) |
Hey everybody,
I've been using
\p{Lu}[_\d\p{L}]*
since #906 landed.Now a user came up with this issue: elm-tooling/tree-sitter-elm#105
I'm not to sure if that should work or not. Testing with https://regex101.com/ gives various results, depending on the engine used.
And tree sitter being it's own engine, leads to me being unsure, what's expected.
The text was updated successfully, but these errors were encountered: