Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode characters not working properly in regex #700

Closed
asoltysik opened this issue May 5, 2017 · 2 comments
Closed

Unicode characters not working properly in regex #700

asoltysik opened this issue May 5, 2017 · 2 comments

Comments

@asoltysik
Copy link
Contributor

Trying to compile this code:

val other = "^[^\u0000-\u00a0\u1680\u2000-\u200a\u202f\u205f\u3000\u2028\u2029]$"

Pattern.compile(other, Pattern.CASE_INSENSITIVE)

Gives me this error:
java.util.regex.PatternSyntaxException: Unclosed character class near index 1 ^[^-   -    

]$ ^ at java.lang.Throwable::fillInStackTrace_class.java.lang.Throwable at java.lang.Throwable::init_class.java.lang.String_class.java.lang.Throwable at java.lang.Exception::init_class.java.lang.String_class.java.lang.Throwable at java.lang.RuntimeException::init_class.java.lang.String_class.java.lang.Throwable at java.lang.IllegalArgumentException::init_class.java.lang.String_class.java.lang.Throwable at java.lang.IllegalArgumentException::init at java.util.regex.PatternSyntaxException::init_class.java.lang.String_class.java.lang.String_i32 at java.util.regex.Pattern$::compile_class.java.lang.String_i32_bool_class.java.util.regex.Pattern at java.util.regex.Pattern$::compile_class.java.lang.String_i32_class.java.util.regex.Pattern at example.Main$::main_class.ssnr.ObjectArray_unit at main at __libc_start_main at _start at java.lang.RuntimeException: Nonzero exit code: 1 at scala.sys.package$.error(package.scala:27)

It compiles on JVM: http://ideone.com/r2I2gm
And it does in Go, which uses re2: http://ideone.com/CGrZar

@densh
Copy link
Member

densh commented May 18, 2017

Thanks for the bug report!

@densh densh added this to the Backlog milestone May 18, 2017
@densh densh modified the milestones: 0.4, Backlog May 20, 2017
@densh
Copy link
Member

densh commented May 25, 2017

From the first glance it looks like our current regex completely ignores the fact utf-8 and utf-16 that we use for in-memory representation of strings are two completely separate beasts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants