Regex syntax parsing of unicode code points is incorrect

#### What version of regex are you using?
Latest

If it isn't the latest version, then please upgrade and check whether the bug
is still present.

#### Describe the bug at a high level.
Because regex_syntax is lazily using `char::from_u32` not all valid unicode code points are parsed, and this prevents valid regex's from compiling.

Give a brief description of the actual problem you're observing.

![image](https://user-images.githubusercontent.com/22282241/163461605-1a13d4ed-aafa-4292-afa7-5c508afeecd4.png)

Rust defines char as a "Unicode scalar value" and explicitly states that it's similar but not the same as a unicode code point.

The parser is supposed to extract all code points as documented above the function:
https://github.com/rust-lang/regex/blob/master/regex-syntax/src/ast/parse.rs#L1611

#### What is the expected behavior?

I expect this crate to include custom logic for validating code points, instead relying on `char::from_u32` which omits valid code points/surrogate values because they aren't considered scalar values.

Javascript and several other regex engines can handle these fine.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Regex syntax parsing of unicode code points is incorrect #854

What version of regex are you using?

Describe the bug at a high level.

What is the expected behavior?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Regex syntax parsing of unicode code points is incorrect #854

Description

What version of regex are you using?

Describe the bug at a high level.

What is the expected behavior?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions