Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RegexSet misbehave with unicode #353

Closed
constituent opened this issue Apr 4, 2017 · 3 comments · Fixed by #369
Closed

RegexSet misbehave with unicode #353

constituent opened this issue Apr 4, 2017 · 3 comments · Fixed by #369
Labels

Comments

@constituent
Copy link

Tested with regex 0.2.1

println!("{:?}", RegexSet::new(&["a", "b",]).unwrap().is_match("b"));
println!("{:?}", RegexSet::new(&["b", "a",]).unwrap().is_match("b"));
println!("{:?}", RegexSet::new(&["a", "β",]).unwrap().is_match("β"));
println!("{:?}", RegexSet::new(&["β", "a",]).unwrap().is_match("β"));

gives

true
true
false
true

The third should also be true. The only difference of b or β leads to different results.

@BurntSushi BurntSushi added the bug label Apr 4, 2017
@BurntSushi
Copy link
Member

Interestingly, these work fine:

println!("{:?}", Regex::new("a|β").unwrap().is_match("β"));
println!("{:?}", Regex::new("β|a").unwrap().is_match("β"));

@BurntSushi
Copy link
Member

Found the problem. There appears to be a bug in the compiler that's producing incorrect bytecode specifically for RegexSet:

0000 Split(1, 3) (start)
0001 Bytes(a, a)
0002 Match(0)
0003 Bytes(\xb2, \xb2) (goto: 5)
0004 Bytes(\xce, \xce) (goto: 3)
0005 Match(1)

The correct program should be:

0000 Split(1, 4) (start)
0001 Bytes(a, a)
0002 Match(0)
0003 Bytes(\xb2, \xb2) (goto: 5)
0004 Bytes(\xce, \xce) (goto: 3)
0005 Match(1)

My guess is that the extra Match instructions is somehow throwing things off, because a|β produces the correct program.

BurntSushi added a commit that referenced this issue May 20, 2017
When compiling a RegexSet, it was possible for the jump locations to
become incorrect if the last regex in the set had a starting location
that didn't correspond to the beginning of its program. This can happen
in simple cases like when your set consists of the regexes `a` and `β`.
In particular, the program for `β` is:

    0: Bytes(\xB2) (goto 2)
    1: Bytes(\xCE) (goto 0)
    2: MATCH

Where the entry point is `1` instead of `0`. To fix this, we compile a
set of regexes similarly to how we compile `a|β`, where we handle the
holes produced by sub-expressions correctly.

Fixes #353
bors added a commit that referenced this issue May 20, 2017
compiler: fix RegexSet bug

When compiling a RegexSet, it was possible for the jump locations to
become incorrect if the last regex in the set had a starting location
that didn't correspond to the beginning of its program. This can happen
in simple cases like when your set consists of the regexes `a` and `β`.
In particular, the program for `β` is:

    0: Bytes(\xB2) (goto 2)
    1: Bytes(\xCE) (goto 0)
    2: MATCH

Where the entry point is `1` instead of `0`. To fix this, we compile a
set of regexes similarly to how we compile `a|β`, where we handle the
holes produced by sub-expressions correctly.

Fixes #353
@bors bors closed this as completed in #369 May 20, 2017
@constituent
Copy link
Author

I've tested with my original issue and it also works fine now. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants