New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix panics parsing regex with whitespace in extended mode #349

Closed
wants to merge 1 commit into
base: master
from

Conversation

Projects
None yet
2 participants
@robinst
Contributor

robinst commented Mar 20, 2017

The added tests fail without the fix like this:

---- parser::tests::ignore_space_escape_hex2 stdout ----
	thread 'parser::tests::ignore_space_escape_hex2' panicked at 'called `Result::unwrap()` on an `Err` value: Error { pos: 10, surround: "x 5 3", kind: InvalidBase16(" 5 3") }', src/libcore/result.rs:860

---- parser::tests::ignore_space_escape_hex stdout ----
	thread 'parser::tests::ignore_space_escape_hex' panicked at 'called `Result::unwrap()` on an `Err` value: Error { pos: 12, surround: "{ 5 3 }", kind: InvalidBase16(" 5 3") }', src/libcore/result.rs:860

---- parser::tests::ignore_space_ascii_classes stdout ----
	thread 'parser::tests::ignore_space_ascii_classes' panicked at 'called `Result::unwrap()` on an `Err` value: Error { pos: 5, surround: "(?x)[ [ : ", kind: UnsupportedClassChar('[') }', src/libcore/result.rs:860
note: Run with `RUST_BACKTRACE=1` for a backtrace.

---- parser::tests::ignore_space_escape_octal stdout ----
	thread 'parser::tests::ignore_space_escape_octal' panicked at 'valid octal number', src/libcore/option.rs:785

---- parser::tests::ignore_space_escape_unicode_name stdout ----
	thread 'parser::tests::ignore_space_escape_unicode_name' panicked at 'called `Result::unwrap()` on an `Err` value: Error { pos: 15, surround: "Y i }", kind: UnrecognizedUnicodeClass(" Y i") }', src/libcore/result.rs:860

---- parser::tests::ignore_space_repeat_counted stdout ----
	thread 'parser::tests::ignore_space_repeat_counted' panicked at 'called `Result::unwrap()` on an `Err` value: Error { pos: 15, surround: ", 1 0 }", kind: InvalidBase10("1 0") }', src/libcore/result.rs:860

The reason for the panics is that bump_get would ignore space when
walking the characters, but then keep the spaces in the returned String.

Found using cargo-fuzz.

Fix panics parsing regex with whitespace in extended mode
The added tests fail without the fix like this:

    ---- parser::tests::ignore_space_escape_hex2 stdout ----
    	thread 'parser::tests::ignore_space_escape_hex2' panicked at 'called `Result::unwrap()` on an `Err` value: Error { pos: 10, surround: "x 5 3", kind: InvalidBase16(" 5 3") }', src/libcore/result.rs:860

    ---- parser::tests::ignore_space_escape_hex stdout ----
    	thread 'parser::tests::ignore_space_escape_hex' panicked at 'called `Result::unwrap()` on an `Err` value: Error { pos: 12, surround: "{ 5 3 }", kind: InvalidBase16(" 5 3") }', src/libcore/result.rs:860

    ---- parser::tests::ignore_space_ascii_classes stdout ----
    	thread 'parser::tests::ignore_space_ascii_classes' panicked at 'called `Result::unwrap()` on an `Err` value: Error { pos: 5, surround: "(?x)[ [ : ", kind: UnsupportedClassChar('[') }', src/libcore/result.rs:860
    note: Run with `RUST_BACKTRACE=1` for a backtrace.

    ---- parser::tests::ignore_space_escape_octal stdout ----
    	thread 'parser::tests::ignore_space_escape_octal' panicked at 'valid octal number', src/libcore/option.rs:785

    ---- parser::tests::ignore_space_escape_unicode_name stdout ----
    	thread 'parser::tests::ignore_space_escape_unicode_name' panicked at 'called `Result::unwrap()` on an `Err` value: Error { pos: 15, surround: "Y i }", kind: UnrecognizedUnicodeClass(" Y i") }', src/libcore/result.rs:860

    ---- parser::tests::ignore_space_repeat_counted stdout ----
    	thread 'parser::tests::ignore_space_repeat_counted' panicked at 'called `Result::unwrap()` on an `Err` value: Error { pos: 15, surround: ", 1 0 }", kind: InvalidBase10("1 0") }', src/libcore/result.rs:860

The reason for the panics is that `bump_get` would ignore space when
walking the characters, but then keep the spaces in the returned String.

Found using cargo-fuzz.
@robinst

This comment has been minimized.

Contributor

robinst commented Mar 20, 2017

The fuzz script is here (not sure if you would want to merge that or not): master...robinst:add-cargo-fuzz-script

You can run it using cargo install cargo-fuzz && cargo fuzz run fuzzer_script_parse.

The artifact that it returned was this: m:(?xxxxxxxxxxxxxms)mmm\x00\x01\x00.\x00+@-\x00\x0a\x00\x10\x10\x10\x10\x10\x10\x10\x10'\x10\x02[--\x0a[\x00\x00$\\\x0a3\x03[\x00:6\x03D\x00. Reduced that and added tests for other cases.


#[test]
fn ignore_space_escape_octal() {
assert_eq!(p(r"(?x)\ 1 2 3"), lit('S'));

This comment has been minimized.

@robinst

robinst Mar 20, 2017

Contributor

Seems a bit weird that it's allowed to add space between digits of a number, but that seems to be the closest to the current behavior.

@BurntSushi

This comment has been minimized.

Member

BurntSushi commented Mar 30, 2017

@robinst Thanks for finding this! Sorry it slipped out of my queue, but your blog post caught my attention. :-) Nice work!

I'm not sure the fix is right either. Does this also apply to thinks like \p{G r e ek}? We should probably look at how other regex engines handle verbose mode.

@robinst

This comment has been minimized.

Contributor

robinst commented Mar 30, 2017

I'm not sure the fix is right either. Does this also apply to thinks like \p{G r e ek}?

Yes, and things like [ [ : u p p e r : ] ], see the test cases. I'm not sure about it either.

Maybe whitespace should only be allowed between logical groups of characters. For example, it should not be allowed within a number or within a text identifier. Here's what other engines do:

Oniguruma: (?x) \ x53, (?x) \x 53, (?x) \x5 3 all compile but don't match S

Perl behaves the same way, checked with perl -e 'print "matches" if "S" =~ /(?x) \ x53/'

So at least for \x they don't seem to allow any space in the whole sequence. Even \ d doesn't match a digit, whereas regex-syntax does.

@BurntSushi

This comment has been minimized.

Member

BurntSushi commented Apr 1, 2017

Thinking about this a bit more, it feels like we shouldn't allow arbitrary whitespace in arbitrary syntax. Maybe things like \p{G r e e k} should produce a syntax error?

robinst added a commit to robinst/regex that referenced this pull request Apr 7, 2017

Fix panics with whitespace in extended mode by being more strict
Instead of ignoring space in all the bump/peek methods (as proposed in
pull request rust-lang#349), have an explicit `ignore_space` method that can be
used in places where space/comments should be allowed.

This makes parsing a bit stricter than before as well.
@robinst

This comment has been minimized.

Contributor

robinst commented Apr 7, 2017

Thinking about this a bit more, it feels like we shouldn't allow arbitrary whitespace in arbitrary syntax.

Agreed. I've prepared a different pull request here: #354

bors added a commit that referenced this pull request May 20, 2017

Auto merge of #354 - robinst:fix-panics-in-extended-mode-by-being-mor…
…e-strict, r=BurntSushi

Fix panics with whitespace in extended mode by being more strict

Instead of ignoring space in all the bump/peek methods (as proposed in
pull request #349), have an explicit `ignore_space` method that can be
used in places where space/comments should be allowed.

This makes parsing a bit stricter than before as well.
@BurntSushi

This comment has been minimized.

Member

BurntSushi commented May 20, 2017

I decided to go with #354 over this one. Thanks so much!

@BurntSushi BurntSushi closed this May 20, 2017

@robinst robinst deleted the robinst:fix-panics-parsing-regex-with-extended-mode branch May 22, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment