Fix panics parsing regex with whitespace in extended mode #349

robinst · 2017-03-20T01:49:34Z

The added tests fail without the fix like this:

---- parser::tests::ignore_space_escape_hex2 stdout ----
	thread 'parser::tests::ignore_space_escape_hex2' panicked at 'called `Result::unwrap()` on an `Err` value: Error { pos: 10, surround: "x 5 3", kind: InvalidBase16(" 5 3") }', src/libcore/result.rs:860

---- parser::tests::ignore_space_escape_hex stdout ----
	thread 'parser::tests::ignore_space_escape_hex' panicked at 'called `Result::unwrap()` on an `Err` value: Error { pos: 12, surround: "{ 5 3 }", kind: InvalidBase16(" 5 3") }', src/libcore/result.rs:860

---- parser::tests::ignore_space_ascii_classes stdout ----
	thread 'parser::tests::ignore_space_ascii_classes' panicked at 'called `Result::unwrap()` on an `Err` value: Error { pos: 5, surround: "(?x)[ [ : ", kind: UnsupportedClassChar('[') }', src/libcore/result.rs:860
note: Run with `RUST_BACKTRACE=1` for a backtrace.

---- parser::tests::ignore_space_escape_octal stdout ----
	thread 'parser::tests::ignore_space_escape_octal' panicked at 'valid octal number', src/libcore/option.rs:785

---- parser::tests::ignore_space_escape_unicode_name stdout ----
	thread 'parser::tests::ignore_space_escape_unicode_name' panicked at 'called `Result::unwrap()` on an `Err` value: Error { pos: 15, surround: "Y i }", kind: UnrecognizedUnicodeClass(" Y i") }', src/libcore/result.rs:860

---- parser::tests::ignore_space_repeat_counted stdout ----
	thread 'parser::tests::ignore_space_repeat_counted' panicked at 'called `Result::unwrap()` on an `Err` value: Error { pos: 15, surround: ", 1 0 }", kind: InvalidBase10("1 0") }', src/libcore/result.rs:860

The reason for the panics is that bump_get would ignore space when
walking the characters, but then keep the spaces in the returned String.

Found using cargo-fuzz.

The added tests fail without the fix like this: ---- parser::tests::ignore_space_escape_hex2 stdout ---- thread 'parser::tests::ignore_space_escape_hex2' panicked at 'called `Result::unwrap()` on an `Err` value: Error { pos: 10, surround: "x 5 3", kind: InvalidBase16(" 5 3") }', src/libcore/result.rs:860 ---- parser::tests::ignore_space_escape_hex stdout ---- thread 'parser::tests::ignore_space_escape_hex' panicked at 'called `Result::unwrap()` on an `Err` value: Error { pos: 12, surround: "{ 5 3 }", kind: InvalidBase16(" 5 3") }', src/libcore/result.rs:860 ---- parser::tests::ignore_space_ascii_classes stdout ---- thread 'parser::tests::ignore_space_ascii_classes' panicked at 'called `Result::unwrap()` on an `Err` value: Error { pos: 5, surround: "(?x)[ [ : ", kind: UnsupportedClassChar('[') }', src/libcore/result.rs:860 note: Run with `RUST_BACKTRACE=1` for a backtrace. ---- parser::tests::ignore_space_escape_octal stdout ---- thread 'parser::tests::ignore_space_escape_octal' panicked at 'valid octal number', src/libcore/option.rs:785 ---- parser::tests::ignore_space_escape_unicode_name stdout ---- thread 'parser::tests::ignore_space_escape_unicode_name' panicked at 'called `Result::unwrap()` on an `Err` value: Error { pos: 15, surround: "Y i }", kind: UnrecognizedUnicodeClass(" Y i") }', src/libcore/result.rs:860 ---- parser::tests::ignore_space_repeat_counted stdout ---- thread 'parser::tests::ignore_space_repeat_counted' panicked at 'called `Result::unwrap()` on an `Err` value: Error { pos: 15, surround: ", 1 0 }", kind: InvalidBase10("1 0") }', src/libcore/result.rs:860 The reason for the panics is that `bump_get` would ignore space when walking the characters, but then keep the spaces in the returned String. Found using cargo-fuzz.

robinst · 2017-03-20T01:51:58Z

The fuzz script is here (not sure if you would want to merge that or not): master...robinst:add-cargo-fuzz-script

You can run it using cargo install cargo-fuzz && cargo fuzz run fuzzer_script_parse.

The artifact that it returned was this: m:(?xxxxxxxxxxxxxms)mmm\x00\x01\x00.\x00+@-\x00\x0a\x00\x10\x10\x10\x10\x10\x10\x10\x10'\x10\x02[--\x0a[\x00\x00$\\\x0a3\x03[\x00:6\x03D\x00. Reduced that and added tests for other cases.

robinst · 2017-03-20T02:32:17Z

regex-syntax/src/parser.rs

+
+    #[test]
+    fn ignore_space_escape_octal() {
+        assert_eq!(p(r"(?x)\ 1 2 3"), lit('S'));


Seems a bit weird that it's allowed to add space between digits of a number, but that seems to be the closest to the current behavior.

BurntSushi · 2017-03-30T00:53:59Z

@robinst Thanks for finding this! Sorry it slipped out of my queue, but your blog post caught my attention. :-) Nice work!

I'm not sure the fix is right either. Does this also apply to thinks like \p{G r e ek}? We should probably look at how other regex engines handle verbose mode.

robinst · 2017-03-30T02:05:35Z

I'm not sure the fix is right either. Does this also apply to thinks like \p{G r e ek}?

Yes, and things like [ [ : u p p e r : ] ], see the test cases. I'm not sure about it either.

Maybe whitespace should only be allowed between logical groups of characters. For example, it should not be allowed within a number or within a text identifier. Here's what other engines do:

Oniguruma: (?x) \ x53, (?x) \x 53, (?x) \x5 3 all compile but don't match S

Perl behaves the same way, checked with perl -e 'print "matches" if "S" =~ /(?x) \ x53/'

So at least for \x they don't seem to allow any space in the whole sequence. Even \ d doesn't match a digit, whereas regex-syntax does.

BurntSushi · 2017-04-01T18:41:55Z

Thinking about this a bit more, it feels like we shouldn't allow arbitrary whitespace in arbitrary syntax. Maybe things like \p{G r e e k} should produce a syntax error?

Instead of ignoring space in all the bump/peek methods (as proposed in pull request rust-lang#349), have an explicit `ignore_space` method that can be used in places where space/comments should be allowed. This makes parsing a bit stricter than before as well.

robinst · 2017-04-07T09:14:41Z

Thinking about this a bit more, it feels like we shouldn't allow arbitrary whitespace in arbitrary syntax.

Agreed. I've prepared a different pull request here: #354

…e-strict, r=BurntSushi Fix panics with whitespace in extended mode by being more strict Instead of ignoring space in all the bump/peek methods (as proposed in pull request #349), have an explicit `ignore_space` method that can be used in places where space/comments should be allowed. This makes parsing a bit stricter than before as well.

BurntSushi · 2017-05-20T14:14:35Z

I decided to go with #354 over this one. Thanks so much!

robinst commented Mar 20, 2017

View reviewed changes

robinst mentioned this pull request Apr 7, 2017

Fix panics with whitespace in extended mode by being more strict #354

Merged

BurntSushi closed this May 20, 2017

robinst deleted the fix-panics-parsing-regex-with-extended-mode branch May 22, 2017 08:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix panics parsing regex with whitespace in extended mode #349

Fix panics parsing regex with whitespace in extended mode #349

robinst commented Mar 20, 2017

robinst commented Mar 20, 2017

robinst Mar 20, 2017

BurntSushi commented Mar 30, 2017

robinst commented Mar 30, 2017

BurntSushi commented Apr 1, 2017

robinst commented Apr 7, 2017

BurntSushi commented May 20, 2017

Fix panics parsing regex with whitespace in extended mode #349

Fix panics parsing regex with whitespace in extended mode #349

Conversation

robinst commented Mar 20, 2017

robinst commented Mar 20, 2017

robinst Mar 20, 2017

Choose a reason for hiding this comment

BurntSushi commented Mar 30, 2017

robinst commented Mar 30, 2017

BurntSushi commented Apr 1, 2017

robinst commented Apr 7, 2017

BurntSushi commented May 20, 2017