fix several small bugs found from fuzzing #262

BurntSushi · 2016-07-10T02:25:18Z

Fixes #255, #257, #246, #251, #250 and #241 (many of which were found by @lukaslueg applying AFL to this library---many thanks!)

The commit messages contain the gory details.

If we ignore the start offset, then we may report a match where none exists. This can in particular lead to a match loop that never terminates. Fixes #255.

The compiler in particular assumes that it never gets an empty character class. The current parser is pretty paranoid about rejecting empty classes, but a few tricky cases made it through. In particular, one can write `[^\d\D]` to correspond to "match nothing." This commit now looks for empty classes explicitly, and if one is found, returns an error. Interestingly, other regex engines allow this particular idiosyncrasy and interpret it as "never match." Even more interesting, expressions like `a{0}` are also allowed (including by this regex library) and are interpreted as "always match the empty string." Both seem semantically the same. In any case, we forbid empty character classes, primarily because that seems like the sensible thing to do but secondarily because it's the conservative choice. It seems plausible that such a construct could be occasionally useful if one were machine generating regexes, because it could be used to indicate "never match." If we do want to support that use case, we'll need to add a new opcode to the regex matching engines. One can still achieve that today using something like `(a|[^a])`. Fixes #257, where using such a form caused an assert to trip in the compiler. A new, more explicit assert has been added.

Fixes #246.

The bug shown in #251 has the same underlying cause as the bug in #255, which has been fixed in a previous commit. This commit just adds a more specific regression test for #251. Fixes #251.

When Unicode mode is disabled, we also disable the use of Unicode literals in the regular expression, since it can lead to unintuitive behavior. In this case, Unicode literals in character classes were not disallowed, and subsequent code filtered them out, which resulted in an empty character class. The compiler assumes that empty character classes are not allowed, and so this causes an assert to trigger. Fixes #250.

…red. This commit fixes a bug where matching (?-u:\B) (that is, "not an ASCII word boundary") in the NFA engines could produce match positions at invalid UTF-8 sequence boundaries. The specific problem is that determining whether (?-u:\B) matches or not relies on knowing whether we must report matches only at UTF-8 boundaries, and this wasn't actually being taken into account. (Instead, we prefer to enforce this invariant in the compiler, so that the matching engines mostly don't have to care about it.) But of course, the zero-width assertions are kind of a special case all around, so we need to handle ASCII word boundaries differently depending on whether we require valid UTF-8. This bug was noticed because the DFA actually handles this correctly (by encoding ASCII word boundaries into the state machine itself, which in turn guarantees the valid UTF-8 invariant) while the NFAs don't, leading to an inconsistency. Fix #241.

BurntSushi · 2016-07-10T04:45:02Z

These fixes are in regex 0.1.72 on crates.io.

BurntSushi added 6 commits July 9, 2016 15:48

Don't ignore the start offset when searching for an anchored literal.

cd85664

If we ignore the start offset, then we may report a match where none exists. This can in particular lead to a match loop that never terminates. Fixes #255.

Explicitly state that flags aren't incorporated into the pattern string.

e55c7ed

Fixes #246.

Add a regression test.

f07b83d

The bug shown in #251 has the same underlying cause as the bug in #255, which has been fixed in a previous commit. This commit just adds a more specific regression test for #251. Fixes #251.

BurntSushi force-pushed the fix-bugs branch from cd99d4d to 84a2bf5 Compare July 10, 2016 02:45

BurntSushi merged commit 01c92c8 into master Jul 10, 2016

BurntSushi deleted the fix-bugs branch July 10, 2016 04:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix several small bugs found from fuzzing #262

fix several small bugs found from fuzzing #262

BurntSushi commented Jul 10, 2016

BurntSushi commented Jul 10, 2016

fix several small bugs found from fuzzing #262

fix several small bugs found from fuzzing #262

Conversation

BurntSushi commented Jul 10, 2016

BurntSushi commented Jul 10, 2016