Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix several small bugs found from fuzzing #262

Merged
merged 6 commits into from
Jul 10, 2016
Merged

fix several small bugs found from fuzzing #262

merged 6 commits into from
Jul 10, 2016

Conversation

BurntSushi
Copy link
Member

Fixes #255, #257, #246, #251, #250 and #241 (many of which were found by @lukaslueg applying AFL to this library---many thanks!)

The commit messages contain the gory details.

If we ignore the start offset, then we may report a match where none
exists. This can in particular lead to a match loop that never terminates.

Fixes #255.
The compiler in particular assumes that it never gets an empty character
class. The current parser is pretty paranoid about rejecting empty classes,
but a few tricky cases made it through. In particular, one can write
`[^\d\D]` to correspond to "match nothing." This commit now looks for
empty classes explicitly, and if one is found, returns an error.

Interestingly, other regex engines allow this particular idiosyncrasy and
interpret it as "never match." Even more interesting, expressions like
`a{0}` are also allowed (including by this regex library) and are
interpreted as "always match the empty string." Both seem semantically the
same. In any case, we forbid empty character classes, primarily because
that seems like the sensible thing to do but secondarily because it's the
conservative choice.

It seems plausible that such a construct could be occasionally useful if
one were machine generating regexes, because it could be used to indicate
"never match." If we do want to support that use case, we'll need to add
a new opcode to the regex matching engines. One can still achieve
that today using something like `(a|[^a])`.

Fixes #257, where using such a form caused an assert to trip in the
compiler. A new, more explicit assert has been added.
The bug shown in #251 has the same underlying cause as the bug in
#255, which has been fixed in a previous commit. This commit just adds
a more specific regression test for #251.

Fixes #251.
When Unicode mode is disabled, we also disable the use of Unicode literals
in the regular expression, since it can lead to unintuitive behavior. In
this case, Unicode literals in character classes were not disallowed, and
subsequent code filtered them out, which resulted in an empty character
class. The compiler assumes that empty character classes are not allowed,
and so this causes an assert to trigger.

Fixes #250.
…red.

This commit fixes a bug where matching (?-u:\B) (that is, "not an ASCII
word boundary") in the NFA engines could produce match positions at invalid
UTF-8 sequence boundaries. The specific problem is that determining whether
(?-u:\B) matches or not relies on knowing whether we must report matches
only at UTF-8 boundaries, and this wasn't actually being taken into
account. (Instead, we prefer to enforce this invariant in the compiler, so
that the matching engines mostly don't have to care about it.) But of
course, the zero-width assertions are kind of a special case all around,
so we need to handle ASCII word boundaries differently depending on
whether we require valid UTF-8.

This bug was noticed because the DFA actually handles this correctly (by
encoding ASCII word boundaries into the state machine itself, which in turn
guarantees the valid UTF-8 invariant) while the NFAs don't, leading to an
inconsistency.

Fix #241.
@BurntSushi
Copy link
Member Author

These fixes are in regex 0.1.72 on crates.io.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant