Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

apply AFL to regex #203

Closed
BurntSushi opened this issue Apr 13, 2016 · 47 comments
Closed

apply AFL to regex #203

BurntSushi opened this issue Apr 13, 2016 · 47 comments

Comments

@BurntSushi
Copy link
Member

Currently, the parser is "fuzzed" to some extent with quickcheck, which ensures that a randomly generated AST can be roundtripped to concrete syntax and back. However, we should also try to apply more generic fuzzing techniques not only to the parser, but to regex searching itself. AFL seems like an ideal approach.

@SeanRBurton
Copy link
Contributor

I'm willing to do this if you want.

@BurntSushi
Copy link
Member Author

@SeanRBurton Yes, that'd be lovely, thank you!

@lukaslueg
Copy link
Contributor

I'm currently hitting regex with AFL, calling RegEx::new on a AFL-generated pattern and ::is_match(), ::find() and ::captures() on a static test-string. So far I've only seen some "crashes" because regex exceeds the memory limit set by AFL - this could probably be circumvented by using ::with_size_limit().
I wonder what a more thorough strategy would look like.

@BurntSushi
Copy link
Member Author

@lukaslueg Could you please open issues with test cases that reproduce your problem? What is the memory limit?

@lukaslueg
Copy link
Contributor

The memory limit is AFL's default of 50mb VM-size. Re-running with a manually-set ulimit -v reproduced the out-of-memory-hard-abort. I'll post an example when I get one again - I did not pay too much attention to it since there is always ::with_size_limit().

Does it make sense to test the generated patterns against a randomly generated instead of a static string? Basically the only thing AFL is looking for right now are arithmetic overflows and panics...

@BurntSushi
Copy link
Member Author

I'm definitely interested in regexes that use more than 50MB of heap. If you could create a new issue with examples that'd be great!

Does it make sense to test the generated patterns against a randomly generated instead of a static string? Basically the only thing AFL is looking for right now are arithmetic overflows and panics...

Yes, definitely.

@lukaslueg
Copy link
Contributor

Is there a minimum length for the haystack? We want the regex to switch strategies (if it does that at all)

@BurntSushi
Copy link
Member Author

It can switch strategies depending on the length, but the specific length isn't fixed. If the combined size of the input and the regex exceeds ~256KB, then the capturing engine will switch from backtracking to the Pike VM. So if your haystack is over 256KB, that should do it. Note though that this only happens when extracting captures (the captures method but not is_match or find).

@lukaslueg
Copy link
Contributor

I'll continue running AFL-generated regexes against a small haystack since matching a string of 260kb is painfully slow. Two crashing bugs and one OOM-situation so far - I'll switch strategies when things quite down on the small haystack

@BurntSushi
Copy link
Member Author

@lukaslueg Thank you! :-)

@lukaslueg
Copy link
Contributor

Can you give some info on when regex actually executes the simd path?

@BurntSushi
Copy link
Member Author

@lukaslueg Any time there are multiple literals in the prefix. For example, aaa|bbb|ccc|ddd|eee|fff should execute the SIMD algorithm.

Note that you need to compile like so:

RUSTFLAGS="-C target-feature=+ssse3" cargo build --release --features simd-accel

And certainly, you'll need a nightly compiler.

@lukaslueg
Copy link
Contributor

That's what I'm doing. I've also patched MAX_SIZE_BYTES in backtrack.rs to be 256 instead of 256kb in order to give PikeVM a rub without slowing things down too much. Any more suggestions?

@BurntSushi
Copy link
Member Author

That's all I can think of at the moment for things based on length. There is also a reverse suffix literal optimization, e.g., \w+Foobar will search for Foobar before actually matching \w+, but I don't know if you have that kind of control with AFL or not.

@lukaslueg
Copy link
Contributor

You do. It would be of great help if you could post some examples that you know will trigger certain behavior.

@BurntSushi
Copy link
Member Author

Examples that should use the reverse suffix literal optimization:

  • \w+foobar
  • \w+foobar\w+foobar
  • \d+foobar\d+foobar
  • \d+foobar\d+(foo|baz)barbar

@lukaslueg
Copy link
Contributor

"\w+a" also? the shorter the better, since it's a random string we are matching against.

Other behavior like the simd-path?

@BurntSushi
Copy link
Member Author

No to \w+a. The reverse suffix literal optimization only kicks in when the literal is long enough (under the assumption that a longer literal yields fewer false positives). You can see the logic here: https://github.com/rust-lang-nursery/regex/blob/master/src/exec.rs#L1154-L1174

Other examples:

  • ^foobar (begin anchored literal)
  • foobar$ (end anchored literal)
  • ^foobar$ (anchored literal)
  • ^\w+ (begin anchored)
  • \w+$ (end anchored)
  • ^\w+$ (anchored)

@lukaslueg
Copy link
Contributor

I've added some examples as above which increase the number of execution paths covered. Things have quieted down though and I'll wait for the reported bugs to get fixed before continuing running AFL. I've got everything in a docker container so things can pick up fast once regex's codebase changes.

@lukaslueg
Copy link
Contributor

Looking at #241: Is there a way for reliably force a certain regex engine without patching the code? It would be interesting to compare the result of different engines.
I've also thought about comparing the result to RE2, yet this seems hard to actually get right without too many false positives.

@BurntSushi
Copy link
Member Author

@lukaslueg Sorry, I missed your most recent comment here. Issue #241 actually contains the relevant code to force a particular regex engine. You may be surprised to find out that things like ExecBuilder are technically exposed through a publicly-accessible-but-not-public-interface.

Comparing with RE2 (or even PCRE2) is probably a good idea.

FYI, I have a PR incoming that fixes most of your bugs. :-)

@BurntSushi
Copy link
Member Author

I think @lukaslueg has done enough of this to make this issue closeable for now. I welcome more fuzzing though! :-)

@lukaslueg
Copy link
Contributor

As a matter of fact, I just spent half a day reworking the fuzzing architecture to make it self-contained. The latest HEAD is as of just now being fuzzed in The Cloud© on a 24/7 machine.

There already have been some patterns turning up where the different regex-engines disagree on match-groups, most likely exposing bugs in at least one of them. I'll spend some time during the next days to minimize them and file bugs individually.

@BurntSushi
Copy link
Member Author

@lukaslueg Awesome, thanks!

@BurntSushi BurntSushi reopened this Dec 30, 2016
@utkarshkukreti
Copy link
Contributor

<offtopic slightly> @lukaslueg that sounds very cool! Is the code/scripts you wrote for this available online somewhere? And are you using https://github.com/frewsxcv/afl.rs?

@lukaslueg
Copy link
Contributor

The code is currently not available, I'll sync a repo once things have stabilized. I used to have a docker container with a custom-built llvm/rust/afl-combo but rust moved ahead of afl.rs and I switched to the container provided by afl.rs just recently. Turns out it is also quite old, unstable and not reproducible so I'm unhappy with this solution as well.

@lukaslueg
Copy link
Contributor

I've started comparing results from libpcre to rust-regex using AFL. This is probably way more tricky that once thought because the regex engines seem to interpret things differently. For starters, the regex \b.\b produces a match with rust-regex but does not with libpcre in the same randomly generated haystack.
I suspect libpcre handles \bdifferently than rust-regex. Some guidance on this @BurntSushi ? I could filter out patterns that are known to produce different results.

@BurntSushi
Copy link
Member Author

@lukaslueg In rust-regex, \b is Unicode aware by default. In fact, most everything is Unicode aware. You'll probably need to enable Unicode support in PCRE, although I'm not sure if that will cover word boundaries. For rust-regex, you can revert to the ASCII versions of various things by disabling Unicode support. e.g., (?-u:\b) is an ASCII-only word boundary, (?-u:\w) is an ASCII-only word character class and so on.

Thanks for doing this by the way. :-)

@lukaslueg
Copy link
Contributor

I've given regex fuzzing love for the last 12 days on a dedicated machine and around 900 million rounds were executed. Variations of #375 show up, due to #345 mismatches were not examined. All Quiet on the Western Front; close?

@BurntSushi
Copy link
Member Author

@lukaslueg Hope so! Sorry I'm dragging my feet. These bugs take a lot of lead up time to load context into my brain, so I tend to put them off.

@frewsxcv
Copy link
Member

frewsxcv commented Dec 5, 2017

fyi, i ran AFL on this crate a few weeks back and found this panic. i've also got a fuzz target linked if anyone wants to run afl.rs themselves for this crate

@BurntSushi
Copy link
Member Author

@frewsxcv Thanks! It takes a while to fix these bugs. It takes a lot of time just to build up the context. I tend to let them pile up a little so I can amortize it.

@frewsxcv
Copy link
Member

frewsxcv commented Dec 5, 2017

oh for sure! i didn't meant to suggest urgency in getting it fix, just linking in case others wanted to see the fuzz target i used and the sort of bug it found. keep up the great work @BurntSushi :)

@BurntSushi
Copy link
Member Author

@frewsxcv Oh ya no worries! Thanks for filing bugs!

@sanmai-NL
Copy link

sanmai-NL commented Feb 24, 2018

@lukaslueg: Do you have thoughts on the advantages/disadvantages of using rust-fuzz/honggfuzz-rs instead/additionally?

@lukaslueg
Copy link
Contributor

Back in June 2017 AFL ran for several weeks on some cloud-node and found no (further) crashes and variations of #321 (which is yet unresolved). The low hanging fruits due to obvious bugs have been picked up.

I could still be interesting to have honggfuzz take a look at regex with respect to "behavior", though.
My thought on how to proceed would be to try to fix #321 (so we have consistent behavior within regex) and then start running fuzzed inputs against regex and re2 / pcre2 to see if they agree.

I started doing that with pcre2 as a reference. It turned out that the rust-glue available at the time was not the best and idiosyncrasies in pcre2 produced an unmanageable amount of noise. I did not put too much effort into it, though.

@ethanpailes
Copy link
Contributor

For anyone looking for more ways to fuzz, I believe any regex that parses should pass backends_are_consistent.

See #468 for more info.

@lukaslueg
Copy link
Contributor

I've not fuzzed regexp since #345 showed up because the duplicates cloud the view: One would have to manually and carefully inspect whether or not two diverging test cases are actually the same underlying problem in order not to waste anyones time.

Turns out fuzzers are surprisingly good at producing funny regular expressions, so I think the works are worthwhile. Yet I also think we need to take the slow route, fix one problem at a time and start over - which is not as bad as it sounds.

Ping me if #345 gets fixed by cough someone. @BurntSushi maybe we can get something like backends_are_consistent as a #[doc(hidden)] into regex?

@ethanpailes
Copy link
Contributor

There is already an internal module for doing various dubious things like lobotomizing the execution planner for testing purposes, so that could be a good place to put backends_are_consistent. It might also be worthwhile to expose an input param in backends_are_consistent so that the fuzzer can drive the randomness. I understand fuzzers and quickcheck operate a little differently.

@lukaslueg
Copy link
Contributor

The input probably should &[u8] for both the haystack and the regex. In case it's valid utf8, it runs a utf8-version; in any case it runs the byte-version. Return type should be something like Option<Result<(), &str>>, which is None if the regex is invalid, Some(Ok(())) if nothing happened and Some(Err(e)) if the results are inconsistent.

@davisjam
Copy link
Contributor

Some ideas here on a starting set of regexes for fuzzing.

  1. The rxxr2 project has a mechanism to automatically generate regex patterns, and tests them against itself and some other engines that follow a similar spec. Possibly a nice way to seed fuzzing efforts.
  2. I have a giant corpus of real JavaScript and Python regexes (~300K) from my FSE'18 paper that could provide a realistic set of starting points.

@BurntSushi
Copy link
Member Author

@davisjam Those sound like great ideas! The corpus may need a bit of massaging since this crate doesn't support all the same features that Javascript/Python does.

@davisjam
Copy link
Contributor

@BurntSushi If leveraging my regex corpus would be of interest, I have a filtered set that compiles in Rust. I don't currently have the cycles to filter out redundant regexes (e.g. those that use identical feature sets) or attempt to bring those into the Rust engine, but I'd be happy to share what I've got so far.

@BurntSushi
Copy link
Member Author

BurntSushi commented Jan 31, 2019

@davisjam Thanks! If you can put it in an accessible place, then I can grab it. I don't know when I'll get a chance to look at it, but it seems useful. One point of clarity: what is the license on your test cases?

@davisjam
Copy link
Contributor

davisjam commented Jan 31, 2019

@BurntSushi

If you can put it in an accessible place, then I can grab it

I've added this to my TODO list. Won't be in the short term though. Maybe a month or two...

What is the license?

I'm a researcher, so the license is "Whatever benefits the world and also acknowledges/cites me".

@BurntSushi
Copy link
Member Author

@davisjam Sounds good. The MIT license would be simplest.

@BurntSushi
Copy link
Member Author

This fuzzer landed and it has been running on Google's infrastructure for a few years now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants