-
Notifications
You must be signed in to change notification settings - Fork 439
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
apply AFL to regex #203
Comments
I'm willing to do this if you want. |
@SeanRBurton Yes, that'd be lovely, thank you! |
I'm currently hitting |
@lukaslueg Could you please open issues with test cases that reproduce your problem? What is the memory limit? |
The memory limit is AFL's default of 50mb VM-size. Re-running with a manually-set Does it make sense to test the generated patterns against a randomly generated instead of a static string? Basically the only thing AFL is looking for right now are arithmetic overflows and panics... |
I'm definitely interested in regexes that use more than 50MB of heap. If you could create a new issue with examples that'd be great!
Yes, definitely. |
Is there a minimum length for the haystack? We want the regex to switch strategies (if it does that at all) |
It can switch strategies depending on the length, but the specific length isn't fixed. If the combined size of the input and the regex exceeds ~256KB, then the capturing engine will switch from backtracking to the Pike VM. So if your haystack is over 256KB, that should do it. Note though that this only happens when extracting captures (the |
I'll continue running AFL-generated regexes against a small haystack since matching a string of 260kb is painfully slow. Two crashing bugs and one OOM-situation so far - I'll switch strategies when things quite down on the small haystack |
@lukaslueg Thank you! :-) |
Can you give some info on when regex actually executes the simd path? |
@lukaslueg Any time there are multiple literals in the prefix. For example, Note that you need to compile like so:
And certainly, you'll need a nightly compiler. |
That's what I'm doing. I've also patched |
That's all I can think of at the moment for things based on length. There is also a reverse suffix literal optimization, e.g., |
You do. It would be of great help if you could post some examples that you know will trigger certain behavior. |
Examples that should use the reverse suffix literal optimization:
|
"\w+a" also? the shorter the better, since it's a random string we are matching against. Other behavior like the simd-path? |
No to Other examples:
|
I've added some examples as above which increase the number of execution paths covered. Things have quieted down though and I'll wait for the reported bugs to get fixed before continuing running AFL. I've got everything in a docker container so things can pick up fast once regex's codebase changes. |
Looking at #241: Is there a way for reliably force a certain regex engine without patching the code? It would be interesting to compare the result of different engines. |
@lukaslueg Sorry, I missed your most recent comment here. Issue #241 actually contains the relevant code to force a particular regex engine. You may be surprised to find out that things like Comparing with RE2 (or even PCRE2) is probably a good idea. FYI, I have a PR incoming that fixes most of your bugs. :-) |
I think @lukaslueg has done enough of this to make this issue closeable for now. I welcome more fuzzing though! :-) |
As a matter of fact, I just spent half a day reworking the fuzzing architecture to make it self-contained. The latest HEAD is as of just now being fuzzed in The Cloud© on a 24/7 machine. There already have been some patterns turning up where the different regex-engines disagree on match-groups, most likely exposing bugs in at least one of them. I'll spend some time during the next days to minimize them and file bugs individually. |
@lukaslueg Awesome, thanks! |
<offtopic slightly> @lukaslueg that sounds very cool! Is the code/scripts you wrote for this available online somewhere? And are you using https://github.com/frewsxcv/afl.rs? |
The code is currently not available, I'll sync a repo once things have stabilized. I used to have a docker container with a custom-built llvm/rust/afl-combo but rust moved ahead of afl.rs and I switched to the container provided by afl.rs just recently. Turns out it is also quite old, unstable and not reproducible so I'm unhappy with this solution as well. |
I've started comparing results from libpcre to rust-regex using AFL. This is probably way more tricky that once thought because the regex engines seem to interpret things differently. For starters, the regex |
@lukaslueg In rust-regex, Thanks for doing this by the way. :-) |
@lukaslueg Hope so! Sorry I'm dragging my feet. These bugs take a lot of lead up time to load context into my brain, so I tend to put them off. |
fyi, i ran AFL on this crate a few weeks back and found this panic. i've also got a fuzz target linked if anyone wants to run afl.rs themselves for this crate |
@frewsxcv Thanks! It takes a while to fix these bugs. It takes a lot of time just to build up the context. I tend to let them pile up a little so I can amortize it. |
oh for sure! i didn't meant to suggest urgency in getting it fix, just linking in case others wanted to see the fuzz target i used and the sort of bug it found. keep up the great work @BurntSushi :) |
@frewsxcv Oh ya no worries! Thanks for filing bugs! |
@lukaslueg: Do you have thoughts on the advantages/disadvantages of using |
Back in June 2017 AFL ran for several weeks on some cloud-node and found no (further) crashes and variations of #321 (which is yet unresolved). The low hanging fruits due to obvious bugs have been picked up. I could still be interesting to have honggfuzz take a look at I started doing that with |
For anyone looking for more ways to fuzz, I believe any regex that parses should pass See #468 for more info. |
I've not fuzzed regexp since #345 showed up because the duplicates cloud the view: One would have to manually and carefully inspect whether or not two diverging test cases are actually the same underlying problem in order not to waste anyones time. Turns out fuzzers are surprisingly good at producing funny regular expressions, so I think the works are worthwhile. Yet I also think we need to take the slow route, fix one problem at a time and start over - which is not as bad as it sounds. Ping me if #345 gets fixed by cough someone. @BurntSushi maybe we can get something like |
There is already an |
The input probably should |
Some ideas here on a starting set of regexes for fuzzing.
|
@davisjam Those sound like great ideas! The corpus may need a bit of massaging since this crate doesn't support all the same features that Javascript/Python does. |
@BurntSushi If leveraging my regex corpus would be of interest, I have a filtered set that compiles in Rust. I don't currently have the cycles to filter out redundant regexes (e.g. those that use identical feature sets) or attempt to bring those into the Rust engine, but I'd be happy to share what I've got so far. |
@davisjam Thanks! If you can put it in an accessible place, then I can grab it. I don't know when I'll get a chance to look at it, but it seems useful. One point of clarity: what is the license on your test cases? |
I've added this to my TODO list. Won't be in the short term though. Maybe a month or two...
I'm a researcher, so the license is "Whatever benefits the world and also acknowledges/cites me". |
@davisjam Sounds good. The MIT license would be simplest. |
This fuzzer landed and it has been running on Google's infrastructure for a few years now. |
Currently, the parser is "fuzzed" to some extent with quickcheck, which ensures that a randomly generated AST can be roundtripped to concrete syntax and back. However, we should also try to apply more generic fuzzing techniques not only to the parser, but to regex searching itself. AFL seems like an ideal approach.
The text was updated successfully, but these errors were encountered: