-
Notifications
You must be signed in to change notification settings - Fork 485
Closed
Description
I have a set of ~20 very simple but not anchored regexes to run against a buffer. The chance of at least one match is on the order of a few percent. My idea was to use a RegexSet to scan the buffer once (in reality a stream of ~20gb) and rescan with individual regexes to find matches if there are in fact any.
As it turns out, the RegexSet-pass is much, much slower than scanning the buffer multiple times:
#![feature(test)]
extern crate test;
extern crate regex;
static PATTERNS: &'static [&'static str] = &[
"ABC", "FOO", "BAR", "FOOBAR", "BARFOO",
"foo", "haha", "asdfk", "safdkj", "sadhakjs", "345845hj",
"^234jkl2$", "^ß0gjk2$", "^lkjsdf$", "^234rklj$", "2234sg$"
];
fn main() {
println!("Hello, world!");
}
fn fixture() -> Vec<u8> {
std::iter::repeat(32).take(10*1024).collect()
}
#[bench]
fn single(b: &mut test::Bencher) {
let fix = fixture();
let haystick = std::str::from_utf8(&fix).unwrap();
let set: Vec<_> = PATTERNS.iter().map(|p| regex::Regex::new(p).unwrap()).collect();
b.iter(move || {
let m: Vec<_> = set.iter().map(|r| r.is_match(haystick)).collect();
test::black_box(m);
});
}
#[bench]
fn multiple(b: &mut test::Bencher) {
let fix = fixture();
let haystick = std::str::from_utf8(&fix).unwrap();
let set = regex::RegexSet::new(PATTERNS).unwrap();
b.iter(move || {
//let m = set.is_match(haystick);
let m: Vec<_> = set.matches(haystick).into_iter().collect();
test::black_box(m);
});
}Running target/release/retest-ae36173f0b85f6dc
running 2 tests
test multiple ... bench: 22,619 ns/iter (+/- 384)
test single ... bench: 3,057 ns/iter (+/- 114)
test result: ok. 0 passed; 0 failed; 0 ignor
The RegexSet being much, much slower than rescanning is very surprising.
Metadata
Metadata
Assignees
Labels
No labels