Skip to content

RegexSet much slower than multiple single scans #247

@lukaslueg

Description

@lukaslueg

I have a set of ~20 very simple but not anchored regexes to run against a buffer. The chance of at least one match is on the order of a few percent. My idea was to use a RegexSet to scan the buffer once (in reality a stream of ~20gb) and rescan with individual regexes to find matches if there are in fact any.

As it turns out, the RegexSet-pass is much, much slower than scanning the buffer multiple times:

#![feature(test)]

extern crate test;
extern crate regex;

static PATTERNS: &'static [&'static str] = &[
         "ABC", "FOO", "BAR", "FOOBAR", "BARFOO",
        "foo", "haha", "asdfk", "safdkj", "sadhakjs", "345845hj",
        "^234jkl2$", "^ß0gjk2$", "^lkjsdf$", "^234rklj$", "2234sg$"
    ];

fn main() {
    println!("Hello, world!");
}

fn fixture() -> Vec<u8> {
    std::iter::repeat(32).take(10*1024).collect()
}

#[bench]
fn single(b: &mut test::Bencher) {
    let fix = fixture();
    let haystick = std::str::from_utf8(&fix).unwrap();
    let set: Vec<_> = PATTERNS.iter().map(|p| regex::Regex::new(p).unwrap()).collect();
    b.iter(move || {
        let m: Vec<_> = set.iter().map(|r| r.is_match(haystick)).collect();
        test::black_box(m);
    });
}

#[bench]
fn multiple(b: &mut test::Bencher) {
    let fix = fixture();
    let haystick = std::str::from_utf8(&fix).unwrap();
    let set = regex::RegexSet::new(PATTERNS).unwrap();
    b.iter(move || {
        //let m = set.is_match(haystick);
        let m: Vec<_> = set.matches(haystick).into_iter().collect();
        test::black_box(m);
    });
}
Running target/release/retest-ae36173f0b85f6dc

running 2 tests
test multiple ... bench:      22,619 ns/iter (+/- 384)
test single   ... bench:       3,057 ns/iter (+/- 114)

test result: ok. 0 passed; 0 failed; 0 ignor

The RegexSet being much, much slower than rescanning is very surprising.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions