regex is less efficient than it could be #14029

comex · 2014-05-08T05:23:34Z

Consider this code:

#![feature(phase)]
extern crate regex;
#[phase(syntax)]
extern crate regex_macros;

pub fn is_all_a(s: &str) -> bool {
    return regex!("^a+$").is_match(s);
}

Ideally this would optimize away to a small function that just iterates over the string and checks for characters other than 'a'.

Instead, it:

calls malloc several times to start out;
goes through an indirect call unless LTO is enabled - might not usually be a big deal, but I would like to eventually be able to efficiently match a regex on a single character in lieu of writing out all the possibilities manually
to the 'exec' function, which itself, even with LTO (and -O) enabled, makes many non-inlined calls, including to malloc, char_range_at, char_range_at_reverse, etc.

Without LTO, it generates about 7kb of code for one regex, or 34kb if I put 8 regexes in that function. Not the end of the world, but it adds up.

I recognize the regex implementation is new, but I thought this was worth filing anyway as room for improvement.

rustc 0.11-pre-nightly (2dcbad5 2014-05-06 22:01:43 -0700)

The text was updated successfully, but these errors were encountered:

BurntSushi · 2014-06-04T05:15:18Z

Here's the expanded code I'm seeing with rustc --pretty expanded scratch.rs. Output of rustc --version:

rustc 0.11.0-pre (2ec3eea 2014-06-03 14:06:42 -0700)
host: x86_64-unknown-linux-gnu

BurntSushi · 2014-06-04T05:24:31Z

I'm having trouble identifying where you're seeing heap allocation. I only see it for the return value.

How much juice you want to squeeze really depends on how much analysis you want to do on the regex and how many implementations of a matching algorithm you want to write. For example, the current algorithm accounts for simple matching, the position of a match and the location of submatches. To get a tighter function, you might want to split these three cases out into their own separate matching algorithms.

This is also complicated by the fact that we're trying to guarantee O(mn) worst case time, which is why the generated code is so big (it's an entire Pike VM). To get a tight loop like you want, I suspect you'll need to do analysis on the regex to identify special cases. RE2 uses a "one pass NFA simulation" for certain types of regexes. Here's what Russ Cox has to say about it:

Let's define a “one-pass regular expression” to be a regular expression with the property that at each input byte during an anchored match, there is only one alternative that makes sense for a given input byte. For example, x_yx_ is one-pass: you read x's until a y, then you read the y, then you keep reading x's. At no point do you have to guess what to do or back up and try a different guess. On the other hand, x_x is not one-pass: when you're looking at an input x, it's not clear whether you should use it to extend the x_ or as the final x. More examples: ([^x])x(.) is one-pass; (.)x(.) is not. (\d+)-(\d+) is one-pass; (\d+).(\d+) is not.

A simple intuition for identifying one-pass regular expressions is that it's always immediately obvious when a repetition ends. It must also be immediately obvious which branch of an | to take: x(y|z) is one-pass, but (xy|xz) is not.

Because there's only one possible next choice, the one-pass NFA implementation never needs to make a copy of the submatch boundary set. The one-pass engine executes in two halves. During compilation, the one-pass code analyzes the compiled form of the program to determine whether it is one-pass. If so, the engine computes a data structure recording what to do at each possible state and input byte. Then at execution the one-pass engine can fly through the string, finding the match (or not) in, well, one pass.

The other direction we could go in is building a DFA.

frewsxcv · 2015-01-25T20:27:01Z

In-tree regex was removed in #21458 . This should probably be moved to https://github.com/rust-lang/regex

rust-highfive · 2015-01-25T20:37:53Z

This issue has been moved to the regex repo: rust-lang/regex#26

huonw added the A-libs label May 8, 2014

rust-highfive mentioned this issue Jan 25, 2015

implement better code generation for the regex plugin rust-lang/regex#26

Closed

rust-highfive closed this as completed Jan 25, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

regex is less efficient than it could be #14029

regex is less efficient than it could be #14029

comex commented May 8, 2014

BurntSushi commented Jun 4, 2014

BurntSushi commented Jun 4, 2014

frewsxcv commented Jan 25, 2015

rust-highfive commented Jan 25, 2015

regex is less efficient than it could be #14029

regex is less efficient than it could be #14029

Comments

comex commented May 8, 2014

BurntSushi commented Jun 4, 2014

BurntSushi commented Jun 4, 2014

frewsxcv commented Jan 25, 2015

rust-highfive commented Jan 25, 2015