Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regex is less efficient than it could be #14029

Closed
comex opened this issue May 8, 2014 · 4 comments
Closed

regex is less efficient than it could be #14029

comex opened this issue May 8, 2014 · 4 comments

Comments

@comex
Copy link
Contributor

comex commented May 8, 2014

Consider this code:

#![feature(phase)]
extern crate regex;
#[phase(syntax)]
extern crate regex_macros;

pub fn is_all_a(s: &str) -> bool {
    return regex!("^a+$").is_match(s);
}

Ideally this would optimize away to a small function that just iterates over the string and checks for characters other than 'a'.

Instead, it:

  • calls malloc several times to start out;
  • goes through an indirect call unless LTO is enabled - might not usually be a big deal, but I would like to eventually be able to efficiently match a regex on a single character in lieu of writing out all the possibilities manually
  • to the 'exec' function, which itself, even with LTO (and -O) enabled, makes many non-inlined calls, including to malloc, char_range_at, char_range_at_reverse, etc.

Without LTO, it generates about 7kb of code for one regex, or 34kb if I put 8 regexes in that function. Not the end of the world, but it adds up.

I recognize the regex implementation is new, but I thought this was worth filing anyway as room for improvement.

rustc 0.11-pre-nightly (2dcbad5 2014-05-06 22:01:43 -0700)

@huonw huonw added the A-libs label May 8, 2014
@BurntSushi
Copy link
Member

Here's the expanded code I'm seeing with rustc --pretty expanded scratch.rs. Output of rustc --version:

rustc 0.11.0-pre (2ec3eea 2014-06-03 14:06:42 -0700)
host: x86_64-unknown-linux-gnu

@BurntSushi
Copy link
Member

I'm having trouble identifying where you're seeing heap allocation. I only see it for the return value.

How much juice you want to squeeze really depends on how much analysis you want to do on the regex and how many implementations of a matching algorithm you want to write. For example, the current algorithm accounts for simple matching, the position of a match and the location of submatches. To get a tighter function, you might want to split these three cases out into their own separate matching algorithms.

This is also complicated by the fact that we're trying to guarantee O(mn) worst case time, which is why the generated code is so big (it's an entire Pike VM). To get a tight loop like you want, I suspect you'll need to do analysis on the regex to identify special cases. RE2 uses a "one pass NFA simulation" for certain types of regexes. Here's what Russ Cox has to say about it:

Let's define a “one-pass regular expression” to be a regular expression with the property that at each input byte during an anchored match, there is only one alternative that makes sense for a given input byte. For example, x_yx_ is one-pass: you read x's until a y, then you read the y, then you keep reading x's. At no point do you have to guess what to do or back up and try a different guess. On the other hand, x_x is not one-pass: when you're looking at an input x, it's not clear whether you should use it to extend the x_ or as the final x. More examples: ([^x])x(.) is one-pass; (.)x(.) is not. (\d+)-(\d+) is one-pass; (\d+).(\d+) is not.

A simple intuition for identifying one-pass regular expressions is that it's always immediately obvious when a repetition ends. It must also be immediately obvious which branch of an | to take: x(y|z) is one-pass, but (xy|xz) is not.

Because there's only one possible next choice, the one-pass NFA implementation never needs to make a copy of the submatch boundary set. The one-pass engine executes in two halves. During compilation, the one-pass code analyzes the compiled form of the program to determine whether it is one-pass. If so, the engine computes a data structure recording what to do at each possible state and input byte. Then at execution the one-pass engine can fly through the string, finding the match (or not) in, well, one pass.

The other direction we could go in is building a DFA.

@frewsxcv
Copy link
Member

In-tree regex was removed in #21458 . This should probably be moved to https://github.com/rust-lang/regex

@rust-highfive
Copy link
Collaborator

This issue has been moved to the regex repo: rust-lang/regex#26

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants