Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature request]: match multiple regular expressions simultaneously obtaining capture groups #822

Closed
notaphplover opened this issue Dec 3, 2021 · 8 comments
Labels

Comments

@notaphplover
Copy link

notaphplover commented Dec 3, 2021

Describe your feature request

Given a vector of regular expressions, I'd like to match all of then in a simultaneously obtaining an array of matches with capture groups.

use regex::RegexSet;

let set = RegexSet::new(&[
    r"\w+",
    r"\d+",
    r"\pL+",
    r"foo",
    r"bar",
    r"barfoo",
    r"foobar",
]).unwrap();

// Iterate over and collect all of the matches.
let matches: Vec<_> = set.matches_and_captures("foobar").into_iter().collect();
assert_eq!(matches, vec![<Captures-0>, 2, 3, 4, 6]);

Where Captures-i represents an object with an index (or similar name) with value i and a captures (or similar name) value with an Option with captures for the regex at index i.

Of course, matches_and_captures is just an example of name for this feature at RegexSet.

Motivation

WASM and wasm-pack in addition to this feature would allow us to build a routing algorithm faster than find-my-way algorithm, the current Fastify routing algorithm, which is the fastest routing algorithm in the NodeJs ecosytem.

I'm open to contribute to this repo in order to achieve this feature and avoid unnecesary forks :)

@notaphplover notaphplover changed the title match multiple regular expressions simultaneously obtaining capture groups [Feature request]: match multiple regular expressions simultaneously obtaining capture groups Dec 3, 2021
@BurntSushi
Copy link
Member

This has been requested a few times in one form or another. See #259 and #352 for example. I think there are more, but a quick search doesn't turn them up.

Also, the docs for RegexSet specifically call out its limitations and suggest a work-around, but it isn't clear to me why that work-around is insufficient for you.

I do think this may be possible one day, once #656 is complete. In particular, regex-automata 0.2 will have first class support for multi-regex matching in all of its engines, and that includes extracting match information and capture groups (where capture groups are supported).

I'll leave this open for now, but this is not just a simple matter of adding new APIs. I don't think it makes sense to attempt this until regex-automata 0.2 lands.

@notaphplover
Copy link
Author

Hi @BurntSushi, after reading your comments I have some doubts regarding the internals of RegexSet. I was asuming there's a DFA (when possible) compiled and the internals somehow can trace back which Regex are matched. Are my assumpions right?

If thats the case, the feature requested should be better for performance reasons since processing the DFA and tracing back capture groups could be done in O(n) where n is proportional to the size of the text input. Processing multiple regex should be done on O(n*m) where n is proportional to the size of the text input and m is the amount of Regex to process. Of course, the problem I want to face requires Regex which can be compiled to DFA.

RegexSet is not the only way to solve this. I could compose a single Regex as the union of multiple ones and I could track back which capture groups are associated to a certain Regex but this approach would require checking all the capture groups in each search and I really want to avoid it.

It's been a while since the last time I implemented a Regex library so I my assumptions may be wrong. thank you so much for your time

I'll leave this open for now, but this is not just a simple matter of adding new APIs. I don't think it makes sense to attempt this until regex-automata 0.2 lands.

I have no rush :), I would be pleased to help if necessary

@BurntSushi
Copy link
Member

BurntSushi commented Dec 8, 2021

Nits aside (see below), I believe you are basically correct. And that's exactly how regex-automata 0.2 will make this possible. Retro-fitting this to the existing regex crate is too difficult IMO. The unfortunate bit is that regex-automata 0.2 is taking quite a long time, but I'm still making progress.

OK, now I'll address some nits. I don't think these are material to the overall point, but I mention them because I think it's important to be precise here. And in particular, I suspect this will have performance implications for your strategy.

I was asuming there's a DFA (when possible) compiled and the internals somehow can trace back which Regex are matched. Are my assumpions right?

So this is not quite correct... Currently, the regex crate doesn't use any DFAs anywhere. It uses NFAs, bounded backtracking and a hybrid NFA/DFA (also called a "lazy DFA"). The issue with DFAs is that they take up a lot of resources, and take exponential time in the worst case to build. The lazy DFA strikes a good balance there that mitigates those downsides. It is generally appropriate to think of a lazy DFA as just a DFA in terms of its search time complexity, although it does have some important differences in its execution model that would surprise you if you went into it thinking it was traditional DFA. (For example, a lazy DFA builds itself at search time, so there is no pre-populated transition table. Instead, that is written incrementally as-needed during search.)

But a lazy DFA is only used if possible. Otherwise, bounded backtracking or the Pike VM ("thompson NFA with capturing support") is used.

Once regex-automata 0.2 lands, this crate may wind up using real DFAs in some circumstances where it knows there isn't any exorbitant resource usage. But that's in the phase beyond integrating regex-automata 0.2 I think, and would be an implementation detail/optimization.

If thats the case, the feature requested should be better for performance reasons since processing the DFA and tracing back capture groups could be done in O(n) where n is proportional to the size of the text input. Processing multiple regex should be done on O(n*m) where n is proportional to the size of the text input and m is the amount of Regex to process. Of course, the problem I want to face requires Regex which can be compiled to DFA.

DFAs do not have the computational power to implement capturing groups. Technically, NFAs don't either, but it is typically very easy to bolt on support of capturing groups to an NFA simulation by virtue of the fact that NFA simulations tend to require tracking the match progress of multiple states. This is relevant to you because it means that if you want to resolve capturing groups, it would have to use one of the slower execution engines instead of the much faster DFA.

Note also that when using a DFA, even when using multiple regexes, search time is only proportional to the length of the haystack, so O(n) where n ~ len(haystack). It's only when you use the NFA execution engines that you get bumped up to O(n * m). (Also, a lazy DFA is O(n * m) as well, since in the worst/pathological case, it will essentially degrade to performing subset construction for every byte of input. But it's rare in practice and can be detected and mitigated by switching to another regex engine.)

RegexSet is not the only way to solve this. I could compose a single Regex as the union of multiple ones and I could track back which capture groups are associated to a certain Regex but this approach would require checking all the capture groups in each search and I really want to avoid it.

I see two possible paths here:

  1. Add capturing support to RegexSet and use that. This would necessarily force the use of a slower regex engine, but will give you your answer in one pass over the haystack.
  2. Build a DFA or a lazy DFA that tells you which regex matched, and then use a slower regex engine to resolve the capturing groups for just that matched regex. This requires two passes over the haystack. This is also doable with the current regex crate API.

It sounds to me like you have ruled out (2) as being slower than (1) (perhaps due to a belief that a DFA can resolve the capturing groups in one pass), but it is not at all clear to me that it is true. It may look like (2) has to be slower than (1), but the key difference is that (1) has to run an NFA with multiple regexes in it, which will tend to take longer since there are more states to manage and keep track of. In (2), you only need to run it for a single regex that you know matches. There should be less overhead associated with that.

Now, there are ways to solve this problem with a single pass over the haystack while retaining the speed you might expect from a DFA. But you have to used something more powerful than a DFA known as a "tagged" DFA. re2c implements this approach and its author has published several papers on the topic. This is well beyond the scope of this library or even regex-automata however. If you wanted to go this route, you'd have to essentially implement your own regex engine, but you at least don't have to write the parser. :-)

@notaphplover
Copy link
Author

notaphplover commented Dec 9, 2021

So this is not quite correct... Currently, the regex crate doesn't use any DFAs anywhere. It uses NFAs, bounded backtracking and a hybrid NFA/DFA (also called a "lazy DFA"). The issue with DFAs is that they take up a lot of resources, and take exponential time in the worst case to build. The lazy DFA strikes a good balance there that mitigates those downsides. It is generally appropriate to think of a lazy DFA as just a DFA in terms of its search time complexity, although it does have some important differences in its execution model that would surprise you if you went into it thinking it was traditional DFA. (For example, a lazy DFA builds itself at search time, so there is no pre-populated transition table. Instead, that is written incrementally as-needed during search.)

But a lazy DFA is only used if possible. Otherwise, bounded backtracking or the Pike VM ("thompson NFA with capturing support") is used.

All right then, thank you so much for the explanation :)

I see two possible paths here:

  1. Add capturing support to RegexSet and use that. This would necessarily force the use of a slower regex engine, but will give you your answer in one pass over the haystack.
  2. Build a DFA or a lazy DFA that tells you which regex matched, and then use a slower regex engine to resolve the capturing groups for just that matched regex. This requires two passes over the haystack. This is also doable with the current regex crate API.

It sounds to me like you have ruled out (2) as being slower than (1) (perhaps due to a belief that a DFA can resolve the capturing groups in one pass), but it is not at all clear to me that it is true. It may look like (2) has to be slower than (1), but the key difference is that (1) has to run an NFA with multiple regexes in it, which will tend to take longer since there are more states to manage and keep track of. In (2), you only need to run it for a single regex that you know matches. There should be less overhead associated with that.

Exactly, I was expecting capture groups to be resolved in one pass.

Now, there are ways to solve this problem with a single pass over the haystack while retaining the speed you might expect from a DFA. But you have to used something more powerful than a DFA known as a "tagged" DFA. re2c implements this approach and its author has published several papers on the topic. This is well beyond the scope of this library or even regex-automata however. If you wanted to go this route, you'd have to essentially implement your own regex engine, but you at least don't have to write the parser. :-)

I have started to read the paper, I'll probably try to implement it once I fully understand it.

My issue is solved now but I think it would be great to implement this feature in a future. Should I close the issue?

Thank you so much man, that state of the art's review was really helpful :)

@BurntSushi
Copy link
Member

Let's leave it open. It's useful to track this enhancement to RegexSet since it's a pretty common request. I'm like 95% sure that it's feasible to do. It will need some API design though, which we can work out here.

@CAD97
Copy link
Contributor

CAD97 commented Jan 29, 2022

I just want to drop an interesting note here. I'm implementing a lexer on top of just the regex crate (for comparison to logos, among other things), and got some interesting results. Because I'm manipulating the regex at proc-macro time, I could easily build a giant list of alternations to do "single pass" matching between the set of regex using capture groups.

Using a two-pass approach with RegexSet instead gave (depending on the exact benchmark load) between a 63% to 85% reduction in runtime (170% to 578% improvement in throughput).

The single pass approach was building a regex \A(?:(option-1)|(option-2)|(option-3)), and the two pass is just the normal RegexSet and then using a Regex to get the match position.

(If anyone wants to see my code for the benchmark, it's here ([RegexSet]), [capturing groups]).)

@BurntSushi
Copy link
Member

FWIW, this is now working (with full capture support) in my in-progress work on regex-automata:

$ regex-cli captures nfa thompson pikevm --matches -kall -Koverlapping 'foobar' '\w+' '\d+' '\pL+' 'foo' 'bar' 'barfoo' 'fo(ob)ar'
build pike vm time:  597.636µs
 create cache time:  67.594µs
       search time:  18.495µs
         counts(0):  "{0: 6}"
         counts(1):  "{0: 0}"
         counts(2):  "{0: 6}"
         counts(3):  "{0: 1}"
         counts(4):  "{0: 1}"
         counts(5):  "{0: 0}"
         counts(6):  "{0: 1, 1: 1}"

PatternID(0): {0: (0, 1)}
PatternID(2): {0: (0, 1)}
PatternID(0): {0: (0, 2)}
PatternID(2): {0: (0, 2)}
PatternID(0): {0: (0, 3)}
PatternID(2): {0: (0, 3)}
PatternID(3): {0: (0, 3)}
PatternID(0): {0: (0, 4)}
PatternID(2): {0: (0, 4)}
PatternID(0): {0: (0, 5)}
PatternID(2): {0: (0, 5)}
PatternID(0): {0: (0, 6)}
PatternID(2): {0: (0, 6)}
PatternID(6): {0: (0, 6), 1: (2, 4)}
PatternID(4): {0: (3, 6)}

@BurntSushi
Copy link
Member

OK, so I am unfortunately going to close this. I declared premature success in the previous comment. The problem here is that supproting overlapping matches---even without capturing groups---is incredibly tricky and full of foot guns. regex-automata 0.3 will have some support for overlapping matches, but only at the very lowest levels. The support is enough that you can actually implement your own RegexSet overlapping search, but it only gives you the overall match span:

/// Execute an overlapping search, and for each match found, also find its
/// overlapping starting positions.
///
/// N.B. This routine used to be part of the crate API, but 1) it wasn't clear
/// to me how useful it was and 2) it wasn't clear to me what its semantics
/// should be. In particular, a potentially surprising footgun of this routine
/// that it is worst case *quadratic* in the size of the haystack. Namely, it's
/// possible to report a match at every position, and for every such position,
/// scan all the way to the beginning of the haystack to find the starting
/// position. Typical leftmost non-overlapping searches don't suffer from this
/// because, well, matches can't overlap. So subsequent searches after a match
/// is found don't revisit previously scanned parts of the haystack.
///
/// Its semantics can be strange for other reasons too. For example, given
/// the regex '.*' and the haystack 'zz', the full set of overlapping matches
/// is: [0, 0], [1, 1], [0, 1], [2, 2], [1, 2], [0, 2]. The ordering of
/// those matches is quite strange, but makes sense when you think about the
/// implementation: an end offset is found left-to-right, and then one or more
/// starting offsets are found right-to-left.
///
/// Nevertheless, we provide this routine in our test suite because it's
/// useful to test the low level DFA overlapping search and our test suite
/// is written in a way that requires starting offsets.
fn try_search_overlapping<A: Automaton>(
    re: &Regex<A>,
    input: &Input<'_, '_>,
) -> Result<TestResult> {
    let mut matches = vec![];
    let mut fwd_state = OverlappingState::start();
    let (fwd_dfa, rev_dfa) = (re.forward(), re.reverse());
    while let Some(end) = {
        fwd_dfa.try_search_overlapping_fwd(input, &mut fwd_state)?;
        fwd_state.get_match()
    } {
        let revsearch = input
            .clone()
            .pattern(Some(end.pattern()))
            .earliest(false)
            .range(input.start()..end.offset());
        let mut rev_state = OverlappingState::start();
        while let Some(start) = {
            rev_dfa.try_search_overlapping_rev(&revsearch, &mut rev_state)?;
            rev_state.get_match()
        } {
            // let start = rev_dfa
            // .try_search_rev(rev_cache, &revsearch)?
            // .expect("reverse search must match if forward search does");
            let span = ret::Span { start: start.offset(), end: end.offset() };
            // Some tests check that we don't yield matches that split a
            // codepoint when UTF-8 mode is enabled, so skip those here.
            if input.get_utf8()
                && span.start == span.end
                && !input.is_char_boundary(span.end)
            {
                continue;
            }
            let mat = ret::Match { id: end.pattern().as_usize(), span };
            matches.push(mat);
        }
    }
    Ok(TestResult::matches(matches))
}

How to generalize this to capturing groups is not particularly clear to me.

I think it's also worth pointing out that the OP of this issue talks about wanting this for perf reasons, but resolving capturing groups almost always needs to take a slower code path.

So anyway, I'm going to close this particular request for now because I don't see it happening. However, while regex-automata 0.3 won't come with this out of the box, it will drastically lower the bar for anyone who wants to try and code this up on their own.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants