Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add multi-regex matching #156

Closed
BurntSushi opened this issue Jan 31, 2016 · 1 comment
Closed

add multi-regex matching #156

BurntSushi opened this issue Jan 31, 2016 · 1 comment
Assignees

Comments

@BurntSushi
Copy link
Member

Lately I've been thinking a lot about providing a "multi regex" similar to RE2's "regex set" functionality. The problem they solve is, "I have multiple regexes that I want to run over some large search text once, and I want to see every match." The poor man's way of doing this is to combine them in a single regex of alternations, e.g., re1|re2|re3|.... Two problems with that though:

  1. The current search machinery reports non-overlapping matches. That is, it's impossible for one alternation in a regex to share a match with another alternation in the same regex.
  2. To check which expressions matched, one adds capture groups and then inspect them after a match. Requiring capture groups for this functionality is bad because it incurs a performance penalty and simply isn't needed.

We can start relatively simple by providing an API that answers these three questions:

  1. Do any of the given regexes match anywhere? (analogous to is_match)
  2. If so, which of those regexes match? Where do they match? (analogous to find)
  3. Can you show me all matches? (analogous to find_iter)

Adding capture groups to this API seems possible, but is tricky, so I suggest doing that after an initial implementation is done.

@BurntSushi BurntSushi added this to the 1.0 milestone Jan 31, 2016
@BurntSushi BurntSushi self-assigned this Feb 15, 2016
@BurntSushi BurntSushi removed this from the 1.0 milestone Feb 20, 2016
@BurntSushi
Copy link
Member Author

I managed to get something half-working in the multi branch. It should be able to correctly detect which regexes in a particular set match in some search text using a single pass, but if you ask for anything more than that (like where they matched, subsequent matches or sub-captures), then the results are not quite correct. For example, consider matching the regexes .* and a. What should the behavior be? In my opinion, it should report multiple matches of a (if they exist) before possibly reporting one match of .*. I contend that any other behavior is counter intuitive. The problem is that the correct behavior implies a significant change to the way the matching engines are currently written. It would require being able to use the matching engines in an iterator like fashion such that its internal state was carried over on subsequent searches.

Even though I wound up with something less powerful than what I wanted, I do think being able to tell which regexes match some text in a single pass is useful. At least one important use case I can think of is a URL router, which could contain possibly many regexes but usually only has one URL to search. It would be very hard to build something like that out-of-crate (and is also fast), which to me suggests it has a place in this library. If a caller needs more detailed info about the match (like captures), then they can re-run only the matched regexes on the search text one at a time.

BurntSushi added a commit that referenced this issue Feb 22, 2016
Regex sets permit matching multiple (possibly overlapping) regular
expressions in a single scan of the search text. This adds a few new
types, with `RegexSet` being the primary one.

All matching engines support regex sets, including the lazy DFA.

This commit also refactors a lot of the code around handling captures
into a central `Search`, which now also includes a set of matches that
is used by regex sets to determine which regex has matched.

We also merged the `Program` and `Insts` type, which were split up when
adding the lazy DFA, but the code seemed more complicated because of it.

Closes #156.
BurntSushi added a commit that referenced this issue Feb 22, 2016
Regex sets permit matching multiple (possibly overlapping) regular
expressions in a single scan of the search text. This adds a few new
types, with `RegexSet` being the primary one.

All matching engines support regex sets, including the lazy DFA.

This commit also refactors a lot of the code around handling captures
into a central `Search`, which now also includes a set of matches that
is used by regex sets to determine which regex has matched.

We also merged the `Program` and `Insts` type, which were split up when
adding the lazy DFA, but the code seemed more complicated because of it.

Closes #156.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant