Optimisation opportunity: literals in the middle of regex #510

RReverser · 2018-09-06T13:23:07Z

This has been discussed in private before, but decided to make it into an actual issue so that it doesn't get lost in time.

Right now, regexes like .*literal and literal.* get optimised by extracting literal prefixes + forward DFA or suffixes + reverse DFA correspondingly, however regexes like .*literal.* are not and are executed as a full DFA, even though they could be a generic case of literal optimisation where literal is detected, found and then both forward and reverse DFA are executed from its boundaries into different directions.

Example benchmark from original discussion: https://gist.github.com/RReverser/a48c9a7332df9ee7bc7abe5f1f708bb7

And numbers showing the current difference:

test with_dots            ... bench:         109 ns/iter (+/- 8)
test with_dots_and_s_mode ... bench:         109 ns/iter (+/- 9)
test without_back_dots    ... bench:         249 ns/iter (+/- 5)
test without_dots         ... bench:          38 ns/iter (+/- 0)
test without_front_dots   ... bench:          56 ns/iter (+/- 1)

The text was updated successfully, but these errors were encountered:

BurntSushi · 2018-09-06T13:43:37Z

Thanks for filing this! It does look like an optimization is possible in this case. We need to be extremely careful though in how we deal with this to make sure we avoid worst case quadratic time complexity. I don't think this specific example is susceptible to this though since every occurrence of literal is itself guaranteed to be a match.

Note that the current infrastructure in this crate is nowhere near capable of implementing something like this in a maintainable way. I'll keep cases like this in mind while doing a crate overhaul in the coming months (or possibly years).

RReverser · 2018-09-06T14:30:23Z

Sure, I looked into code to see how easy it would be to implement this and came to the same conclusion that it would currently require significant changes, but decided to file nevertheless.

Btw, few more thing I've noticed, but not sure if they're worth dedicated issues or should be just left as part of this one:

Even using lazy quantifier like .*?literal.*? seems not to produce any prefixes or suffixes. I guess prefixes are understandable, since apparently .*? still has to return all possible non-newline chars before the literal, but I suppose suffixes could be still generated as an optimisation opportunity? (since .*? in the end can be ignored altogether)
is_match currently just calls into shortest_match, which, in turn, for e.g. pattern literal.* (already optimisable) and string abcliteraldef, still tries to find entire match by executing DFA after the prefix. It looks like another optimisation opportunity to just ignore any nullable patterns at ends of the pattern when we don't care about the actual match location.

I usually close tickets on a commit-by-commit basis, but this refactor was so big that it wasn't feasible to do that. So ticket closures are marked here. Closes #244 Closes #259 Closes #476 Closes #644 Closes #675 Closes #824 Closes #961 Closes #68 Closes #510 Closes #787 Closes #891 Closes #429 Closes #517 Closes #579 Closes #779 Closes #850 Closes #921 Closes #976 Closes #1002 Closes #656

This PR contains the following updates: | Package | Type | Update | Change | |---|---|---|---| | [regex](https://github.com/rust-lang/regex) | dependencies | minor | `1.8.4` -> `1.9.1` | --- ### Release Notes <details> <summary>rust-lang/regex (regex)</summary> ### [`v1.9.1`](https://github.com/rust-lang/regex/blob/HEAD/CHANGELOG.md#191-2023-07-07) [Compare Source](rust-lang/regex@1.9.0...1.9.1) \================== This is a patch release which fixes a memory usage regression. In the regex 1.9 release, one of the internal engines used a more aggressive allocation strategy than what was done previously. This patch release reverts to the prior on-demand strategy. Bug fixes: - [BUG #1027](rust-lang/regex#1027): Change the allocation strategy for the backtracker to be less aggressive. ### [`v1.9.0`](https://github.com/rust-lang/regex/blob/HEAD/CHANGELOG.md#190-2023-07-05) [Compare Source](rust-lang/regex@1.8.4...1.9.0) \================== This release marks the end of a [years long rewrite of the regex crate internals](rust-lang/regex#656). Since this is such a big release, please report any issues or regressions you find. We would also love to hear about improvements as well. In addition to many internal improvements that should hopefully result in "my regex searches are faster," there have also been a few API additions: - A new `Captures::extract` method for quickly accessing the substrings that match each capture group in a regex. - A new inline flag, `R`, which enables CRLF mode. This makes `.` match any Unicode scalar value except for `\r` and `\n`, and also makes `(?m:^)` and `(?m:$)` match after and before both `\r` and `\n`, respectively, but never between a `\r` and `\n`. - `RegexBuilder::line_terminator` was added to further customize the line terminator used by `(?m:^)` and `(?m:$)` to be any arbitrary byte. - The `std` Cargo feature is now actually optional. That is, the `regex` crate can be used without the standard library. - Because `regex 1.9` may make binary size and compile times even worse, a new experimental crate called `regex-lite` has been published. It prioritizes binary size and compile times over functionality (like Unicode) and performance. It shares no code with the `regex` crate. New features: - [FEATURE #244](rust-lang/regex#244): One can opt into CRLF mode via the `R` flag. e.g., `(?mR:$)` matches just before `\r\n`. - [FEATURE #259](rust-lang/regex#259): Multi-pattern searches with offsets can be done with `regex-automata 0.3`. - [FEATURE #476](rust-lang/regex#476): `std` is now an optional feature. `regex` may be used with only `alloc`. - [FEATURE #644](rust-lang/regex#644): `RegexBuilder::line_terminator` configures how `(?m:^)` and `(?m:$)` behave. - [FEATURE #675](rust-lang/regex#675): Anchored search APIs are now available in `regex-automata 0.3`. - [FEATURE #824](rust-lang/regex#824): Add new `Captures::extract` method for easier capture group access. - [FEATURE #961](rust-lang/regex#961): Add `regex-lite` crate with smaller binary sizes and faster compile times. - [FEATURE #1022](rust-lang/regex#1022): Add `TryFrom` implementations for the `Regex` type. Performance improvements: - [PERF #68](rust-lang/regex#68): Added a one-pass DFA engine for faster capture group matching. - [PERF #510](rust-lang/regex#510): Inner literals are now used to accelerate searches, e.g., `\w+@\w+` will scan for `@`. - [PERF #787](rust-lang/regex#787), [PERF #891](rust-lang/regex#891): Makes literal optimizations apply to regexes of the form `\b(foo|bar|quux)\b`. (There are many more performance improvements as well, but not all of them have specific issues devoted to them.) Bug fixes: - [BUG #429](rust-lang/regex#429): Fix matching bugs related to `\B` and inconsistencies across internal engines. - [BUG #517](rust-lang/regex#517): Fix matching bug with capture groups. - [BUG #579](rust-lang/regex#579): Fix matching bug with word boundaries. - [BUG #779](rust-lang/regex#779): Fix bug where some regexes like `(re)+` were not equivalent to `(re)(re)*`. - [BUG #850](rust-lang/regex#850): Fix matching bug inconsistency between NFA and DFA engines. - [BUG #921](rust-lang/regex#921): Fix matching bug where literal extraction got confused by `$`. - [BUG #976](rust-lang/regex#976): Add documentation to replacement routines about dealing with fallibility. - [BUG #1002](rust-lang/regex#1002): Use corpus rejection in fuzz testing. </details> --- ### Configuration 📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about this update again. --- - [ ] If you want to rebase/retry this PR, check this box --- This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate).  Co-authored-by: cabr2-bot <cabr2.help@gmail.com> Co-authored-by: crapStone <crapstone01@gmail.com> Reviewed-on: https://codeberg.org/Calciumdibromid/CaBr2/pulls/1957 Reviewed-by: crapStone <crapstone01@gmail.com> Co-authored-by: Calciumdibromid Bot <cabr2_bot@noreply.codeberg.org> Co-committed-by: Calciumdibromid Bot <cabr2_bot@noreply.codeberg.org>

BurntSushi added the enhancement label Sep 6, 2018

BurntSushi added the fix-incoming label Feb 23, 2023

BurntSushi closed this as completed in aa64e6d Jul 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimisation opportunity: literals in the middle of regex #510

Optimisation opportunity: literals in the middle of regex #510

RReverser commented Sep 6, 2018

BurntSushi commented Sep 6, 2018

RReverser commented Sep 6, 2018

Optimisation opportunity: literals in the middle of regex #510

Optimisation opportunity: literals in the middle of regex #510

Comments

RReverser commented Sep 6, 2018

BurntSushi commented Sep 6, 2018

RReverser commented Sep 6, 2018