loosen ASCII compatible rules + improve reverse suffix optimization #1105

BurntSushi · 2023-10-14T13:29:14Z

Basically, patterns like (?-u:☃) are now allowed. Previously they were banned since -u disables Unicode mode. But since it's just a literal and patterns must be valid UTF-8, there is a simple and unambiguous interpretation: the UTF-8 encoding of the codepoint. Note though that Unicode character classes, including even (?-u:[☃]), are still banned. I think this restriction could probably be lifted, but it's not quite as obvious since disabling Unicode mode is supposed to switch the atom of matching from the codepoint to the byte, and something like [☃] seems to require that the atom of matching is the codepoint.

This PR also contains a tweak to the reverse suffix optimization to make it a bit more broadly applicable. This actually brings it in line with the reverse inner optimization. Basically, instead of only limiting its use to when there is a non-empty and single common suffix, we expand its use to whenever the prefilter build from the suffixes of the pattern is believed to be "fast."

In some ad hoc profiling, I noticed an extra function call that really didn't need to be there.

Previously, patterns like `(?-u:☃)` were banned under the logic that Unicode scalar values shouldn't be available unless Unicode mode is enabled. But since patterns are required to be UTF-8, there really isn't any difficulty in just interpreting Unicode literals as their corresponding UTF-8 encoding. Note though that Unicode character classes, even things like `(?-u:[☃])`, remain banned. We probably could make character classes work too, but it's unclear how that plays with ASCII compatible mode requiring that a single byte is the fundamental atom of matching (where as Unicode mode requires that Unicode scalar values are the fundamental atom of matching).

Previously, we were only use the reverse suffix optimization if it found a non-empty longest common suffix *and* if the prefilter thought itself was fast. This was a heuristic used in the old regex crate before we grew the "is prefilter fast" heuristic. We change this optimization to just use the "is prefilter fast" heuristic instead of requiring a non-empty longest common suffix. This is, after all, what the inner literal optimization does. And in the inner literal case, one should probably be even more conservative because of the extra work that needs to be done. So if things are going okay with the inner literal optimization, then we should be fine with the reverse suffix optimization doing essentially the same thing.

BurntSushi added 3 commits October 13, 2023 09:51

automata/meta: force some prefilter inlining

b6aabdc

In some ad hoc profiling, I noticed an extra function call that really didn't need to be there.

BurntSushi merged commit 8a8d599 into master Oct 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

loosen ASCII compatible rules + improve reverse suffix optimization #1105

loosen ASCII compatible rules + improve reverse suffix optimization #1105

BurntSushi commented Oct 14, 2023

Uh oh!

Uh oh!

loosen ASCII compatible rules + improve reverse suffix optimization #1105

loosen ASCII compatible rules + improve reverse suffix optimization #1105

Conversation

BurntSushi commented Oct 14, 2023

Uh oh!

Uh oh!