implement regex engine for handling regexes like `^(<match at most one codepoint>){m,n}$` #802

demurgos · 2021-08-19T11:44:40Z

What version of regex are you using?

1.5.4

Describe the bug at a high level.

I use regexes to validate string inputs. Usually the strings are fairly small and there are no issues. Today I wanted to accept any text as long as it is shorter than 10000 unicode codepoints. I expected the following regex to work ^(?s:.){0,10000}$.

This triggered a CompiledTooBig(10485760) error instead.

What are the steps to reproduce the behavior?

Code:

fn main() {
    let re = regex::Regex::new(r"^(?s:.){0,10000}$").unwrap();
    dbg!(re);
}

https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=bf2062cc98fc1a2c2afe61c88ea0cc86

What is the actual behavior?

Output:

   Compiling playground v0.0.1 (/playground)
    Finished dev [unoptimized + debuginfo] target(s) in 3.51s
     Running `target/debug/playground`
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: CompiledTooBig(10485760)', src/main.rs:2:54
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

What is the expected behavior?

The regex should compile and be reasonably small. It looks like the memory requirements grow very large with the maximum string length checked by this regex.

The text was updated successfully, but these errors were encountered:

BurntSushi · 2021-08-19T12:03:08Z

I'm not quite sure why you're expecting something different here. The error message is pretty clear: the compiled regex is too big. You are free to increase the limit though: https://docs.rs/regex/1.5.4/regex/struct.RegexBuilder.html#method.size_limit

and be reasonably small

But the state machine isn't reasonably small. Firstly, . matches any UTF-8 encoding of any Unicode scalar value. So that on its own bloats the state machine itself. Secondly, the {0,10000} is saying that any number of Unicode scalar values between 0 and 10000 will reach a match state. Even if you minimize the corresponding DFA (as a proxy for thinking concretely about its size, a full DFA isn't actually compiled by this crate), then you're still looking at 80,000+ states. So there is no theoretically better approach using general state machines.

Now in theory, the regex engine could detect this particular variant of a regex and implement something more optimized that doesn't use a state machine. But it's a narrow enough optimization that it would overall be pretty brittle. I don't know whether it's worth doing. Maybe this formulation is common enough that it's worth specializing this and other similarish patterns.

Finally, you could use (?s-u:.) to match any arbitrary byte. That should decrease the size required quite a bit (possibly an order of magnitude).

Today I wanted to accept any text as long as it is shorter than 10000 unicode codepoints.

A regex is really just the wrong tool for this. It would be soooo much faster to just do s.chars().count() <= 10000. The only reason anyone should use a regex for something like this is if you have no other choice. e.g., The regex is the only interface you have.

demurgos · 2021-08-19T12:24:05Z

Thank you for your quick reply.

A regex is really just the wrong tool for this. It would be soooo much faster to just do s.chars().count() <= 10000. The only reason anyone should use a regex for something like this is if you have no other choice. e.g., The regex is the only interface you have.

First of all, I definitely agree with you here. The issue came up because I only had regexes as built-in checks for strings in the framework I used. It worked well enough so far, but it will definitely have to change (this is how I will fix my immediate issue).

I still opened the issue because it looks to me like a situation that could be handled more efficiently.

Reading your answer and thinking more about it, it's obvious that expanding {n,m} quantifiers will quickly generate a lot of states. I naively thought it could be collapsed and use a counter... but yes: it is no longer a graph-like state-machine and it would only work for specific cases. Figuring out such optimizations in a general way seems to be like a huge amount of work.

If you feel that the issue still has some merit, you may leave it open; otherwise it's OK for me to close it since it does not seem actionable to me.

BurntSushi · 2021-08-19T12:49:47Z

I'll leave it open for now, because I do think "regex is the only available/convenient interface" is pretty common, and in that circumstance, it's not too uncommon to want to use that interface to put limits on the lengths of things. I did mark it as a "question" though, because I'm not totally convinced we should do something here. But if there is a simple characterization that covers simple use cases, e.g., of the form ^(<match at most one codepoint>){m,n}$, then I think we could implement that specific case more efficiently without too much trouble. (Famous last words.)

Bounded repetitions are indeed a difficult area for regex engines to deal with. There are some techniques for handling them and they do indeed involve counters of some sort. Generally speaking, it's done by making the NFA simulation more powerful than a state machine. As it stands now, bounded repetitions are really just like macros that do the "obvious" replacement. e.g., x{2,5} is identical to xxx?x?x?.

demurgos changed the title ~~CompiledTooBig on regex only checking for string length~~ CompiledTooBig error on regex only checking for string length Aug 19, 2021

BurntSushi added enhancement question labels Aug 19, 2021

BurntSushi changed the title ~~CompiledTooBig error on regex only checking for string length~~ implement regex engine for handling regexes like ^(<match at most one codepoint>){m,n}$ Mar 6, 2023

BurntSushi added the help wanted label Mar 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implement regex engine for handling regexes like `^(<match at most one codepoint>){m,n}$` #802

implement regex engine for handling regexes like `^(<match at most one codepoint>){m,n}$` #802

demurgos commented Aug 19, 2021

BurntSushi commented Aug 19, 2021

demurgos commented Aug 19, 2021 •

edited

Loading

BurntSushi commented Aug 19, 2021

implement regex engine for handling regexes like ^(<match at most one codepoint>){m,n}$ #802

implement regex engine for handling regexes like ^(<match at most one codepoint>){m,n}$ #802

Comments

demurgos commented Aug 19, 2021

What version of regex are you using?

Describe the bug at a high level.

What are the steps to reproduce the behavior?

What is the actual behavior?

What is the expected behavior?

BurntSushi commented Aug 19, 2021

demurgos commented Aug 19, 2021 • edited Loading

BurntSushi commented Aug 19, 2021

implement regex engine for handling regexes like `^(<match at most one codepoint>){m,n}$` #802

implement regex engine for handling regexes like `^(<match at most one codepoint>){m,n}$` #802

demurgos commented Aug 19, 2021 •

edited

Loading