-
Notifications
You must be signed in to change notification settings - Fork 443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
implement regex engine for handling regexes like ^(<match at most one codepoint>){m,n}$
#802
Comments
I'm not quite sure why you're expecting something different here. The error message is pretty clear: the compiled regex is too big. You are free to increase the limit though: https://docs.rs/regex/1.5.4/regex/struct.RegexBuilder.html#method.size_limit
But the state machine isn't reasonably small. Firstly, Now in theory, the regex engine could detect this particular variant of a regex and implement something more optimized that doesn't use a state machine. But it's a narrow enough optimization that it would overall be pretty brittle. I don't know whether it's worth doing. Maybe this formulation is common enough that it's worth specializing this and other similarish patterns. Finally, you could use
A regex is really just the wrong tool for this. It would be soooo much faster to just do |
Thank you for your quick reply.
First of all, I definitely agree with you here. The issue came up because I only had regexes as built-in checks for strings in the framework I used. It worked well enough so far, but it will definitely have to change (this is how I will fix my immediate issue). I still opened the issue because it looks to me like a situation that could be handled more efficiently. Reading your answer and thinking more about it, it's obvious that expanding If you feel that the issue still has some merit, you may leave it open; otherwise it's OK for me to close it since it does not seem actionable to me. |
I'll leave it open for now, because I do think "regex is the only available/convenient interface" is pretty common, and in that circumstance, it's not too uncommon to want to use that interface to put limits on the lengths of things. I did mark it as a "question" though, because I'm not totally convinced we should do something here. But if there is a simple characterization that covers simple use cases, e.g., of the form Bounded repetitions are indeed a difficult area for regex engines to deal with. There are some techniques for handling them and they do indeed involve counters of some sort. Generally speaking, it's done by making the NFA simulation more powerful than a state machine. As it stands now, bounded repetitions are really just like macros that do the "obvious" replacement. e.g., |
^(<match at most one codepoint>){m,n}$
What version of regex are you using?
1.5.4
Describe the bug at a high level.
I use regexes to validate string inputs. Usually the strings are fairly small and there are no issues. Today I wanted to accept any text as long as it is shorter than 10000 unicode codepoints. I expected the following regex to work
^(?s:.){0,10000}$
.This triggered a
CompiledTooBig(10485760)
error instead.What are the steps to reproduce the behavior?
Code:
https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=bf2062cc98fc1a2c2afe61c88ea0cc86
What is the actual behavior?
Output:
What is the expected behavior?
The regex should compile and be reasonably small. It looks like the memory requirements grow very large with the maximum string length checked by this regex.
The text was updated successfully, but these errors were encountered: