Bad regex generation where capture groups won't be used #116827
Replies: 3 comments 1 reply
-
Why? |
Beta Was this translation helpful? Give feedback.
-
@stephentoub : Because some scavenged use cases use \1 in the regex, which means backtracking. A few more use $1 in the replace in places that would never require backtracking, but I'm pretty sure the regex runner isn't that smart. |
Beta Was this translation helpful? Give feedback.
-
I'm sitting here talking about it due to a lack of good ideas. The object is to prevent accident not malice; and this is supposed to be usable by the support team. I had started with a proposal for regex option: CaptureGroupsUnusedInMatch but it was dirty because .Replace would still use them and I didn't like the idea much. I determined that call site analysis would work; which means that the compiled form could also use it as the compiler has access to the call site; and then reduced the call site analysis to "if it's one of these two; you know". But even then I was uncertain if the idea was actually good. I forgot that non-capturing groups were a thing; this suggests the input transform may well be better: detect all unused capture groups and convert them to non-capture groups. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
On analysis of some code, we had something that reduced to this:
Regex.Replace("aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa!", "(a+){20}$", "<repl>")
where all three arguments were user-sourced strings.
As you can guess, this runs terribly. Analysis was frustratingly bad because this regex has a closed form DFA and so should run in linear time.
Use-case analysis said we can't actually set RegexOptions.NonBacktracking; however a real case of a catastrophic backtracking was so difficult to construct that it's not happening. However; bogus capture group backtracking is likely to actually happen, and suck the system performance away. Putting a timeout doesn't fix the problem it just makes it fail instead.
What it boils down to is Regex.Replace and Regex.EnumerateMatches don't return the capture groups and so the runtime can guarantee in most cases it doesn't have to build out all those capture groups and pick a fast DFA. (Yes it does have to look at replace contents to find the last capture group it looked at...)
We can fake it most of the time by parsing the regex and replace and checking for backtracking. I can write down cases for which this doesn't work; no measure of how likely such cases really are.
Beta Was this translation helpful? Give feedback.
All reactions