Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix VRE_capture() interface wrt more matching groups than group count #3725

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

nigoroll
Copy link
Member

@nigoroll nigoroll commented Oct 28, 2021

The new VRE_capture() interface does not provide a way to detect the case where the number of matching subexpressions is higher than the group count passed.

This patch basically restores the original pcre (pcre1 if you will) behaviour to return a number higher than the number of groups to signal this case.

Alternatively, if we wanted to have the pcre2-like semantics ...

The return from pcre2_match() is [...] zero if the vector of offsets is too small

... we would basically need to make the VRE_capture() interface identical to vre_capture() and allow to pass a count pointer, otherwise we had no way to return the number of matches.

ping @slimhazard

ref: https://gitlab.com/uplex/varnish/libvmod-re/-/merge_requests/1

@nigoroll nigoroll requested a review from Dridi October 28, 2021 17:23
@nigoroll nigoroll force-pushed the vre_capture_fix branch 2 times, most recently from 9a2fa48 to bc7b5cf Compare November 1, 2021 11:00
@nigoroll
Copy link
Member Author

nigoroll commented Nov 4, 2021

On a related question: Shouldn't callers of VRE_capture() be given a way to query how many capture groups are required?

@nigoroll
Copy link
Member Author

nigoroll commented Nov 4, 2021

Another related question: Or should VRE_compile() offer a way to specify "no more than X capture groups"?

@slimhazard
Copy link
Contributor

On a related question: Shouldn't callers of VRE_capture() be given a way to query how many capture groups are required?

pcre2 has the means to get that info from a compiled regex, so VRE could expose it.

@slimhazard
Copy link
Contributor

Another related question: Or should VRE_compile() offer a way to specify "no more than X capture groups"?

Not sure what you mean by this. If the caller doesn't want to capture more than N groups, well then don't.

One way to ensure that is to write the regex so that it has no more than N captures.

Do you have something in mind like "tell pcre2 to never capture than N groups, because I'll never need more"? pcre2 has a lot of bells and whistles, but I don't recall that one. There are so many ways to configure pcre2 that I could easily be wrong. But I think it's likely that Philip Hazel reasoned that if you don't want more than N captures, then you wouldn't have more in your regex in the first place.

@nigoroll
Copy link
Member Author

nigoroll commented Nov 5, 2021

If the caller doesn't want to capture more than N groups, well then don't.

The point is to fail the compilation if the pattern contains more than N groups, rather than failing later for a match.

@slimhazard
Copy link
Contributor

The point is to fail the compilation if the pattern contains more than N groups, rather than failing later for a match.

Got it. That's certainly possible and should be easy, same reason as for the other point -- pcre2 can tell you how many groups there are for a compiled regex.

@nigoroll nigoroll mentioned this pull request Nov 8, 2021
7 tasks
@Dridi
Copy link
Member

Dridi commented Nov 8, 2021

About that, VCL regsub[all]() and the underlying VRT_regsub() and VRE_sub() are limited to positional capture groups, so that limits us to 9 capture groups plus the special "all of them" group 0. In order not to break existing VCL, we didn't adopt pcre2_substitute() that has its own very similar syntax but different ($1 vs \1). It also has named capture groups which we don't take advantage of for the same reason.

Because of this hard limit, I'm tempted to say querying the number of groups should be a pcre2 operation not available in VRE, accessible via VRE_unpack().

@nigoroll nigoroll marked this pull request as draft November 16, 2021 12:43
@nigoroll nigoroll marked this pull request as ready for review November 16, 2021 12:59
@nigoroll
Copy link
Member Author

After the last comment from @Dridi I have now added a compile time check for a maximum of 9 capture groups: As explained by @Dridi , we have no facility in varnish-cache to use more capture groups, so we should not accept them at compile time. As a side effect, this could also avoid potential memory consumption issues when matching.

For the case where capture groups are not used for capturing, a hint is output with the error message:

too many capturing groups (maximum 9) - tip: use (?:...) for non capturing groups

ping @slimhazard

lib/libvarnish/vre.c Outdated Show resolved Hide resolved
@slimhazard
Copy link
Contributor

Has it been possible in the past to have more than 9 capturing groups in a regex, and yet not a problem as long as you don't go higher than 9 in any use of regsub()? That's a marginal case, of course, I'm just wondering if this could be someone's breaking change, if the number of groups has never been checked as an error condition before.

@nigoroll
Copy link
Member Author

@slimhazard yes, this is a breaking change, thus the tip in the error message

@nigoroll nigoroll force-pushed the vre_capture_fix branch 2 times, most recently from c60d4d7 to 21ec451 Compare November 17, 2021 12:02
@Dridi
Copy link
Member

Dridi commented Nov 17, 2021

I think it is too heavy-handed to limit the number of capturing groups at regex compile time, and it goes against the spirit of 6503249.

Not everyone knows that a group doesn't have to capture:

  • (capturing )?group
  • (?:simply optional)?group

My comment about the 10 capturing groups was regarding our regsub() implementation that uses its own syntax with substitution from \0 to \9. As far as we are concerned, we only support the 9 capturing groups plus the special 0th group. It shouldn't be forbidden to have more capture groups, it's simply that I don't see the lack of support for querying the number of capturing groups as a regression since to my knowledge the former VRE API didn't offer that, and contrary to the former VRE API we can access the pcre2_code directly.

In its current state, the new VRE API should be an improvement in that regard, not a regression.

Now regarding the original problem, the way you initially tried to solve it in 9a2fa48 is probably better if you:

  • aren't going to query the number of groups in the first place (outside of VRE)
  • want to pass an arbitrary number of txt (venturing in undefined behavior land abve 10 groups)
  • plan to react to a capture overflow after the facts

Now that I have a better understanding, I think the original patch is better. That makes VRE_match() a VRE_capture() with less arguments, which is still an improvement in my mind.

@nigoroll nigoroll marked this pull request as draft November 22, 2021 13:04
@nigoroll nigoroll force-pushed the vre_capture_fix branch 4 times, most recently from d0783b6 to f529fdc Compare February 1, 2022 09:29
@nigoroll nigoroll force-pushed the vre_capture_fix branch 3 times, most recently from d861038 to 03b0478 Compare April 27, 2022 11:34
in order to signal potential regsub() failures at compile time.
The new VRE_capture() interface did not provide a way to detect the
case where the number of matching subexpressions was higher than the
group count passed.

This patch basically restores the original pcre (pcre1 if you will)
behaviour to return a number higher than the number of groups to
signal this case.

Alternatively, if we wanted to have the pcre2-like semantics (pcre2
returns 0 for this case), we would basically need to make the
VRE_capture() interface identical to vre_capture() and allow to pass a
count pointer, otherwise we had no way to return the number of
matches.
PCRE2_ERROR_TOO_MANY_CAPTURES was added in April 2019, which, apparently,
is too new for some of our circ^Wbuild boxes.
@nigoroll
Copy link
Member Author

nigoroll commented Nov 8, 2022

Related:

Working on a recent addition to the re vmod, the question arose if the VRE API could be changed to allow this use case.

Unless I overlook something, in addition to this ticket, only a JIT option argument would be needed in VRE_compile().

@Dridi I think I might also have understood a reason for PCRE2 to use offsets for captures: They are a natural choice when relocating the subject (you can just keep them unaltered if you do).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants