Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regular expression addition - lookbehind assertions and lookahead assertions #998

Open
liamquin opened this issue Feb 4, 2024 · 2 comments
Labels
Enhancement A change or improvement to an existing feature PRG-hard Categorized as "hard" at the Prague f2f, 2024 PRG-optional Categorized as "optional for 4.0" at the Prague f2f, 2024 XQFO An issue related to Functions and Operators

Comments

@liamquin
Copy link

liamquin commented Feb 4, 2024

look-ahead assertions are i think the most useful things not found in qt regular expressions, and also look-behind.

This lets you do things like

  replace( ., '
     / ( [^/]+ ) (*positive_lookahead: /)
    ', '...', 'x')

replacing components between /..../ but not consuming the trailing /, so that
/a/b/c/d/ comes out as /../../../../

Perl uses
(?=pattern), (*pla:pattern), (*positive_lookahead:pattern)
(?!pattern), (*nla:pattern), (*negative_lookahead:pattern)
to match only if the pattern is (or is not) followed by a match to pattern,

and
(?<=pattern), \K, (*plb:pattern), (*popsitive_lookbehind:pattern)
(?<!pattern), (*nlb:pattern), (*negative_lookbehind:pattern)
for zero-width look-behind assertions.

Note, libpcre (and older Perl version) restrict lookbehind assertions to fixed length. You can write
(?<=dog|cat) food
to match " food" preceded by "dog" or "cat", but you cannot write
(?<=dogs?|cats?) barking

\C is also forbidden, as are capturing subgroups. But the facility is still very useful, and reduces the need for repeated substitutions.

I propose adding only the first form in each case, not the newer "*" forms, which are less widely supported.

@ChristianGruen ChristianGruen added XQFO An issue related to Functions and Operators Enhancement A change or improvement to an existing feature labels Feb 4, 2024
@michaelhkay
Copy link
Contributor

michaelhkay commented Feb 6, 2024

This sounds feasible to me, though it's a fair bit of work (on specification, tests, and implementation). I suggest we restrict it so that (a) lookbehind has to be fixed length, and (b) neither lookahead nor lookbehind allows capturing groups.

Lookbehind syntax uses the < character which will make it rather inconvenient, especially as &-escaping inside string literals is recognized (and required) in XQuery but not in XPath. Perhaps we should allow leftwards arrow (U+2190) as an alternative.

@liamquin
Copy link
Author

liamquin commented Feb 7, 2024

We could use the *plb *pla *nlb *nla syntax instead of < maybe, or even the longer versions ("terseness shall not be a goal" hah).

The restrictions are OK if sometimes a pain; i think libpcre has them too.

You’ve been wanting to use non-ASCII characters for yonks, now’s the chance :-) but i’d rather use something people have a chance of looking up online, so i'd go for (*posivitve-lookbehind-assertion:foo) i think.

People already have trouble with { } in regular expressions, it’s an excellent point.

@ndw ndw added PRG-hard Categorized as "hard" at the Prague f2f, 2024 PRG-optional Categorized as "optional for 4.0" at the Prague f2f, 2024 labels Jun 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement A change or improvement to an existing feature PRG-hard Categorized as "hard" at the Prague f2f, 2024 PRG-optional Categorized as "optional for 4.0" at the Prague f2f, 2024 XQFO An issue related to Functions and Operators
Projects
None yet
Development

No branches or pull requests

4 participants