Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix suffixed-+'s on regex's #2536

Merged
merged 1 commit into from
Aug 24, 2023
Merged

fix suffixed-+'s on regex's #2536

merged 1 commit into from
Aug 24, 2023

Conversation

nekopsykose
Copy link
Contributor

these are not valid, and libc++ fails on these with:

libc++abi: terminating due to uncaught exception of type std::__1::regex_error: One of *?+{ was not preceded by a valid regular expression.

it happens to work on libstdc++, but the intention is the same without the + and works with both


part of #2535

@nekopsykose
Copy link
Contributor Author

well, maybe this was actually intended to be something else? i can't find a reference in libstdc++ for this kind of style being recognised or meaning something special..

@wwmm
Copy link
Owner

wwmm commented Aug 24, 2023

I will wait for @Digitalone1 opinion on this as she is the one that worked the most with the regex strings. At least on Arch Linux and gcc-libs there is no issue with them.

@nekopsykose
Copy link
Contributor Author

At least on Arch Linux and gcc-libs there is no issue with them.

it's handled by the c++ stdlib (libc++/libstdc++/...), and yea, the gcc libstdc++ handles this fine. i guess i am curious on how it actually works with libstdc++ at all myself, even if this fix is correct, because it strikes me as bizarre :D

@github-actions
Copy link

github-actions bot commented Aug 24, 2023

@q66
Copy link
Contributor

q66 commented Aug 24, 2023

the + suffix is an extension for possessive quantifiers; e.g. for input string aaa, regex a+a will match (the a+ will match the first two a's, leaving the final a) while a++a will not (because a++ will consume all the a characters, leaving nothing else to match)

however, the c++ standard's ecmascript syntax does not specify possessive quantifiers; as far as i can tell, they exist in PCRE and in java, as well as in python (from version 3.11), perl, ruby, and boost, but not in standard c++ or javascript

this means they should be removed, but probably with care, as the meaning can change subtly

i guess libstdc++ implements them as an extension but fails to mention anywhere that it does

@q66
Copy link
Contributor

q66 commented Aug 24, 2023

more findings: i just tested this with libstdc++ and it does not actually implement the semantics of possessive quantifiers; testing my a++a pattern with the aaa input string returns a match in libstdc++ (it does not in python, for example)

so this might as well be a bug in libstdc++, as they don't mention the extension anywhere (and perhaps @Digitalone1 implemented it with the assumption that they match possessively, but they do not)

@Digitalone1
Copy link
Contributor

I will wait for @Digitalone1 opinion on this as she is the one that worked the most with the regex strings. At least on Arch Linux and gcc-libs there is no issue with them.

I'm against this. Possessive quantifiers are more efficient and safer because if they are not respected, the regex fails faster. More info here:

Possessive quantifiers are a way to prevent the regex engine from trying all permutations. This is primarily useful for performance reasons. You can also use possessive quantifiers to eliminate certain matches.

The main practical benefit of possessive quantifiers is to speed up your regular expression. In particular, possessive quantifiers allow your regex to fail faster.

This is because they prevent backtracking. More info here.

Backtracking occurs when a regular expression pattern contains optional quantifiers or alternation constructs, and the regular expression engine returns to a previous saved state to continue its search for a match.

Backtracking is central to the power of regular expressions; it makes it possible for expressions to be powerful and flexible, and to match very complex patterns. At the same time, this power comes at a cost. Backtracking is often the single most important factor that affects the performance of the regular expression engine.

I know that C++ reference does not mention them and that's surprising to me (I only found it here), but it works for most, if not all, Linux distributions and it's good to have them.

A specific patch for Chimera could be made as it was made for std::from_chars. But this is only my opinion.

@wwmm you decide if this is worth to merge. The problem supporting other compilers is that maybe today it's only regex (and also std::from_chars), but tomorrow can be another thing that forces us to change plans. Don't know if it's worth to do this if most of Linux distributions work out of the box with libstdc++ in an easier and more efficient way.

@q66
Copy link
Contributor

q66 commented Aug 24, 2023

@Digitalone1 libstdc++ literally does not implement them; the semantics are identical to ordinary, just check it

@q66
Copy link
Contributor

q66 commented Aug 24, 2023

this is not possessive semantics: https://ideone.com/j75Rdp

the libstdc++ documentation does not mention the existence of any such extension, therefore it can safely be concluded that the only reason that syntax seemingly works is that the implementation has a bug, not that the implementation implements the extension

@Digitalone1
Copy link
Contributor

@Digitalone1 libstdc++ literally does not implement them; the semantics are identical to ordinary, just check it

Already tested with debug messages on the command line when I introduced them.

https://github.com/wwmm/easyeffects/blob/master/src/equalizer_ui.cpp#L428-L431

this is not possessive semantics: https://ideone.com/j75Rdp

language: C++14 (gcc 8.3)

We try to use C++20. gcc is now on 13.2.1

@q66
Copy link
Contributor

q66 commented Aug 24, 2023

q66@chimera:~$ c++ --version
c++ (Debian 12.2.0-14) 12.2.0
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
q66@chimera:~$ c++ -std=c++20 test.cc -o test
q66@chimera:~$ ./test
aaa

at least with gcc 12 it still does not match possessively

@q66
Copy link
Contributor

q66 commented Aug 24, 2023

gcc 13: https://godbolt.org/z/1WYK417Mz

@q66
Copy link
Contributor

q66 commented Aug 24, 2023

additionally, i'm grepping the libstdc++ codebase and there are zero references to any such extension existing in it

@q66
Copy link
Contributor

q66 commented Aug 24, 2023

made an upstream report too: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111129

@q66
Copy link
Contributor

q66 commented Aug 24, 2023

@Digitalone1 GCC upstream says:

Confirmed. There is no such extension in libstdc++, this is just a bug.

@Digitalone1
Copy link
Contributor

That's interesting. I will retest it on my system and let you know the result.

@jwakely
Copy link

jwakely commented Aug 24, 2023

@Digitalone1 libstdc++ literally does not implement them; the semantics are identical to ordinary, just check it

Already tested with debug messages on the command line when I introduced them.
https://github.com/wwmm/easyeffects/blob/master/src/equalizer_ui.cpp#L428-L431

I don't know how you tested it, but that regex_search has always returned true with all versions of GCC:
https://godbolt.org/z/q3WvTqoMq

this is not possessive semantics: https://ideone.com/j75Rdp

language: C++14 (gcc 8.3)

We try to use C++20. gcc is now on 13.2.1

Doesn't matter, the std::regex semantics are unchanged except for the addition of the multiline option.

@Digitalone1
Copy link
Contributor

I don't know how you tested it, but that regex_search has always returned true with all versions of GCC: https://godbolt.org/z/q3WvTqoMq

That's weird. See https://en.cppreference.com/w/cpp/regex/regex_search

Returns true if a match exists, false otherwise.

@q66
Copy link
Contributor

q66 commented Aug 24, 2023

yes, because the semantics are not possessive, so it returns a match

basically, adding a + to a quantifier with libstdc++ is identical to as if there was no +, as this is a gcc bug,and the easyeffects code is relying on a specific bug to work, despite there being no change in behavior over correct code (without the pluses)

@jwakely
Copy link

jwakely commented Aug 24, 2023

GCC shouldn't even compile the regex, because it's not valid. But we incorrectly accept the invalid regex and also say it matches.

I'm not sure what we do, maybe we just completely ignore the extra + quantifiers, so we treat (a*+a++b) as (a*a+b) which does match "aaab".

Edit: But as qp66 says, there's no benefit to using an invalid regex that happens to "work" with one implementation due to a bug (which will get fixed), rather than just using a correct regex that works with all implementations.

@Digitalone1
Copy link
Contributor

I'm pretty sure it was returning false at the time I tested it. If you say so, I made a big mistake.

@Digitalone1
Copy link
Contributor

Digitalone1 commented Aug 24, 2023

I tested it another time and I confirm it returns true on the following

std::string s = "aaa";
std::smatch matches;
auto res = std::regex_search(s, matches, std::regex("(a++a)"));

So it was my mistake. I'm sorry. This PR is okay to me.

@q66 @nekopsykose Also the lines here can be removed, thanks: https://github.com/wwmm/easyeffects/blob/master/src/equalizer_ui.cpp#L428-L431

these are not valid, and libc++ fails on these with:

libc++abi: terminating due to uncaught exception of type std::__1::regex_error: One of *?+{ was not preceded by a valid regular expression.

it happens to work on libstdc++, but the intention is the same without
the + and works with both
@wwmm wwmm merged commit d1a9347 into wwmm:master Aug 24, 2023
8 checks passed
@jwakely
Copy link

jwakely commented Aug 24, 2023

I'm pretty sure it was returning false at the time I tested it. If you say so, I made a big mistake.

It returned false with GCC 4.8 when std::regex was unimplemented, but regex_search also returned false when match "a" against "a", because no matching was implemented yet.

Since GCC 4.9.0 (the first release with working std::regex) it returned true.

@Digitalone1
Copy link
Contributor

Digitalone1 commented Aug 24, 2023

It returned false with GCC 4.8 when std::regex was unimplemented, but regex_search also returned false when match "a" against "a", because no matching was implemented yet.

Since GCC 4.9.0 (the first release with working std::regex) it returned true.

It was 2-3 years ago, could not be a version so low. I misunderstood the results of the tests and the compiler was not failing adding the + suffix. That made me think possessive quantifiers were supported.

Anyway, in 2023 not having this feature in C++ is a bit awkward. PHP has it from years.

@jwakely
Copy link

jwakely commented Aug 24, 2023

I think PHP uses PCRE for its regex engine. C++ supports ecmascript and posix regexes, and neither of those supports that feature.

@Digitalone1
Copy link
Contributor

I think PHP uses PCRE for its regex engine. C++ supports ecmascript and posix regexes, and neither of those supports that feature.

So if I want to optimize a regex avoiding the backtracking, is there no way to do it on C++?

At least atomic groups are supported?

@jwakely
Copy link

jwakely commented Aug 26, 2023

You want to optimize a regex in C++, don't use std::regex. PCRE2 can be used in C++.

@Digitalone1
Copy link
Contributor

You want to optimize a regex in C++, don't use std::regex. PCRE2 can be used in C++.

How? With an external library? It's doable, but I don't like to add an extra dependency on this project. Let's keep it simple, even if it's not optimized...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants