Skip to content

<regex>: Remove usage of non-standard _Uelem from parser #5592

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

muellerj2
Copy link
Contributor

@muellerj2 muellerj2 commented Jun 15, 2025

Towards #995. A second PR in the future will remove _Uelem from the matcher.

Requirements on the character type _Elem in the standard

Effectively, the standard currently spells out the following guarantees for the character type:

  • The regex traits class can handle the character type for all the operations in [re.traits].
  • Indirectly, through the reference to string_type as a basic_string in the regex traits class requirements, the traits_type of the string_type (which should be char_traits<_Elem>) must support all operations in [char.traits.require].
  • According to [strings.general]/1 referenced by [re.general]/2, _Elem must be a non-array trivially copyable standard-layout and trivially default-constructible type.

Alas, this is woefully underspecified because regex has to convert and compare code points (integers) to characters (see also LWG-3835). There is a de facto requirement that regex must be able to convert and compare integers to characters, otherwise regexes couldn't be parsed or line endings couldn't be matched. There is also a de facto requirement that regex must be able to convert characters to integers again to implement [re.grammar]/12. But how these conversions and comparisons actually work is not specified at all.

One idea out might be to rely on the existing int_type in the character traits type, but the standard (a) does not actually specify any property for this type other than that it can somehow represent all characters + eof() (see [char.traits.typedefs]/1) and (b) immediately goes on to violate this only requirement in the specializations for Unicode character types (see LWG-2959). Moreover, there doesn't appear to be an actual requirement that int_type is an integer type -- there is only a guarantee that it can be used through the API of the character traits class. So int_type does not seem helpful, rather, relying on it would just open its own can of worms.

Requirements on _Elem in the regex implementations before and after this PR

The following fundamental requirements on _Elem remain unchanged by this PR:

  • All integral conversions must be consistent with some implicit one-to-one-mapping between unsigned integers (code points) and characters (up to lossiness and sign extension), and the natural ordering of these unsigned integers must be consistent with the lt() and eq() functions in the character traits class.
  • Lossy casts to and from integers are allowed and behave as one would expect for casts between integers.
  • There must not be any gaps between code points, i.e., if a character has code point c, then 0, .., c-1 must be valid code points of characters as well (in the sense that we can obtain a different character object for each of them through casting).
  • The code points for the basic character set (to the degree they have special meaning in regexes) plus the paragraph and line separator must agree with the corresponding Unicode code points. For the paragraph and line separator, this must only be the case if the _Elem type is large enough to represent these code points.
  • _Elem{} must represent NUL (with code point 0).

Requirements on _Elem in the regex implementation before this PR

The current implementation makes at least the following additional assumptions on convertibility and comparability:

  • _Elem must be equality-comparable to itself, char, int and the internal enum type _Meta_type (maybe including some implicit conversion).
  • _Elem type must be implicitly convertible to int and _Meta_type must be implicitly convertible to _Elem.
  • char, int and unsigned int must be explicitly convertible to _Elem.
  • _Elem must be explicitly convertible to _Meta_type.
  • The code point for any character can fit into an int, i.e., conversion _Elem -> int -> _Elem must yield the original character again.
  • The character traits must provide an unsigned integral type _Uelem such that explicit conversion to this type yields the code point for any character, i.e., conversion _Elem -> _Uelem -> _Elem produces the original character again and the natural ordering of the _Uelem values after conversion must be consistent with the lt() and eq() functions in the character traits class. (Note that conversion to an arbitrary but big unsigned integral type does not achieve this if the character type behaves like a signed integral type because the converted value will be sign-extended.)
  • make_unsigned<_Elem> must be well-defined. (This mostly defeats the purpose of _Uelem, because specializing make_unsigned<_Elem> for user-defined types is forbidden.)

This list might be non-exhaustive.

Requirements on _Elem in the regex implementation after this PR

This PR imposes the following requirements on comparisons and conversions:

  • _Elem must be equality-comparable to itself.
  • _Elem must be explicitly convertible to unsigned char and unsigned int.
    • If _Elem is an integral or enum type, it must also be explicitly convertible to and from make_unsigned<_Elem>. Explicit conversion to this character type thus yields the code points of the characters.
    • Otherwise, conversion to unsigned int must yield the unsigned code point for a character, if the code point can fit into an unsigned int.
      • This means that _Elem must behave like an unsigned integer type when converted to unsigned int.
  • char, unsigned char and unsigned int must be explicitly convertible to _Elem.

This list should be exhaustive; the new test checks that we don't do any conversions not listed above.

We cannot just drop most of the special logic for signed integral and enum types because we must support char.

But if desired, we could drop explicit convertibility of _Elem from and to char and unsigned char in favor of unsigned int (and make_unsigned<_Elem> for integral or enum _Elem) only. But this will mean we will have to add even more casts in <regex>.

Differences in requirements

Essentially, we have the following new requirements:

  • _Elem must be explicitly convertible to unsigned int.
  • _Elem must support explicit conversion to and from unsigned char.
  • _Elem must convert like an unsigned integer type when explicitly converted to unsigned int, if _Elem is not an integral or enum type. (This requirement is only kind of new: Before this PR, regex didn't even compile for such types, except when entering UB territory by specializing make_unsigned.)

In exchange, the following requirements will be dropped when the changes are completed for the matcher as well:

  • No more equality-comparability between _Elem and other integral types.
  • _Elem does not have to be convertible to and from int and _Meta_type.
  • No more implicit conversions from and to _Elem.
  • The code points for _Elem no longer have to fit into an int.
  • No more _Uelem in the regex traits class.
  • No more reliance on make_unsigned<_Elem> for types that are not integral or enum.

Changes

  • Add new _Unescaped_char member to _Parser2 to represent the character represented by some kind of escape sequence.
    • This allows us to drop the "must fit into an int" requirement, because we no longer have to represent such characters in the _Val member.
  • Lots of static_cast's added to avoid implicit conversions.
    • For many _Meta_type values, we first have to cast them to char before casting them to _Elem.
    • In two cases, I avoided casts from _Meta_type values by temporarily saving the character from the input character sequence.
  • To decide whether a character can be stored in the bitmap, the test checks now whether a character code point fits into an unsigned char.
    • Such tests are done by roundtrip casting: _Elem -> unsigned char -> _Elem yields the original character again if and only if the code point is less than 256.
  • Only call strchr() in _Parser2::_Trans() when the character code point fits into an unsigned char.
    • It's actually surprising that this logic worked for wregex (probably by accident). It uses the fact that casting to _Meta_type for non-ASCII code points produces meta values the parser doesn't know about, and the parser treats unknown meta values as non-special. In any case, this change is also necessary to drop the "must fit into an int" requirement.
  • Replace some ordering checks by calls to the lt() function of the character traits class.
    • This lets us avoid some casting and case distinction between integral/enum types and other types. We can also avoid the addition of a requirement that the character code points must fit into an unsigned int or similar.
  • Rework the logic for the bitmap optimization and small character range optimization for the new requirements.
    • The code for the small character range optimization is no longer generated if sizeof(_Elem) = 1U: The bitmap optimization already handles all characters with code point < 256, so this is just dead code.
    • Checks whether character code points are too large (to store the character in the bitmap or the code point in unsigned int) are done using roundtrip casting.
    • We have to differentiate between integral/enum types and other user-defined types for the small character range optimization, as we have to cast the characters to different unsigned integer types.
      • When _Elem is not an integral or enum type, the small character range optimization is not performed for code points unrepresentable by unsigned int.
  • Store the named character classes and equivalence classes if sizeof(_Elem) > 1U.
    • This is sufficient (due to the requirement that code points must have no gaps), but not necessary: We could have a character type with sizeof(_Elem) > 1U with maximal code point < 256. A more accurate check appears much more ugly to me, though, for little gain.
  • Revise logic to cast integer value (from escape sequence) to an _Elem in _CharacterEscape():
    • Again, we have to differentiate between integral/enum types and other user-defined types to perform the correct casts.
    • We check whether the integer value corresponds to a code point by performing roundtrip and testing whether it yields the same integer value again.
  • No more switch on _Elem. All switches are now performed on unsigned char (after checking that the code point is less than 256 while handling the case for code points >= 256 separately) or int (when it is on _Mchar). In the one switch on _Mchar, the value is cast to int first to prevent a warning that there aren't case labels for some of the enum identifiers.

Test

For now, the test only checks that the parser compiles and doesn't crash on two user-defines character types. We can only check semantic correctness after adjusting the matcher similarly.

The test currently suppresses warning C6510 because #5563 hasn't been merged yet.

@muellerj2 muellerj2 requested a review from a team as a code owner June 15, 2025 17:41
@github-project-automation github-project-automation bot moved this to Initial Review in STL Code Reviews Jun 15, 2025
@StephanTLavavej StephanTLavavej added bug Something isn't working regex meow is a substring of homeowner labels Jun 16, 2025
@StephanTLavavej StephanTLavavej self-assigned this Jun 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working regex meow is a substring of homeowner
Projects
Status: Initial Review
Development

Successfully merging this pull request may close these issues.

2 participants