<regex>
: Remove usage of non-standard _Uelem
from parser
#5592
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Towards #995. A second PR in the future will remove
_Uelem
from the matcher.Requirements on the character type
_Elem
in the standardEffectively, the standard currently spells out the following guarantees for the character type:
string_type
as abasic_string
in the regex traits class requirements, thetraits_type
of thestring_type
(which should bechar_traits<_Elem>
) must support all operations in [char.traits.require]._Elem
must be a non-array trivially copyable standard-layout and trivially default-constructible type.Alas, this is woefully underspecified because
regex
has to convert and compare code points (integers) to characters (see also LWG-3835). There is a de facto requirement thatregex
must be able to convert and compare integers to characters, otherwise regexes couldn't be parsed or line endings couldn't be matched. There is also a de facto requirement thatregex
must be able to convert characters to integers again to implement [re.grammar]/12. But how these conversions and comparisons actually work is not specified at all.One idea out might be to rely on the existing
int_type
in the character traits type, but the standard (a) does not actually specify any property for this type other than that it can somehow represent all characters +eof()
(see [char.traits.typedefs]/1) and (b) immediately goes on to violate this only requirement in the specializations for Unicode character types (see LWG-2959). Moreover, there doesn't appear to be an actual requirement thatint_type
is an integer type -- there is only a guarantee that it can be used through the API of the character traits class. Soint_type
does not seem helpful, rather, relying on it would just open its own can of worms.Requirements on
_Elem
in theregex
implementations before and after this PRThe following fundamental requirements on
_Elem
remain unchanged by this PR:lt()
andeq()
functions in the character traits class.c
, then0, .., c-1
must be valid code points of characters as well (in the sense that we can obtain a different character object for each of them through casting)._Elem
type is large enough to represent these code points._Elem{}
must represent NUL (with code point 0).Requirements on
_Elem
in theregex
implementation before this PRThe current implementation makes at least the following additional assumptions on convertibility and comparability:
_Elem
must be equality-comparable to itself,char
,int
and the internal enum type_Meta_type
(maybe including some implicit conversion)._Elem
type must be implicitly convertible toint
and_Meta_type
must be implicitly convertible to_Elem
.char
,int
andunsigned int
must be explicitly convertible to_Elem
._Elem
must be explicitly convertible to_Meta_type
.int
, i.e., conversion_Elem
->int
->_Elem
must yield the original character again._Uelem
such that explicit conversion to this type yields the code point for any character, i.e., conversion_Elem
->_Uelem
->_Elem
produces the original character again and the natural ordering of the_Uelem
values after conversion must be consistent with thelt()
andeq()
functions in the character traits class. (Note that conversion to an arbitrary but big unsigned integral type does not achieve this if the character type behaves like a signed integral type because the converted value will be sign-extended.)make_unsigned<_Elem>
must be well-defined. (This mostly defeats the purpose of_Uelem
, because specializingmake_unsigned<_Elem>
for user-defined types is forbidden.)This list might be non-exhaustive.
Requirements on
_Elem
in theregex
implementation after this PRThis PR imposes the following requirements on comparisons and conversions:
_Elem
must be equality-comparable to itself._Elem
must be explicitly convertible tounsigned char
andunsigned int
._Elem
is an integral or enum type, it must also be explicitly convertible to and frommake_unsigned<_Elem>
. Explicit conversion to this character type thus yields the code points of the characters.unsigned int
must yield the unsigned code point for a character, if the code point can fit into anunsigned int
._Elem
must behave like an unsigned integer type when converted tounsigned int
.char
,unsigned char
andunsigned int
must be explicitly convertible to_Elem
.This list should be exhaustive; the new test checks that we don't do any conversions not listed above.
We cannot just drop most of the special logic for signed integral and enum types because we must support
char
.But if desired, we could drop explicit convertibility of
_Elem
from and tochar
andunsigned char
in favor ofunsigned int
(andmake_unsigned<_Elem>
for integral or enum_Elem
) only. But this will mean we will have to add even more casts in<regex>
.Differences in requirements
Essentially, we have the following new requirements:
_Elem
must be explicitly convertible tounsigned int
._Elem
must support explicit conversion to and fromunsigned char
._Elem
must convert like an unsigned integer type when explicitly converted tounsigned int
, if_Elem
is not an integral or enum type. (This requirement is only kind of new: Before this PR,regex
didn't even compile for such types, except when entering UB territory by specializingmake_unsigned
.)In exchange, the following requirements will be dropped when the changes are completed for the matcher as well:
_Elem
and other integral types._Elem
does not have to be convertible to and fromint
and_Meta_type
._Elem
._Elem
no longer have to fit into anint
._Uelem
in the regex traits class.make_unsigned<_Elem>
for types that are not integral or enum.Changes
_Unescaped_char
member to_Parser2
to represent the character represented by some kind of escape sequence.int
" requirement, because we no longer have to represent such characters in the_Val
member.static_cast
's added to avoid implicit conversions._Meta_type
values, we first have to cast them tochar
before casting them to_Elem
._Meta_type
values by temporarily saving the character from the input character sequence.unsigned char
._Elem
->unsigned char
->_Elem
yields the original character again if and only if the code point is less than 256.strchr()
in_Parser2::_Trans()
when the character code point fits into anunsigned char
.wregex
(probably by accident). It uses the fact that casting to_Meta_type
for non-ASCII code points produces meta values the parser doesn't know about, and the parser treats unknown meta values as non-special. In any case, this change is also necessary to drop the "must fit into anint
" requirement.lt()
function of the character traits class.unsigned int
or similar.sizeof(_Elem) = 1U
: The bitmap optimization already handles all characters with code point < 256, so this is just dead code.unsigned int
) are done using roundtrip casting._Elem
is not an integral or enum type, the small character range optimization is not performed for code points unrepresentable byunsigned int
.sizeof(_Elem) > 1U
.sizeof(_Elem) > 1U
with maximal code point < 256. A more accurate check appears much more ugly to me, though, for little gain._Elem
in_CharacterEscape()
:switch
on_Elem
. All switches are now performed onunsigned char
(after checking that the code point is less than 256 while handling the case for code points >= 256 separately) orint
(when it is on_Mchar
). In the one switch on_Mchar
, the value is cast toint
first to prevent a warning that there aren't case labels for some of the enum identifiers.Test
For now, the test only checks that the parser compiles and doesn't crash on two user-defines character types. We can only check semantic correctness after adjusting the matcher similarly.
The test currently suppresses warning C6510 because #5563 hasn't been merged yet.