Deprecate `std::ctype`, `std::ctype_byname`, `std::isupper()`, and `std::toupper()` #2

tahonermann · 2018-04-23T00:29:29Z

The standard library specifies a number of interfaces that cannot be made to work reasonably well for Unicode. For example, from <locale>:

std::ctype, std::ctype_byname
Character classification functions (e.g., std::isupper())
Character conversion functions (e.g., std::toupper())

Such interfaces are candidates for deprecation, replacement, and eventual removal.

The text was updated successfully, but these errors were encountered:

cubbimew · 2018-04-25T14:27:09Z

To be fair, isupper could be trivially implemented as a test for Unicode's General_Category Lu. Of course, other C/POSIX character classes don't map to Unicode categories that well. There is ISO TR 30112:2014 (draft), which defines what POSIX classes and conversions should do for every Unicode code point, but I'd agree it isn't what a forward-looking library spec should be considering: I'd like a ctype (or a replacement code point classifier) that can tell me if a code point has General_Category Cc rather than if it is "cntrl as interpreted by TR 30112" (which doesn't actually match Cc)

rmartinho · 2018-04-25T14:36:04Z

General_Category is the wrong property, I think. Maybe it's ok for Cc (if what you want to test really is C0&C1 control characters), but definitely wrong for isupper. isupper should check Uppercase, which doesn't match gc. Always doubt yourself when you think what you need is General_Category.

cubbimew · 2018-04-25T16:58:48Z

Fair point, @rmartinho : TR 30112's definition of isupper includes non-letters with a case, such as Ⓐ
Anyway, to make my comment clearer:

it may be argued that a definition of those things in Unicode terms exists (will exist if TR becomes IS)
a ctype/isxyz/toxyz replacement would be something that checks the category, and, as pointed out, other (all?) character properties that can be defined for a code point

dimztimz · 2018-05-05T21:11:10Z

Can't deprecate this, it's used by iostreams.

tahonermann · 2018-05-05T21:54:42Z

Can't deprecate this, it's used by iostreams.

I'm not sure what you are referring to by "this", but deprecation is not removal. We can deprecate features that are still in use.

tahonermann · 2018-07-25T18:43:48Z

Changed title to limit scope. Focus on issues currently identified and described in this issue.

cor3ntin · 2019-08-02T11:00:50Z

Here is a potential replacement http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1628r0.pdf

Note that this is low level ( doesn't mean it should't be provided as it it useful/necessary for lexers among other things ), but in general Unicode recommend this kind of things to be done on strings rather than code points both in locale independent and tailored fashion.

Exhaustive list of functions to deprecate

Name	Description
isspace(std::locale)	checks if a character is classified as whitespace by a locale (function template)
isblank(std::locale)(C++11)	checks if a character is classified as a blank character by a locale (function template)
iscntrl(std::locale)	checks if a character is classified as a control character by a locale (function template)
isupper(std::locale)	checks if a character is classified as uppercase by a locale (function template)
islower(std::locale)	checks if a character is classified as lowercase by a locale (function template)
isalpha(std::locale)	checks if a character is classified as alphabetic by a locale (function template)
isdigit(std::locale)	checks if a character is classified as a digit by a locale (function template)
ispunct(std::locale)	checks if a character is classified as punctuation by a locale (function template)
isxdigit(std::locale)	checks if a character is classified as a hexadecimal digit by a locale (function template)
isalnum(std::locale)	checks if a character is classified as alphanumeric by a locale (function template)
isprint(std::locale)	checks if a character is classified as printable by a locale (function template)
isgraph(std::locale)	checks if a character is classfied as graphical by a locale (function template)
toupper(std::locale)	converts a character to uppercase using the ctype facet of a locale (function template)
tolower(std::locale)	converts a character to lowercase using the ctype facet of a locale

tahonermann · 2019-08-02T21:21:16Z

Exhaustive list of functions to deprecate

Just the variants that take a std::locale argument? I think we want to deprecate them all, but the other variants are defined by C. Deprecating them will require specifying suitable replacements for both C and C++.

cor3ntin · 2019-08-02T21:57:39Z

Just the variants that take a std::locale argument? I think we want to deprecate them all, but the other variants are defined by C. Deprecating them will require specifying suitable replacements for both C and C++.

Good question
And alternative might be to add deprecated (or deleted, might be a hard sale ?) overloads for char8_t, char16_t, char32_t

tahonermann · 2019-08-03T00:30:34Z

And alternative might be to add deprecated (or deleted, might be a hard sale ?) overloads for char8_t, char16_t, char32_t

I think we should focus more on what an appropriate C replacement would look like first.

cor3ntin · 2019-08-03T06:36:24Z

Would C be interested in supporting unicode character properties?

…

On Sat, 3 Aug 2019 at 02:30, Tom Honermann ***@***.***> wrote: And alternative might be to add deprecated (or deleted, might be a hard sale ?) overloads for char8_t, char16_t, char32_t I think we should focus more on what an appropriate C replacement would look like first. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#2?email_source=notifications&email_token=AAKX766HDH3F6KGCS32IUYDQCTGSXA5CNFSM4E34L7E2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3PDIFA#issuecomment-517878804>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAKX764FIALZV2KEJMEHU6TQCTGSXANCNFSM4E34L7EQ> .

tahonermann · 2019-08-03T14:56:53Z

Would C be interested in supporting unicode character properties?

No idea. My guess is that they would require any replacements to work with (wide) execution encoding and thus existing (non-Unicode) encodings. I see the motivation for replacement being:

improved error handling; no EOF value handling.
no UB on values not representable in unsigned char.
not code unit value based so that variable length encodings can be supported.

The point is more that we can’t deprecate these (in C) without replacements (in C).

cor3ntin · 2019-08-03T16:08:16Z

The thing is - I'm pretty sure Unicode character properties are NOT a replacement. Unicode characters properties should NOT be locale dependent in anyway, cp_isupper(U'Γ') should always be true, regardless of the execution encoding, platform, etc Ignoring the fact that isupper(foo) (for example) does not support anything but the first 255 value of a given character set, a negative answer means either - foo is not a upper case letter - foo is not part of this non-unicode character set Is that a useful information? Is a replacement useful? If it is we still need two api and maybe we can fix the existing one - By fixing your second and third bullet points.

…

On Sat, 3 Aug 2019 at 16:56, Tom Honermann ***@***.***> wrote: Would C be interested in supporting unicode character properties? No idea. My guess is that they would require any replacements to work with (wide) execution encoding and thus existing (non-Unicode) encodings. I see the motivation for replacement being: - improved error handling; no EOF value handling. - no UB on values not representable in unsigned char. - not code unit value based so that variable length encodings can be supported. The point is more that we can’t deprecate these (in C) without replacements (in C). — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#2?email_source=notifications&email_token=AAKX763CXOOW5ZORKYZKGPLQCWMDPA5CNFSM4E34L7E2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3PP4FY#issuecomment-517930519>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAKX76YFT2UR7AY3U2FZLNDQCWMDPANCNFSM4E34L7EQ> .

tahonermann · 2019-08-03T21:26:13Z

The thing is - I'm pretty sure Unicode character properties are NOT a replacement.

I agree. They might be used in the implementation of a replacement though.

Unicode characters properties should NOT be locale dependent in anyway, cp_isupper(U'Γ') should always be true, regardless of the execution encoding, platform, etc

I agree, but the provided example is specifically passing a Unicode code point, so I don't think anyone would expect a locale dependency (this is not true for case mapping algorithms in general, but is for Unicode code point properties).

Ignoring the fact that isupper(foo) (for example) does not support anything but the first 255 value of a given character set

Technically, it supports all values that fit in a value of unsigned char (which is usually 8-bit in practice).

a negative answer means either

foo is not a upper case letter

foo is not part of this non-unicode character set

Or foo isn't a code point at all (e.g., a trailing code unit value).

A code point based interface would solve all three of the bullet points I listed. (I would be fine with passing an invalid code point, errm, scalar value being a precondition violation; long live Contracts 2.0!)

tahonermann added enhancement New feature or request help wanted Extra attention is needed labels Apr 23, 2018

tahonermann changed the title ~~Deprecate text/string/character interfaces that are too broken to fix~~ Deprecate std::ctype, std::ctype_byname, std::isupper(), and std::toupper() Jul 25, 2018

tahonermann added the paper needed A paper proposing a specific solution is needed label Aug 6, 2018

tahonermann removed the enhancement New feature or request label Nov 18, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deprecate `std::ctype`, `std::ctype_byname`, `std::isupper()`, and `std::toupper()` #2

Deprecate `std::ctype`, `std::ctype_byname`, `std::isupper()`, and `std::toupper()` #2

tahonermann commented Apr 23, 2018

cubbimew commented Apr 25, 2018

rmartinho commented Apr 25, 2018

cubbimew commented Apr 25, 2018

dimztimz commented May 5, 2018

tahonermann commented May 5, 2018

tahonermann commented Jul 25, 2018

cor3ntin commented Aug 2, 2019

tahonermann commented Aug 2, 2019

cor3ntin commented Aug 2, 2019

tahonermann commented Aug 3, 2019

cor3ntin commented Aug 3, 2019 via email

tahonermann commented Aug 3, 2019

cor3ntin commented Aug 3, 2019 via email

tahonermann commented Aug 3, 2019

Deprecate std::ctype, std::ctype_byname, std::isupper(), and std::toupper() #2

Deprecate std::ctype, std::ctype_byname, std::isupper(), and std::toupper() #2

Comments

tahonermann commented Apr 23, 2018

cubbimew commented Apr 25, 2018

rmartinho commented Apr 25, 2018

cubbimew commented Apr 25, 2018

dimztimz commented May 5, 2018

tahonermann commented May 5, 2018

tahonermann commented Jul 25, 2018

cor3ntin commented Aug 2, 2019

tahonermann commented Aug 2, 2019

cor3ntin commented Aug 2, 2019

tahonermann commented Aug 3, 2019

cor3ntin commented Aug 3, 2019 via email

tahonermann commented Aug 3, 2019

cor3ntin commented Aug 3, 2019 via email

tahonermann commented Aug 3, 2019

Deprecate `std::ctype`, `std::ctype_byname`, `std::isupper()`, and `std::toupper()` #2

Deprecate `std::ctype`, `std::ctype_byname`, `std::isupper()`, and `std::toupper()` #2