Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deprecate std::ctype, std::ctype_byname, std::isupper(), and std::toupper() #2

Open
tahonermann opened this issue Apr 23, 2018 · 14 comments
Labels
help wanted Extra attention is needed paper needed A paper proposing a specific solution is needed

Comments

@tahonermann
Copy link
Member

The standard library specifies a number of interfaces that cannot be made to work reasonably well for Unicode. For example, from <locale>:

  • std::ctype, std::ctype_byname
  • Character classification functions (e.g., std::isupper())
  • Character conversion functions (e.g., std::toupper())

Such interfaces are candidates for deprecation, replacement, and eventual removal.

@tahonermann tahonermann added enhancement New feature or request help wanted Extra attention is needed labels Apr 23, 2018
@cubbimew
Copy link

To be fair, isupper could be trivially implemented as a test for Unicode's General_Category Lu. Of course, other C/POSIX character classes don't map to Unicode categories that well. There is ISO TR 30112:2014 (draft), which defines what POSIX classes and conversions should do for every Unicode code point, but I'd agree it isn't what a forward-looking library spec should be considering: I'd like a ctype (or a replacement code point classifier) that can tell me if a code point has General_Category Cc rather than if it is "cntrl as interpreted by TR 30112" (which doesn't actually match Cc)

@rmartinho
Copy link
Collaborator

General_Category is the wrong property, I think. Maybe it's ok for Cc (if what you want to test really is C0&C1 control characters), but definitely wrong for isupper. isupper should check Uppercase, which doesn't match gc. Always doubt yourself when you think what you need is General_Category.

@cubbimew
Copy link

Fair point, @rmartinho : TR 30112's definition of isupper includes non-letters with a case, such as Ⓐ
Anyway, to make my comment clearer:

  1. it may be argued that a definition of those things in Unicode terms exists (will exist if TR becomes IS)
  2. a ctype/isxyz/toxyz replacement would be something that checks the category, and, as pointed out, other (all?) character properties that can be defined for a code point

@dimztimz
Copy link

dimztimz commented May 5, 2018

Can't deprecate this, it's used by iostreams.

@tahonermann
Copy link
Member Author

Can't deprecate this, it's used by iostreams.

I'm not sure what you are referring to by "this", but deprecation is not removal. We can deprecate features that are still in use.

@tahonermann tahonermann changed the title Deprecate text/string/character interfaces that are too broken to fix Deprecate std::ctype, std::ctype_byname, std::isupper(), and std::toupper() Jul 25, 2018
@tahonermann
Copy link
Member Author

Changed title to limit scope. Focus on issues currently identified and described in this issue.

@tahonermann tahonermann added the paper needed A paper proposing a specific solution is needed label Aug 6, 2018
@tahonermann tahonermann removed the enhancement New feature or request label Nov 18, 2018
@cor3ntin
Copy link
Collaborator

cor3ntin commented Aug 2, 2019

Here is a potential replacement http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1628r0.pdf

Note that this is low level ( doesn't mean it should't be provided as it it useful/necessary for lexers among other things ), but in general Unicode recommend this kind of things to be done on strings rather than code points both in locale independent and tailored fashion.

Exhaustive list of functions to deprecate

Name Description
isspace(std::locale) checks if a character is classified as whitespace by a locale (function template)
isblank(std::locale)(C++11) checks if a character is classified as a blank character by a locale (function template)
iscntrl(std::locale) checks if a character is classified as a control character by a locale (function template)
isupper(std::locale) checks if a character is classified as uppercase by a locale (function template)
islower(std::locale) checks if a character is classified as lowercase by a locale (function template)
isalpha(std::locale) checks if a character is classified as alphabetic by a locale (function template)
isdigit(std::locale) checks if a character is classified as a digit by a locale (function template)
ispunct(std::locale) checks if a character is classified as punctuation by a locale (function template)
isxdigit(std::locale) checks if a character is classified as a hexadecimal digit by a locale (function template)
isalnum(std::locale) checks if a character is classified as alphanumeric by a locale (function template)
isprint(std::locale) checks if a character is classified as printable by a locale (function template)
isgraph(std::locale) checks if a character is classfied as graphical by a locale (function template)
toupper(std::locale) converts a character to uppercase using the ctype facet of a locale (function template)
tolower(std::locale) converts a character to lowercase using the ctype facet of a locale

@tahonermann
Copy link
Member Author

Exhaustive list of functions to deprecate

Just the variants that take a std::locale argument? I think we want to deprecate them all, but the other variants are defined by C. Deprecating them will require specifying suitable replacements for both C and C++.

@cor3ntin
Copy link
Collaborator

cor3ntin commented Aug 2, 2019

Just the variants that take a std::locale argument? I think we want to deprecate them all, but the other variants are defined by C. Deprecating them will require specifying suitable replacements for both C and C++.

Good question
And alternative might be to add deprecated (or deleted, might be a hard sale ?) overloads for char8_t, char16_t, char32_t

@tahonermann
Copy link
Member Author

And alternative might be to add deprecated (or deleted, might be a hard sale ?) overloads for char8_t, char16_t, char32_t

I think we should focus more on what an appropriate C replacement would look like first.

@cor3ntin
Copy link
Collaborator

cor3ntin commented Aug 3, 2019 via email

@tahonermann
Copy link
Member Author

Would C be interested in supporting unicode character properties?

No idea. My guess is that they would require any replacements to work with (wide) execution encoding and thus existing (non-Unicode) encodings. I see the motivation for replacement being:

  • improved error handling; no EOF value handling.
  • no UB on values not representable in unsigned char.
  • not code unit value based so that variable length encodings can be supported.

The point is more that we can’t deprecate these (in C) without replacements (in C).

@cor3ntin
Copy link
Collaborator

cor3ntin commented Aug 3, 2019 via email

@tahonermann
Copy link
Member Author

The thing is - I'm pretty sure Unicode character properties are NOT a replacement.

I agree. They might be used in the implementation of a replacement though.

Unicode characters properties should NOT be locale dependent in anyway, cp_isupper(U'Γ') should always be true, regardless of the execution encoding, platform, etc

I agree, but the provided example is specifically passing a Unicode code point, so I don't think anyone would expect a locale dependency (this is not true for case mapping algorithms in general, but is for Unicode code point properties).

Ignoring the fact that isupper(foo) (for example) does not support anything but the first 255 value of a given character set

Technically, it supports all values that fit in a value of unsigned char (which is usually 8-bit in practice).

a negative answer means either

  • foo is not a upper case letter
  • foo is not part of this non-unicode character set

Or foo isn't a code point at all (e.g., a trailing code unit value).

A code point based interface would solve all three of the bullet points I listed. (I would be fine with passing an invalid code point, errm, scalar value being a precondition violation; long live Contracts 2.0!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed paper needed A paper proposing a specific solution is needed
Development

No branches or pull requests

5 participants