Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify what constitutes white-space characters #69

Open
tahonermann opened this issue Mar 23, 2021 · 7 comments
Open

Specify what constitutes white-space characters #69

tahonermann opened this issue Mar 23, 2021 · 7 comments
Labels
clarification Something isn't clear help wanted Extra attention is needed paper needed A paper proposing a specific solution is needed

Comments

@tahonermann
Copy link
Member

The C++ standard defines behavior that depends on whether a character constitutes white-space, but never defines what those characters are. Uses of the "whitespace" and "white-space" terms appear in:

P2178 proposal 2 sought to clarify the set of characters that constitute white-space and proposed the following set. These characters all satisfy the immutable Pattern_White_Space property (see UAX #44 and/or search for Pattern_White_Space in the UCD).

  • U+0009: CHARACTER TABULATION
  • U+000A: LINE FEED (LF)
  • U+000B: LINE TABULATION
  • U+000C: FORM FEED (FF)
  • U+000D: CARRIAGE RETURN (CR)
  • U+0020: SPACE
  • U+0085: NEXT LINE (NEL)
  • U+200E: LEFT-TO-RIGHT MARK
  • U+200F: RIGHT-TO-LEFT MARK
  • U+2028: LINE SEPARATOR
  • U+2029: PARAGRAPH SEPARATOR

The above set of characters excludes the following characters that satisfy the (not immutable) White_Space property (see UAX #44 and/or search for White_Space in the UCD).

  • U+00A0: NO-BREAK SPACE
  • U+1680: OGHAM SPACE MARK
  • U+2000: EN QUAD
  • U+2001: EM QUAD
  • U+2002: EN SPACE
  • U+2003: EM SPACE
  • U+2004: THREE-PER-EM SPACE
  • U+2005: FOUR-PER-EM SPACE
  • U+2006: SIX-PER-EM SPACE
  • U+2007: FIGURE SPACE
  • U+2008: PUNCTUATION SPACE
  • U+2009: THIN SPACE
  • U+200A: HAIR SPACE
  • U+202F: NARROW NO-BREAK SPACE
  • U+205F: MEDIUM MATHEMATICAL SPACE
  • U+3000: IDEOGRAPHIC SPACE

When addressing this issue, we may want to take the opportunity to replace the existing "whitespace" and "white-space" terminology with "blank space"; ISO guidance may require such a renaming in the future.

@tahonermann
Copy link
Member Author

tahonermann commented Mar 23, 2021

Actually, the standard does supply a list of whitespace characters in [lex.pptoken]p2:

... Preprocessing tokens can be separated by whitespace; this consists of comments ([lex.comment]), or whitespace characters (space, horizontal tab, new-line, vertical tab, and form-feed), or both. ...

and again in [lex.token]p1:

... Blanks, horizontal and vertical tabs, newlines, formfeeds, and comments (collectively, “whitespace”), as described below, are ignored except as they serve to separate tokens.

[Note 1: Some whitespace is required to separate otherwise adjacent identifiers, keywords, numeric literals, and alternative tokens containing alphabetic characters. — end note]

@steve-downey
Copy link
Collaborator

steve-downey commented Mar 23, 2021 via email

@tahonermann
Copy link
Member Author

P2295 addresses this. The wording in revision 0 proposes a subset of the characters in P2178; it omits:

  • U+200E: LEFT-TO-RIGHT MARK
  • U+200F: RIGHT-TO-LEFT MARK

@tahonermann tahonermann added clarification Something isn't clear paper submitted A paper proposing a specific solution has been submitted labels Mar 23, 2021
@tahonermann
Copy link
Member Author

Later revisions of P2295 no longer address this.

@tahonermann tahonermann added help wanted Extra attention is needed paper needed A paper proposing a specific solution is needed and removed paper submitted A paper proposing a specific solution has been submitted labels Mar 28, 2021
@cor3ntin
Copy link
Collaborator

P2348 - of which an early draft is there https://isocpp.org/files/papers/D2348R0.pdf rewords the handling of whitspaces and new lines without extending the set

@tahonermann
Copy link
Member Author

This issue was discussed on the Unicode.org mailing list. There was a recommendation from a Unicode expert that, for programming languages, Pattern_White_Space may be a useful starting point, but that it might make sense to drop the U+200E and U+200F bidirectional markers and add U+3000 (IDEOGRAPHIC SPACE).

@jensmaurer
Copy link
Collaborator

The total feedback was a single response, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clarification Something isn't clear help wanted Extra attention is needed paper needed A paper proposing a specific solution is needed
Development

No branches or pull requests

4 participants