Skip to content

Latest commit

 

History

History
238 lines (184 loc) · 6.7 KB

P1879_please_dont_rewrite_my_string_literals.md

File metadata and controls

238 lines (184 loc) · 6.7 KB
title document date audience author toc monofont
Please Don't Rewrite My String Literals
D1879R1
2019-08-23
SG16
EWG-I
name email
Zach Laine
<whatwasthataddress@gmail.com>
false
DejaVu Sans Mono

Revisions

  • R1
    • Use the terminology conventions from [@P1859R0].
    • Add T-dog Tables.
    • Explain why the MSVC current behavior is conforming.
    • Be explicit that only string literals are affected (and not identifiers).

Terminology

The terminology used in this paper follows the conventions introduced in [@P1859R0].

The u8 string literal prefix does not do what you think it does

I was writing tests for a Unicode library for Boost. The tests included some non-ASCII string literals, at least one of which included Unicode U+03C2, "GREEK SMALL LETTER FINAL SIGMA". It is two UTF-8 code units, 0xcf and 0x82. In the editor in which I initially wrote that code point, I entered it as those two code units, and the editor showed as I have here, as a single glyph representing code point U+03C2.

Note that in every editor I used during the testing process, I saw the source code as u8"ς". A hex dump of the source file showed 0xcf 0x82 for this code point.

I started on Linux, got the tests passing, and then ran them on Mac. So far, so good. Then, I ran them on MSVC, where they did not pass. Sometime during the resulting investigation, I wrote this expression, which evaluated to true:

strlen(u8"ς") == 5

After asking around a bit online, I learned about the /utf-8 MSVC compiler flag. That flag fixed my tests.

To see why, consider this variable declaration:

char str[3] = u8"ς";

When compiled with MSVC in a source file with a UTF-8 BOM, or using the /utf-8 flag; GCC; or Clang; that declaration is equivalent to this:

char str[3] = {0xcf, 0x82, 0x0};

When compiled with MSVC, without a BOM or /utf-8 flag, it is equivalent to this:

char str[6] = {0xc3, 0x8f, 0xe2, 0x80, 0x9a, 0x0};

So, my two UTF-8 code units were silently rewritten to be 5 chars in some encoding that is not UTF-8. To make matters worse, the declaration without the u8 prefix gets me back to the bits I want, regardless of whether the utf-8 flags are in use:

char str[3] = "ς"; // Identical to "char str[3] = {0xcf, 0x82, 0x0};".

That's a lousy user experience

This is well-defined, non-erroneous behavior on the part of all compilers involved. All the modes of compilation above are standards conforming as far as I know. In [lex.phases], the description of phase 1 of translation starts with this sentence:

[1]{.pnum}Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary. The set of physical source file characters accepted is implementation-defined.

This applies to all characters in the file, regardless of whether they are within a u8-, u-, or U- prefixed string literal.

Many users who deal in Unicode on Windows (or portably) already know about this issue and are dealing with it. However, SG16 is trying their best to get Unicode support into standard C++. As such, the hope is that we'll get more Unicode-naive users to start using Unicode-aware C++ features to future-proof their code. Such naive users are going to write nonportable string literals all over the place if the status quo remains.

Specifically, users must be taught that they cannot use the u8 string literal prefix for a string literal that they know to be UTF-8 encoded, or at least not portably. They must omit the u8 prefix in some cases to get a UTF-8 encoded string literal in their final program.

The fix

To fix this, I want to make it ill-formed for a u8-, u-, or U-prefixed string literal to appear in a TU whose source and/or execution character sets would cause the meaning of the literal to change. The meaning of the literal is preserved if:

  • the bits do not change from what the user wrote in the source file, or

  • the literal is transcoded to another UTF format, such that the original code points represented by the original bits in the original UTF format are preserved.

This lets users specify that they want a particular UTF character set to be used for a particular string literal, likely seeing it in their editor the way they entered it, and have it appear in their object code with no unexpected semantics.

Note that this only applies to transformations applied to the contents of u8-, u-, or U-string literals during phases 1 and 5 of translation. In particular, it does not apply to identifiers. So, this code would retain its current semantices:

char ς[] = "status quo";

Also note that this change will not silently change any existing code. This change also will not cause any existing string literals to be diagnosed, if they happen to preserve meaning. In particular, ASCII contents of u8 string literals will not be ill-fomed when the source character set is Windows-1252 or EBCDIC.

Semantic changes

Below, I've show differences across different source character sets. The same results apply to differences across different execution character sets.

::: tonytable

Before

// Source character set: Unicode (UTF-8)
char str[] = u8"ς";
assert(strlen(str) == 2); // ok
assert(strlen(str) == 5); // error

After

// Source character set: Unicode (UTF-8)
char str[] = u8"ς";
assert(strlen(str) == 2); // ok
assert(strlen(str) == 5); // error

// Source character set: Unicode (UTF-16)
char str[] = u8"ς";
assert(strlen(str) == 2); // ok
assert(strlen(str) == 5); // error
// Source character set: Unicode (UTF-16)
char str[] = u8"ς";
assert(strlen(str) == 2); // ok
assert(strlen(str) == 5); // error

// Source character set: Windows-1252
char str[5] = u8"ς";
assert(strlen(str) == 2); // error
assert(strlen(str) == 5); // ok
// Source character set: Windows-1252
char str[3] = u8"ς";      // ill-formed
assert(strlen(str) == 2); // never evaluated
assert(strlen(str) == 5); // never evaluated

// Source character set: Windows-1252
char str[] = u8"asdf";
assert(strlen(str) == 4); // ok
// Source character set: Windows-1252
char str[] = u8"asdf";    // still well-formed
assert(strlen(str) == 4); // ok

// Any source character set
char ς[] = u8"asdf";
assert(strlen(str) == 4); // ok
// Any source character set
char ς[] = u8"asdf";      // same well-formedness
assert(strlen(str) == 4); // ok

// Any source character set
char str[] = "ς";
// Any source character set
char ς[] = "ς";           // same well-formedness

:::