Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build: set /utf-8 flag when using MSVC #4975

Merged
merged 1 commit into from Feb 22, 2023
Merged

Conversation

fghzxm
Copy link
Contributor

@fghzxm fghzxm commented Feb 21, 2023

MSVC by default decodes source files in the current Windows code page if they don't have a Unicode BOM, and also encodes strings and chars into the current code page before storing them into the compiled binary. Our files are always encoded in UTF-8, and our code always assumes runtime strings are encoded in UTF-8, so we should pass the /utf-8 flag to MSVC.

Microsoft Docs:
https://learn.microsoft.com/en-us/cpp/build/reference/utf-8-set-source-and-executable-character-sets-to-utf-8

@fghzxm
Copy link
Contributor Author

fghzxm commented Feb 21, 2023

Clarification: "strings" and "chars" refer to only narrow (char) strings and chars, not wide (wchar_t) ones.

@mikedld
Copy link
Member

mikedld commented Feb 21, 2023

Is there a specific issue that you're trying to solve with this?

@fghzxm
Copy link
Contributor Author

fghzxm commented Feb 21, 2023

When building Transmission on Windows, if the active code page is set to certain values (mine is 936 (GBK)), MSVC fails to correctly parse certain C++ files. For example, compiling any C++ file that includes libtransmission/utils.h on code page 936 yields the errors

<...>\libtransmission\utils.h(1): warning C4819: The file contains a character that cannot be represented in the current code page (936). Save the file in Unicode format to prevent data loss
<...>\libtransmission\utils.h(216): error C2146: syntax error: missing ')' before identifier 'size_t'
<...>\libtransmission\utils.h(216): error C3646: 'size_t': unknown override specifier
<...>\libtransmission\utils.h(216): error C3646: 'tr_strvToBuf': unknown override specifier
<...>\libtransmission\utils.h(216): error C2059: syntax error: '('
<...>\libtransmission\utils.h(216): error C2062: type 'char' unexpected
<...>\libtransmission\utils.h(216): error C2059: syntax error: ')'

The warning on line 1 refers to the copyright symbol (©). The errors on line 216 are caused by a U+FFFD (0xEF 0xBF 0xBD in UTF-8 encoding) character inside a comment on line 207:

// vvv Line 207 vvv   ---------------------------------------------------------------------------------v U+FFFD
[[nodiscard]] std::string tr_strv_replace_invalid(std::string_view sv, uint32_t replacement = 0xFFFD /**/);

/**
 * @brief copies `src` into `buf`.
 *
 * - Always returns std::size(src).
 * - `src` will be copied into `buf` iff `buflen >= std::size(src)`
 * - `buf` will also be zero terminated iff `buflen >= std::size(src) + 1`.
 */ // <<< Line 215 <<<
size_t tr_strvToBuf(std::string_view src, char* buf, size_t buflen);
// ^^^ Line 216 ^^^

Code page 936 (GBK) is a multi-byte encoding; A byte in the 0x81-0xFE range is combined with the following byte to form a single character. When MSVC tries to decode the file at the position of the U+FFFD character, it sees 0xEF 0xBF 0xBD 0x2A, 0x2A being the * character, and interprets them as 2 multi-byte characters. It therefore fails to see the comment closer */, eats the ); as if it were part of the comment, and only closes the comment on line 215.

Since the repository currently has many files that contain non-ASCII characters, the solution is not to just remove the offending character in libtransmission/utils.h, but to tell MSVC to properly decode our files from the beginning. Also, the issue is even scarier for code pages on which MSVC happens to parse and compile all the files; because many non-ASCII characters are located in string literals, MSVC will silently decode the bytes incorrectly, producing corrupted string literals from the parser which will ultimately end up in the produced binaries.

@ckerr ckerr requested a review from mikedld February 21, 2023 12:33
MSVC by default decodes source files in the current Windows code page if
they don't have a Unicode BOM, and also encodes strings and chars into
the current code page before storing them into the compiled binary. Our
files are always encoded in UTF-8, and our code always assumes runtime
strings are encoded in UTF-8, so we should pass the `/utf-8` flag to
MSVC.

Microsoft Docs:
https://learn.microsoft.com/en-us/cpp/build/reference/utf-8-set-source-and-executable-character-sets-to-utf-8
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants