-
Notifications
You must be signed in to change notification settings - Fork 5.1k
Fix parsing of json with UTF 8 BOM in the host #115967
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Tagging subscribers to this area: @vitek-karas, @agocke, @VSadov |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes the UTF-8 BOM detection in the native JSON parser, enables writing test files with or without a BOM, and adds tests to ensure both runtimeconfig.json
and .deps.json
files with BOM are handled correctly.
- Correct BOM byte comparison in
json_parser.cpp
and cast literals tochar
- Introduce
WriteJsonWithOptionalUtf8Bom
helper and updateRuntimeConfig.Save
to use it - Add test coverage for BOM in both runtime config and deps.json
Reviewed Changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
File | Description |
---|---|
src/native/corehost/json_parser.cpp | Fix index typo and cast BOM bytes to char |
src/installer/tests/TestUtils/RuntimeConfig.cs | Add Save() overload and pass BOM flag to save helper |
src/installer/tests/TestUtils/FileUtils.cs | Implement WriteJsonWithOptionalUtf8Bom to write with/without BOM |
src/installer/tests/HostActivation.Tests/NativeHostApis.cs | Extend test to run with/without BOM flag |
src/installer/tests/HostActivation.Tests/DependencyResolution/DepsFile.cs | Add test verifying .deps.json parsing with BOM |
// Write without UTF8 BOM (default behavior) | ||
File.WriteAllText(filePath, jsonContent); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default overload of WriteAllText may emit a UTF-8 BOM on some platforms. To guarantee no BOM, use an explicit new UTF8Encoding(encoderShouldEmitUTF8Identifier: false)
when writing without BOM.
// Write without UTF8 BOM (default behavior) | |
File.WriteAllText(filePath, jsonContent); | |
// Write without UTF8 BOM | |
File.WriteAllText(filePath, jsonContent, new UTF8Encoding(encoderShouldEmitUTF8Identifier: false)); |
Copilot uses AI. Check for mistakes.
byte[] utf8Bom = new byte[] { 0xEF, 0xBB, 0xBF }; | ||
byte[] jsonBytes = Encoding.UTF8.GetBytes(jsonContent); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The BOM byte sequence is defined inline. Consider extracting this array into a shared constant or static readonly field to avoid duplication and clarify intent.
byte[] utf8Bom = new byte[] { 0xEF, 0xBB, 0xBF }; | |
byte[] jsonBytes = Encoding.UTF8.GetBytes(jsonContent); | |
byte[] jsonBytes = Encoding.UTF8.GetBytes(jsonContent); | |
byte[] fileBytes = new byte[Utf8Bom.Length + jsonBytes.Length]; |
Copilot uses AI. Check for mistakes.
src/native/corehost/json_parser.cpp
Outdated
@@ -123,7 +123,7 @@ bool json_parser_t::parse_file(const pal::string_t& path) | |||
size_t size = m_size; | |||
|
|||
// Skip over UTF-8 BOM, if present | |||
if (size >= 3 && data[0] == 0xEF && data[1] == 0xBB && data[1] == 0xBF) | |||
if (size >= 3 && data[0] == (char)0xEF && data[1] == (char)0xBB && data[2] == (char)0xBF) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comparing signed char
values to byte literals can be error-prone. Consider casting the data pointer to unsigned char*
or comparing static_cast<unsigned char>(data[i])
to ensure correct behavior across platforms.
if (size >= 3 && data[0] == (char)0xEF && data[1] == (char)0xBB && data[2] == (char)0xBF) | |
if (size >= 3 && static_cast<unsigned char>(data[0]) == 0xEF && static_cast<unsigned char>(data[1]) == 0xBB && static_cast<unsigned char>(data[2]) == 0xBF) |
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is technically a correct cast. On a platform like linux arm32, char is unsigned while arm64 has char signed (0xEF > 0x7F). Since we are adding an explicit cast, it makes sense to add it on LHS (or leave it without cast as we have in main).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I admit that I don't know enough about cross-plat C to determine the best way to fix this. If somebody knows, please tell me and I'll change it. Thank you for your help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should just apply copilot's suggestion. It is more correct. cc @janvorli
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @am11 here. Since we don't tend to use static_cast
in the runtime, I'd use just (unsigned char)
style cast
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
C-style cast is actually more work for compiler: https://en.cppreference.com/w/cpp/language/explicit_cast
1) When the C-style cast is encountered, the compiler attempts to interpret it as the following cast expressions, in this order:
a) [const_cast](https://en.cppreference.com/w/cpp/language/const_cast.html)<type-id >(unary-expression );
b) [static_cast](https://en.cppreference.com/w/cpp/language/static_cast.html)<type-id >(unary-expression ), with extensions: pointer or reference to a [derived class](https://en.cppreference.com/w/cpp/language/derived_class.html) is additionally allowed to be cast to pointer or reference to unambiguous base class (and vice versa) even if the base class is [inaccessible](https://en.cppreference.com/w/cpp/language/access.html) (that is, this cast ignores the private inheritance specifier). Same applies to casting [pointer to member](https://en.cppreference.com/w/cpp/language/pointer.html) to pointer to member of unambiguous non-virtual base;
c) a static_cast (with extensions) followed by const_cast;
d) [reinterpret_cast](https://en.cppreference.com/w/cpp/language/reinterpret_cast.html)<type-id >(unary-expression );
e) a reinterpret_cast followed by const_cast.
So eventually it will pick the right one (static_cast in this case). But if we know the precise flavor, it's better to use it; makes code more readable and less work for the compiler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So eventually it will pick the right one (static_cast in this case).
It isn't about not picking the right one, it is about what it represents. It doesn't enforce the rules and that is the issue. Overhead on the compiler is generally negligible. This isn't about the cost to the compiler or if it will pick the right one. This is purely about enforcing the narrowest contract we can afford and in this case that is static_cast<>
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
C-style cast doesn't necessarily represent a reinterpret_cast
. It's more involved than that. If we know which one we need, we should use that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're missing the point of what I'm saying. The checklist or how the compiler operates in this case is irrelvant for the user of the language. The point here is the C style cast can be correct in the moment or it can instantly fallback to a naive reinterpret_cast<>
. I am well aware of the sequence of operations, but none of that is indicated in code and changing the type of the right side can turn a correct C style cast into a reinterpret_cast<>
without an indication to the users, which means for all intents and purposes a C style cast is a lurking reinterpret_cast<>
. The implementation of the language here is a sad artifact of history and should never be relied upon so considering it what it will degrade to is the point. Many of these types of descriptions in the standard (cppreference.com included) are there for compiler authors, not C++ users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have a consensus and that was on using static_cast
in this case. We should continue this practice in C++ code (JIT code uses precise casts heavily).
When I read your "The classic C-style case in C++ is the reinterpret_cast<>
" comment, I remembered reading about those governing rules and reinterpret_cast
being the last resort. That's why I replied. Your later interpretation is not I'm disagreeing with, but since runtime code base is partially C and partially C++, I wanted to highlight that the C-style cast is not necessarily evil; modern compiler versions, which official builds are using, are smart enough to make a right choice (considering in C mode, that's the only option so it doesn't automatically makes the entire language worse).
/cc @elinor-fung |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
There are two bugs in this line:
runtime/src/native/corehost/json_parser.cpp
Line 126 in d896e85
A typo where the last comparison should compare the 3rd byte.
Comparing chars to numbers. Chars are treated as signed, so the comparison will actually never succeed because the constant numbers are treated as positive numbers, but the effective value of the BOM will be treated as negative numbers.
This change fixes it and adds tests for both runtimeconfig anf deps.json.
Fixes #115915