Skip to content

Fix parsing of json with UTF 8 BOM in the host #115967

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jun 2, 2025

Conversation

vitek-karas
Copy link
Member

There are two bugs in this line:

if (size >= 3 && data[0] == 0xEF && data[1] == 0xBB && data[1] == 0xBF)

A typo where the last comparison should compare the 3rd byte.
Comparing chars to numbers. Chars are treated as signed, so the comparison will actually never succeed because the constant numbers are treated as positive numbers, but the effective value of the BOM will be treated as negative numbers.

This change fixes it and adds tests for both runtimeconfig anf deps.json.

Fixes #115915

@vitek-karas vitek-karas added this to the 10.0.0 milestone May 24, 2025
@vitek-karas vitek-karas self-assigned this May 24, 2025
@Copilot Copilot AI review requested due to automatic review settings May 24, 2025 21:09
Copy link
Contributor

Tagging subscribers to this area: @vitek-karas, @agocke, @VSadov
See info in area-owners.md if you want to be subscribed.

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes the UTF-8 BOM detection in the native JSON parser, enables writing test files with or without a BOM, and adds tests to ensure both runtimeconfig.json and .deps.json files with BOM are handled correctly.

  • Correct BOM byte comparison in json_parser.cpp and cast literals to char
  • Introduce WriteJsonWithOptionalUtf8Bom helper and update RuntimeConfig.Save to use it
  • Add test coverage for BOM in both runtime config and deps.json

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/native/corehost/json_parser.cpp Fix index typo and cast BOM bytes to char
src/installer/tests/TestUtils/RuntimeConfig.cs Add Save() overload and pass BOM flag to save helper
src/installer/tests/TestUtils/FileUtils.cs Implement WriteJsonWithOptionalUtf8Bom to write with/without BOM
src/installer/tests/HostActivation.Tests/NativeHostApis.cs Extend test to run with/without BOM flag
src/installer/tests/HostActivation.Tests/DependencyResolution/DepsFile.cs Add test verifying .deps.json parsing with BOM

Comment on lines 58 to 59
// Write without UTF8 BOM (default behavior)
File.WriteAllText(filePath, jsonContent);
Copy link
Preview

Copilot AI May 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default overload of WriteAllText may emit a UTF-8 BOM on some platforms. To guarantee no BOM, use an explicit new UTF8Encoding(encoderShouldEmitUTF8Identifier: false) when writing without BOM.

Suggested change
// Write without UTF8 BOM (default behavior)
File.WriteAllText(filePath, jsonContent);
// Write without UTF8 BOM
File.WriteAllText(filePath, jsonContent, new UTF8Encoding(encoderShouldEmitUTF8Identifier: false));

Copilot uses AI. Check for mistakes.

Comment on lines 47 to 48
byte[] utf8Bom = new byte[] { 0xEF, 0xBB, 0xBF };
byte[] jsonBytes = Encoding.UTF8.GetBytes(jsonContent);
Copy link
Preview

Copilot AI May 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The BOM byte sequence is defined inline. Consider extracting this array into a shared constant or static readonly field to avoid duplication and clarify intent.

Suggested change
byte[] utf8Bom = new byte[] { 0xEF, 0xBB, 0xBF };
byte[] jsonBytes = Encoding.UTF8.GetBytes(jsonContent);
byte[] jsonBytes = Encoding.UTF8.GetBytes(jsonContent);
byte[] fileBytes = new byte[Utf8Bom.Length + jsonBytes.Length];

Copilot uses AI. Check for mistakes.

@@ -123,7 +123,7 @@ bool json_parser_t::parse_file(const pal::string_t& path)
size_t size = m_size;

// Skip over UTF-8 BOM, if present
if (size >= 3 && data[0] == 0xEF && data[1] == 0xBB && data[1] == 0xBF)
if (size >= 3 && data[0] == (char)0xEF && data[1] == (char)0xBB && data[2] == (char)0xBF)
Copy link
Preview

Copilot AI May 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comparing signed char values to byte literals can be error-prone. Consider casting the data pointer to unsigned char* or comparing static_cast<unsigned char>(data[i]) to ensure correct behavior across platforms.

Suggested change
if (size >= 3 && data[0] == (char)0xEF && data[1] == (char)0xBB && data[2] == (char)0xBF)
if (size >= 3 && static_cast<unsigned char>(data[0]) == 0xEF && static_cast<unsigned char>(data[1]) == 0xBB && static_cast<unsigned char>(data[2]) == 0xBF)

Copilot uses AI. Check for mistakes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is technically a correct cast. On a platform like linux arm32, char is unsigned while arm64 has char signed (0xEF > 0x7F). Since we are adding an explicit cast, it makes sense to add it on LHS (or leave it without cast as we have in main).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I admit that I don't know enough about cross-plat C to determine the best way to fix this. If somebody knows, please tell me and I'll change it. Thank you for your help.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should just apply copilot's suggestion. It is more correct. cc @janvorli

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @am11 here. Since we don't tend to use static_cast in the runtime, I'd use just (unsigned char) style cast

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C-style cast is actually more work for compiler: https://en.cppreference.com/w/cpp/language/explicit_cast

1) When the C-style cast is encountered, the compiler attempts to interpret it as the following cast expressions, in this order:
a) [const_cast](https://en.cppreference.com/w/cpp/language/const_cast.html)<type-id >(unary-expression );
b) [static_cast](https://en.cppreference.com/w/cpp/language/static_cast.html)<type-id >(unary-expression ), with extensions: pointer or reference to a [derived class](https://en.cppreference.com/w/cpp/language/derived_class.html) is additionally allowed to be cast to pointer or reference to unambiguous base class (and vice versa) even if the base class is [inaccessible](https://en.cppreference.com/w/cpp/language/access.html) (that is, this cast ignores the private inheritance specifier). Same applies to casting [pointer to member](https://en.cppreference.com/w/cpp/language/pointer.html) to pointer to member of unambiguous non-virtual base;
c) a static_cast (with extensions) followed by const_cast;
d) [reinterpret_cast](https://en.cppreference.com/w/cpp/language/reinterpret_cast.html)<type-id >(unary-expression );
e) a reinterpret_cast followed by const_cast.

So eventually it will pick the right one (static_cast in this case). But if we know the precise flavor, it's better to use it; makes code more readable and less work for the compiler.

Copy link
Member

@AaronRobinsonMSFT AaronRobinsonMSFT May 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So eventually it will pick the right one (static_cast in this case).

It isn't about not picking the right one, it is about what it represents. It doesn't enforce the rules and that is the issue. Overhead on the compiler is generally negligible. This isn't about the cost to the compiler or if it will pick the right one. This is purely about enforcing the narrowest contract we can afford and in this case that is static_cast<>.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C-style cast doesn't necessarily represent a reinterpret_cast. It's more involved than that. If we know which one we need, we should use that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're missing the point of what I'm saying. The checklist or how the compiler operates in this case is irrelvant for the user of the language. The point here is the C style cast can be correct in the moment or it can instantly fallback to a naive reinterpret_cast<>. I am well aware of the sequence of operations, but none of that is indicated in code and changing the type of the right side can turn a correct C style cast into a reinterpret_cast<> without an indication to the users, which means for all intents and purposes a C style cast is a lurking reinterpret_cast<>. The implementation of the language here is a sad artifact of history and should never be relied upon so considering it what it will degrade to is the point. Many of these types of descriptions in the standard (cppreference.com included) are there for compiler authors, not C++ users.

Copy link
Member

@am11 am11 May 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have a consensus and that was on using static_cast in this case. We should continue this practice in C++ code (JIT code uses precise casts heavily).

When I read your "The classic C-style case in C++ is the reinterpret_cast<>" comment, I remembered reading about those governing rules and reinterpret_cast being the last resort. That's why I replied. Your later interpretation is not I'm disagreeing with, but since runtime code base is partially C and partially C++, I wanted to highlight that the C-style cast is not necessarily evil; modern compiler versions, which official builds are using, are smart enough to make a right choice (considering in C mode, that's the only option so it doesn't automatically makes the entire language worse).

@AaronRobinsonMSFT
Copy link
Member

/cc @elinor-fung

Copy link
Member

@elinor-fung elinor-fung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@vitek-karas vitek-karas merged commit 664ac25 into dotnet:main Jun 2, 2025
150 checks passed
@vitek-karas vitek-karas deleted the bomjsontest branch June 17, 2025 07:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

UTF-8 BOM causes .deps.json and .runtimeconfig.json to break and prevent app from launching
7 participants