Fix parsing of json with UTF 8 BOM in the host #115967

vitek-karas · 2025-05-24T21:09:22Z

There are two bugs in this line:

runtime/src/native/corehost/json_parser.cpp

Line 126 in d896e85

if (size >= 3 && data[0] == 0xEF && data[1] == 0xBB && data[1] == 0xBF)

A typo where the last comparison should compare the 3rd byte.
Comparing chars to numbers. Chars are treated as signed, so the comparison will actually never succeed because the constant numbers are treated as positive numbers, but the effective value of the BOM will be treated as negative numbers.

This change fixes it and adds tests for both runtimeconfig anf deps.json.

Fixes #115915

dotnet-policy-service · 2025-05-24T21:09:54Z

Tagging subscribers to this area: @vitek-karas, @agocke, @VSadov
See info in area-owners.md if you want to be subscribed.

Copilot

Pull Request Overview

This PR fixes the UTF-8 BOM detection in the native JSON parser, enables writing test files with or without a BOM, and adds tests to ensure both runtimeconfig.json and .deps.json files with BOM are handled correctly.

Correct BOM byte comparison in json_parser.cpp and cast literals to char
Introduce WriteJsonWithOptionalUtf8Bom helper and update RuntimeConfig.Save to use it
Add test coverage for BOM in both runtime config and deps.json

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
src/native/corehost/json_parser.cpp	Fix index typo and cast BOM bytes to `char`
src/installer/tests/TestUtils/RuntimeConfig.cs	Add `Save()` overload and pass BOM flag to save helper
src/installer/tests/TestUtils/FileUtils.cs	Implement `WriteJsonWithOptionalUtf8Bom` to write with/without BOM
src/installer/tests/HostActivation.Tests/NativeHostApis.cs	Extend test to run with/without BOM flag
src/installer/tests/HostActivation.Tests/DependencyResolution/DepsFile.cs	Add test verifying `.deps.json` parsing with BOM

Copilot · 2025-05-24T21:10:29Z

src/installer/tests/TestUtils/FileUtils.cs

+                // Write without UTF8 BOM (default behavior)
+                File.WriteAllText(filePath, jsonContent);


The default overload of WriteAllText may emit a UTF-8 BOM on some platforms. To guarantee no BOM, use an explicit new UTF8Encoding(encoderShouldEmitUTF8Identifier: false) when writing without BOM.

Suggested change

// Write without UTF8 BOM (default behavior)

File.WriteAllText(filePath, jsonContent);

// Write without UTF8 BOM

File.WriteAllText(filePath, jsonContent, new UTF8Encoding(encoderShouldEmitUTF8Identifier: false));

Copilot · 2025-05-24T21:10:29Z

src/installer/tests/TestUtils/FileUtils.cs

+                byte[] utf8Bom = new byte[] { 0xEF, 0xBB, 0xBF };
+                byte[] jsonBytes = Encoding.UTF8.GetBytes(jsonContent);


[nitpick] The BOM byte sequence is defined inline. Consider extracting this array into a shared constant or static readonly field to avoid duplication and clarify intent.

Suggested change

byte[] utf8Bom = new byte[] { 0xEF, 0xBB, 0xBF };

byte[] jsonBytes = Encoding.UTF8.GetBytes(jsonContent);

byte[] jsonBytes = Encoding.UTF8.GetBytes(jsonContent);

byte[] fileBytes = new byte[Utf8Bom.Length + jsonBytes.Length];

Copilot · 2025-05-24T21:10:30Z

src/native/corehost/json_parser.cpp


    // Skip over UTF-8 BOM, if present
-    if (size >= 3 && data[0] == 0xEF && data[1] == 0xBB && data[1] == 0xBF)
+    if (size >= 3 && data[0] == (char)0xEF && data[1] == (char)0xBB && data[2] == (char)0xBF)


Comparing signed char values to byte literals can be error-prone. Consider casting the data pointer to unsigned char* or comparing static_cast<unsigned char>(data[i]) to ensure correct behavior across platforms.

Suggested change

if (size >= 3 && data[0] == (char)0xEF && data[1] == (char)0xBB && data[2] == (char)0xBF)

if (size >= 3 && static_cast<unsigned char>(data[0]) == 0xEF && static_cast<unsigned char>(data[1]) == 0xBB && static_cast<unsigned char>(data[2]) == 0xBF)

This is technically a correct cast. On a platform like linux arm32, char is unsigned while arm64 has char signed (0xEF > 0x7F). Since we are adding an explicit cast, it makes sense to add it on LHS (or leave it without cast as we have in main).

I admit that I don't know enough about cross-plat C to determine the best way to fix this. If somebody knows, please tell me and I'll change it. Thank you for your help.

I think we should just apply copilot's suggestion. It is more correct. cc @janvorli

I agree with @am11 here. Since we don't tend to use static_cast in the runtime, I'd use just (unsigned char) style cast

C-style cast is actually more work for compiler: https://en.cppreference.com/w/cpp/language/explicit_cast

1) When the C-style cast is encountered, the compiler attempts to interpret it as the following cast expressions, in this order: a) [const_cast](https://en.cppreference.com/w/cpp/language/const_cast.html)<type-id >(unary-expression ); b) [static_cast](https://en.cppreference.com/w/cpp/language/static_cast.html)<type-id >(unary-expression ), with extensions: pointer or reference to a [derived class](https://en.cppreference.com/w/cpp/language/derived_class.html) is additionally allowed to be cast to pointer or reference to unambiguous base class (and vice versa) even if the base class is [inaccessible](https://en.cppreference.com/w/cpp/language/access.html) (that is, this cast ignores the private inheritance specifier). Same applies to casting [pointer to member](https://en.cppreference.com/w/cpp/language/pointer.html) to pointer to member of unambiguous non-virtual base; c) a static_cast (with extensions) followed by const_cast; d) [reinterpret_cast](https://en.cppreference.com/w/cpp/language/reinterpret_cast.html)<type-id >(unary-expression ); e) a reinterpret_cast followed by const_cast.

So eventually it will pick the right one (static_cast in this case). But if we know the precise flavor, it's better to use it; makes code more readable and less work for the compiler.

So eventually it will pick the right one (static_cast in this case).

It isn't about not picking the right one, it is about what it represents. It doesn't enforce the rules and that is the issue. Overhead on the compiler is generally negligible. This isn't about the cost to the compiler or if it will pick the right one. This is purely about enforcing the narrowest contract we can afford and in this case that is static_cast<>.

C-style cast doesn't necessarily represent a reinterpret_cast. It's more involved than that. If we know which one we need, we should use that.

I think you're missing the point of what I'm saying. The checklist or how the compiler operates in this case is irrelvant for the user of the language. The point here is the C style cast can be correct in the moment or it can instantly fallback to a naive reinterpret_cast<>. I am well aware of the sequence of operations, but none of that is indicated in code and changing the type of the right side can turn a correct C style cast into a reinterpret_cast<> without an indication to the users, which means for all intents and purposes a C style cast is a lurking reinterpret_cast<>. The implementation of the language here is a sad artifact of history and should never be relied upon so considering it what it will degrade to is the point. Many of these types of descriptions in the standard (cppreference.com included) are there for compiler authors, not C++ users.

I think we have a consensus and that was on using static_cast in this case. We should continue this practice in C++ code (JIT code uses precise casts heavily).

When I read your "The classic C-style case in C++ is the reinterpret_cast<>" comment, I remembered reading about those governing rules and reinterpret_cast being the last resort. That's why I replied. Your later interpretation is not I'm disagreeing with, but since runtime code base is partially C and partially C++, I wanted to highlight that the C-style cast is not necessarily evil; modern compiler versions, which official builds are using, are smart enough to make a right choice (considering in C mode, that's the only option so it doesn't automatically makes the entire language worse).

src/installer/tests/TestUtils/FileUtils.cs

AaronRobinsonMSFT · 2025-05-26T00:00:06Z

/cc @elinor-fung

src/installer/tests/TestUtils/FileUtils.cs

elinor-fung

Thanks!

vitek-karas added 3 commits May 23, 2025 22:10

Add tests separately for runtimeconfig and deps

3fc57c6

Fix the bug and tests

44ccd81

Formating fixes

1da6194

vitek-karas added this to the 10.0.0 milestone May 24, 2025

vitek-karas self-assigned this May 24, 2025

Copilot AI review requested due to automatic review settings May 24, 2025 21:09

vitek-karas added the area-Host label May 24, 2025

Copilot AI reviewed May 24, 2025

View reviewed changes

vitek-karas mentioned this pull request May 24, 2025

Fix typo in UTF-8 BOM detection in corehost #115942

Closed

agocke approved these changes May 24, 2025

View reviewed changes

GerardSmit reviewed May 24, 2025

View reviewed changes

src/installer/tests/TestUtils/FileUtils.cs Outdated Show resolved Hide resolved

This was referenced May 25, 2025

The Operation will be canceled. The next steps may not contain expected logs. dotnet/dnceng#3008

Open

Test failure: baseservices/exceptions/stackoverflow/stackoverflowtester/stackoverflowtester.cmd #110173

Open

elinor-fung reviewed May 27, 2025

View reviewed changes

src/installer/tests/TestUtils/FileUtils.cs Outdated Show resolved Hide resolved

vitek-karas added 2 commits May 28, 2025 16:40

Feedback from code review

bd850a7

Merge branch 'main' into bomjsontest

1a9e8ea

elinor-fung approved these changes May 28, 2025

View reviewed changes

Merge branch 'main' into bomjsontest

bae6ba7

vitek-karas enabled auto-merge (squash) May 29, 2025 07:01

vitek-karas merged commit 664ac25 into dotnet:main Jun 2, 2025
150 checks passed

vitek-karas deleted the bomjsontest branch June 17, 2025 07:43

github-actions bot locked and limited conversation to collaborators Jul 17, 2025

		// Write without UTF8 BOM (default behavior)
		File.WriteAllText(filePath, jsonContent);

		byte[] utf8Bom = new byte[] { 0xEF, 0xBB, 0xBF };
		byte[] jsonBytes = Encoding.UTF8.GetBytes(jsonContent);

	if (size >= 3 && data[0] == (char)0xEF && data[1] == (char)0xBB && data[2] == (char)0xBF)
	if (size >= 3 && static_cast<unsigned char>(data[0]) == 0xEF && static_cast<unsigned char>(data[1]) == 0xBB && static_cast<unsigned char>(data[2]) == 0xBF)

Fix parsing of json with UTF 8 BOM in the host #115967

Fix parsing of json with UTF 8 BOM in the host #115967

Uh oh!

Conversation

vitek-karas commented May 24, 2025

Uh oh!

dotnet-policy-service bot commented May 24, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI May 24, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI May 24, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI May 24, 2025

Choose a reason for hiding this comment

Uh oh!

am11 May 25, 2025

Choose a reason for hiding this comment

Uh oh!

vitek-karas May 26, 2025

Choose a reason for hiding this comment

Uh oh!

am11 May 26, 2025

Choose a reason for hiding this comment

Uh oh!

janvorli May 26, 2025

Choose a reason for hiding this comment

Uh oh!

am11 May 27, 2025

Choose a reason for hiding this comment

Uh oh!

AaronRobinsonMSFT May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

am11 May 27, 2025

Choose a reason for hiding this comment

Uh oh!

AaronRobinsonMSFT May 28, 2025

Choose a reason for hiding this comment

Uh oh!

am11 May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

AaronRobinsonMSFT commented May 26, 2025

Uh oh!

Uh oh!

elinor-fung left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

AaronRobinsonMSFT May 27, 2025 •

edited

Loading

am11 May 28, 2025 •

edited

Loading