Use non-BOM encodings #2370

filmor · 2024-05-04T14:54:56Z

Use non-BOM encodings for both C#->Python and Python->C#, as the byteorder is always the native one and the BOM is neither never or always used.

Fixes #2369.

The documentation of the used `PyUnicode_DecodeUTF16` states that not passing `*byteorder` or passing a 0 results in the first two bytes, if they are the BOM (U+FEFF, zero-width no-break space), to be interpreted and skipped, which is incorrect when we convert a known "non BOM" string, which all strings from C# are.

lostmsu · 2024-05-06T06:48:13Z

@filmor can you ELI5? For someone not familiar with intricacies of BOM, but aware of byte order issues.

My biggest question is if this change has any potential to introduce bugs to handling strings that actually have BOM? E.g. imagine a scenario when someone serialized and persisted something with BOM using 3.0.3, but after this change in 3.0.4 if they read it back BOM will be in their string data.

filmor · 2024-05-06T09:11:46Z

It's the other way round. Strings that are being passed between Python and .NET are UTF16 in the respective native byte order (usually LE), without a BOM. The functions that we were using for the conversions (in particular PyUnicode_DecodeUTF16 and the defaulr encoding objects from Encoding) try to be "smart" and will interpret a leading set of FE FF or FF FE as the byte order mark, removing it from the converted string. By passing the correct endian-ness explicitly, this behaviour is disabled.

filmor force-pushed the fix-bom-strings branch 2 times, most recently from 49db3bf to 07f65c7 Compare May 5, 2024 18:41

filmor added 2 commits May 5, 2024 20:42

Use non-BOM encodings

4c46c6d

filmor force-pushed the fix-bom-strings branch from 07f65c7 to dc6f5ef Compare May 5, 2024 18:42

filmor marked this pull request as ready for review May 5, 2024 18:42

filmor requested a review from lostmsu May 5, 2024 18:44

filmor self-assigned this May 7, 2024

lostmsu merged commit 195cde6 into pythonnet:master May 10, 2024
27 checks passed

filmor deleted the fix-bom-strings branch May 10, 2024 19:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use non-BOM encodings #2370

Use non-BOM encodings #2370

filmor commented May 4, 2024 •

edited

Loading

lostmsu commented May 6, 2024

filmor commented May 6, 2024

Use non-BOM encodings #2370

Use non-BOM encodings #2370

Conversation

filmor commented May 4, 2024 • edited Loading

lostmsu commented May 6, 2024

filmor commented May 6, 2024

filmor commented May 4, 2024 •

edited

Loading