Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use non-BOM encodings #2370

Merged
merged 2 commits into from
May 10, 2024
Merged

Use non-BOM encodings #2370

merged 2 commits into from
May 10, 2024

Conversation

filmor
Copy link
Member

@filmor filmor commented May 4, 2024

Use non-BOM encodings for both C#->Python and Python->C#, as the byteorder is always the native one and the BOM is neither never or always used.

Fixes #2369.

@filmor filmor force-pushed the fix-bom-strings branch 2 times, most recently from 49db3bf to 07f65c7 Compare May 5, 2024 18:41
filmor added 2 commits May 5, 2024 20:42
The documentation of the used `PyUnicode_DecodeUTF16` states that not
passing `*byteorder` or passing a 0 results in the first two bytes, if
they are the BOM (U+FEFF, zero-width no-break space), to be interpreted
and skipped, which is incorrect when we convert a known "non BOM"
string, which all strings from C# are.
@filmor filmor marked this pull request as ready for review May 5, 2024 18:42
@filmor filmor requested a review from lostmsu May 5, 2024 18:44
@lostmsu
Copy link
Member

lostmsu commented May 6, 2024

@filmor can you ELI5? For someone not familiar with intricacies of BOM, but aware of byte order issues.

My biggest question is if this change has any potential to introduce bugs to handling strings that actually have BOM? E.g. imagine a scenario when someone serialized and persisted something with BOM using 3.0.3, but after this change in 3.0.4 if they read it back BOM will be in their string data.

@filmor
Copy link
Member Author

filmor commented May 6, 2024

It's the other way round. Strings that are being passed between Python and .NET are UTF16 in the respective native byte order (usually LE), without a BOM. The functions that we were using for the conversions (in particular PyUnicode_DecodeUTF16 and the defaulr encoding objects from Encoding) try to be "smart" and will interpret a leading set of FE FF or FF FE as the byte order mark, removing it from the converted string. By passing the correct endian-ness explicitly, this behaviour is disabled.

@filmor filmor self-assigned this May 7, 2024
@lostmsu lostmsu merged commit 195cde6 into pythonnet:master May 10, 2024
27 checks passed
@filmor filmor deleted the fix-bom-strings branch May 10, 2024 19:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Possible bug in reading zero width no-break space character
2 participants