Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace wchar_t string decoding implementation with a uint32_t-based one #555

Merged
merged 1 commit into from Jun 20, 2022

Conversation

JustAnotherArchivist
Copy link
Collaborator

This fixes character handling on platforms with 16-bit wchar_t (notably, Windows), which was broken (in different ways) on both CPython and PyPy.

Fixes #552

Remarks:

  • For the disgusting Py_UCS4 == JSUINT32 check magic, see the comments on Surrogates fix fails tests with PyPy on Windows #552.
  • The changelog might need a little touch-up here as this is essentially a continuation of Fix handling of surrogates on decoding #550.
  • I have not run any performance comparisons yet. In general, I would expect it to perform at least as well as the previous implementation since PyUnicode_FromWideChar does some extra work compared to PyUnicode_FromKindAndData (mostly due to surrogate handling). On 16-bit wchar_t platforms, the larger buffer size might have some impact though; I won't be able to run comparisons for that though, I think.

@codecov-commenter
Copy link

codecov-commenter commented Jun 18, 2022

Codecov Report

Merging #555 (bc7bdff) into main (cc70119) will increase coverage by 0.03%.
The diff coverage is 90.90%.

@@            Coverage Diff             @@
##             main     #555      +/-   ##
==========================================
+ Coverage   91.81%   91.84%   +0.03%     
==========================================
  Files           6        6              
  Lines        1856     1852       -4     
==========================================
- Hits         1704     1701       -3     
+ Misses        152      151       -1     
Impacted Files Coverage Δ
tests/test_ujson.py 99.45% <ø> (+0.17%) ⬆️
lib/ultrajsondec.c 92.56% <90.00%> (ø)
python/JSONtoObj.c 88.04% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cc70119...bc7bdff. Read the comment docs.

lib/ultrajsondec.c Outdated Show resolved Hide resolved
This fixes character handling on platforms with 16-bit wchar_t (notably, Windows), which was broken (in different ways) on both CPython and PyPy.

Fixes ultrajson#552
Copy link
Collaborator

@bwoodsend bwoodsend left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. I was expecting a replacement of all strings to be a much bigger, scarier looking change set.

@hugovk hugovk merged commit 67ec071 into ultrajson:main Jun 20, 2022
20 checks passed
@JustAnotherArchivist
Copy link
Collaborator Author

Yeah, much of the code essentially assumed 32-bit ints already for proper operation, so not many changes were needed at all.

Also, just realised I forgot about the benchmarks. Some quick tests right now indicate that it's very marginally faster than the previous code by a couple per cent or so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
changelog: Fixed For any bug fixes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Surrogates fix fails tests with PyPy on Windows
4 participants