Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for arbitrary size integers #548

Merged
merged 6 commits into from Jun 18, 2022

Conversation

JustAnotherArchivist
Copy link
Collaborator

This implements the idea I mentioned in #440: when an int is too large to be handled with C integers, do the conversion with Python's C functions instead and treat it as a char array internally.

The code already works, but it's obviously not ready for merging. I'm submitting it now for feedback.


I only looked at performance briefly so far. As the code is now, it shouldn't noticeably impact throughput when no large ints are in the input (just one additional jump on encoding and a bigger jump on decoding, I think). Given how complex they are compared to ujson's implementation, I was surprised to find that the CPython functions actually perform very well. On encoding, with some simple python3 -m timeit -s 'import ujson' 'ujson.dumps(i)' tests, I found the following:

i time note
0 255 ns
9 267 ns
10 280 ns
256 282 ns
257 282 ns Not interned, clearly the creation of new int objects is not relevant.
65536 289 ns
9223372036854775807 (2**63-1) 337 ns Last value to make use of the long long
9223372036854775808 501 ns
18446744073709551615 (2**64-1) 503 ns Last value to make use of the unsigned long long
18446744073709551616 504 ns First value to make use of this new implementation

I also briefly looked into getting rid of the PyLong_AsLongLong and PyLong_AsUnsignedLongLong calls entirely and just using this method for everything (though I didn't clean everything up, just commented out that code in Object_beginTypeContext). For ]-2**63, 2**63[, this is slower than the existing code. (The range [-9, 9] is marginally faster because CPython has optimised code for that.) Outside of that range, it performs better. Specifically, ujson.dumps(9223372036854775808) takes about 395 ns on my machine then, 20 % less than the existing code. I haven't thoroughly tested removing only the unsigned call, but it seems to not be of any value. In other words, if someone were to encode lots of large ints, this might be a considerable performance increase for them, and if most ints were single-digit, it would be about equal. For others, it would be significantly worse. I'm not sure what to do with this insight.

static JSOBJ Object_newIntegerFromString(void *prv, char *value, size_t length)
{
// PyLong_FromString requires a NUL-terminated string in CPython, contrary to the documentation: https://github.com/python/cpython/issues/59200
char *buf = PyObject_Malloc(length + 1);
memcpy(buf, value, length);
buf[length] = '\0';
return PyLong_FromString(buf, NULL, 10);
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The memcpy here is certainly not ideal. It would probably be faster to buffer the next byte, replace it with a NUL, feed the char array through PyLong_FromString, and then restore it. I think that should always be possible/safe without buffer overruns, but I haven't spent a great deal of thought about it yet. Do you think it's worth exploring that sort of hack?

Copy link
Collaborator

@bwoodsend bwoodsend Jun 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooh, now that's a disgusting but tantalising thing to do. Two ways I imagine that causing damage are:

  1. The integer we're parsing is the last part of the JSON so the next byte would be out of bounds.
  2. Someone is reading the JSON string in multiple threads and the second thread happens to look at that byte when it's temporarily nullified.

1 is moot if the string we're parsing is NULL terminated (which I'm fairly certain that they already are). 2 feels like a very nasty way to derail something.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The relevant part of the decoder boils down to:

  1. If the argument is a string, call PyUnicode_AsUTF8String to convert to bytes.
  2. Call PyBytes_AsString, which returns a NULL-terminated char* that is the internal buffer of the bytes object.
  3. Iterate over that buffer.

So no out-of-bounds issue. As for multithreading, yes, that could be an issue (if the user passes bytes). But on the other hand, ujson already isn't thread-safe as far as I can tell. For example, modifying a dict from another thread while encoding it should crash the encoder.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran some tests on this, parsing an array equivalent to [2**64 + i for i in range(1000)], i.e. 1000 ints all using this code path. On my machine, parsing such a string averages 128 μs with memcpy and 108 μs with the suggested hack.

Compared to the other code paths for smaller ints (note that the number of digits obviously has a direct impact on throughput as well):

input time
2**64 + i, memcpy 128 μs
2**64 + i, buffer hack 108 μs
2**62 + i (newUnsignedLong) 60 μs
-2**63 + i (newLong) 64 μs
2**30 + i (newInt) 45 μs
-2**31 + i (newInt) 45 μs

Command, for reference: python3 -m timeit -s 'import ujson; s = "[" + ",".join(str(2**64 + i) for i in range(1000)) + "]"' 'ujson.loads(s)'

Implementation of the hack:

static JSOBJ Object_newIntegerFromString(void *prv, char *value, size_t length)
{
  char buf = value[length];
  value[length] = '\0';
  PyObject *out = PyLong_FromString(value, NULL, 10);
  value[length] = buf;
  return out;
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a pretty small speedup really. On my machine it's even less (115μs vs 110μs). Definitely not worth the hack in my mind.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, agree. It's basically only relevant if all you ever do is parse huge integers. I don't think that's common enough to warrant the ugly hack.

@JustAnotherArchivist
Copy link
Collaborator Author

@bwoodsend Small comment on the previous discussion in #440 just to avoid confusion: I realised that an extra type isn't needed. JT_RAW already covers exactly this use case on encoding and decoding doesn't care about the different kind of ints anyway.

@JustAnotherArchivist
Copy link
Collaborator Author

Just to confirm, I did more performance checks on both encoding and decoding. As expected, throughput for ints within the previously supported range [-2**63, 2**64-1] is not measurably impacted by this change.

@JustAnotherArchivist JustAnotherArchivist force-pushed the arbitrary-ints branch 2 times, most recently from b33ec84 to 5c9cfa8 Compare June 16, 2022 03:26
@codecov-commenter
Copy link

codecov-commenter commented Jun 16, 2022

Codecov Report

Merging #548 (773c04a) into main (b47c3a7) will decrease coverage by 0.02%.
The diff coverage is 93.75%.

@@            Coverage Diff             @@
##             main     #548      +/-   ##
==========================================
- Coverage   91.83%   91.80%   -0.03%     
==========================================
  Files           6        6              
  Lines        1824     1855      +31     
==========================================
+ Hits         1675     1703      +28     
- Misses        149      152       +3     
Impacted Files Coverage Δ
python/objToJSON.c 89.92% <90.90%> (-0.22%) ⬇️
tests/test_ujson.py 99.27% <91.66%> (-0.17%) ⬇️
lib/ultrajsondec.c 92.56% <100.00%> (+0.06%) ⬆️
python/JSONtoObj.c 88.04% <100.00%> (+0.68%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b47c3a7...773c04a. Read the comment docs.

@JustAnotherArchivist
Copy link
Collaborator Author

JustAnotherArchivist commented Jun 16, 2022

Apparently PyPy's PyNumber_ToBase requires the value to fit into long long. I'll file an issue about this. It's unlikely to get fixed soon though, which would mean that arbitrary int encoding support would be restricted to CPython for the time being. (Decoding, i.e. PyLong_FromString, is fine though.)

Edit: Issue filed: https://foss.heptapod.net/pypy/pypy/-/issues/3765

tests/test_ujson.py Outdated Show resolved Hide resolved
tests/test_ujson.py Outdated Show resolved Hide resolved
@hugovk
Copy link
Member

hugovk commented Jun 17, 2022

Before merging, let's also test on the CI with PyPy on Windows.

Obviously #552 is a blocker, so we can either wait for that to be fixed, or could temporarily skip its failing test on PyPy/Windows to enable testing here first?

@bwoodsend
Copy link
Collaborator

bwoodsend commented Jun 17, 2022

xfailing the offending test on Windows+PyPy, re-enabling CI for PyPy on either Windows or all platforms then rebasing this PR on top of that would be my plan.

@JustAnotherArchivist
Copy link
Collaborator Author

Either way is fine with me.

@hugovk
Copy link
Member

hugovk commented Jun 18, 2022

xfailing the offending test on Windows+PyPy, re-enabling CI for PyPy on either Windows or all platforms then rebasing this PR on top of that would be my plan.

Sounds good: #553

@hugovk hugovk added the changelog: Added For new features label Jun 18, 2022
@hugovk
Copy link
Member

hugovk commented Jun 18, 2022

#553 is merged, we should be good to merge this too. Thanks both!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
changelog: Added For new features
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants