Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UltraJSON has inconsistent behavior on parsing large integral values. #440

Closed
TheShiftedBit opened this issue Nov 10, 2020 · 12 comments · Fixed by #544
Closed

UltraJSON has inconsistent behavior on parsing large integral values. #440

TheShiftedBit opened this issue Nov 10, 2020 · 12 comments · Fixed by #544

Comments

@TheShiftedBit
Copy link

For some integers that are too large, UltraJSON will raise a ValueError. For others, it will parse, but parse to the wrong value. This happens both for numbers that are too large in the positive direction, and numbers that are too large in the negative direction.

What did you do?

Encoded an integer that was too big or too small.

What did you expect to happen?

I expected UltraJSON to raise a ValueError.

What actually happened?

UltraJSON parsed the integer...into the wrong value.

What versions are you using?

  • OS: Debian 5.7.17-1rodete4 (2020-10-01) x86_64 GNU/Linux
  • Python: 3.8.5
  • UltraJSON: HEAD as of Nov 9 2020 (commit hash f26f967)

Please include code that reproduces the issue.

import ujson
ujson.loads("18446744073709551615")  # Parses fine, max 64-bit value
ujson.loads("18446744073709551616") # Raises a ValueError: Value is too big!
# This value is even bigger:
ujson.loads("33333333303333333333")  # Parses, but parses into the wrong value, 14886589229623781717

# The same thing happens for negative numbers. For example:
ujson.loads("-80888888888888888888")  # Parses to the wrong value, -7101912594050682424
@JustAnotherArchivist
Copy link
Collaborator

This is happening due to incorrect overflow* detection. The decoder parses the input digit by digit. Every time, the previous value is multiplied by ten and the next digit added. After that, there is an attempt to check whether the result is smaller than the previous value. However, overflows can happen without triggering that check.

For example, let's take the input 33333333303333336789 (just changed the last few digits to make it more readable). When parsing the last digit, this has the prevIntValue = 3333333330333333678. Upon multiplying by ten, this wraps around to 14886589229623785164, and the addition of the last digit then results in ...73. This result is larger than prevIntValue, so the error condition is not triggered despite the overflow.

(* Technically, unsigned ints cannot actually overflow (cause undefined behaviour etc.) in C. I'm using 'overflow' in the looser sense to mean wrapping around.)

@jlusiardi
Copy link

@bwoodsend
Copy link
Collaborator

I don't think that matching constraints caused by JavaScript's slightly questionable behaviour of munging integers and floats under the hood would make us very popular - especially given that Python's json handles arbitrary sized integers without complaint.

@jlusiardi
Copy link

Ok, works for me ;)

@jlusiardi
Copy link

jlusiardi commented May 29, 2022

never the less https://datatracker.ietf.org/doc/html/rfc7159#section-6 recommends considering the range [-(2**53)+1, (2**53)-1] as interoperable.

@JustAnotherArchivist
Copy link
Collaborator

JustAnotherArchivist commented May 29, 2022

ujson is compatible with that as it already supports the range [-2**63, (2**64)-1]. This issue is merely about smaller/larger numbers producing incorrect values rather than a decoding error.

@bwoodsend
Copy link
Collaborator

I believe that @jlusiardi is suggesting that we change ujson to raise an error for any value outside of the range [-(2**53)+1, (2**53)-1]. That would make out of range value detection easier since we wouldn't have to tiptoe around 64 bit integer overflow.

@JustAnotherArchivist
Copy link
Collaborator

Hmm, right. I'm not a fan of artificially restricting the range like that.

But now I wonder why we aren't just reading digits into a char* buffer and then using PyLong_FromString. That would give us arbitrary int support, I'd expect. It's probably significantly slower, but we could use that if the number would overflow, I suppose. Or even if the number gets close to ULLONG_MAX / 10 and might therefore overflow later, if we don't care too much about performance in those edge cases.

@bwoodsend
Copy link
Collaborator

I agree. If JSON was used exclusively by JavaScript talking to other JavaScript applications then I would enforce the limited range but since JSON has become pretty universal, I don't think it makes sense to inherit JavaScript's shortcomings.

But now I wonder why we aren't just reading digits into a char* buffer and then using PyLong_FromString.

I think it's just for the perceived cleanliness of having a pure C implementation in a lib directory and a CPython wrapper in the python directory. I wouldn't be opposed to breaking that convention.

@JustAnotherArchivist
Copy link
Collaborator

Good point, and I do like that, actually. We could add an extra type for these, e.g. JT_INTSTR. The parser and encoder would deal with char* and the Python interface would convert these to/from PyLong, still keeping those things entirely separated.

@bwoodsend
Copy link
Collaborator

JT_INTSTR being an alias char array, correct? i.e. Conversion to int is deferred until it gets up the stack into the CPython code.

@JustAnotherArchivist
Copy link
Collaborator

Well, not quite an alias, just an entry in the JSTYPES enum, but yes, it would represent a char array and act exactly as you described.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants