UltraJSON has inconsistent behavior on parsing large integral values. #440

TheShiftedBit · 2020-11-10T05:30:12Z

For some integers that are too large, UltraJSON will raise a ValueError. For others, it will parse, but parse to the wrong value. This happens both for numbers that are too large in the positive direction, and numbers that are too large in the negative direction.

What did you do?

Encoded an integer that was too big or too small.

What did you expect to happen?

I expected UltraJSON to raise a ValueError.

What actually happened?

UltraJSON parsed the integer...into the wrong value.

What versions are you using?

OS: Debian 5.7.17-1rodete4 (2020-10-01) x86_64 GNU/Linux
Python: 3.8.5
UltraJSON: HEAD as of Nov 9 2020 (commit hash f26f967)

Please include code that reproduces the issue.

import ujson
ujson.loads("18446744073709551615")  # Parses fine, max 64-bit value
ujson.loads("18446744073709551616") # Raises a ValueError: Value is too big!
# This value is even bigger:
ujson.loads("33333333303333333333")  # Parses, but parses into the wrong value, 14886589229623781717

# The same thing happens for negative numbers. For example:
ujson.loads("-80888888888888888888")  # Parses to the wrong value, -7101912594050682424

The text was updated successfully, but these errors were encountered:

JustAnotherArchivist · 2022-02-12T19:34:01Z

This is happening due to incorrect overflow* detection. The decoder parses the input digit by digit. Every time, the previous value is multiplied by ten and the next digit added. After that, there is an attempt to check whether the result is smaller than the previous value. However, overflows can happen without triggering that check.

For example, let's take the input 33333333303333336789 (just changed the last few digits to make it more readable). When parsing the last digit, this has the prevIntValue = 3333333330333333678. Upon multiplying by ten, this wraps around to 14886589229623785164, and the addition of the last digit then results in ...73. This result is larger than prevIntValue, so the error condition is not triggered despite the overflow.

(* Technically, unsigned ints cannot actually overflow (cause undefined behaviour etc.) in C. I'm using 'overflow' in the looser sense to mean wrapping around.)

jlusiardi · 2022-05-27T09:01:04Z

Should Number.MAX_SAFE_INTEGER be used? See https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Number/MAX_SAFE_INTEGER

bwoodsend · 2022-05-27T19:13:46Z

I don't think that matching constraints caused by JavaScript's slightly questionable behaviour of munging integers and floats under the hood would make us very popular - especially given that Python's json handles arbitrary sized integers without complaint.

jlusiardi · 2022-05-27T19:15:24Z

Ok, works for me ;)

jlusiardi · 2022-05-29T14:12:16Z

never the less https://datatracker.ietf.org/doc/html/rfc7159#section-6 recommends considering the range [-(2**53)+1, (2**53)-1] as interoperable.

JustAnotherArchivist · 2022-05-29T17:15:14Z

ujson is compatible with that as it already supports the range [-2**63, (2**64)-1]. This issue is merely about smaller/larger numbers producing incorrect values rather than a decoding error.

bwoodsend · 2022-05-29T19:09:27Z

I believe that @jlusiardi is suggesting that we change ujson to raise an error for any value outside of the range [-(2**53)+1, (2**53)-1]. That would make out of range value detection easier since we wouldn't have to tiptoe around 64 bit integer overflow.

JustAnotherArchivist · 2022-05-29T19:38:47Z

Hmm, right. I'm not a fan of artificially restricting the range like that.

But now I wonder why we aren't just reading digits into a char* buffer and then using PyLong_FromString. That would give us arbitrary int support, I'd expect. It's probably significantly slower, but we could use that if the number would overflow, I suppose. Or even if the number gets close to ULLONG_MAX / 10 and might therefore overflow later, if we don't care too much about performance in those edge cases.

bwoodsend · 2022-05-30T20:30:02Z

I agree. If JSON was used exclusively by JavaScript talking to other JavaScript applications then I would enforce the limited range but since JSON has become pretty universal, I don't think it makes sense to inherit JavaScript's shortcomings.

But now I wonder why we aren't just reading digits into a char* buffer and then using PyLong_FromString.

I think it's just for the perceived cleanliness of having a pure C implementation in a lib directory and a CPython wrapper in the python directory. I wouldn't be opposed to breaking that convention.

JustAnotherArchivist · 2022-05-30T20:52:39Z

Good point, and I do like that, actually. We could add an extra type for these, e.g. JT_INTSTR. The parser and encoder would deal with char* and the Python interface would convert these to/from PyLong, still keeping those things entirely separated.

bwoodsend · 2022-05-30T21:21:18Z

JT_INTSTR being an alias char array, correct? i.e. Conversion to int is deferred until it gets up the stack into the CPython code.

JustAnotherArchivist · 2022-05-30T21:38:17Z

Well, not quite an alias, just an entry in the JSTYPES enum, but yes, it would represent a char array and act exactly as you described.

hugovk mentioned this issue May 27, 2022

Fix wrap around issues on large numerics #543

Closed

NaN-git mentioned this issue May 28, 2022

Integer parsing: always detect overflows #544

Merged

hugovk closed this as completed in #544 Jun 1, 2022

JustAnotherArchivist mentioned this issue Jun 3, 2022

Add support for arbitrary size integers #548

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UltraJSON has inconsistent behavior on parsing large integral values. #440

UltraJSON has inconsistent behavior on parsing large integral values. #440

TheShiftedBit commented Nov 10, 2020

JustAnotherArchivist commented Feb 12, 2022

jlusiardi commented May 27, 2022

bwoodsend commented May 27, 2022

jlusiardi commented May 27, 2022

jlusiardi commented May 29, 2022 •

edited by bwoodsend

Loading

JustAnotherArchivist commented May 29, 2022 •

edited

Loading

bwoodsend commented May 29, 2022

JustAnotherArchivist commented May 29, 2022

bwoodsend commented May 30, 2022

JustAnotherArchivist commented May 30, 2022

bwoodsend commented May 30, 2022

JustAnotherArchivist commented May 30, 2022

UltraJSON has inconsistent behavior on parsing large integral values. #440

UltraJSON has inconsistent behavior on parsing large integral values. #440

Comments

TheShiftedBit commented Nov 10, 2020

What did you do?

What did you expect to happen?

What actually happened?

What versions are you using?

JustAnotherArchivist commented Feb 12, 2022

jlusiardi commented May 27, 2022

bwoodsend commented May 27, 2022

jlusiardi commented May 27, 2022

jlusiardi commented May 29, 2022 • edited by bwoodsend Loading

JustAnotherArchivist commented May 29, 2022 • edited Loading

bwoodsend commented May 29, 2022

JustAnotherArchivist commented May 29, 2022

bwoodsend commented May 30, 2022

JustAnotherArchivist commented May 30, 2022

bwoodsend commented May 30, 2022

JustAnotherArchivist commented May 30, 2022

jlusiardi commented May 29, 2022 •

edited by bwoodsend

Loading

JustAnotherArchivist commented May 29, 2022 •

edited

Loading