New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Options for "NaN" strings as NaN floats #543
Conversation
You're right: we shouldn't support single quotes. That's a Python print-out, not JSON. This PR isn't so much about supporting @aswin1447's specific files as the general features of data like this:
as though it were in a single list with commas in between. This would not be by literally transforming the text: see and It's already doing some funny business with ArrayBuilder's The other part of this is to optionally read and write specified strings as floating point https://github.com/scikit-hep/awkward-1.0/blob/master/src/libawkward/io/json.cpp (This is also the only place where we can use RapidJSON directly—it's hidden from the interface/header files so that downstream users of the library won't also depend on RapidJSON. That's why some of the calling conventions are convoluted.) This PR can also include issue #192, passing a Python file pointer down to the C++ (which also mentions this issue of taking a stream of JSON objects, by the way). |
Sure, the issues are related. I was thinking of using a generator to stream the JSON fragments: def stream_read_json(fn):
start_pos = 0
with open(fn, 'r') as f:
while True:
try:
obj = json.load(f)
yield obj
return
except json.JSONDecodeError as e:
f.seek(start_pos)
json_str = f.read(e.pos)
obj = json.loads(json_str)
start_pos += e.pos
yield obj but I'm not sure about performance - it's way too slow or the test file is way too big :-) |
It would be much better to tweak the rules under which |
(*) failing integration tests are due to a C++17 @jpivarski - I run into a curious comparison error. The reported diff seems to be identical... > assert awkward1.to_list(array) == [1.1, 2.2, 3.3, numpy.inf, -numpy.inf,
[4.4, 5.5, 6.6, 7.7, 8.8, 9.9, 10.1, 11.11],
[12.12, 13.13, 14.14, 15.15, 16.16, 17.17],
[[[18.18, 19.19, 20.2, 21.21, 22.22],
[23.23, 24.24, 25.25, 26.26, 27.27,
28.28, 29.29, 30.3, 31.31, 32.32,
33.33, 34.34, 35.35, 36.36, 37.37],
[38.38],
[39.39, 40.4, numpy.nan, numpy.nan, 41.41, 42.42, 43.43]],
[[44.44, 45.45, 46.46, 47.47, 48.48],
[49.49, 50.5, 51.51, 52.52, 53.53,
54.54, 55.55, 56.56, 57.57, 58.58,
59.59, 60.6, 61.61, 62.62, 63.63],
[64.64],
[65.65, 66.66, numpy.nan, numpy.nan, 67.67, 68.68, 69.69]]]]
E assert [1.1,\n 2.2,\n 3.3,\n inf,\n -inf,\n [4.4, 5.5, 6.6, 7.7, 8.8, 9.9, 10.1, 11.11],\n [12.12, 13.13, 14.14, 15.15, 16.16, 17.17],\n [[[18.18, 19.19, 20.2, 21.21, 22.22],\n [23.23,\n 24.24,\n 25.25,\n 26.26,\n 27.27,\n 28.28,\n 29.29,\n 30.3,\n 31.31,\n 32.32,\n 33.33,\n 34.34,\n 35.35,\n 36.36,\n 37.37],\n [38.38],\n [39.39, 40.4, nan, nan, 41.41, 42.42, 43.43]],\n [[44.44, 45.45, 46.46, 47.47, 48.48],\n [49.49,\n 50.5,\n 51.51,\n 52.52,\n 53.53,\n 54.54,\n 55.55,\n 56.56,\n 57.57,\n 58.58,\n 59.59,\n 60.6,\n 61.61,\n 62.62,\n 63.63],\n [64.64],\n [65.65, 66.66, nan, nan, 67.67, 68.68, 69.69]]]] == [1.1,\n 2.2,\n 3.3,\n inf,\n -inf,\n [4.4, 5.5, 6.6, 7.7, 8.8, 9.9, 10.1, 11.11],\n [12.12, 13.13, 14.14, 15.15, 16.16, 17.17],\n [[[18.18, 19.19, 20.2, 21.21, 22.22],\n [23.23,\n 24.24,\n 25.25,\n 26.26,\n 27.27,\n 28.28,\n 29.29,\n 30.3,\n 31.31,\n 32.32,\n 33.33,\n 34.34,\n 35.35,\n 36.36,\n 37.37],\n [38.38],\n [39.39, 40.4, nan, nan, 41.41, 42.42, 43.43]],\n [[44.44, 45.45, 46.46, 47.47, 48.48],\n [49.49,\n 50.5,\n 51.51,\n 52.52,\n 53.53,\n 54.54,\n 55.55,\n 56.56,\n 57.57,\n 58.58,\n 59.59,\n 60.6,\n 61.61,\n 62.62,\n 63.63],\n [64.64],\n [65.65, 66.66, nan, nan, 67.67, 68.68, 69.69]]]]
E At index 7 diff: [[[18.18, 19.19, 20.2, 21.21, 22.22], [23.23, 24.24, 25.25, 26.26, 27.27, 28.28, 29.29, 30.3, 31.31, 32.32, 33.33, 34.34, 35.35, 36.36, 37.37], [38.38], [39.39, 40.4, nan, nan, 41.41, 42.42, 43.43]], [[44.44, 45.45, 46.46, 47.47, 48.48], [49.49, 50.5, 51.51, 52.52, 53.53, 54.54, 55.55, 56.56, 57.57, 58.58, 59.59, 60.6, 61.61, 62.62, 63.63], [64.64], [65.65, 66.66, nan, nan, 67.67, 68.68, 69.69]]] != [[[18.18, 19.19, 20.2, 21.21, 22.22], [23.23, 24.24, 25.25, 26.26, 27.27, 28.28, 29.29, 30.3, 31.31, 32.32, 33.33, 34.34, 35.35, 36.36, 37.37], [38.38], [39.39, 40.4, nan, nan, 41.41, 42.42, 43.43]], [[44.44, 45.45, 46.46, 47.47, 48.48], [49.49, 50.5, 51.51, 52.52, 53.53, 54.54, 55.55, 56.56, 57.57, 58.58, 59.59, 60.6, 61.61, 62.62, 63.63], [64.64], [65.65, 66.66, nan, nan, 67.67, 68.68, 69.69]]]
E Full diff:
E [
E 1.1,
E 2.2,
E 3.3,
E inf,
E -inf,
E [4.4, 5.5, 6.6, 7.7, 8.8, 9.9, 10.1, 11.11],
E [12.12, 13.13, 14.14, 15.15, 16.16, 17.17],
E [[[18.18, 19.19, 20.2, 21.21, 22.22],
E [23.23,
E 24.24,
E 25.25,
E 26.26,
E 27.27,
E 28.28,
E 29.29,
E 30.3,
E 31.31,
E 32.32,
E 33.33,
E 34.34,
E 35.35,
E 36.36,
E 37.37],
E [38.38],
E [39.39, 40.4, nan, nan, 41.41, 42.42, 43.43]],
E [[44.44, 45.45, 46.46, 47.47, 48.48],
E [49.49,
E 50.5,
E 51.51,
E 52.52,
E 53.53,
E 54.54,
E 55.55,
E 56.56,
E 57.57,
E 58.58,
E 59.59,
E 60.6,
E 61.61,
E 62.62,
E 63.63],
E [64.64],
E [65.65, 66.66, nan, nan, 67.67, 68.68, 69.69]]],
E ]
tests/test_0437-stream-of-many-json-files.py:26: AssertionError |
Maybe it's because NaN is not equal to NaN (according to IEEE rules): >>> float("nan") == float("nan")
False
>>> np.nan == np.nan
False A suggestion on StackOverflow is to replace the NaN values with strings, which can be done after the conversion to Python lists. >>> data1 = [[44.44, 45.45, 46.46, 47.47, 48.48],
... [49.49, 50.5, 51.51, 52.52, 53.53,
... 54.54, 55.55, 56.56, 57.57, 58.58,
... 59.59, 60.6, 61.61, 62.62, 63.63],
... [64.64],
... [65.65, 66.66, float("nan"), float("nan"), 67.67, 68.68, 69.69]]
>>> data2 = [[44.44, 45.45, 46.46, 47.47, 48.48],
... [49.49, 50.5, 51.51, 52.52, 53.53,
... 54.54, 55.55, 56.56, 57.57, 58.58,
... 59.59, 60.6, 61.61, 62.62, 63.63],
... [64.64],
... [65.65, 66.66, float("nan"), float("nan"), 67.67, 68.68, 69.69]]
>>> data1 == data2
False
>>>
>>> def fix(x):
... if isinstance(x, dict):
... return dict((k, fix(v)) for k, v in x.items())
... elif isinstance(x, list):
... return [fix(v) for v in x]
... elif isinstance(x, tuple):
... return tuple(fix(v) for v in x)
... elif np.isnan(x):
... return "nan"
... else:
... return x
...
>>> fix(data1) == fix(data2)
True There's no such problem with infinity: >>> float("inf") == float("inf")
True
>>> float("inf") == float("-inf")
False
>>> float("-inf") == float("-inf")
True
>>> np.inf == np.inf
True
>>> np.inf == -np.inf
False
>>> -np.inf == -np.inf
True In the IEEE floating point definition, only one bit pattern corresponds to infinity, another bit pattern corresponds to minus infinity, but there's a large space of bit patterns that all correspond to not-a-number. Here are two of them: >>> np.array([0, 0, 0, 0, 0, 0, 248, 127], dtype=np.uint8).view(np.float64)
array([nan])
>>> np.array([0, 0, 0, 0, 0, 0, 249, 127], dtype=np.uint8).view(np.float64)
array([nan]) Rather than requiring implementations of IEEE floating point arithmetic to complicate equality checks by checking to see if both sides are in or not in the "not-a-number" class, they're just defined as being not equal to each other. (After all, most implementations these days are in hardware, and you want to keep that as simple as possible.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I'd rather these options not be a part of ArrayBuilderOptions, since they only apply to JSON (as they work around a JSON limitation). ArrayBuilder is used in a lot of contexts other than JSON. If there's some reason why it would be very difficult to do this, then we can consider this an internal detail, since it's not exposed to the Python interface to ArrayBuilder, but it's looking to me like a bad direction to grow.
The reason we have ArrayBuilderOptions is because ArrayBuilder makes a tree of indeterminate depth. At any moment, it could encounter new data that broadens the type of its array, so each node needs to have access to the configuration options, which includes such things as "how big should the initial allocation be?"
These JSON options don't need to be pushed down this way. The JSON-handling code is all in a SAX parser class that doesn't create nodes of a tree (it calls methods on ArrayBuilder, which internally builds those nodes). So the JSON options don't have to be "pushed down" in the same way. That's why they don't really belong in ArrayBuilderOptions.
Also, about the options themselves: users need to be able to specify the spelling of "nan"
, "inf"
, and "-inf"
, since there's no standard—different JSON writers generate different strings: "NaN"
, "Inf"
, "-Inf"
, "Infinity"
, "-Infinity"
, and "minusInfinity"
are all things I've seen. A single dataset won't have multiple spellings, so these are three new options:
- the string that gets mapped to
std::numeric_limits<double>::quiet_NaN()
- the string that gets mapped to
std::numeric_limits<double>::infinity()
- the string that gets mapped to
-std::numeric_limits<double>::infinity()
any of which can be None (i.e. no string that gets mapped to a numeric limit).
done. It's the
reimplemented. There are a few parameters with default definitions that can be overwritten by a user. It's a long-ish list. Please, comment. convertNanAndInf=False # to float
replaceNanAndInf=False # with another string
fromNan='NaN'
fromInf='inf'
fromMinusInf='-inf'
toNan='NaN'
toInf='inf'
toMinusInf='-inf' for example: array = awkward1.from_json('tests/samples/test-nan-inf.json', convertNanAndInf=True,
fromInf='Infinity', fromNan='None')
looking into it :-) Thanks! |
The def to_json(
array,
destination=None,
pretty=False,
maxdecimals=None,
nan_string=None,
infinity_string=None,
minus_infinity_string=None,
buffersize=65536,
):
...
def from_json(
source,
nan_string=None,
infinity_string=None,
minus_infinity_string=None,
highlevel=True,
behavior=None,
initial=1024,
resize=1.5,
buffersize=65536,
):
... Giving them the same names highlights the symmetry. I've grouped them here with other JSON parameters ( There needs to be a distinction between "no conversion string" and "the conversion string that is ' |
…en writing out json
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking great, thanks! Just a few inline comments.
Also, the "awkward1
" → "awkward
" name change will be happening Nov 30 or Dec 1, so you'll want this PR to be done by then. See details about the name change here:
https://github.com/scikit-hep/awkward-1.0/wiki/Name-change-procedure
Thanks, @jpivarski ! I think, I'm done with this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are some cases to test. These first ones should raise an error (ValueError, as usual), but not hang:
"""{"one": 1, "two": 2.2,"""
"""{"one": 1,
"two": 2.2,"""
"""{"one": 1,
"two": 2.2,
"""
"""{"one": 1, "two": 2.2, "three": "THREE"}
{"one": 10, "two": 22,"""
"""{"one": 1, "two": 2.2, "three": "THREE"}
{"one": 10, "two": 22,
"""
"""["one", "two","""
"""["one",
"two","""
"""["one",
"two",
"""
And these should recognize that there are two JSON objects in the stream, despite the whitespace:
'{"one": 1, "two": 2.2}{"one": 10, "two": 22}'
'{"one": 1, "two": 2.2} {"one": 10, "two": 22}'
'{"one": 1, \t "two": 2.2}{"one": 10, "two": 22}'
'{"one": 1, "two": 2.2} \t {"one": 10, "two": 22}'
'{"one": 1, "two": 2.2}\n{"one": 10, "two": 22}'
'{"one": 1, "two": 2.2}\n\r{"one": 10, "two": 22}'
'{"one": 1, "two": 2.2} \n {"one": 10, "two": 22}'
'{"one": 1, "two": 2.2} \n\r {"one": 10, "two": 22}'
'{"one": 1, "two": 2.2}{"one": 10, "two": 22}\n'
'{"one": 1, "two": 2.2}{"one": 10, "two": 22}\n\r'
'{"one": 1, "two": 2.2}\n{"one": 10, "two": 22}\n'
'{"one": 1, "two": 2.2}\n\r{"one": 10, "two": 22}\n\r'
'{"one": 1, \n"two": 2.2}{"one": 10, "two": 22}'
'{"one": 1, \n\r"two": 2.2}{"one": 10, "two": 22}'
'{"one": 1, \n"two": 2.2}\n{"one": 10, "two": 22}'
'{"one": 1, \n\r"two": 2.2}\n\r{"one": 10, "two": 22}'
'{"one": 1, \n"two": 2.2}\n{"one": 10, "two": 22}\n'
'{"one": 1, \n\r"two": 2.2}\n\r{"one": 10, "two": 22}\n\r'
'{"one": 1, "two": 2.2}{"one": 10, \n"two": 22}'
'{"one": 1, "two": 2.2}{"one": 10, \n\r"two": 22}'
'{"one": 1, "two": 2.2}\n{"one": 10, \n"two": 22}'
'{"one": 1, "two": 2.2}\n\r{"one": 10, \n\r"two": 22}'
'["one", "two"]["uno", "dos"]'
'["one", "two"]\n["uno", "dos"]'
'["one", "two"]\n\r["uno", "dos"]'
'["one", "two"] ["uno", "dos"]'
'["one", "two"] \n ["uno", "dos"]'
'["one", "two"] \n\r ["uno", "dos"]'
'["one", \n "two"]["uno", "dos"]'
'["one", \n "two"]\n["uno", "dos"]'
'["one", \n\r "two"]["uno", "dos"]'
'["one", \n\r "two"]\n\r["uno", "dos"]'
'"one""two"'
'"one"\n"two"'
'"one"\n\r"two"'
'"one" "two"'
'"one" \n "two"'
'"one" \n\r "two"'
If it's too difficult to allow carriage returns within JSON objects and also between them (while not allowing incomplete JSON objects to make the parser hang), we could make the definition of a "JSONS" stream more strict: we could adopt a rule that valid JSON documents must be separated by carriage returns, but allow any whitespace. (That also solves the \n
vs \n\r
problem: we can consider \n
to be the true carriage return and \r
is just whitespace, like
and \t
.) That reduces the set of valid examples above. But even with this restrictive rule, blank lines should be allowed and skipped:
"""{"one": 1, "two": 2.2}
{"one": 10, "two": 22}"""
"""{"one": 1, "two": 2.2}
{"one": 10, "two": 22}
"""
"""1
2
3"""
"""1
2
3
"""
That's probably the easiest way to allow the stream to end with a newline.
{"one": 1, \t "two": 2.2}{"one": 10, "two": 22} | ||
{"one": 1, \n"two": 2.2}{"one": 10, "two": 22} | ||
{"one": 1, \n\r"two": 2.2}{"one": 10, "two": 22} | ||
{"one": 1, \n"two": 2.2}\n{"one": 10, "two": 22} | ||
{"one": 1, \n\r"two": 2.2}\n\r{"one": 10, "two": 22} | ||
{"one": 1, \n"two": 2.2}\n{"one": 10, "two": 22}\n | ||
{"one": 1, \n\r"two": 2.2}\n\r{"one": 10, "two": 22}\n\r | ||
{"one": 1, "two": 2.2}{"one": 10, \n"two": 22} | ||
{"one": 1, "two": 2.2}{"one": 10, \n\r"two": 22} | ||
{"one": 1, "two": 2.2}\n{"one": 10, \n"two": 22} | ||
{"one": 1, "two": 2.2}\n\r{"one": 10, \n\r"two": 22} | ||
["one", \n "two"]["uno", "dos"] | ||
["one", \n "two"]\n["uno", "dos"] | ||
["one", \n\r "two"]["uno", "dos"] | ||
["one", \n\r "two"]\n\r["uno", "dos"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are not carriage return/tab characters: they're literal backslash and n
, r
, t
in the text file. I meant for these to be independent tests (given as Python strings in the Python test file). If they're in a standalone JSON file like this, it's all one test, rather than testing features like "extra carriage return in the first line" and "extra carriage return in the second line," etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Yes, the strings are added as independent tests and the tests pass. I'll remove the file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
…he file name and a string
@jpivarski - the tests pass. There is a workaround for a string that is not a file name. The following change would help to eliminate the need to distinguish between a file name and a string. Would that be satisfactory? diff --git a/src/awkward1/operations/convert.py b/src/awkward1/operations/convert.py
index 6ffb080..11c135b 100644
--- a/src/awkward1/operations/convert.py
+++ b/src/awkward1/operations/convert.py
@@ -6,6 +6,8 @@ import numbers
import json
import collections
import math
+import io
+import os
import threading
import distutils.version
@@ -842,7 +844,7 @@ def from_json(source,
See also #ak.to_json.
"""
layout = awkward1._ext.fromjson(
- source,
+ source if not os.path.isfile(source) else open(source, "r").read(),
nan_string=nan_string,
infinity_string=infinity_string,
minus_infinity_string=minus_infinity_string, then there would be only need for |
Currently, have you put in any changes to let As for this proposed solution: it reads the whole JSON file as a string before doing any conversion. If a user is passing a filename or Python file object to the function, they're not expecting that, and very large files would then be a problem. (Note: a filled Awkward Array uses much less memory than the text of a JSON file.) We do need >>> file = open("setup.py", "r")
>>> file.fileno()
3 (Don't check The |
No, I did not push any changes - all this is in my private area :-)
Yes, you are right.
Ok, thanks.
Please, let me know if I should continue working on this PR or is it ok asis to be merged? Thanks! |
I just put in one correction, hiding I also noticed this: >>> ak.from_json('["one", \n"two"] \n ["three"]')
<Array ['one', 'two', 'three'] type='3 * string'> The output should be
because each separate JSON document should be separate. Sorry that I wasn't clear when listing test cases. Meanwhile, >>> ak.from_json('{"x": 1, "y": 2} \n {"x": 10, "y": 20}')
<Array [[{x: 1, y: 2}], [{x: 10, y: 20}]] type='2 * var * {"x": int64, "y": int64}'> is too deep: it should be
I'll be in a meeting now, so I'm not working on that now. I'll check in again tomorrow. Thanks! As for passing in the |
done.
done.
The last thing on my list is parsing a string that starts with a number: str = """ "1
2
3 " """
array = awkward1.from_json(str)
> assert awkward1.to_list(array) == [1, 2, 3]
E assert [2, 3] == [1, 2, 3]
E At index 0 diff: 2 != 1
E Right contains one more item: 3
E Full diff:
E - [2, 3]
E + [1, 2, 3]
E ? +++ |
@jpivarski - I think, I'll close it and make a new PR after the name migration. |
Address Issue #437
"NaN"
and"Inf"
strings asNaN
andInf
floats"NaN"
and"Inf"
strings as other "predefined" strings@jpivarski - that seems to be working out of the box. I'm creating the PR to test it on different architectures.