Options for "NaN" strings as NaN floats #543

ianna · 2020-11-17T15:58:30Z

Address Issue #437

stream JSON fragments from a JSON file
stream JSON fragments from a string (same as above?)
convert "NaN" and "Inf" strings as NaN and Inf floats
convert "NaN" and "Inf" strings as other "predefined" strings
allow user to re-defined "from_string" to "to_string" values (the UI?)

@jpivarski - that seems to be working out of the box. I'm creating the PR to test it on different architectures.

jpivarski · 2020-11-18T15:31:19Z

You're right: we shouldn't support single quotes. That's a Python print-out, not JSON.

This PR isn't so much about supporting @aswin1447's specific files as the general features of data like this:

{"x": 1.1, "y": []}
{"x": 2.2, "y": [1]}
{"x": 3.3, "y": [1, 2]}
{"x": 4.4, "y": [1, 2, 3]}
{"x": 5.5, "y": [1, 2, 3, 4]}
{"x": 6.6, "y": [1, 2, 3, 4, 5]}

as though it were in a single list with commas in between. This would not be by literally transforming the text: see

https://github.com/scikit-hep/awkward-1.0/blob/05ca0f3202e3aa2886bce1b7e811ebde52289a40/src/python/io.cpp#L20-L79

and

https://github.com/scikit-hep/awkward-1.0/blob/05ca0f3202e3aa2886bce1b7e811ebde52289a40/src/awkward1/operations/convert.py#L827-L833

It's already doing some funny business with ArrayBuilder's beginlist and endlist to support records; just a little more tweaking would be required to tell RapidJSON to keep parsing after a JSON document is done and handling the beginlist/endlist to return an array, rather than a record. It might not be necessary to rely on carriage returns (\n or \r\n) to separate JSON items in a stream, though this is a very common case. (In principle, it should be enough to see that the RapidJSON parser has come to the end of a valid JSON object but there's more text left; JSON provides its own delimiters.)

The other part of this is to optionally read and write specified strings as floating point nan, inf, and -inf. The nicest interface would let users specify strings that can be as these values, with no such interpretation being the default (the option can be None or a string). The place where this would be applied is:

https://github.com/scikit-hep/awkward-1.0/blob/master/src/libawkward/io/json.cpp

(This is also the only place where we can use RapidJSON directly—it's hidden from the interface/header files so that downstream users of the library won't also depend on RapidJSON. That's why some of the calling conventions are convoluted.)

This PR can also include issue #192, passing a Python file pointer down to the C++ (which also mentions this issue of taking a stream of JSON objects, by the way).

ianna · 2020-11-18T15:57:38Z

Sure, the issues are related. I was thinking of using a generator to stream the JSON fragments:

    def stream_read_json(fn):
        start_pos = 0
        with open(fn, 'r') as f:
            while True:
                try:
                    obj = json.load(f)
                    yield obj
                    return
                except json.JSONDecodeError as e:
                    f.seek(start_pos)
                    json_str = f.read(e.pos)
                    obj = json.loads(json_str)
                    start_pos += e.pos
                    yield obj

but I'm not sure about performance - it's way too slow or the test file is way too big :-)

jpivarski · 2020-11-18T16:28:19Z

but I'm not sure about performance - it's way too slow or the test file is way too big :-)

It would be much better to tweak the rules under which ArrayBuilder::beginlist() and ArrayBuilder::endlist() are called. The difference would be two function calls (in C++) versus modifying the text stream. Generally, these data files are big. (That's why we use RapidJSON's SAX parser, rather than creating a DOM—it can make the difference between having or not having enough memory is realistic use-cases.)

ianna · 2020-11-19T12:41:05Z

(*) failing integration tests are due to a C++17 string_vew. I'll deal with it alternatively :-)

@jpivarski - I run into a curious comparison error. The reported diff seems to be identical...
Are the numpy.inf and numpy.nan handled as I expect? What shall I use instead? Thanks

>       assert awkward1.to_list(array) == [1.1, 2.2, 3.3, numpy.inf, -numpy.inf,
            [4.4, 5.5, 6.6, 7.7, 8.8, 9.9, 10.1, 11.11],
            [12.12, 13.13, 14.14, 15.15, 16.16, 17.17],
            [[[18.18,   19.19,   20.2,    21.21,   22.22],
              [23.23,   24.24,   25.25,   26.26,   27.27,
               28.28,   29.29,   30.3,    31.31,   32.32,
               33.33,   34.34,   35.35,   36.36,   37.37],
              [38.38],
              [39.39, 40.4, numpy.nan, numpy.nan, 41.41, 42.42, 43.43]],
             [[44.44,   45.45,   46.46,   47.47,   48.48],
              [49.49,   50.5,    51.51,   52.52,   53.53,
               54.54,   55.55,   56.56,   57.57,   58.58,
               59.59,   60.6,    61.61,   62.62,   63.63],
              [64.64],
              [65.65, 66.66, numpy.nan, numpy.nan, 67.67, 68.68, 69.69]]]]
E       assert [1.1,\n 2.2,\n 3.3,\n inf,\n -inf,\n [4.4, 5.5, 6.6, 7.7, 8.8, 9.9, 10.1, 11.11],\n [12.12, 13.13, 14.14, 15.15, 16.16, 17.17],\n [[[18.18, 19.19, 20.2, 21.21, 22.22],\n   [23.23,\n    24.24,\n    25.25,\n    26.26,\n    27.27,\n    28.28,\n    29.29,\n    30.3,\n    31.31,\n    32.32,\n    33.33,\n    34.34,\n    35.35,\n    36.36,\n    37.37],\n   [38.38],\n   [39.39, 40.4, nan, nan, 41.41, 42.42, 43.43]],\n  [[44.44, 45.45, 46.46, 47.47, 48.48],\n   [49.49,\n    50.5,\n    51.51,\n    52.52,\n    53.53,\n    54.54,\n    55.55,\n    56.56,\n    57.57,\n    58.58,\n    59.59,\n    60.6,\n    61.61,\n    62.62,\n    63.63],\n   [64.64],\n   [65.65, 66.66, nan, nan, 67.67, 68.68, 69.69]]]] == [1.1,\n 2.2,\n 3.3,\n inf,\n -inf,\n [4.4, 5.5, 6.6, 7.7, 8.8, 9.9, 10.1, 11.11],\n [12.12, 13.13, 14.14, 15.15, 16.16, 17.17],\n [[[18.18, 19.19, 20.2, 21.21, 22.22],\n   [23.23,\n    24.24,\n    25.25,\n    26.26,\n    27.27,\n    28.28,\n    29.29,\n    30.3,\n    31.31,\n    32.32,\n    33.33,\n    34.34,\n    35.35,\n    36.36,\n    37.37],\n   [38.38],\n   [39.39, 40.4, nan, nan, 41.41, 42.42, 43.43]],\n  [[44.44, 45.45, 46.46, 47.47, 48.48],\n   [49.49,\n    50.5,\n    51.51,\n    52.52,\n    53.53,\n    54.54,\n    55.55,\n    56.56,\n    57.57,\n    58.58,\n    59.59,\n    60.6,\n    61.61,\n    62.62,\n    63.63],\n   [64.64],\n   [65.65, 66.66, nan, nan, 67.67, 68.68, 69.69]]]]
E         At index 7 diff: [[[18.18, 19.19, 20.2, 21.21, 22.22], [23.23, 24.24, 25.25, 26.26, 27.27, 28.28, 29.29, 30.3, 31.31, 32.32, 33.33, 34.34, 35.35, 36.36, 37.37], [38.38], [39.39, 40.4, nan, nan, 41.41, 42.42, 43.43]], [[44.44, 45.45, 46.46, 47.47, 48.48], [49.49, 50.5, 51.51, 52.52, 53.53, 54.54, 55.55, 56.56, 57.57, 58.58, 59.59, 60.6, 61.61, 62.62, 63.63], [64.64], [65.65, 66.66, nan, nan, 67.67, 68.68, 69.69]]] != [[[18.18, 19.19, 20.2, 21.21, 22.22], [23.23, 24.24, 25.25, 26.26, 27.27, 28.28, 29.29, 30.3, 31.31, 32.32, 33.33, 34.34, 35.35, 36.36, 37.37], [38.38], [39.39, 40.4, nan, nan, 41.41, 42.42, 43.43]], [[44.44, 45.45, 46.46, 47.47, 48.48], [49.49, 50.5, 51.51, 52.52, 53.53, 54.54, 55.55, 56.56, 57.57, 58.58, 59.59, 60.6, 61.61, 62.62, 63.63], [64.64], [65.65, 66.66, nan, nan, 67.67, 68.68, 69.69]]]
E         Full diff:
E           [
E            1.1,
E            2.2,
E            3.3,
E            inf,
E            -inf,
E            [4.4, 5.5, 6.6, 7.7, 8.8, 9.9, 10.1, 11.11],
E            [12.12, 13.13, 14.14, 15.15, 16.16, 17.17],
E            [[[18.18, 19.19, 20.2, 21.21, 22.22],
E              [23.23,
E               24.24,
E               25.25,
E               26.26,
E               27.27,
E               28.28,
E               29.29,
E               30.3,
E               31.31,
E               32.32,
E               33.33,
E               34.34,
E               35.35,
E               36.36,
E               37.37],
E              [38.38],
E              [39.39, 40.4, nan, nan, 41.41, 42.42, 43.43]],
E             [[44.44, 45.45, 46.46, 47.47, 48.48],
E              [49.49,
E               50.5,
E               51.51,
E               52.52,
E               53.53,
E               54.54,
E               55.55,
E               56.56,
E               57.57,
E               58.58,
E               59.59,
E               60.6,
E               61.61,
E               62.62,
E               63.63],
E              [64.64],
E              [65.65, 66.66, nan, nan, 67.67, 68.68, 69.69]]],
E           ]

tests/test_0437-stream-of-many-json-files.py:26: AssertionError

jpivarski · 2020-11-19T16:34:14Z

Are the numpy.inf and numpy.nan handled as I expect? What shall I use instead?

Maybe it's because NaN is not equal to NaN (according to IEEE rules):

>>> float("nan") == float("nan")
False
>>> np.nan == np.nan
False

A suggestion on StackOverflow is to replace the NaN values with strings, which can be done after the conversion to Python lists.

>>> data1 = [[44.44,   45.45,   46.46,   47.47,   48.48],
...          [49.49,   50.5,    51.51,   52.52,   53.53,
...           54.54,   55.55,   56.56,   57.57,   58.58,
...           59.59,   60.6,    61.61,   62.62,   63.63],
...          [64.64],
...          [65.65, 66.66, float("nan"), float("nan"), 67.67, 68.68, 69.69]]
>>> data2 = [[44.44,   45.45,   46.46,   47.47,   48.48],
...          [49.49,   50.5,    51.51,   52.52,   53.53,
...           54.54,   55.55,   56.56,   57.57,   58.58,
...           59.59,   60.6,    61.61,   62.62,   63.63],
...          [64.64],
...          [65.65, 66.66, float("nan"), float("nan"), 67.67, 68.68, 69.69]]
>>> data1 == data2
False
>>> 
>>> def fix(x):
...     if isinstance(x, dict):
...         return dict((k, fix(v)) for k, v in x.items())
...     elif isinstance(x, list):
...         return [fix(v) for v in x]
...     elif isinstance(x, tuple):
...         return tuple(fix(v) for v in x)
...     elif np.isnan(x):
...         return "nan"
...     else:
...         return x
... 
>>> fix(data1) == fix(data2)
True

There's no such problem with infinity:

>>> float("inf") == float("inf")
True
>>> float("inf") == float("-inf")
False
>>> float("-inf") == float("-inf")
True
>>> np.inf == np.inf
True
>>> np.inf == -np.inf
False
>>> -np.inf == -np.inf
True

In the IEEE floating point definition, only one bit pattern corresponds to infinity, another bit pattern corresponds to minus infinity, but there's a large space of bit patterns that all correspond to not-a-number. Here are two of them:

>>> np.array([0, 0, 0, 0, 0, 0, 248, 127], dtype=np.uint8).view(np.float64)
array([nan])
>>> np.array([0, 0, 0, 0, 0, 0, 249, 127], dtype=np.uint8).view(np.float64)
array([nan])

Rather than requiring implementations of IEEE floating point arithmetic to complicate equality checks by checking to see if both sides are in or not in the "not-a-number" class, they're just defined as being not equal to each other. (After all, most implementations these days are in hardware, and you want to keep that as simple as possible.)

jpivarski

Oh, I'd rather these options not be a part of ArrayBuilderOptions, since they only apply to JSON (as they work around a JSON limitation). ArrayBuilder is used in a lot of contexts other than JSON. If there's some reason why it would be very difficult to do this, then we can consider this an internal detail, since it's not exposed to the Python interface to ArrayBuilder, but it's looking to me like a bad direction to grow.

The reason we have ArrayBuilderOptions is because ArrayBuilder makes a tree of indeterminate depth. At any moment, it could encounter new data that broadens the type of its array, so each node needs to have access to the configuration options, which includes such things as "how big should the initial allocation be?"

These JSON options don't need to be pushed down this way. The JSON-handling code is all in a SAX parser class that doesn't create nodes of a tree (it calls methods on ArrayBuilder, which internally builds those nodes). So the JSON options don't have to be "pushed down" in the same way. That's why they don't really belong in ArrayBuilderOptions.

Also, about the options themselves: users need to be able to specify the spelling of "nan", "inf", and "-inf", since there's no standard—different JSON writers generate different strings: "NaN", "Inf", "-Inf", "Infinity", "-Infinity", and "minusInfinity" are all things I've seen. A single dataset won't have multiple spellings, so these are three new options:

the string that gets mapped to std::numeric_limits<double>::quiet_NaN()
the string that gets mapped to std::numeric_limits<double>::infinity()
the string that gets mapped to -std::numeric_limits<double>::infinity()

any of which can be None (i.e. no string that gets mapped to a numeric limit).

ianna · 2020-11-20T18:10:30Z

Oh, I'd rather these options not be a part of ArrayBuilderOptions, since they only apply to JSON (as they work around a JSON limitation). ArrayBuilder is used in a lot of contexts other than JSON. If there's some reason why it would be very difficult to do this, then we can consider this an internal detail, since it's not exposed to the Python interface to ArrayBuilder, but it's looking to me like a bad direction to grow.

The reason we have ArrayBuilderOptions is because ArrayBuilder makes a tree of indeterminate depth. At any moment, it could encounter new data that broadens the type of its array, so each node needs to have access to the configuration options, which includes such things as "how big should the initial allocation be?"

These JSON options don't need to be pushed down this way. The JSON-handling code is all in a SAX parser class that doesn't create nodes of a tree (it calls methods on ArrayBuilder, which internally builds those nodes). So the JSON options don't have to be "pushed down" in the same way. That's why they don't really belong in ArrayBuilderOptions.

done. It's the *Options that confused me :-)

Also, about the options themselves: users need to be able to specify the spelling of "nan", "inf", and "-inf", since there's no standard—different JSON writers generate different strings: "NaN", "Inf", "-Inf", "Infinity", "-Infinity", and "minusInfinity" are all things I've seen. A single dataset won't have multiple spellings, so these are three new options:

the string that gets mapped to std::numeric_limits<double>::quiet_NaN()

the string that gets mapped to std::numeric_limits<double>::infinity()

the string that gets mapped to -std::numeric_limits<double>::infinity()

reimplemented. There are a few parameters with default definitions that can be overwritten by a user. It's a long-ish list. Please, comment.

convertNanAndInf=False # to float
replaceNanAndInf=False # with another string
fromNan='NaN'
fromInf='inf'
fromMinusInf='-inf'
toNan='NaN'
toInf='inf'
toMinusInf='-inf'

for example:

    array = awkward1.from_json('tests/samples/test-nan-inf.json', convertNanAndInf=True,
        fromInf='Infinity', fromNan='None')

any of which can be None (i.e. no string that gets mapped to a numeric limit).

looking into it :-) Thanks!

jpivarski · 2020-11-20T18:24:34Z

The to_json and from_json functions are separate and can get separate sets of options; no one needs all 6, they each need only 3. They can take the same names, too:

def to_json(
    array,
    destination=None,
    pretty=False,
    maxdecimals=None,
    nan_string=None,
    infinity_string=None,
    minus_infinity_string=None,
    buffersize=65536,
):
    ...

def from_json(
    source,
    nan_string=None,
    infinity_string=None,
    minus_infinity_string=None,
    highlevel=True,
    behavior=None,
    initial=1024,
    resize=1.5,
    buffersize=65536,
):
    ...

Giving them the same names highlights the symmetry. I've grouped them here with other JSON parameters (pretty and maxdecimals in to_json, before highlevel and behavior, which are usually last, but ArrayBuilder parameters, initial, resize, and buffersize are usually of even less interest).

There needs to be a distinction between "no conversion string" and "the conversion string that is 'None'", since the latter is a possible encoding of NaN.

…en writing out json

jpivarski

This is looking great, thanks! Just a few inline comments.

Also, the "awkward1" → "awkward" name change will be happening Nov 30 or Dec 1, so you'll want this PR to be done by then. See details about the name change here:

https://github.com/scikit-hep/awkward-1.0/wiki/Name-change-procedure

src/libawkward/io/json.cpp

src/python/content.cpp

tests/test_0437-stream-of-many-json-files.py

ianna · 2020-11-24T08:41:54Z

Thanks, @jpivarski ! I think, I'm done with this PR.

jpivarski

Here are some cases to test. These first ones should raise an error (ValueError, as usual), but not hang:

"""{"one": 1, "two": 2.2,"""
"""{"one": 1,
"two": 2.2,"""
"""{"one": 1,
"two": 2.2,
"""
"""{"one": 1, "two": 2.2, "three": "THREE"}
{"one": 10, "two": 22,"""
"""{"one": 1, "two": 2.2, "three": "THREE"}
{"one": 10, "two": 22,
"""
"""["one", "two","""
"""["one",
"two","""
"""["one",
"two",
"""

And these should recognize that there are two JSON objects in the stream, despite the whitespace:

'{"one": 1, "two": 2.2}{"one": 10, "two": 22}'
'{"one": 1, "two": 2.2}     {"one": 10, "two": 22}'
'{"one": 1, \t "two": 2.2}{"one": 10, "two": 22}'
'{"one": 1, "two": 2.2}  \t   {"one": 10, "two": 22}'
'{"one": 1, "two": 2.2}\n{"one": 10, "two": 22}'
'{"one": 1, "two": 2.2}\n\r{"one": 10, "two": 22}'
'{"one": 1, "two": 2.2}     \n     {"one": 10, "two": 22}'
'{"one": 1, "two": 2.2}     \n\r     {"one": 10, "two": 22}'
'{"one": 1, "two": 2.2}{"one": 10, "two": 22}\n'
'{"one": 1, "two": 2.2}{"one": 10, "two": 22}\n\r'
'{"one": 1, "two": 2.2}\n{"one": 10, "two": 22}\n'
'{"one": 1, "two": 2.2}\n\r{"one": 10, "two": 22}\n\r'
'{"one": 1, \n"two": 2.2}{"one": 10, "two": 22}'
'{"one": 1, \n\r"two": 2.2}{"one": 10, "two": 22}'
'{"one": 1, \n"two": 2.2}\n{"one": 10, "two": 22}'
'{"one": 1, \n\r"two": 2.2}\n\r{"one": 10, "two": 22}'
'{"one": 1, \n"two": 2.2}\n{"one": 10, "two": 22}\n'
'{"one": 1, \n\r"two": 2.2}\n\r{"one": 10, "two": 22}\n\r'
'{"one": 1, "two": 2.2}{"one": 10, \n"two": 22}'
'{"one": 1, "two": 2.2}{"one": 10, \n\r"two": 22}'
'{"one": 1, "two": 2.2}\n{"one": 10, \n"two": 22}'
'{"one": 1, "two": 2.2}\n\r{"one": 10, \n\r"two": 22}'
'["one", "two"]["uno", "dos"]'
'["one", "two"]\n["uno", "dos"]'
'["one", "two"]\n\r["uno", "dos"]'
'["one", "two"]     ["uno", "dos"]'
'["one", "two"]  \n   ["uno", "dos"]'
'["one", "two"]  \n\r   ["uno", "dos"]'
'["one", \n "two"]["uno", "dos"]'
'["one", \n "two"]\n["uno", "dos"]'
'["one", \n\r "two"]["uno", "dos"]'
'["one", \n\r "two"]\n\r["uno", "dos"]'
'"one""two"'
'"one"\n"two"'
'"one"\n\r"two"'
'"one"     "two"'
'"one"  \n   "two"'
'"one"  \n\r   "two"'

If it's too difficult to allow carriage returns within JSON objects and also between them (while not allowing incomplete JSON objects to make the parser hang), we could make the definition of a "JSONS" stream more strict: we could adopt a rule that valid JSON documents must be separated by carriage returns, but allow any whitespace. (That also solves the \n vs \n\r problem: we can consider \n to be the true carriage return and \r is just whitespace, like and \t.) That reduces the set of valid examples above. But even with this restrictive rule, blank lines should be allowed and skipped:

"""{"one": 1, "two": 2.2}

{"one": 10, "two": 22}"""
"""{"one": 1, "two": 2.2}

{"one": 10, "two": 22}
"""
"""1
2

3"""
"""1
     2

3
"""

That's probably the easiest way to allow the stream to end with a newline.

jpivarski · 2020-11-24T21:25:11Z

tests/samples/test-two-arrays-with-spaces.json

+{"one": 1, \t "two": 2.2}{"one": 10, "two": 22}
+{"one": 1, \n"two": 2.2}{"one": 10, "two": 22}
+{"one": 1, \n\r"two": 2.2}{"one": 10, "two": 22}
+{"one": 1, \n"two": 2.2}\n{"one": 10, "two": 22}
+{"one": 1, \n\r"two": 2.2}\n\r{"one": 10, "two": 22}
+{"one": 1, \n"two": 2.2}\n{"one": 10, "two": 22}\n
+{"one": 1, \n\r"two": 2.2}\n\r{"one": 10, "two": 22}\n\r
+{"one": 1, "two": 2.2}{"one": 10, \n"two": 22}
+{"one": 1, "two": 2.2}{"one": 10, \n\r"two": 22}
+{"one": 1, "two": 2.2}\n{"one": 10, \n"two": 22}
+{"one": 1, "two": 2.2}\n\r{"one": 10, \n\r"two": 22}
+["one", \n "two"]["uno", "dos"]
+["one", \n "two"]\n["uno", "dos"]
+["one", \n\r "two"]["uno", "dos"]
+["one", \n\r "two"]\n\r["uno", "dos"]


These are not carriage return/tab characters: they're literal backslash and n, r, t in the text file. I meant for these to be independent tests (given as Python strings in the Python test file). If they're in a standalone JSON file like this, it's all one test, rather than testing features like "extra carriage return in the first line" and "extra carriage return in the second line," etc.

Thanks! Yes, the strings are added as independent tests and the tests pass. I'll remove the file.

…he file name and a string

ianna · 2020-11-25T13:25:34Z

@jpivarski - the tests pass. There is a workaround for a string that is not a file name.

The following change would help to eliminate the need to distinguish between a file name and a string. Would that be satisfactory?

diff --git a/src/awkward1/operations/convert.py b/src/awkward1/operations/convert.py
index 6ffb080..11c135b 100644
--- a/src/awkward1/operations/convert.py
+++ b/src/awkward1/operations/convert.py
@@ -6,6 +6,8 @@ import numbers
 import json
 import collections
 import math
+import io
+import os
 import threading
 import distutils.version
 
@@ -842,7 +844,7 @@ def from_json(source,
     See also #ak.to_json.
     """
     layout = awkward1._ext.fromjson(
-        source,
+        source if not os.path.isfile(source) else open(source, "r").read(),
         nan_string=nan_string,
         infinity_string=infinity_string,
         minus_infinity_string=minus_infinity_string,

then there would be only need for ak::FromJsonString and FromJsonFile could be deleted.
Please, let me know if that's the direction you'd like me to take and if you'd like this be part of the PR. I've got it locally and all tests pass. Thanks!

jpivarski · 2020-11-25T17:35:17Z

Currently, have you put in any changes to let to_json and from_json accept a Python file object? I didn't see any such change in the GitHub diff, just the diff you quoted. (I'm only asking because I want to know if there's some code I'm supposed to look at.)

As for this proposed solution: it reads the whole JSON file as a string before doing any conversion. If a user is passing a filename or Python file object to the function, they're not expecting that, and very large files would then be a problem. (Note: a filled Awkward Array uses much less memory than the text of a JSON file.)

We do need FromJsonFile to do this incremental parsing. That's why we're using RapidJSON's SAX parser. If given a Python file object, you can get its file-handle as an integer:

>>> file = open("setup.py", "r")
>>> file.fileno()
3

(Don't check isinstance to see if it's a plain file—a lot of interfaces in Python have this fileno() method so they can be used interchangeably. Do hasattr(file, "fileno") to check for it. This also doesn't cover a common Python case of file-like objects that don't have a fileno but do have a read method. We can put that off for later, as it will require a FromJsonStringIncremental, which simply hasn't been written yet.)

The source of from_json has a weakness that if a string is both a filename and valid JSON, it's ambiguous how to parse it. I think we pick "filename" in these cases. Having a filename that is valid JSON is weird.

ianna · 2020-11-25T18:25:44Z

Currently, have you put in any changes to let to_json and from_json accept a Python file object? I didn't see any such change in the GitHub diff, just the diff you quoted. (I'm only asking because I want to know if there's some code I'm supposed to look at.)

No, I did not push any changes - all this is in my private area :-)

As for this proposed solution: it reads the whole JSON file as a string before doing any conversion. If a user is passing a filename or Python file object to the function, they're not expecting that, and very large files would then be a problem. (Note: a filled Awkward Array uses much less memory than the text of a JSON file.)

Yes, you are right.

We do need FromJsonFile to do this incremental parsing. That's why we're using RapidJSON's SAX parser. If given a Python file object, you can get its file-handle as an integer:
>>> file = open("setup.py", "r")
>>> file.fileno()
3
(Don't check isinstance to see if it's a plain file—a lot of interfaces in Python have this fileno() method so they can be used interchangeably. Do hasattr(file, "fileno") to check for it. This also doesn't cover a common Python case of file-like objects that don't have a fileno but do have a read method. We can put that off for later, as it will require a FromJsonStringIncremental, which simply hasn't been written yet.)

Ok, thanks.

The source of from_json has a weakness that if a string is both a filename and valid JSON, it's ambiguous how to parse it. I think we pick "filename" in these cases. Having a filename that is valid JSON is weird.

Please, let me know if I should continue working on this PR or is it ok asis to be merged? Thanks!

jpivarski · 2020-11-25T19:15:07Z

I just put in one correction, hiding nan_and_inf_as_float from users. (That's an optimization that should be implicit from the nan_string, infinity_string, or minus_infinity_string being None.)

I also noticed this:

>>> ak.from_json('["one", \n"two"] \n ["three"]')
<Array ['one', 'two', 'three'] type='3 * string'>

The output should be

[['one', 'two'], ['three']]

because each separate JSON document should be separate. Sorry that I wasn't clear when listing test cases.

Meanwhile,

>>> ak.from_json('{"x": 1, "y": 2} \n {"x": 10, "y": 20}')
<Array [[{x: 1, y: 2}], [{x: 10, y: 20}]] type='2 * var * {"x": int64, "y": int64}'>

is too deep: it should be

[{'x': 1, 'y': 2}, {'x': 10, 'y': 20}]

I'll be in a meeting now, so I'm not working on that now. I'll check in again tomorrow. Thanks!

As for passing in the FILE handle with fileno, that doesn't need to be in this PR.

ianna · 2020-11-27T17:43:55Z

I just put in one correction, hiding nan_and_inf_as_float from users. (That's an optimization that should be implicit from the nan_string, infinity_string, or minus_infinity_string being None.)

I also noticed this:
>>> ak.from_json('["one", \n"two"] \n ["three"]')
<Array ['one', 'two', 'three'] type='3 * string'>
The output should be
[['one', 'two'], ['three']]
because each separate JSON document should be separate. Sorry that I wasn't clear when listing test cases.

done.

Meanwhile,

>>> ak.from_json('{"x": 1, "y": 2} \n {"x": 10, "y": 20}')
<Array [[{x: 1, y: 2}], [{x: 10, y: 20}]] type='2 * var * {"x": int64, "y": int64}'>

is too deep: it should be

[{'x': 1, 'y': 2}, {'x': 10, 'y': 20}]

done.

I'll be in a meeting now, so I'm not working on that now. I'll check in again tomorrow. Thanks!

As for passing in the FILE handle with fileno, that doesn't need to be in this PR.

The last thing on my list is parsing a string that starts with a number:

        str = """ "1
        2
    
        3 " """
        array = awkward1.from_json(str)
>       assert awkward1.to_list(array) == [1, 2, 3]
E       assert [2, 3] == [1, 2, 3]
E         At index 0 diff: 2 != 1
E         Right contains one more item: 3
E         Full diff:
E         - [2, 3]
E         + [1, 2, 3]
E         ?  +++

ianna · 2020-11-30T12:29:56Z

@jpivarski - I think, I'll close it and make a new PR after the name migration.

add a JSON file test

b48ff47

ianna marked this pull request as draft November 17, 2020 15:58

ianna added 3 commits November 18, 2020 20:44

add a test to continue parsing

195464d

need to fix failing test

58d69b5

replace nan and inf strings with double values

7c37a18

ianna added 3 commits November 20, 2020 08:23

done with conversion to float

55562ec

allow to replace NaN and Inf on demand

a61e3e9

cleanup

0293a37

ianna requested a review from jpivarski November 20, 2020 13:28

jpivarski reviewed Nov 20, 2020

View reviewed changes

ianna added 2 commits November 20, 2020 15:49

revert modifications to ArrayBuilder, keep the options for JSON only

5219dcf

introduce user defined NaN and Inf

55f78b2

add the test file

46c6065

ianna added 5 commits November 23, 2020 09:30

[skip ci] address Jim's comments

02654ea

allow json string(s) streaming the same way as json file

dc98a91

add an option to convert float nan and inf to user-defined strings wh…

e190801

…en writing out json

update documentation and add comments to the tests

9f7df70

treat None as NaN when writing out json and nan_string is defined

74fee06

ianna marked this pull request as ready for review November 23, 2020 17:25

ianna requested a review from jpivarski November 23, 2020 17:25

jpivarski reviewed Nov 23, 2020

View reviewed changes

address Jim's comments

011065b

jpivarski reviewed Nov 24, 2020

View reviewed changes

ianna added 3 commits November 24, 2020 20:32

add more tests

3f127a5

add test files

8ba0b1e

add one more example with empty lines

ff51a44

jpivarski reviewed Nov 24, 2020

View reviewed changes

ianna added 3 commits November 24, 2020 23:31

remove test sample file

f582fc9

clean up

9f9f51f

introduce a double quote in an input string as a distintion between t…

763d1d7

…he file name and a string

nan_and_inf_as_float should not be user-visible.

1102876

ianna added 3 commits November 26, 2020 00:38

fix records

b93e921

introduce separate string handler

3d315a9

no need to check the type of array to return the content

233cafc

revert changes to incremental string parsing

b182afa

ianna closed this Nov 30, 2020

ianna mentioned this pull request Dec 3, 2020

Options for "NaN" strings as NaN floats #568

Merged

ianna deleted the ianna/handle_stream_of_many_JSON_files branch December 4, 2020 08:42

jpivarski mentioned this pull request Dec 11, 2020

ak.to_json/ak.from_json should be revamped/organized for v2 #192

Closed

jpivarski mentioned this pull request Jan 28, 2021

tests: fix B015 #682

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Options for "NaN" strings as NaN floats #543

Options for "NaN" strings as NaN floats #543

ianna commented Nov 17, 2020 •

edited

jpivarski commented Nov 18, 2020

ianna commented Nov 18, 2020

jpivarski commented Nov 18, 2020

ianna commented Nov 19, 2020

jpivarski commented Nov 19, 2020

jpivarski left a comment

ianna commented Nov 20, 2020

jpivarski commented Nov 20, 2020

jpivarski left a comment

ianna commented Nov 24, 2020

jpivarski left a comment

jpivarski Nov 24, 2020

ianna Nov 24, 2020

ianna Nov 25, 2020

ianna commented Nov 25, 2020 •

edited

jpivarski commented Nov 25, 2020

ianna commented Nov 25, 2020

jpivarski commented Nov 25, 2020

ianna commented Nov 27, 2020

ianna commented Nov 30, 2020

Options for "NaN" strings as NaN floats #543

Options for "NaN" strings as NaN floats #543

Conversation

ianna commented Nov 17, 2020 • edited

jpivarski commented Nov 18, 2020

ianna commented Nov 18, 2020

jpivarski commented Nov 18, 2020

ianna commented Nov 19, 2020

jpivarski commented Nov 19, 2020

jpivarski left a comment

Choose a reason for hiding this comment

ianna commented Nov 20, 2020

jpivarski commented Nov 20, 2020

jpivarski left a comment

Choose a reason for hiding this comment

ianna commented Nov 24, 2020

jpivarski left a comment

Choose a reason for hiding this comment

jpivarski Nov 24, 2020

Choose a reason for hiding this comment

ianna Nov 24, 2020

Choose a reason for hiding this comment

ianna Nov 25, 2020

Choose a reason for hiding this comment

ianna commented Nov 25, 2020 • edited

jpivarski commented Nov 25, 2020

ianna commented Nov 25, 2020

jpivarski commented Nov 25, 2020

ianna commented Nov 27, 2020

ianna commented Nov 30, 2020

ianna commented Nov 17, 2020 •

edited

ianna commented Nov 25, 2020 •

edited