Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Options for "NaN" strings as NaN floats #543

Closed
wants to merge 27 commits into from

Conversation

ianna
Copy link
Collaborator

@ianna ianna commented Nov 17, 2020

Address Issue #437

  • stream JSON fragments from a JSON file
  • stream JSON fragments from a string (same as above?)
  • convert "NaN" and "Inf" strings as NaN and Inf floats
  • convert "NaN" and "Inf" strings as other "predefined" strings
  • allow user to re-defined "from_string" to "to_string" values (the UI?)

@jpivarski - that seems to be working out of the box. I'm creating the PR to test it on different architectures.

@ianna ianna marked this pull request as draft November 17, 2020 15:58
@jpivarski
Copy link
Member

You're right: we shouldn't support single quotes. That's a Python print-out, not JSON.

This PR isn't so much about supporting @aswin1447's specific files as the general features of data like this:

{"x": 1.1, "y": []}
{"x": 2.2, "y": [1]}
{"x": 3.3, "y": [1, 2]}
{"x": 4.4, "y": [1, 2, 3]}
{"x": 5.5, "y": [1, 2, 3, 4]}
{"x": 6.6, "y": [1, 2, 3, 4, 5]}

as though it were in a single list with commas in between. This would not be by literally transforming the text: see

https://github.com/scikit-hep/awkward-1.0/blob/05ca0f3202e3aa2886bce1b7e811ebde52289a40/src/python/io.cpp#L20-L79

and

https://github.com/scikit-hep/awkward-1.0/blob/05ca0f3202e3aa2886bce1b7e811ebde52289a40/src/awkward1/operations/convert.py#L827-L833

It's already doing some funny business with ArrayBuilder's beginlist and endlist to support records; just a little more tweaking would be required to tell RapidJSON to keep parsing after a JSON document is done and handling the beginlist/endlist to return an array, rather than a record. It might not be necessary to rely on carriage returns (\n or \r\n) to separate JSON items in a stream, though this is a very common case. (In principle, it should be enough to see that the RapidJSON parser has come to the end of a valid JSON object but there's more text left; JSON provides its own delimiters.)

The other part of this is to optionally read and write specified strings as floating point nan, inf, and -inf. The nicest interface would let users specify strings that can be as these values, with no such interpretation being the default (the option can be None or a string). The place where this would be applied is:

https://github.com/scikit-hep/awkward-1.0/blob/master/src/libawkward/io/json.cpp

(This is also the only place where we can use RapidJSON directly—it's hidden from the interface/header files so that downstream users of the library won't also depend on RapidJSON. That's why some of the calling conventions are convoluted.)

This PR can also include issue #192, passing a Python file pointer down to the C++ (which also mentions this issue of taking a stream of JSON objects, by the way).

@ianna
Copy link
Collaborator Author

ianna commented Nov 18, 2020

Sure, the issues are related. I was thinking of using a generator to stream the JSON fragments:

    def stream_read_json(fn):
        start_pos = 0
        with open(fn, 'r') as f:
            while True:
                try:
                    obj = json.load(f)
                    yield obj
                    return
                except json.JSONDecodeError as e:
                    f.seek(start_pos)
                    json_str = f.read(e.pos)
                    obj = json.loads(json_str)
                    start_pos += e.pos
                    yield obj

but I'm not sure about performance - it's way too slow or the test file is way too big :-)

@jpivarski
Copy link
Member

but I'm not sure about performance - it's way too slow or the test file is way too big :-)

It would be much better to tweak the rules under which ArrayBuilder::beginlist() and ArrayBuilder::endlist() are called. The difference would be two function calls (in C++) versus modifying the text stream. Generally, these data files are big. (That's why we use RapidJSON's SAX parser, rather than creating a DOM—it can make the difference between having or not having enough memory is realistic use-cases.)

@ianna
Copy link
Collaborator Author

ianna commented Nov 19, 2020

(*) failing integration tests are due to a C++17 string_vew. I'll deal with it alternatively :-)

@jpivarski - I run into a curious comparison error. The reported diff seems to be identical...
Are the numpy.inf and numpy.nan handled as I expect? What shall I use instead? Thanks

>       assert awkward1.to_list(array) == [1.1, 2.2, 3.3, numpy.inf, -numpy.inf,
            [4.4, 5.5, 6.6, 7.7, 8.8, 9.9, 10.1, 11.11],
            [12.12, 13.13, 14.14, 15.15, 16.16, 17.17],
            [[[18.18,   19.19,   20.2,    21.21,   22.22],
              [23.23,   24.24,   25.25,   26.26,   27.27,
               28.28,   29.29,   30.3,    31.31,   32.32,
               33.33,   34.34,   35.35,   36.36,   37.37],
              [38.38],
              [39.39, 40.4, numpy.nan, numpy.nan, 41.41, 42.42, 43.43]],
             [[44.44,   45.45,   46.46,   47.47,   48.48],
              [49.49,   50.5,    51.51,   52.52,   53.53,
               54.54,   55.55,   56.56,   57.57,   58.58,
               59.59,   60.6,    61.61,   62.62,   63.63],
              [64.64],
              [65.65, 66.66, numpy.nan, numpy.nan, 67.67, 68.68, 69.69]]]]
E       assert [1.1,\n 2.2,\n 3.3,\n inf,\n -inf,\n [4.4, 5.5, 6.6, 7.7, 8.8, 9.9, 10.1, 11.11],\n [12.12, 13.13, 14.14, 15.15, 16.16, 17.17],\n [[[18.18, 19.19, 20.2, 21.21, 22.22],\n   [23.23,\n    24.24,\n    25.25,\n    26.26,\n    27.27,\n    28.28,\n    29.29,\n    30.3,\n    31.31,\n    32.32,\n    33.33,\n    34.34,\n    35.35,\n    36.36,\n    37.37],\n   [38.38],\n   [39.39, 40.4, nan, nan, 41.41, 42.42, 43.43]],\n  [[44.44, 45.45, 46.46, 47.47, 48.48],\n   [49.49,\n    50.5,\n    51.51,\n    52.52,\n    53.53,\n    54.54,\n    55.55,\n    56.56,\n    57.57,\n    58.58,\n    59.59,\n    60.6,\n    61.61,\n    62.62,\n    63.63],\n   [64.64],\n   [65.65, 66.66, nan, nan, 67.67, 68.68, 69.69]]]] == [1.1,\n 2.2,\n 3.3,\n inf,\n -inf,\n [4.4, 5.5, 6.6, 7.7, 8.8, 9.9, 10.1, 11.11],\n [12.12, 13.13, 14.14, 15.15, 16.16, 17.17],\n [[[18.18, 19.19, 20.2, 21.21, 22.22],\n   [23.23,\n    24.24,\n    25.25,\n    26.26,\n    27.27,\n    28.28,\n    29.29,\n    30.3,\n    31.31,\n    32.32,\n    33.33,\n    34.34,\n    35.35,\n    36.36,\n    37.37],\n   [38.38],\n   [39.39, 40.4, nan, nan, 41.41, 42.42, 43.43]],\n  [[44.44, 45.45, 46.46, 47.47, 48.48],\n   [49.49,\n    50.5,\n    51.51,\n    52.52,\n    53.53,\n    54.54,\n    55.55,\n    56.56,\n    57.57,\n    58.58,\n    59.59,\n    60.6,\n    61.61,\n    62.62,\n    63.63],\n   [64.64],\n   [65.65, 66.66, nan, nan, 67.67, 68.68, 69.69]]]]
E         At index 7 diff: [[[18.18, 19.19, 20.2, 21.21, 22.22], [23.23, 24.24, 25.25, 26.26, 27.27, 28.28, 29.29, 30.3, 31.31, 32.32, 33.33, 34.34, 35.35, 36.36, 37.37], [38.38], [39.39, 40.4, nan, nan, 41.41, 42.42, 43.43]], [[44.44, 45.45, 46.46, 47.47, 48.48], [49.49, 50.5, 51.51, 52.52, 53.53, 54.54, 55.55, 56.56, 57.57, 58.58, 59.59, 60.6, 61.61, 62.62, 63.63], [64.64], [65.65, 66.66, nan, nan, 67.67, 68.68, 69.69]]] != [[[18.18, 19.19, 20.2, 21.21, 22.22], [23.23, 24.24, 25.25, 26.26, 27.27, 28.28, 29.29, 30.3, 31.31, 32.32, 33.33, 34.34, 35.35, 36.36, 37.37], [38.38], [39.39, 40.4, nan, nan, 41.41, 42.42, 43.43]], [[44.44, 45.45, 46.46, 47.47, 48.48], [49.49, 50.5, 51.51, 52.52, 53.53, 54.54, 55.55, 56.56, 57.57, 58.58, 59.59, 60.6, 61.61, 62.62, 63.63], [64.64], [65.65, 66.66, nan, nan, 67.67, 68.68, 69.69]]]
E         Full diff:
E           [
E            1.1,
E            2.2,
E            3.3,
E            inf,
E            -inf,
E            [4.4, 5.5, 6.6, 7.7, 8.8, 9.9, 10.1, 11.11],
E            [12.12, 13.13, 14.14, 15.15, 16.16, 17.17],
E            [[[18.18, 19.19, 20.2, 21.21, 22.22],
E              [23.23,
E               24.24,
E               25.25,
E               26.26,
E               27.27,
E               28.28,
E               29.29,
E               30.3,
E               31.31,
E               32.32,
E               33.33,
E               34.34,
E               35.35,
E               36.36,
E               37.37],
E              [38.38],
E              [39.39, 40.4, nan, nan, 41.41, 42.42, 43.43]],
E             [[44.44, 45.45, 46.46, 47.47, 48.48],
E              [49.49,
E               50.5,
E               51.51,
E               52.52,
E               53.53,
E               54.54,
E               55.55,
E               56.56,
E               57.57,
E               58.58,
E               59.59,
E               60.6,
E               61.61,
E               62.62,
E               63.63],
E              [64.64],
E              [65.65, 66.66, nan, nan, 67.67, 68.68, 69.69]]],
E           ]

tests/test_0437-stream-of-many-json-files.py:26: AssertionError

@jpivarski
Copy link
Member

Are the numpy.inf and numpy.nan handled as I expect? What shall I use instead?

Maybe it's because NaN is not equal to NaN (according to IEEE rules):

>>> float("nan") == float("nan")
False
>>> np.nan == np.nan
False

A suggestion on StackOverflow is to replace the NaN values with strings, which can be done after the conversion to Python lists.

>>> data1 = [[44.44,   45.45,   46.46,   47.47,   48.48],
...          [49.49,   50.5,    51.51,   52.52,   53.53,
...           54.54,   55.55,   56.56,   57.57,   58.58,
...           59.59,   60.6,    61.61,   62.62,   63.63],
...          [64.64],
...          [65.65, 66.66, float("nan"), float("nan"), 67.67, 68.68, 69.69]]
>>> data2 = [[44.44,   45.45,   46.46,   47.47,   48.48],
...          [49.49,   50.5,    51.51,   52.52,   53.53,
...           54.54,   55.55,   56.56,   57.57,   58.58,
...           59.59,   60.6,    61.61,   62.62,   63.63],
...          [64.64],
...          [65.65, 66.66, float("nan"), float("nan"), 67.67, 68.68, 69.69]]
>>> data1 == data2
False
>>> 
>>> def fix(x):
...     if isinstance(x, dict):
...         return dict((k, fix(v)) for k, v in x.items())
...     elif isinstance(x, list):
...         return [fix(v) for v in x]
...     elif isinstance(x, tuple):
...         return tuple(fix(v) for v in x)
...     elif np.isnan(x):
...         return "nan"
...     else:
...         return x
... 
>>> fix(data1) == fix(data2)
True

There's no such problem with infinity:

>>> float("inf") == float("inf")
True
>>> float("inf") == float("-inf")
False
>>> float("-inf") == float("-inf")
True
>>> np.inf == np.inf
True
>>> np.inf == -np.inf
False
>>> -np.inf == -np.inf
True

In the IEEE floating point definition, only one bit pattern corresponds to infinity, another bit pattern corresponds to minus infinity, but there's a large space of bit patterns that all correspond to not-a-number. Here are two of them:

>>> np.array([0, 0, 0, 0, 0, 0, 248, 127], dtype=np.uint8).view(np.float64)
array([nan])
>>> np.array([0, 0, 0, 0, 0, 0, 249, 127], dtype=np.uint8).view(np.float64)
array([nan])

Rather than requiring implementations of IEEE floating point arithmetic to complicate equality checks by checking to see if both sides are in or not in the "not-a-number" class, they're just defined as being not equal to each other. (After all, most implementations these days are in hardware, and you want to keep that as simple as possible.)

@ianna ianna requested a review from jpivarski November 20, 2020 13:28
Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I'd rather these options not be a part of ArrayBuilderOptions, since they only apply to JSON (as they work around a JSON limitation). ArrayBuilder is used in a lot of contexts other than JSON. If there's some reason why it would be very difficult to do this, then we can consider this an internal detail, since it's not exposed to the Python interface to ArrayBuilder, but it's looking to me like a bad direction to grow.

The reason we have ArrayBuilderOptions is because ArrayBuilder makes a tree of indeterminate depth. At any moment, it could encounter new data that broadens the type of its array, so each node needs to have access to the configuration options, which includes such things as "how big should the initial allocation be?"

These JSON options don't need to be pushed down this way. The JSON-handling code is all in a SAX parser class that doesn't create nodes of a tree (it calls methods on ArrayBuilder, which internally builds those nodes). So the JSON options don't have to be "pushed down" in the same way. That's why they don't really belong in ArrayBuilderOptions.

Also, about the options themselves: users need to be able to specify the spelling of "nan", "inf", and "-inf", since there's no standard—different JSON writers generate different strings: "NaN", "Inf", "-Inf", "Infinity", "-Infinity", and "minusInfinity" are all things I've seen. A single dataset won't have multiple spellings, so these are three new options:

  • the string that gets mapped to std::numeric_limits<double>::quiet_NaN()
  • the string that gets mapped to std::numeric_limits<double>::infinity()
  • the string that gets mapped to -std::numeric_limits<double>::infinity()

any of which can be None (i.e. no string that gets mapped to a numeric limit).

@ianna
Copy link
Collaborator Author

ianna commented Nov 20, 2020

Oh, I'd rather these options not be a part of ArrayBuilderOptions, since they only apply to JSON (as they work around a JSON limitation). ArrayBuilder is used in a lot of contexts other than JSON. If there's some reason why it would be very difficult to do this, then we can consider this an internal detail, since it's not exposed to the Python interface to ArrayBuilder, but it's looking to me like a bad direction to grow.

The reason we have ArrayBuilderOptions is because ArrayBuilder makes a tree of indeterminate depth. At any moment, it could encounter new data that broadens the type of its array, so each node needs to have access to the configuration options, which includes such things as "how big should the initial allocation be?"

These JSON options don't need to be pushed down this way. The JSON-handling code is all in a SAX parser class that doesn't create nodes of a tree (it calls methods on ArrayBuilder, which internally builds those nodes). So the JSON options don't have to be "pushed down" in the same way. That's why they don't really belong in ArrayBuilderOptions.

done. It's the *Options that confused me :-)

Also, about the options themselves: users need to be able to specify the spelling of "nan", "inf", and "-inf", since there's no standard—different JSON writers generate different strings: "NaN", "Inf", "-Inf", "Infinity", "-Infinity", and "minusInfinity" are all things I've seen. A single dataset won't have multiple spellings, so these are three new options:

  • the string that gets mapped to std::numeric_limits<double>::quiet_NaN()
  • the string that gets mapped to std::numeric_limits<double>::infinity()
  • the string that gets mapped to -std::numeric_limits<double>::infinity()

reimplemented. There are a few parameters with default definitions that can be overwritten by a user. It's a long-ish list. Please, comment.

convertNanAndInf=False # to float
replaceNanAndInf=False # with another string
fromNan='NaN'
fromInf='inf'
fromMinusInf='-inf'
toNan='NaN'
toInf='inf'
toMinusInf='-inf'

for example:

    array = awkward1.from_json('tests/samples/test-nan-inf.json', convertNanAndInf=True,
        fromInf='Infinity', fromNan='None')

any of which can be None (i.e. no string that gets mapped to a numeric limit).

looking into it :-) Thanks!

@jpivarski
Copy link
Member

The to_json and from_json functions are separate and can get separate sets of options; no one needs all 6, they each need only 3. They can take the same names, too:

def to_json(
    array,
    destination=None,
    pretty=False,
    maxdecimals=None,
    nan_string=None,
    infinity_string=None,
    minus_infinity_string=None,
    buffersize=65536,
):
    ...

def from_json(
    source,
    nan_string=None,
    infinity_string=None,
    minus_infinity_string=None,
    highlevel=True,
    behavior=None,
    initial=1024,
    resize=1.5,
    buffersize=65536,
):
    ...

Giving them the same names highlights the symmetry. I've grouped them here with other JSON parameters (pretty and maxdecimals in to_json, before highlevel and behavior, which are usually last, but ArrayBuilder parameters, initial, resize, and buffersize are usually of even less interest).

There needs to be a distinction between "no conversion string" and "the conversion string that is 'None'", since the latter is a possible encoding of NaN.

@ianna ianna marked this pull request as ready for review November 23, 2020 17:25
@ianna ianna requested a review from jpivarski November 23, 2020 17:25
Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great, thanks! Just a few inline comments.

Also, the "awkward1" → "awkward" name change will be happening Nov 30 or Dec 1, so you'll want this PR to be done by then. See details about the name change here:

https://github.com/scikit-hep/awkward-1.0/wiki/Name-change-procedure

src/libawkward/io/json.cpp Outdated Show resolved Hide resolved
src/libawkward/io/json.cpp Outdated Show resolved Hide resolved
src/libawkward/io/json.cpp Outdated Show resolved Hide resolved
src/libawkward/io/json.cpp Outdated Show resolved Hide resolved
src/libawkward/io/json.cpp Outdated Show resolved Hide resolved
src/libawkward/io/json.cpp Show resolved Hide resolved
src/python/content.cpp Show resolved Hide resolved
tests/test_0437-stream-of-many-json-files.py Outdated Show resolved Hide resolved
@ianna
Copy link
Collaborator Author

ianna commented Nov 24, 2020

Thanks, @jpivarski ! I think, I'm done with this PR.

Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are some cases to test. These first ones should raise an error (ValueError, as usual), but not hang:

"""{"one": 1, "two": 2.2,"""
"""{"one": 1,
"two": 2.2,"""
"""{"one": 1,
"two": 2.2,
"""
"""{"one": 1, "two": 2.2, "three": "THREE"}
{"one": 10, "two": 22,"""
"""{"one": 1, "two": 2.2, "three": "THREE"}
{"one": 10, "two": 22,
"""
"""["one", "two","""
"""["one",
"two","""
"""["one",
"two",
"""

And these should recognize that there are two JSON objects in the stream, despite the whitespace:

'{"one": 1, "two": 2.2}{"one": 10, "two": 22}'
'{"one": 1, "two": 2.2}     {"one": 10, "two": 22}'
'{"one": 1, \t "two": 2.2}{"one": 10, "two": 22}'
'{"one": 1, "two": 2.2}  \t   {"one": 10, "two": 22}'
'{"one": 1, "two": 2.2}\n{"one": 10, "two": 22}'
'{"one": 1, "two": 2.2}\n\r{"one": 10, "two": 22}'
'{"one": 1, "two": 2.2}     \n     {"one": 10, "two": 22}'
'{"one": 1, "two": 2.2}     \n\r     {"one": 10, "two": 22}'
'{"one": 1, "two": 2.2}{"one": 10, "two": 22}\n'
'{"one": 1, "two": 2.2}{"one": 10, "two": 22}\n\r'
'{"one": 1, "two": 2.2}\n{"one": 10, "two": 22}\n'
'{"one": 1, "two": 2.2}\n\r{"one": 10, "two": 22}\n\r'
'{"one": 1, \n"two": 2.2}{"one": 10, "two": 22}'
'{"one": 1, \n\r"two": 2.2}{"one": 10, "two": 22}'
'{"one": 1, \n"two": 2.2}\n{"one": 10, "two": 22}'
'{"one": 1, \n\r"two": 2.2}\n\r{"one": 10, "two": 22}'
'{"one": 1, \n"two": 2.2}\n{"one": 10, "two": 22}\n'
'{"one": 1, \n\r"two": 2.2}\n\r{"one": 10, "two": 22}\n\r'
'{"one": 1, "two": 2.2}{"one": 10, \n"two": 22}'
'{"one": 1, "two": 2.2}{"one": 10, \n\r"two": 22}'
'{"one": 1, "two": 2.2}\n{"one": 10, \n"two": 22}'
'{"one": 1, "two": 2.2}\n\r{"one": 10, \n\r"two": 22}'
'["one", "two"]["uno", "dos"]'
'["one", "two"]\n["uno", "dos"]'
'["one", "two"]\n\r["uno", "dos"]'
'["one", "two"]     ["uno", "dos"]'
'["one", "two"]  \n   ["uno", "dos"]'
'["one", "two"]  \n\r   ["uno", "dos"]'
'["one", \n "two"]["uno", "dos"]'
'["one", \n "two"]\n["uno", "dos"]'
'["one", \n\r "two"]["uno", "dos"]'
'["one", \n\r "two"]\n\r["uno", "dos"]'
'"one""two"'
'"one"\n"two"'
'"one"\n\r"two"'
'"one"     "two"'
'"one"  \n   "two"'
'"one"  \n\r   "two"'

If it's too difficult to allow carriage returns within JSON objects and also between them (while not allowing incomplete JSON objects to make the parser hang), we could make the definition of a "JSONS" stream more strict: we could adopt a rule that valid JSON documents must be separated by carriage returns, but allow any whitespace. (That also solves the \n vs \n\r problem: we can consider \n to be the true carriage return and \r is just whitespace, like and \t.) That reduces the set of valid examples above. But even with this restrictive rule, blank lines should be allowed and skipped:

"""{"one": 1, "two": 2.2}

{"one": 10, "two": 22}"""
"""{"one": 1, "two": 2.2}

{"one": 10, "two": 22}
"""
"""1
2

3"""
"""1
     2

3
"""

That's probably the easiest way to allow the stream to end with a newline.

Comment on lines 1 to 15
{"one": 1, \t "two": 2.2}{"one": 10, "two": 22}
{"one": 1, \n"two": 2.2}{"one": 10, "two": 22}
{"one": 1, \n\r"two": 2.2}{"one": 10, "two": 22}
{"one": 1, \n"two": 2.2}\n{"one": 10, "two": 22}
{"one": 1, \n\r"two": 2.2}\n\r{"one": 10, "two": 22}
{"one": 1, \n"two": 2.2}\n{"one": 10, "two": 22}\n
{"one": 1, \n\r"two": 2.2}\n\r{"one": 10, "two": 22}\n\r
{"one": 1, "two": 2.2}{"one": 10, \n"two": 22}
{"one": 1, "two": 2.2}{"one": 10, \n\r"two": 22}
{"one": 1, "two": 2.2}\n{"one": 10, \n"two": 22}
{"one": 1, "two": 2.2}\n\r{"one": 10, \n\r"two": 22}
["one", \n "two"]["uno", "dos"]
["one", \n "two"]\n["uno", "dos"]
["one", \n\r "two"]["uno", "dos"]
["one", \n\r "two"]\n\r["uno", "dos"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are not carriage return/tab characters: they're literal backslash and n, r, t in the text file. I meant for these to be independent tests (given as Python strings in the Python test file). If they're in a standalone JSON file like this, it's all one test, rather than testing features like "extra carriage return in the first line" and "extra carriage return in the second line," etc.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Yes, the strings are added as independent tests and the tests pass. I'll remove the file.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@ianna
Copy link
Collaborator Author

ianna commented Nov 25, 2020

@jpivarski - the tests pass. There is a workaround for a string that is not a file name.

The following change would help to eliminate the need to distinguish between a file name and a string. Would that be satisfactory?

diff --git a/src/awkward1/operations/convert.py b/src/awkward1/operations/convert.py
index 6ffb080..11c135b 100644
--- a/src/awkward1/operations/convert.py
+++ b/src/awkward1/operations/convert.py
@@ -6,6 +6,8 @@ import numbers
 import json
 import collections
 import math
+import io
+import os
 import threading
 import distutils.version
 
@@ -842,7 +844,7 @@ def from_json(source,
     See also #ak.to_json.
     """
     layout = awkward1._ext.fromjson(
-        source,
+        source if not os.path.isfile(source) else open(source, "r").read(),
         nan_string=nan_string,
         infinity_string=infinity_string,
         minus_infinity_string=minus_infinity_string,

then there would be only need for ak::FromJsonString and FromJsonFile could be deleted.
Please, let me know if that's the direction you'd like me to take and if you'd like this be part of the PR. I've got it locally and all tests pass. Thanks!

@jpivarski
Copy link
Member

Currently, have you put in any changes to let to_json and from_json accept a Python file object? I didn't see any such change in the GitHub diff, just the diff you quoted. (I'm only asking because I want to know if there's some code I'm supposed to look at.)

As for this proposed solution: it reads the whole JSON file as a string before doing any conversion. If a user is passing a filename or Python file object to the function, they're not expecting that, and very large files would then be a problem. (Note: a filled Awkward Array uses much less memory than the text of a JSON file.)

We do need FromJsonFile to do this incremental parsing. That's why we're using RapidJSON's SAX parser. If given a Python file object, you can get its file-handle as an integer:

>>> file = open("setup.py", "r")
>>> file.fileno()
3

(Don't check isinstance to see if it's a plain file—a lot of interfaces in Python have this fileno() method so they can be used interchangeably. Do hasattr(file, "fileno") to check for it. This also doesn't cover a common Python case of file-like objects that don't have a fileno but do have a read method. We can put that off for later, as it will require a FromJsonStringIncremental, which simply hasn't been written yet.)

The source of from_json has a weakness that if a string is both a filename and valid JSON, it's ambiguous how to parse it. I think we pick "filename" in these cases. Having a filename that is valid JSON is weird.

@ianna
Copy link
Collaborator Author

ianna commented Nov 25, 2020

Currently, have you put in any changes to let to_json and from_json accept a Python file object? I didn't see any such change in the GitHub diff, just the diff you quoted. (I'm only asking because I want to know if there's some code I'm supposed to look at.)

No, I did not push any changes - all this is in my private area :-)

As for this proposed solution: it reads the whole JSON file as a string before doing any conversion. If a user is passing a filename or Python file object to the function, they're not expecting that, and very large files would then be a problem. (Note: a filled Awkward Array uses much less memory than the text of a JSON file.)

Yes, you are right.

We do need FromJsonFile to do this incremental parsing. That's why we're using RapidJSON's SAX parser. If given a Python file object, you can get its file-handle as an integer:

>>> file = open("setup.py", "r")
>>> file.fileno()
3

(Don't check isinstance to see if it's a plain file—a lot of interfaces in Python have this fileno() method so they can be used interchangeably. Do hasattr(file, "fileno") to check for it. This also doesn't cover a common Python case of file-like objects that don't have a fileno but do have a read method. We can put that off for later, as it will require a FromJsonStringIncremental, which simply hasn't been written yet.)

Ok, thanks.

The source of from_json has a weakness that if a string is both a filename and valid JSON, it's ambiguous how to parse it. I think we pick "filename" in these cases. Having a filename that is valid JSON is weird.

Please, let me know if I should continue working on this PR or is it ok asis to be merged? Thanks!

@jpivarski
Copy link
Member

I just put in one correction, hiding nan_and_inf_as_float from users. (That's an optimization that should be implicit from the nan_string, infinity_string, or minus_infinity_string being None.)

I also noticed this:

>>> ak.from_json('["one", \n"two"] \n ["three"]')
<Array ['one', 'two', 'three'] type='3 * string'>

The output should be

[['one', 'two'], ['three']]

because each separate JSON document should be separate. Sorry that I wasn't clear when listing test cases.

Meanwhile,

>>> ak.from_json('{"x": 1, "y": 2} \n {"x": 10, "y": 20}')
<Array [[{x: 1, y: 2}], [{x: 10, y: 20}]] type='2 * var * {"x": int64, "y": int64}'>

is too deep: it should be

[{'x': 1, 'y': 2}, {'x': 10, 'y': 20}]

I'll be in a meeting now, so I'm not working on that now. I'll check in again tomorrow. Thanks!

As for passing in the FILE handle with fileno, that doesn't need to be in this PR.

@ianna
Copy link
Collaborator Author

ianna commented Nov 27, 2020

I just put in one correction, hiding nan_and_inf_as_float from users. (That's an optimization that should be implicit from the nan_string, infinity_string, or minus_infinity_string being None.)

I also noticed this:

>>> ak.from_json('["one", \n"two"] \n ["three"]')
<Array ['one', 'two', 'three'] type='3 * string'>

The output should be

[['one', 'two'], ['three']]

because each separate JSON document should be separate. Sorry that I wasn't clear when listing test cases.

done.

Meanwhile,

>>> ak.from_json('{"x": 1, "y": 2} \n {"x": 10, "y": 20}')
<Array [[{x: 1, y: 2}], [{x: 10, y: 20}]] type='2 * var * {"x": int64, "y": int64}'>

is too deep: it should be

[{'x': 1, 'y': 2}, {'x': 10, 'y': 20}]

done.

I'll be in a meeting now, so I'm not working on that now. I'll check in again tomorrow. Thanks!

As for passing in the FILE handle with fileno, that doesn't need to be in this PR.

The last thing on my list is parsing a string that starts with a number:

        str = """ "1
        2
    
        3 " """
        array = awkward1.from_json(str)
>       assert awkward1.to_list(array) == [1, 2, 3]
E       assert [2, 3] == [1, 2, 3]
E         At index 0 diff: 2 != 1
E         Right contains one more item: 3
E         Full diff:
E         - [2, 3]
E         + [1, 2, 3]
E         ?  +++

@ianna
Copy link
Collaborator Author

ianna commented Nov 30, 2020

@jpivarski - I think, I'll close it and make a new PR after the name migration.

@ianna ianna closed this Nov 30, 2020
@ianna ianna deleted the ianna/handle_stream_of_many_JSON_files branch December 4, 2020 08:42
@jpivarski jpivarski mentioned this pull request Jan 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants