Reusing parser modifies previous results #17

ROpdebee · 2021-04-18T14:02:59Z

Installed via PyPI

from cysimdjson import JSONParser
parser = JSONParser()
result = parser.parse(b'{"hello": "world"}')
print(result['hello'])  # "world"
new_result = parser.parse(b'{"hello": "universe"}')
print(result['hello'])  # "universe"

This only happens when reusing a previous parser instance. I'm not sure whether this is by design, but if it is, it should probably be explicitly documented to avoid confusion.

It also becomes especially iffy when mixing different types:

result = parser.parse(b'{"hello": "world"}')
print(type(result))  # JSONObject
print(list(result.keys()))  # ['hello']
new_result = parser.parse(b'[1,2,3]')
print(type(result))  # JSONObject
print(type(new_result))  # JSONArray
print(list(result.keys()))  # ['hello', 'hello', 'hello']

So if it is by design, it might be worthwhile to somehow invalidate the previous reference when a new document is parsed.

ateska · 2021-04-20T13:21:50Z

Thanks for reporting, we will look at that.

ateska · 2021-04-23T19:18:07Z

Ok, after some digging, this is actually correct behaviour - it is linked with the SIMDJSON requirement for the lifecycle of the document, see the remark at https://simdjson.org/api/0.9.1/classsimdjson_1_1dom_1_1parser.html#ab3e5bbb1974a1932aead90ad63883a23

I will try to provide some kind of indication of this as per your suggestion.

lemire · 2021-04-23T19:57:00Z

Note that the parser holds the allocated memory. By reusing the parser from document to document, we reuse the allocated memory. The net result is to avoid memory reallocation, an expensive process. On some systems, it is more expensive to allocate memory than to parse JSON !

TkTech · 2021-05-23T05:28:13Z

You can see one approach to stopping this in pysimdjson, https://github.com/TkTech/pysimdjson/blob/master/simdjson/csimdjson.pyx#L437. It's likely not our final fix, since it's a bit buggy on pypy whose garbage collector might wait a long, long time to cleanup things pointing into the Parser's document even if it's been explicitly deleted. In @ROpdebee's example above, it would have raised a RuntimeError instead of a potential segfault.

ateska self-assigned this Apr 23, 2021

ateska added question Further information is requested wontfix This will not be worked on labels Apr 23, 2021

ateska closed this as completed Feb 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reusing parser modifies previous results #17

Reusing parser modifies previous results #17

ROpdebee commented Apr 18, 2021

ateska commented Apr 20, 2021

ateska commented Apr 23, 2021

lemire commented Apr 23, 2021

TkTech commented May 23, 2021

Reusing parser modifies previous results #17

Reusing parser modifies previous results #17

Comments

ROpdebee commented Apr 18, 2021

ateska commented Apr 20, 2021

ateska commented Apr 23, 2021

lemire commented Apr 23, 2021

TkTech commented May 23, 2021