Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reusing parser modifies previous results #17

Closed
ROpdebee opened this issue Apr 18, 2021 · 4 comments
Closed

Reusing parser modifies previous results #17

ROpdebee opened this issue Apr 18, 2021 · 4 comments
Assignees
Labels
question Further information is requested wontfix This will not be worked on

Comments

@ROpdebee
Copy link

Installed via PyPI

from cysimdjson import JSONParser
parser = JSONParser()
result = parser.parse(b'{"hello": "world"}')
print(result['hello'])  # "world"
new_result = parser.parse(b'{"hello": "universe"}')
print(result['hello'])  # "universe"

This only happens when reusing a previous parser instance. I'm not sure whether this is by design, but if it is, it should probably be explicitly documented to avoid confusion.

It also becomes especially iffy when mixing different types:

result = parser.parse(b'{"hello": "world"}')
print(type(result))  # JSONObject
print(list(result.keys()))  # ['hello']
new_result = parser.parse(b'[1,2,3]')
print(type(result))  # JSONObject
print(type(new_result))  # JSONArray
print(list(result.keys()))  # ['hello', 'hello', 'hello']

So if it is by design, it might be worthwhile to somehow invalidate the previous reference when a new document is parsed.

@ateska
Copy link
Contributor

ateska commented Apr 20, 2021

Thanks for reporting, we will look at that.

@ateska
Copy link
Contributor

ateska commented Apr 23, 2021

Ok, after some digging, this is actually correct behaviour - it is linked with the SIMDJSON requirement for the lifecycle of the document, see the remark at https://simdjson.org/api/0.9.1/classsimdjson_1_1dom_1_1parser.html#ab3e5bbb1974a1932aead90ad63883a23

I will try to provide some kind of indication of this as per your suggestion.

@ateska ateska self-assigned this Apr 23, 2021
@ateska ateska added question Further information is requested wontfix This will not be worked on labels Apr 23, 2021
@lemire
Copy link
Contributor

lemire commented Apr 23, 2021

Note that the parser holds the allocated memory. By reusing the parser from document to document, we reuse the allocated memory. The net result is to avoid memory reallocation, an expensive process. On some systems, it is more expensive to allocate memory than to parse JSON !

@TkTech
Copy link

TkTech commented May 23, 2021

You can see one approach to stopping this in pysimdjson, https://github.com/TkTech/pysimdjson/blob/master/simdjson/csimdjson.pyx#L437. It's likely not our final fix, since it's a bit buggy on pypy whose garbage collector might wait a long, long time to cleanup things pointing into the Parser's document even if it's been explicitly deleted. In @ROpdebee's example above, it would have raised a RuntimeError instead of a potential segfault.

@ateska ateska closed this as completed Feb 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

4 participants