Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

response.json() call makes unnecessary memory allocation #5968

Closed
GeorgeA92 opened this issue Jul 10, 2023 · 4 comments · Fixed by #6016
Closed

response.json() call makes unnecessary memory allocation #5968

GeorgeA92 opened this issue Jul 10, 2023 · 4 comments · Fixed by #6016

Comments

@GeorgeA92
Copy link
Contributor

GeorgeA92 commented Jul 10, 2023

Summary

def json(self):
"""
.. versionadded:: 2.2
Deserialize a JSON document to a Python object.
"""
if self._cached_decoded_json is _NONE:
self._cached_decoded_json = json.loads(self.text)
return self._cached_decoded_json

As result of response.text call inside response.json - in addition to parsed object stored in response._cached_decoded_json application will hold in RAM both bytes and str (from response.text) representation of response by the end of parse method (and related mw's methods).

response.body is already enough to receive parsed python dict object so we don't need to response.text (and definitely we don't need to hold response as str in RAM)

@property
def text(self):
"""Body as unicode"""
# access self.encoding before _cached_ubody to make sure
# _body_inferred_encoding is called
benc = self.encoding
if self._cached_ubody is None:
charset = f"charset={benc}"
self._cached_ubody = html_to_unicode(charset, self.body)[1]
return self._cached_ubody

Also we don't need to apply encoding identifier logic from response.text as according to JSON specs we expect only unicode input.

reproducible code sample
from json import loads, dumps
from sys import getsizeof
from scrapy.http import TextResponse

def create_record(id):
    return {"name": "quotes", "website": "quotes.toscrape.com", "type":"tutorial", "id": id}

def create_sample_json_body():
    return {'data': [create_record(r) for r in range(100)]}

# generating response body - closer to real json
body = dumps(create_sample_json_body()).encode('utf8')

a = TextResponse(url='https://quotes.toscrape.com/example_1', body=body[:], encoding='utf8', flags=['response.json()'])
b = TextResponse(url='https://quotes.toscrape.com/example_2', body=body[:], encoding='utf8', flags=['json.loads(response.text) + import'])
c = TextResponse(url='https://quotes.toscrape.com/example_3', body=body[:], encoding='utf8', flags=['json.loads(response.body) + import'])

a_parsed = a.json()
b_parsed = loads(b.text)
c_parsed = loads(c.body)

for response in [a, b, c]:
    print(f'response {response}')
    print(f'option: {response.flags[0]}')
    print(f'bytes:\t{getsizeof(response.body)}\t{response.body}'[:300])
    print(f'str:\t{getsizeof(response._cached_ubody)}\t{response._cached_ubody}'[:300])
    print(f'bytes+str size combined:\t{getsizeof(response.body) + getsizeof(response._cached_ubody)}\n')
log output
response <200 https://quotes.toscrape.com/example_1>
option: response.json()
bytes:	8433	b'{"data": [{"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 0}, {"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 1}, {"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 2}, {"name": "quotes", "website
str:	8449	{"data": [{"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 0}, {"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 1}, {"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 2}, {"name": "quotes", "website": "
bytes+str size combined:	16882

response <200 https://quotes.toscrape.com/example_2>
option: json.loads(response.text) + import
bytes:	8433	b'{"data": [{"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 0}, {"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 1}, {"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 2}, {"name": "quotes", "website
str:	8449	{"data": [{"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 0}, {"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 1}, {"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 2}, {"name": "quotes", "website": "
bytes+str size combined:	16882

response <200 https://quotes.toscrape.com/example_3>
option: json.loads(response.body) + import
bytes:	8433	b'{"data": [{"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 0}, {"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 1}, {"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 2}, {"name": "quotes", "website
str:	16	None
bytes+str size combined:	8449


Process finished with exit code 0

Motivation

Calling of response.json() mentioned on docs offen referred as prefferred and easy scrapy's built-in way to immediately receive Python object(dict) from json response (without external import of json etc.) - easy to use method but as I described here it is not efficient.

Describe alternatives you've considered

  1. Instead of response.json I use json.loads(response.body) + it also require to make additional import of json.
  2. In case if I 100% sure that I don't need response.body or response.text - I can safely del response.body and del response._cached_ubody (if exists)

Additional context

Influence on RAM memory allocations on per-response basis (and influence of specific response.text bytes->str conversion call) briefly mentioned on scrapy/parsel#210 (comment)

@wRAR
Copy link
Member

wRAR commented Jul 14, 2023

Hmm should it be simply changed to self._cached_decoded_json = json.loads(self.body)?

@jxlil
Copy link
Contributor

jxlil commented Aug 16, 2023

@wRAR, I implemented the change you suggested and ran a test. It appears to be working now. The str response is no longer being stored in RAM.

For the test, I used the script that @GeorgeA92 shared:

response <200 https://quotes.toscrape.com/example_1>
option: response.json()
bytes:	8433	b'{"data": [{"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 0}, {"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 1}, {"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 2}, {"name": "quotes", "website
str:	16	None
bytes+str size combined:	8449

response <200 https://quotes.toscrape.com/example_2>
option: json.loads(response.text) + import
bytes:	8433	b'{"data": [{"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 0}, {"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 1}, {"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 2}, {"name": "quotes", "website
str:	8449	{"data": [{"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 0}, {"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 1}, {"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 2}, {"name": "quotes", "website": "
bytes+str size combined:	16882

response <200 https://quotes.toscrape.com/example_3>
option: json.loads(response.body) + import
bytes:	8433	b'{"data": [{"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 0}, {"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 1}, {"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 2}, {"name": "quotes", "website
str:	16	None
bytes+str size combined:	8449

@GeorgeA92
Copy link
Contributor Author

I think that self._cached_decoded_json = json.loads(self.body) can be a solution (assuming that this change will not break anything else)

@jxlil We expect to see prepared pull request and related new tests added according to our contribution policy

@jxlil
Copy link
Contributor

jxlil commented Aug 17, 2023

@GeorgeA92 I've just created a PR with the changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants