Skip to content

response.json() call makes unnecessary memory allocation #5968

Closed
@GeorgeA92

Description

@GeorgeA92

Summary

def json(self):
"""
.. versionadded:: 2.2
Deserialize a JSON document to a Python object.
"""
if self._cached_decoded_json is _NONE:
self._cached_decoded_json = json.loads(self.text)
return self._cached_decoded_json

As result of response.text call inside response.json - in addition to parsed object stored in response._cached_decoded_json application will hold in RAM both bytes and str (from response.text) representation of response by the end of parse method (and related mw's methods).

response.body is already enough to receive parsed python dict object so we don't need to response.text (and definitely we don't need to hold response as str in RAM)

@property
def text(self):
"""Body as unicode"""
# access self.encoding before _cached_ubody to make sure
# _body_inferred_encoding is called
benc = self.encoding
if self._cached_ubody is None:
charset = f"charset={benc}"
self._cached_ubody = html_to_unicode(charset, self.body)[1]
return self._cached_ubody

Also we don't need to apply encoding identifier logic from response.text as according to JSON specs we expect only unicode input.

reproducible code sample
from json import loads, dumps
from sys import getsizeof
from scrapy.http import TextResponse

def create_record(id):
    return {"name": "quotes", "website": "quotes.toscrape.com", "type":"tutorial", "id": id}

def create_sample_json_body():
    return {'data': [create_record(r) for r in range(100)]}

# generating response body - closer to real json
body = dumps(create_sample_json_body()).encode('utf8')

a = TextResponse(url='https://quotes.toscrape.com/example_1', body=body[:], encoding='utf8', flags=['response.json()'])
b = TextResponse(url='https://quotes.toscrape.com/example_2', body=body[:], encoding='utf8', flags=['json.loads(response.text) + import'])
c = TextResponse(url='https://quotes.toscrape.com/example_3', body=body[:], encoding='utf8', flags=['json.loads(response.body) + import'])

a_parsed = a.json()
b_parsed = loads(b.text)
c_parsed = loads(c.body)

for response in [a, b, c]:
    print(f'response {response}')
    print(f'option: {response.flags[0]}')
    print(f'bytes:\t{getsizeof(response.body)}\t{response.body}'[:300])
    print(f'str:\t{getsizeof(response._cached_ubody)}\t{response._cached_ubody}'[:300])
    print(f'bytes+str size combined:\t{getsizeof(response.body) + getsizeof(response._cached_ubody)}\n')
log output
response <200 https://quotes.toscrape.com/example_1>
option: response.json()
bytes:	8433	b'{"data": [{"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 0}, {"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 1}, {"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 2}, {"name": "quotes", "website
str:	8449	{"data": [{"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 0}, {"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 1}, {"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 2}, {"name": "quotes", "website": "
bytes+str size combined:	16882

response <200 https://quotes.toscrape.com/example_2>
option: json.loads(response.text) + import
bytes:	8433	b'{"data": [{"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 0}, {"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 1}, {"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 2}, {"name": "quotes", "website
str:	8449	{"data": [{"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 0}, {"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 1}, {"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 2}, {"name": "quotes", "website": "
bytes+str size combined:	16882

response <200 https://quotes.toscrape.com/example_3>
option: json.loads(response.body) + import
bytes:	8433	b'{"data": [{"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 0}, {"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 1}, {"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 2}, {"name": "quotes", "website
str:	16	None
bytes+str size combined:	8449


Process finished with exit code 0

Motivation

Calling of response.json() mentioned on docs offen referred as prefferred and easy scrapy's built-in way to immediately receive Python object(dict) from json response (without external import of json etc.) - easy to use method but as I described here it is not efficient.

Describe alternatives you've considered

  1. Instead of response.json I use json.loads(response.body) + it also require to make additional import of json.
  2. In case if I 100% sure that I don't need response.body or response.text - I can safely del response.body and del response._cached_ubody (if exists)

Additional context

Influence on RAM memory allocations on per-response basis (and influence of specific response.text bytes->str conversion call) briefly mentioned on scrapy/parsel#210 (comment)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions