`response.json()` call makes unnecessary memory allocation #5968

GeorgeA92 · 2023-07-10T12:05:49Z

Summary

Lines 74 to 82 in 52c0726

    
               def json(self): 
        
                   """ 
        
                   .. versionadded:: 2.2 
        
                   Deserialize a JSON document to a Python object. 
        
                   """ 
        
                   if self._cached_decoded_json is _NONE: 
        
                       self._cached_decoded_json = json.loads(self.text) 
        
                   return self._cached_decoded_json

As result of response.text call inside response.json - in addition to parsed object stored in response._cached_decoded_json application will hold in RAM both bytes and str (from response.text) representation of response by the end of parse method (and related mw's methods).

response.body is already enough to receive parsed python dict object so we don't need to response.text (and definitely we don't need to hold response as str in RAM)

scrapy/scrapy/http/response/text.py

Lines 84 to 93 in 52c0726

    
           @property 
        
           def text(self): 
        
               """Body as unicode""" 
        
               # access self.encoding before _cached_ubody to make sure 
        
               # _body_inferred_encoding is called 
        
               benc = self.encoding 
        
               if self._cached_ubody is None: 
        
                   charset = f"charset={benc}" 
        
                   self._cached_ubody = html_to_unicode(charset, self.body)[1] 
        
               return self._cached_ubody

Also we don't need to apply encoding identifier logic from response.text as according to JSON specs we expect only unicode input.

reproducible code sample

from json import loads, dumps
from sys import getsizeof
from scrapy.http import TextResponse

def create_record(id):
    return {"name": "quotes", "website": "quotes.toscrape.com", "type":"tutorial", "id": id}

def create_sample_json_body():
    return {'data': [create_record(r) for r in range(100)]}

# generating response body - closer to real json
body = dumps(create_sample_json_body()).encode('utf8')

a = TextResponse(url='https://quotes.toscrape.com/example_1', body=body[:], encoding='utf8', flags=['response.json()'])
b = TextResponse(url='https://quotes.toscrape.com/example_2', body=body[:], encoding='utf8', flags=['json.loads(response.text) + import'])
c = TextResponse(url='https://quotes.toscrape.com/example_3', body=body[:], encoding='utf8', flags=['json.loads(response.body) + import'])

a_parsed = a.json()
b_parsed = loads(b.text)
c_parsed = loads(c.body)

for response in [a, b, c]:
    print(f'response {response}')
    print(f'option: {response.flags[0]}')
    print(f'bytes:\t{getsizeof(response.body)}\t{response.body}'[:300])
    print(f'str:\t{getsizeof(response._cached_ubody)}\t{response._cached_ubody}'[:300])
    print(f'bytes+str size combined:\t{getsizeof(response.body) + getsizeof(response._cached_ubody)}\n')

log output

response <200 https://quotes.toscrape.com/example_1>
option: response.json()
bytes:	8433	b'{"data": [{"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 0}, {"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 1}, {"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 2}, {"name": "quotes", "website
str:	8449	{"data": [{"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 0}, {"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 1}, {"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 2}, {"name": "quotes", "website": "
bytes+str size combined:	16882

response <200 https://quotes.toscrape.com/example_2>
option: json.loads(response.text) + import
bytes:	8433	b'{"data": [{"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 0}, {"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 1}, {"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 2}, {"name": "quotes", "website
str:	8449	{"data": [{"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 0}, {"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 1}, {"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 2}, {"name": "quotes", "website": "
bytes+str size combined:	16882

response <200 https://quotes.toscrape.com/example_3>
option: json.loads(response.body) + import
bytes:	8433	b'{"data": [{"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 0}, {"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 1}, {"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 2}, {"name": "quotes", "website
str:	16	None
bytes+str size combined:	8449


Process finished with exit code 0

Motivation

Calling of response.json() mentioned on docs offen referred as prefferred and easy scrapy's built-in way to immediately receive Python object(dict) from json response (without external import of json etc.) - easy to use method but as I described here it is not efficient.

Describe alternatives you've considered

Instead of response.json I use json.loads(response.body) + it also require to make additional import of json.
In case if I 100% sure that I don't need response.body or response.text - I can safely del response.body and del response._cached_ubody (if exists)

Additional context

Influence on RAM memory allocations on per-response basis (and influence of specific response.text bytes->str conversion call) briefly mentioned on scrapy/parsel#210 (comment)

The text was updated successfully, but these errors were encountered:

wRAR · 2023-07-14T16:04:43Z

Hmm should it be simply changed to self._cached_decoded_json = json.loads(self.body)?

jxlil · 2023-08-16T00:15:57Z

@wRAR, I implemented the change you suggested and ran a test. It appears to be working now. The str response is no longer being stored in RAM.

For the test, I used the script that @GeorgeA92 shared:

response <200 https://quotes.toscrape.com/example_1>
option: response.json()
bytes:	8433	b'{"data": [{"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 0}, {"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 1}, {"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 2}, {"name": "quotes", "website
str:	16	None
bytes+str size combined:	8449

response <200 https://quotes.toscrape.com/example_2>
option: json.loads(response.text) + import
bytes:	8433	b'{"data": [{"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 0}, {"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 1}, {"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 2}, {"name": "quotes", "website
str:	8449	{"data": [{"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 0}, {"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 1}, {"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 2}, {"name": "quotes", "website": "
bytes+str size combined:	16882

response <200 https://quotes.toscrape.com/example_3>
option: json.loads(response.body) + import
bytes:	8433	b'{"data": [{"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 0}, {"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 1}, {"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 2}, {"name": "quotes", "website
str:	16	None
bytes+str size combined:	8449

GeorgeA92 · 2023-08-17T07:47:26Z

I think that self._cached_decoded_json = json.loads(self.body) can be a solution (assuming that this change will not break anything else)

@jxlil We expect to see prepared pull request and related new tests added according to our contribution policy

jxlil · 2023-08-17T18:21:49Z

@GeorgeA92 I've just created a PR with the changes

wRAR added enhancement performance labels Jul 10, 2023

jxlil mentioned this issue Aug 17, 2023

Fix: response.json() call makes unnecessary memory allocation #6016

Merged

kmike closed this as completed in #6016 Aug 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`response.json()` call makes unnecessary memory allocation #5968

`response.json()` call makes unnecessary memory allocation #5968

GeorgeA92 commented Jul 10, 2023 •

edited

wRAR commented Jul 14, 2023

jxlil commented Aug 16, 2023

GeorgeA92 commented Aug 17, 2023

jxlil commented Aug 17, 2023

response.json() call makes unnecessary memory allocation #5968

response.json() call makes unnecessary memory allocation #5968

Comments

GeorgeA92 commented Jul 10, 2023 • edited

Summary

Motivation

Describe alternatives you've considered

Additional context

wRAR commented Jul 14, 2023

jxlil commented Aug 16, 2023

GeorgeA92 commented Aug 17, 2023

jxlil commented Aug 17, 2023

`response.json()` call makes unnecessary memory allocation #5968

`response.json()` call makes unnecessary memory allocation #5968

GeorgeA92 commented Jul 10, 2023 •

edited