Description
Summary
scrapy/scrapy/http/response/text.py
Lines 74 to 82 in 52c0726
As result of response.text
call inside response.json
- in addition to parsed object stored in response._cached_decoded_json
application will hold in RAM both bytes
and str
(from response.text
) representation of response by the end of parse method (and related mw's methods).
response.body
is already enough to receive parsed python dict object so we don't need to response.text
(and definitely we don't need to hold response as str in RAM)
scrapy/scrapy/http/response/text.py
Lines 84 to 93 in 52c0726
Also we don't need to apply encoding identifier logic from
response.text
as according to JSON specs we expect only unicode input.
reproducible code sample
from json import loads, dumps
from sys import getsizeof
from scrapy.http import TextResponse
def create_record(id):
return {"name": "quotes", "website": "quotes.toscrape.com", "type":"tutorial", "id": id}
def create_sample_json_body():
return {'data': [create_record(r) for r in range(100)]}
# generating response body - closer to real json
body = dumps(create_sample_json_body()).encode('utf8')
a = TextResponse(url='https://quotes.toscrape.com/example_1', body=body[:], encoding='utf8', flags=['response.json()'])
b = TextResponse(url='https://quotes.toscrape.com/example_2', body=body[:], encoding='utf8', flags=['json.loads(response.text) + import'])
c = TextResponse(url='https://quotes.toscrape.com/example_3', body=body[:], encoding='utf8', flags=['json.loads(response.body) + import'])
a_parsed = a.json()
b_parsed = loads(b.text)
c_parsed = loads(c.body)
for response in [a, b, c]:
print(f'response {response}')
print(f'option: {response.flags[0]}')
print(f'bytes:\t{getsizeof(response.body)}\t{response.body}'[:300])
print(f'str:\t{getsizeof(response._cached_ubody)}\t{response._cached_ubody}'[:300])
print(f'bytes+str size combined:\t{getsizeof(response.body) + getsizeof(response._cached_ubody)}\n')
log output
response <200 https://quotes.toscrape.com/example_1>
option: response.json()
bytes: 8433 b'{"data": [{"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 0}, {"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 1}, {"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 2}, {"name": "quotes", "website
str: 8449 {"data": [{"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 0}, {"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 1}, {"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 2}, {"name": "quotes", "website": "
bytes+str size combined: 16882
response <200 https://quotes.toscrape.com/example_2>
option: json.loads(response.text) + import
bytes: 8433 b'{"data": [{"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 0}, {"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 1}, {"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 2}, {"name": "quotes", "website
str: 8449 {"data": [{"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 0}, {"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 1}, {"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 2}, {"name": "quotes", "website": "
bytes+str size combined: 16882
response <200 https://quotes.toscrape.com/example_3>
option: json.loads(response.body) + import
bytes: 8433 b'{"data": [{"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 0}, {"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 1}, {"name": "quotes", "website": "quotes.toscrape.com", "type": "tutorial", "id": 2}, {"name": "quotes", "website
str: 16 None
bytes+str size combined: 8449
Process finished with exit code 0
Motivation
Calling of response.json()
mentioned on docs offen referred as prefferred and easy scrapy's built-in way to immediately receive Python object(dict) from json response (without external import of json
etc.) - easy to use method but as I described here it is not efficient.
Describe alternatives you've considered
- Instead of
response.json
I usejson.loads(response.body)
+ it also require to make additional import of json. - In case if I 100% sure that I don't need
response.body
orresponse.text
- I can safelydel response.body
anddel response._cached_ubody
(if exists)
Additional context
Influence on RAM memory allocations on per-response basis (and influence of specific response.text
bytes->str conversion call) briefly mentioned on scrapy/parsel#210 (comment)