You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Usage of scrapy.utils.response.response_httpreprinsideDownloaderStats middleware causing application to make unnecessary memory allocation. response_httprepr used only one time - to calculate response sizes for downloader/response_bytes stats in that middleware.
In current implementation response_httprepr return bytes (immutable type) - in order to calculate downloader/response_bytes application will additionally allocate nearly the same memory amount as for original response (only for calculating len inside middleware).
In order to demonstrate influence of this i made this spider:
spider code
importsysfromimportlibimportimport_moduleimportscrapyclassMemoryHttpreprSpider(scrapy.Spider):
name='memory_httprepr'custom_settings= {
'DOWNLOADER_MIDDLEWARES':{
'scrapy.downloadermiddlewares.stats.DownloaderStats': None
}
}
# the same as in MemoryUsage extension:defget_virtual_size(self):
size=self.resource.getrusage(self.resource.RUSAGE_SELF).ru_maxrssifsys.platform!='darwin':
# on macOS ru_maxrss is in bytes, on Linux it is in KBsize*=1024returnsizedefstart_requests(self):
try:
self.resource=import_module('resource')
exceptImportError:
passself.logger.info(f"used memory on start: {str(self.get_virtual_size())}")
yieldscrapy.Request(url='https://speed.hetzner.de/100MB.bin', callback=self.parse)
#yield scrapy.Request(url='http://quotes.toscrape.com', callback=self.parse)defparse(self, response, **kwargs):
self.logger.info(f"used memory after downloading response: {str(self.get_virtual_size())}")
It include:
usage of get_virtual_size method - directly the same as on MemoryUsage extension
Description
Usage of
scrapy.utils.response.response_httprepr
insideDownloaderStats
middleware causing application to make unnecessary memory allocation.response_httprepr
used only one time - to calculate response sizes fordownloader/response_bytes
stats in that middleware.In current implementation
response_httprepr
returnbytes
(immutable type) - in order to calculatedownloader/response_bytes
application will additionally allocate nearly the same memory amount as for original response (only for calculatinglen
inside middleware).scrapy/scrapy/utils/response.py
Lines 45 to 60 in 26836c4
Steps to Reproduce
In order to demonstrate influence of this i made this spider:
spider code
get_virtual_size
method - directly the same as onMemoryUsage
extensionDownloaderStats
middleware that usesresponse_httprepr
andrequest_httprepr
[memory_httprepr] used memory on start:
61 587 456
61 521 920
61 558 784
[memory_httprepr] used memory after downloading response:
375 910 400
271 179 776
61 558 784
Versions
[scrapy.utils.log] Scrapy 2.4.0 started (bot: httprepr)
[scrapy.utils.log] Versions: lxml 4.5.2.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.2 (default, Apr 23 2020, 14:32:57) - [GCC 8.3.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1h 22 Sep 2020), cryptography 3.1.1, Platform Linux-4.15.0-76-generic-x86_64-with-glibc2.2.5
The text was updated successfully, but these errors were encountered: