-
Notifications
You must be signed in to change notification settings - Fork 11.2k
Description
Summary
Usage of HttpCompressionMiddleware needs to be relfected in Scrapy stats.
Motivation
In order to estimate scrapy memory usage efficiency and prevent.. memory leaks like this.
I will need to know:
- number of request/response objects that can be active (can be achieved by using
trackref) - size of memory required to store that number of request/response objects.
A lot of websites use compression to reduce traffic. In this case I would like to calculate average size of decompressed responses to estimate p.2.
Decompression process means that at some point application will require to allocate memory to store both compressed and decompressed response body and I will need to know this sizes to have more complete vision of scrapy memory usage.
Also size of decompressed body will be several times more than size of compressed response and it will affect scrapy memory usage.
Describe alternatives you've considered
The easiest one - is to change priority of DownloaderStats middleware and check difference in downloader/response_bytes stats parameter.
custom_settings = {"DOWNLOAD_DELAY":1,
"DOWNLOADER_MIDDLEWARES":{
'scrapy.downloadermiddlewares.stats.DownloaderStats':50
}
Stats from quotes.toscrape.com spider (it uses gzip compression) with default settings:
{'downloader/request_bytes': 2642,
'downloader/request_count': 10,
'downloader/request_method_count/GET': 10,
'downloader/response_bytes': 24534,
And with changed priority of DownloaderStats middleware:
{'downloader/request_bytes': 912, # size reduced as it didn't count size of request headers populated by downloader middlewares
'downloader/request_count': 10,
'downloader/request_method_count/GET': 10,
'downloader/response_bytes': 110191, # it counted size of decompressed data
Average size of compressed response (by default) - 2453 bytes.
Average size of decompressed response - 11019 bytes (~4.5 times more).
Additional context
Potential solution is to add something like this:
self.stats.inc_value('decompressed_bytes', spider=spider)
into process_response method of HttpCompressionMiddleware