Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Usage of HttpCompressionMiddleware needs to be reflected in Scrapy stats #4797

Closed
GeorgeA92 opened this issue Sep 19, 2020 · 4 comments · Fixed by #4799
Closed

Usage of HttpCompressionMiddleware needs to be reflected in Scrapy stats #4797

GeorgeA92 opened this issue Sep 19, 2020 · 4 comments · Fixed by #4799

Comments

@GeorgeA92
Copy link
Contributor

Summary

Usage of HttpCompressionMiddleware needs to be relfected in Scrapy stats.

Motivation

In order to estimate scrapy memory usage efficiency and prevent.. memory leaks like this.
I will need to know:

  1. number of request/response objects that can be active (can be achieved by using trackref )
  2. size of memory required to store that number of request/response objects.

A lot of websites use compression to reduce traffic. In this case I would like to calculate average size of decompressed responses to estimate p.2.

Decompression process means that at some point application will require to allocate memory to store both compressed and decompressed response body and I will need to know this sizes to have more complete vision of scrapy memory usage.

Also size of decompressed body will be several times more than size of compressed response and it will affect scrapy memory usage.

Describe alternatives you've considered

The easiest one - is to change priority of DownloaderStats middleware and check difference in downloader/response_bytes stats parameter.

    custom_settings = {"DOWNLOAD_DELAY":1,
                       "DOWNLOADER_MIDDLEWARES":{
                           'scrapy.downloadermiddlewares.stats.DownloaderStats':50
                       }

Stats from quotes.toscrape.com spider (it uses gzip compression) with default settings:

{'downloader/request_bytes': 2642,
 'downloader/request_count': 10,
 'downloader/request_method_count/GET': 10,
 'downloader/response_bytes': 24534,

And with changed priority of DownloaderStats middleware:

{'downloader/request_bytes': 912, # size reduced as it didn't count size of request headers populated by downloader middlewares
 'downloader/request_count': 10,
 'downloader/request_method_count/GET': 10,
 'downloader/response_bytes': 110191,   # it counted size of decompressed data 

Average size of compressed response (by default) - 2453 bytes.
Average size of decompressed response - 11019 bytes (~4.5 times more).

Additional context

Potential solution is to add something like this:
self.stats.inc_value('decompressed_bytes', spider=spider)
into process_response method of HttpCompressionMiddleware

@luckyguy73
Copy link

fyi the issue #4797 description has a typo "relfected in Scrapy stats" -> "reflected in Scrapy stats"

@GeorgeA92 GeorgeA92 changed the title Usage of HttpCompressionMiddleware needs to be relfected in Scrapy stats Usage of HttpCompressionMiddleware needs to be reflected in Scrapy stats Oct 5, 2020
@codevbus
Copy link

codevbus commented Oct 6, 2020

I'd like to take this.

@Gallaecio
Copy link
Member

@codevbus #4799 is quite close to completion already.

@codevbus
Copy link

codevbus commented Oct 6, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants