Skip to content

Usage of HttpCompressionMiddleware needs to be reflected in Scrapy stats #4797

@GeorgeA92

Description

@GeorgeA92

Summary

Usage of HttpCompressionMiddleware needs to be relfected in Scrapy stats.

Motivation

In order to estimate scrapy memory usage efficiency and prevent.. memory leaks like this.
I will need to know:

  1. number of request/response objects that can be active (can be achieved by using trackref )
  2. size of memory required to store that number of request/response objects.

A lot of websites use compression to reduce traffic. In this case I would like to calculate average size of decompressed responses to estimate p.2.

Decompression process means that at some point application will require to allocate memory to store both compressed and decompressed response body and I will need to know this sizes to have more complete vision of scrapy memory usage.

Also size of decompressed body will be several times more than size of compressed response and it will affect scrapy memory usage.

Describe alternatives you've considered

The easiest one - is to change priority of DownloaderStats middleware and check difference in downloader/response_bytes stats parameter.

    custom_settings = {"DOWNLOAD_DELAY":1,
                       "DOWNLOADER_MIDDLEWARES":{
                           'scrapy.downloadermiddlewares.stats.DownloaderStats':50
                       }

Stats from quotes.toscrape.com spider (it uses gzip compression) with default settings:

{'downloader/request_bytes': 2642,
 'downloader/request_count': 10,
 'downloader/request_method_count/GET': 10,
 'downloader/response_bytes': 24534,

And with changed priority of DownloaderStats middleware:

{'downloader/request_bytes': 912, # size reduced as it didn't count size of request headers populated by downloader middlewares
 'downloader/request_count': 10,
 'downloader/request_method_count/GET': 10,
 'downloader/response_bytes': 110191,   # it counted size of decompressed data 

Average size of compressed response (by default) - 2453 bytes.
Average size of decompressed response - 11019 bytes (~4.5 times more).

Additional context

Potential solution is to add something like this:
self.stats.inc_value('decompressed_bytes', spider=spider)
into process_response method of HttpCompressionMiddleware

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions