Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

response_httprepr in DownloaderStats middleware causing scrapy to make unnecessary memory allocation. #4964

Closed
GeorgeA92 opened this issue Jan 28, 2021 · 4 comments · Fixed by #4972

Comments

@GeorgeA92
Copy link
Contributor

Description

Usage of scrapy.utils.response.response_httprepr inside DownloaderStats middleware causing application to make unnecessary memory allocation.
response_httprepr used only one time - to calculate response sizes for downloader/response_bytes stats in that middleware.

In current implementation response_httprepr return bytes (immutable type) - in order to calculate downloader/response_bytes application will additionally allocate nearly the same memory amount as for original response (only for calculating len inside middleware).

def response_httprepr(response):
"""Return raw HTTP representation (as bytes) of the given response. This
is provided only for reference, since it's not the exact stream of bytes
that was received (that's not exposed by Twisted).
"""
values = [
b"HTTP/1.1 ",
to_bytes(str(response.status)),
b" ",
to_bytes(http.RESPONSES.get(response.status, b'')),
b"\r\n",
]
if response.headers:
values.extend([response.headers.to_string(), b"\r\n"])
values.extend([b"\r\n", response.body])
return b"".join(values)

Steps to Reproduce

In order to demonstrate influence of this i made this spider:

spider code
import sys
from importlib import import_module
import scrapy


class MemoryHttpreprSpider(scrapy.Spider):
    name = 'memory_httprepr'
    custom_settings = {
        'DOWNLOADER_MIDDLEWARES':{
            'scrapy.downloadermiddlewares.stats.DownloaderStats': None
        }
    }
# the same as in MemoryUsage extension:
    def get_virtual_size(self):
        size = self.resource.getrusage(self.resource.RUSAGE_SELF).ru_maxrss
        if sys.platform != 'darwin':
            # on macOS ru_maxrss is in bytes, on Linux it is in KB
            size *= 1024
        return size

    def start_requests(self):
        try:
            self.resource = import_module('resource')
        except ImportError:
            pass
        self.logger.info(f"used memory on start: {str(self.get_virtual_size())}")
        yield scrapy.Request(url='https://speed.hetzner.de/100MB.bin', callback=self.parse)
        #yield scrapy.Request(url='http://quotes.toscrape.com', callback=self.parse)

    def parse(self, response, **kwargs):
        self.logger.info(f"used memory after downloading response: {str(self.get_virtual_size())}")
It include:
log output 100Mb default settings 100Mb no downloader stats quotes default
[memory_httprepr] used memory on start: 61 587 456 61 521 920 61 558 784
[memory_httprepr] used memory after downloading response: 375 910 400 271 179 776 61 558 784

Versions

[scrapy.utils.log] Scrapy 2.4.0 started (bot: httprepr)
[scrapy.utils.log] Versions: lxml 4.5.2.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.2 (default, Apr 23 2020, 14:32:57) - [GCC 8.3.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1h 22 Sep 2020), cryptography 3.1.1, Platform Linux-4.15.0-76-generic-x86_64-with-glibc2.2.5

@GeorgeA92
Copy link
Contributor Author

Remaining things related to this issue:

  • Usage of request_httprepr.
  • On table from issue description we clearly see that 3 ~100Mb memory chunks was allocated:
    1. original response.body - received from twisted side
    2. response_httprepr - mentioned in this issue
    3. probably side effect of Wrong type(response) for binary responses #4240 scrapy identified this response as text, and when it converted to str - another 100Mb memory chunk allocated.

@aadityasinha-dotcom
Copy link

Can I work on this issue?

@NIKZ3
Copy link

NIKZ3 commented Mar 30, 2021

I would like to work on this issue

@elacuesta
Copy link
Member

For the record, there's already #4972 for this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants