Skip to content

response_httprepr memory issue fixed #4972

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Jan 28, 2022
Merged

Conversation

GeorgeA92
Copy link
Contributor

@GeorgeA92 GeorgeA92 commented Feb 2, 2021

This code change aimed to fix memory issue #4964

Fixes #4964

@codecov
Copy link

codecov bot commented Feb 2, 2021

Codecov Report

Merging #4972 (f958ccd) into master (23cfdb0) will increase coverage by 0.35%.
The diff coverage is 92.85%.

@@            Coverage Diff             @@
##           master    #4972      +/-   ##
==========================================
+ Coverage   88.16%   88.52%   +0.35%     
==========================================
  Files         162      163       +1     
  Lines       10487    10603     +116     
  Branches     1517     1527      +10     
==========================================
+ Hits         9246     9386     +140     
+ Misses        966      943      -23     
+ Partials      275      274       -1     
Impacted Files Coverage Δ
scrapy/downloadermiddlewares/stats.py 91.66% <91.66%> (-0.93%) ⬇️
scrapy/utils/response.py 89.79% <100.00%> (+0.43%) ⬆️
scrapy/extensions/feedexport.py 95.09% <0.00%> (-0.23%) ⬇️
scrapy/utils/python.py 87.64% <0.00%> (ø)
scrapy/http/request/rpc.py 100.00% <0.00%> (ø)
scrapy/spiders/__init__.py 100.00% <0.00%> (ø)
scrapy/http/response/text.py 100.00% <0.00%> (ø)
scrapy/commands/startproject.py 100.00% <0.00%> (ø)
scrapy/http/request/__init__.py 97.80% <0.00%> (ø)
scrapy/http/request/json_request.py 94.87% <0.00%> (ø)
... and 18 more

@wRAR
Copy link
Member

wRAR commented Feb 2, 2021

"/home/runner/work/scrapy/scrapy/scrapy/downloadermiddlewares/stats.py:3:1: F401 'scrapy.utils.response.response_httprepr' imported but unused"

@GeorgeA92
Copy link
Contributor Author

GeorgeA92 commented Jun 4, 2021

Added function for calculating headers size.

Implementation of get_header_size based on w3lib.http.headers_dict_to_raw function used in response_httprepr

For url http://quotes.toscrape.com/page/1/
I received:
len(response_httprepr(response)) - 2204 bytes
len(response.body) - 2039 bytes
len(response.body) + get_header_size(response.headers) - 2183 bytes

In this implementation DownloaderStats middleware instead of memory intensive len(response_httprepr(response)) will use len(response.body) + get_header_size(response.headers) for calculating response size used in downloader/response_bytes stats parameter

@Gallaecio
Copy link
Member

So it’s still not the same size. Could you work out an implementation that returns the same as response_httprepr, taking into account additional data that response_httprepr takes into account (e.g. status code, line breaks)?

@GeorgeA92
Copy link
Contributor Author

@Gallaecio Added counting status code related part and remaining linebreaks.

Copy link
Member

@Gallaecio Gallaecio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we now have a test that compares the result to that of len(response_httprepr(response)), and verifies it matches for some corner cases? (empty/non-empty body, 0-1-2 headers)

@GeorgeA92
Copy link
Contributor Author

@Gallaecio

Can we now have a test that compares the result to that of len(response_httprepr(response)), and verifies it matches for some corner cases? (empty/non-empty body, 0-1-2 headers)

Ready -> GeorgeA92@3805bab

Copy link
Member

@Gallaecio Gallaecio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job!

I would still like to address #4972 (comment), but since this is a test it is not that important.

@wRAR wRAR merged commit 4bdaa54 into scrapy:master Jan 28, 2022
@GeorgeA92 GeorgeA92 deleted the httprepr-4964 branch January 10, 2023 08:51
@@ -50,6 +51,7 @@ def response_status_message(status: Union[bytes, float, int, str]) -> str:
return f'{status_int} {to_unicode(message)}'


@deprecated
def response_httprepr(response: Response) -> bytes:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wRAR FYI this function is deprecated here (and removed in 2.11) but never mentioned in the release logs (nor its removal)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch, I’ve logged #6207.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

response_httprepr in DownloaderStats middleware causing scrapy to make unnecessary memory allocation.
5 participants