Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Add ability to change max_active_size by setting #3551

Merged
merged 2 commits into from Jan 23, 2020

Conversation

jpbalarini
Copy link
Contributor

@jpbalarini jpbalarini commented Dec 26, 2018

This PR fixes #1410

In our setup we had urls that were bigger than 5Mb (the default value) and we were reaching a deadlock scenario.
These changes allow for configuration of the max_active_size by setting the SCRAPER_SLOT_MAX_ACTIVE_SIZE configuration.

Thanks!

@codecov
Copy link

codecov bot commented Dec 26, 2018

Codecov Report

Merging #3551 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #3551      +/-   ##
==========================================
+ Coverage   84.44%   84.45%   +<.01%     
==========================================
  Files         167      167              
  Lines        9401     9402       +1     
  Branches     1396     1396              
==========================================
+ Hits         7939     7940       +1     
  Misses       1204     1204              
  Partials      258      258
Impacted Files Coverage Δ
scrapy/core/scraper.py 88.51% <100%> (ø) ⬆️
scrapy/settings/default_settings.py 98.64% <100%> (ø) ⬆️

@codecov
Copy link

codecov bot commented Dec 26, 2018

Codecov Report

Merging #3551 into master will decrease coverage by 0.02%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #3551      +/-   ##
==========================================
- Coverage   84.06%   84.03%   -0.03%     
==========================================
  Files         166      166              
  Lines        9737     9738       +1     
  Branches     1454     1454              
==========================================
- Hits         8185     8183       -2     
- Misses       1297     1301       +4     
+ Partials      255      254       -1
Impacted Files Coverage Δ
scrapy/core/scraper.py 89.67% <100%> (ø) ⬆️
scrapy/settings/default_settings.py 98.7% <100%> (ø) ⬆️
scrapy/utils/test.py 51.66% <0%> (-6.67%) ⬇️
scrapy/utils/defer.py 92.5% <0%> (-2.5%) ⬇️
scrapy/core/downloader/__init__.py 90.83% <0%> (+1.52%) ⬆️
scrapy/utils/trackref.py 85.71% <0%> (+2.85%) ⬆️

@jpbalarini
Copy link
Contributor Author

jpbalarini commented Jan 3, 2019

@kmike what do you think of these changes?
Thanks!

Default: ``5000000``

Maximum size (in bytes) which the scraper can handle at once. Change this setting
if your requests will have a body larger than this amount.
Copy link
Member

@dangra dangra Jan 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The explanation about the setting is not accurate, even if the value is 1 byte the scraper will process bigger responses.

It is a soft limit, in the sense that it process multiple responses until the sum of all responses being concurrently processed by scraper component (spider code mainly) goes above the limit. When it goes over, Scrapy core pauses the downloader (active downloads continue as normal, but new ones are not started). This mechanism prevents responses bodies from blowing up process memory by keeping only a sensible number of responses in memory.

Copy link
Contributor Author

@jpbalarini jpbalarini Jan 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dangra what do you think of the following (basically what you said):

Maximum size (in bytes) which the scraper handles concurrently. It works as a soft limit, in the sense that the scraper component will process multiple concurrent responses until the sum of all of them goes above the limit. When this happens, Scrapy core pauses the downloader (active downloads continue as normal, but new ones are not started). This mechanism prevents responses bodies from taking too much process memory by keeping only a sensible number of responses in memory.

Feel free to correct anything you want to rephrase or add.

Copy link
Member

@Gallaecio Gallaecio Apr 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about this?

Soft limit (in bytes) for response data being processed.

While the sum of the sizes of all responses being processed is above this value, Scrapy does not process new requests.

@dangra
Copy link
Member

dangra commented Jan 3, 2019

I think the change is simple and makes sense to make that limit configurable by settings, but I wonder what real case motivates it.

IMO increasing the limit will only mitigate and delay the real reason for the deadlock mentioned in #1410.

@jpbalarini
Copy link
Contributor Author

jpbalarini commented Jan 17, 2019

@dangra what do you think of the response I made to your comment above? If you are ok with that I can update the Readme with that.
Thanks!

@jpbalarini jpbalarini force-pushed the change_scraper_slot branch 2 times, most recently from 05e71be to a404de7 Compare May 20, 2019
@jpbalarini
Copy link
Contributor Author

jpbalarini commented May 20, 2019

@Gallaecio changed documentation to what you proposed. Let me know what do you think.
Thanks!


SCRAPER_SLOT_MAX_ACTIVE_SIZE
----------------------------
Default: ``5000000``
Copy link
Member

@Gallaecio Gallaecio May 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💄 I recently found about PEP 515, maybe we should use it here? Even if it only works in Python 3.6+, I think it makes sense to use it for the documentation.

@Gallaecio
Copy link
Member

Gallaecio commented May 22, 2019

I left a minor, aesthetic comment, but the changes look good to me already as they are.

@Gallaecio Gallaecio changed the title Add ability to change max_active_size by setting [MRG+1] Add ability to change max_active_size by setting May 22, 2019
@jpbalarini
Copy link
Contributor Author

jpbalarini commented May 22, 2019

I left a minor, aesthetic comment, but the changes look good to me already as they are.

@Gallaecio Corrected what you said for the documentation

@kmike kmike merged commit 6a98d66 into scrapy:master Jan 23, 2020
2 checks passed
@kmike
Copy link
Member

kmike commented Jan 23, 2020

Thanks @jpbalarini!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

allow specifying max_active_size for scraper.slot
4 participants