Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Downloader-aware Priority Queue for Scrapy #3520

Merged
merged 40 commits into from
Apr 6, 2019

Conversation

whalebot-helmsman
Copy link
Contributor

@whalebot-helmsman whalebot-helmsman commented Dec 5, 2018

This is long long-awaited by community(#1802, #2474 and #3140) priority queue implementation for multi host case. Contrary to past proposals it is based on Downloader signals direct interaction with downloader, not on round-robin principle. Round-robin is much slower according to benchmarks. It became possible after new signal was added in #3393.
Intention was to provide solution for this pain in shortest term as possible. So two problems left out of scope:

  • I didn't directly use Adding domain #3160 but take a lot of inspiration from it. Thanks @tianhuil. Adding domain #3160 replaces priorities by hostnames as a slot for scheduling. As a result priorities of requests does not affect scheduling. I tried to merge priority and hostname to single structure. This structure is consumable by old downstream priority queue. As a result priorities affect scheduling. Both ways are not perfect. Perfect solution is to refactor current Scheduler structure.
  • This new priority queue is not set as default priority queue, because it does not support CONCURRENCY_PER_IP option. To enable such support we should resolve hostnames on adding them to queue. This is deferred operation and code calling to Scheduler.enqueue_request should be ready for this situation. This is not the case now.

I run benchmarks using https://github.com/scrapy/scrapy-bench. In single host case(bookworm spider) new priority queue does not affect performance. In multi host case(broadworm spider) new priority queue improves it 10 times. Exact numbers available https://docs.google.com/spreadsheets/d/1f8iSzfmLuiG9O-mDgG4iDb4fGvAlnkC4sH_zr9Qlj-s/edit#gid=0, full logs are available https://gist.github.com/whalebot-helmsman/83ed45a1aa1b4221ab6be2f36226c5e3. Round robin based queue also does not affect single host case, but improves multi host one only 2-3x. That's why I remove round robin implementation.

Settings used for bookworm https://github.com/scrapy/scrapy-bench/blob/f03011d63f4149eeaad6dd7908ffa6fcbbb3304e/books/books/settings.py
Settings used for broadworm https://github.com/scrapy/scrapy-bench/blob/f03011d63f4149eeaad6dd7908ffa6fcbbb3304e/broad/broad/settings.py

List of changes:

  • added tests for current priority queue
  • create priority queue in scheduler using create_instance
  • added DOWNLOADER_SLOT constant to Downloader
  • added wrapper for queuelib.PriorityQueue to handle merged priorities
  • added DownloaderAwarePriorityQueue implementation
  • added tests for DownloaderAwarePriorityQueue
  • added request_left_downloader signal as response_downloaded sent only in case of successful download

@codecov
Copy link

codecov bot commented Dec 5, 2018

Codecov Report

Merging #3520 into master will increase coverage by 0.55%.
The diff coverage is 96.32%.

@@            Coverage Diff             @@
##           master    #3520      +/-   ##
==========================================
+ Coverage   84.72%   85.28%   +0.55%     
==========================================
  Files         168      169       +1     
  Lines        9460     9582     +122     
  Branches     1407     1422      +15     
==========================================
+ Hits         8015     8172     +157     
+ Misses       1188     1160      -28     
+ Partials      257      250       -7
Impacted Files Coverage Δ
scrapy/squeues.py 100% <100%> (ø) ⬆️
scrapy/settings/default_settings.py 98.64% <100%> (ø) ⬆️
scrapy/core/downloader/__init__.py 89.05% <100%> (+0.08%) ⬆️
scrapy/core/scheduler.py 90.35% <86.2%> (+28.21%) ⬆️
scrapy/pqueues.py 98.97% <98.97%> (ø)
scrapy/downloadermiddlewares/redirect.py 96.77% <0%> (+0.05%) ⬆️
scrapy/core/engine.py 92.76% <0%> (+0.45%) ⬆️
scrapy/dupefilters.py 98.07% <0%> (+1.92%) ⬆️
scrapy/spiders/crawl.py 82.35% <0%> (+3.97%) ⬆️
... and 2 more

scrapy/pqueues.py Outdated Show resolved Hide resolved
scrapy/pqueues.py Outdated Show resolved Hide resolved
@whalebot-helmsman
Copy link
Contributor Author

whalebot-helmsman commented Dec 17, 2018

It is not possible to check correct scheduling for DownloaderAwarePriorityQueue in integration test. A lot of things take place simultaneously - signals, site-responses and scheduling moments. I just remove integration test and keep unit test.

@tianhuil
Copy link

I really hope we can get this feature. It's an important feature and I just haven't had time to work on my PR.

@kmike
Copy link
Member

kmike commented Dec 26, 2018

I've started to look at the code; proposed (incomplete, and rather mechanical) cleanup so far: whalebot-helmsman#2

scrapy/pqueues.py Outdated Show resolved Hide resolved
scrapy/pqueues.py Outdated Show resolved Hide resolved
scrapy/pqueues.py Show resolved Hide resolved
tests/test_scheduler.py Outdated Show resolved Hide resolved
scrapy/pqueues.py Outdated Show resolved Hide resolved
scrapy/pqueues.py Outdated Show resolved Hide resolved
tests/test_scheduler.py Outdated Show resolved Hide resolved
@kmike kmike added this to the v1.7 milestone Jan 31, 2019
@kmike kmike mentioned this pull request Feb 9, 2019
@whalebot-helmsman
Copy link
Contributor Author

I removed commits related to request_left_downloader signal, but htey preserved in https://github.com/whalebot-helmsman/scrapy/tree/signal-backup branch

docs/topics/settings.rst Outdated Show resolved Hide resolved
docs/topics/settings.rst Outdated Show resolved Hide resolved
tests/test_scheduler.py Outdated Show resolved Hide resolved
tests/test_scheduler.py Outdated Show resolved Hide resolved
tests/test_scheduler.py Outdated Show resolved Hide resolved
tests/test_scheduler.py Outdated Show resolved Hide resolved
tests/test_scheduler.py Outdated Show resolved Hide resolved
tests/test_scheduler.py Outdated Show resolved Hide resolved
@kmike
Copy link
Member

kmike commented Mar 22, 2019

Hey @whalebot-helmsman!

I've left some comments about tests, but besides these comments the implementation looks good to me.

Could you please also clean up the commit history?

I think after fixing these minor issues with tests and cleaning up the commit history the PR is ready to merge 🎉

whalebot-helmsman and others added 5 commits March 22, 2019 15:40
handle exception
use O(N) instead of O(NlogN)
here we have request as struct
additional check for meptiness
small performance improvement
do not consume another request
test number of responses
mark requests
back to 3 slots test case
raise exceptions in case of missed meta
add marks to requests and work only with your own requests
only disk queue should obtain signals
separate functions for slot rasd/write
use signlas without variable
stop crawler
get signals in correct place
logic test for download-aware priority queue
update comment for structure
ensure text type
transform slot name to path
use implicit structure
use unicode type implicitly
use real crawler
add signals
more slot accounting
simple implementation of pop
small slot accounting code
no need for custom len function
ability to call super in py27
add slots
generic tests for downloader aware queue
dummy implementation of crawler aware priority queue
move common logic to base class
rename class
pass crawler to pqclass constructor
do not copy quelib.PriorityQueue code
add comment about new class
remove obsolete function
modify behaviour of queuelib.PriorityQueue to dodge very complex priority
better way to get name
remove obsolete commentary
check boundaries
function for priority convertion with known limits
correct import path
move file
do not switch on by deffault as ip concurrency not supported
set scheduler slot in case of empty slot
use constant
single place for added urls
single place for constants
use as default queue
correct format for error text
test migration from old version with on disk queue
in these tests we have only two inflection points - jobdir and priority_queue_cls
we do not need separate mock spider, use usual one
do not rely on order of dict elements, imply order of list
test round robiness of priority queue
add comments and requirements for our magick function
remove debug logging
put queues into slot
as we fabricate priorities we do not need special types anymore
fabricate priority for priority queue
more versatile priorities
Scheduler class is not inflection point
wrap correct types
check for emptinees before initialization
tests for new priority queue
correct default type for startprios
use exact values
put common settings to base class
test priorities for disk scheduler
test dequeue for disk scheduler
test length for disk scheduler
setUp/tearDown methods for on disk schedulers
new methods
remove excessive line
base class to handle scheduler creation
correct method names
test priorities
deque test
close scheduler on test end
enqueue some requests
test template for scheduler
use downloader slot
I/O implementation for RoundRobin queue
round-robin implementation without I/O and slot detection
wrappers for every disk queue class
PEP8 fixes
no need to close implicitly
do not use pytest
need to put it into class
remove round-robin queue
additional check for empty queue
use pytest tmpdir fixture
docs/topics/broad-crawls.rst Outdated Show resolved Hide resolved
@kmike kmike changed the title Downloader-aware Priority Queue for Scrapy [MRG+1] Downloader-aware Priority Queue for Scrapy Mar 26, 2019
@kmike
Copy link
Member

kmike commented Mar 26, 2019

I think that's good to merge, but commit history is not super-clean still:

  • there are commits not related to this PR, such as afdb69e or 9c31480 - if I'm not mistaken, they'd be duplicated in Scrapy history if we merge it;
  • some of the commit names can be improved, e.g. 9093495 shouldn't be named "actually apply __slots__ suggestion", it is a refactoring

I'm ok with merging it as-is, but if that's not too hard, it'd be good to clean it up.

Co-Authored-By: whalebot-helmsman <whalebot.helmsman@gmail.com>
Copy link
Member

@Gallaecio Gallaecio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve left a few comments, but they are mostly aesthetic. It looks good to me.

@kmike Do you think we should get a third opinion here?

docs/topics/broad-crawls.rst Outdated Show resolved Hide resolved
docs/topics/broad-crawls.rst Outdated Show resolved Hide resolved
docs/topics/broad-crawls.rst Outdated Show resolved Hide resolved
docs/topics/broad-crawls.rst Outdated Show resolved Hide resolved
docs/topics/settings.rst Outdated Show resolved Hide resolved
scrapy/core/scheduler.py Outdated Show resolved Hide resolved
scrapy/core/scheduler.py Outdated Show resolved Hide resolved
scrapy/core/scheduler.py Outdated Show resolved Hide resolved
Gallaecio and others added 10 commits March 29, 2019 10:28
Co-Authored-By: whalebot-helmsman <whalebot.helmsman@gmail.com>
Co-Authored-By: whalebot-helmsman <whalebot.helmsman@gmail.com>
Co-Authored-By: whalebot-helmsman <whalebot.helmsman@gmail.com>
Co-Authored-By: whalebot-helmsman <whalebot.helmsman@gmail.com>
…lmsman/scrapy into round-robin-scheduler-tested
@kmike
Copy link
Member

kmike commented Apr 6, 2019

Thanks @whalebot-helmsman and everyone who provided feedback!

I think we don't need a third opinion @Gallaecio (asked @dangra a while ago about this).

@whalebot-helmsman
Copy link
Contributor Author

Thanks for review @kmike, @lucywang000 and @Gallaecio for review

@elacuesta
Copy link
Member

elacuesta commented Jan 30, 2020

Hi there @whalebot-helmsman, I think the request_left_downloader signal could still be useful, do you have any plans to submit a new PR with it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants