-
Notifications
You must be signed in to change notification settings - Fork 10.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+1] Downloader-aware Priority Queue for Scrapy #3520
[MRG+1] Downloader-aware Priority Queue for Scrapy #3520
Conversation
Codecov Report
@@ Coverage Diff @@
## master #3520 +/- ##
==========================================
+ Coverage 84.72% 85.28% +0.55%
==========================================
Files 168 169 +1
Lines 9460 9582 +122
Branches 1407 1422 +15
==========================================
+ Hits 8015 8172 +157
+ Misses 1188 1160 -28
+ Partials 257 250 -7
|
ce4df9c
to
1600450
Compare
It is not possible to check correct scheduling for |
1600450
to
5e6fe5e
Compare
I really hope we can get this feature. It's an important feature and I just haven't had time to work on my PR. |
I've started to look at the code; proposed (incomplete, and rather mechanical) cleanup so far: whalebot-helmsman#2 |
I removed commits related to |
Hey @whalebot-helmsman! I've left some comments about tests, but besides these comments the implementation looks good to me. Could you please also clean up the commit history? I think after fixing these minor issues with tests and cleaning up the commit history the PR is ready to merge 🎉 |
handle exception use O(N) instead of O(NlogN) here we have request as struct additional check for meptiness small performance improvement do not consume another request test number of responses mark requests back to 3 slots test case raise exceptions in case of missed meta add marks to requests and work only with your own requests only disk queue should obtain signals separate functions for slot rasd/write use signlas without variable stop crawler get signals in correct place logic test for download-aware priority queue update comment for structure ensure text type transform slot name to path use implicit structure use unicode type implicitly use real crawler add signals more slot accounting simple implementation of pop small slot accounting code no need for custom len function ability to call super in py27 add slots generic tests for downloader aware queue dummy implementation of crawler aware priority queue move common logic to base class rename class pass crawler to pqclass constructor do not copy quelib.PriorityQueue code add comment about new class remove obsolete function modify behaviour of queuelib.PriorityQueue to dodge very complex priority better way to get name remove obsolete commentary check boundaries function for priority convertion with known limits correct import path move file do not switch on by deffault as ip concurrency not supported set scheduler slot in case of empty slot use constant single place for added urls single place for constants use as default queue correct format for error text test migration from old version with on disk queue in these tests we have only two inflection points - jobdir and priority_queue_cls we do not need separate mock spider, use usual one do not rely on order of dict elements, imply order of list test round robiness of priority queue add comments and requirements for our magick function remove debug logging put queues into slot as we fabricate priorities we do not need special types anymore fabricate priority for priority queue more versatile priorities Scheduler class is not inflection point wrap correct types check for emptinees before initialization tests for new priority queue correct default type for startprios use exact values put common settings to base class test priorities for disk scheduler test dequeue for disk scheduler test length for disk scheduler setUp/tearDown methods for on disk schedulers new methods remove excessive line base class to handle scheduler creation correct method names test priorities deque test close scheduler on test end enqueue some requests test template for scheduler use downloader slot I/O implementation for RoundRobin queue round-robin implementation without I/O and slot detection wrappers for every disk queue class
Its initial content covers the workaround for scrapy#2473.
PEP8 fixes no need to close implicitly do not use pytest need to put it into class remove round-robin queue additional check for empty queue use pytest tmpdir fixture
I think that's good to merge, but commit history is not super-clean still:
I'm ok with merging it as-is, but if that's not too hard, it'd be good to clean it up. |
Co-Authored-By: whalebot-helmsman <whalebot.helmsman@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’ve left a few comments, but they are mostly aesthetic. It looks good to me.
@kmike Do you think we should get a third opinion here?
Co-Authored-By: whalebot-helmsman <whalebot.helmsman@gmail.com>
Co-Authored-By: whalebot-helmsman <whalebot.helmsman@gmail.com>
Co-Authored-By: whalebot-helmsman <whalebot.helmsman@gmail.com>
Co-Authored-By: whalebot-helmsman <whalebot.helmsman@gmail.com>
…lmsman/scrapy into round-robin-scheduler-tested
Thanks @whalebot-helmsman and everyone who provided feedback! I think we don't need a third opinion @Gallaecio (asked @dangra a while ago about this). |
Thanks for review @kmike, @lucywang000 and @Gallaecio for review |
Hi there @whalebot-helmsman, I think the |
This is long long-awaited by community(#1802, #2474 and #3140) priority queue implementation for multi host case. Contrary to past proposals it is based on
Downloader signalsdirect interaction with downloader, not on round-robin principle. Round-robin is much slower according to benchmarks.It became possible after new signal was added in #3393.Intention was to provide solution for this pain in shortest term as possible. So two problems left out of scope:
CONCURRENCY_PER_IP
option. To enable such support we should resolve hostnames on adding them to queue. This is deferred operation and code calling toScheduler.enqueue_request
should be ready for this situation. This is not the case now.I run benchmarks using https://github.com/scrapy/scrapy-bench. In single host case(bookworm spider) new priority queue does not affect performance. In multi host case(broadworm spider) new priority queue improves it 10 times. Exact numbers available https://docs.google.com/spreadsheets/d/1f8iSzfmLuiG9O-mDgG4iDb4fGvAlnkC4sH_zr9Qlj-s/edit#gid=0, full logs are available https://gist.github.com/whalebot-helmsman/83ed45a1aa1b4221ab6be2f36226c5e3. Round robin based queue also does not affect single host case, but improves multi host one only 2-3x. That's why I remove round robin implementation.
Settings used for bookworm https://github.com/scrapy/scrapy-bench/blob/f03011d63f4149eeaad6dd7908ffa6fcbbb3304e/books/books/settings.py
Settings used for broadworm https://github.com/scrapy/scrapy-bench/blob/f03011d63f4149eeaad6dd7908ffa6fcbbb3304e/broad/broad/settings.py
List of changes:
create_instance
DOWNLOADER_SLOT
constant to Downloaderqueuelib.PriorityQueue
to handle merged prioritiesaddedrequest_left_downloader
signal asresponse_downloaded
sent only in case of successful download