Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove specifics of downstream request queues from scheduler #3884

Merged
merged 47 commits into from Feb 22, 2020

Conversation

whalebot-helmsman
Copy link
Contributor

@whalebot-helmsman whalebot-helmsman commented Jul 18, 2019

Implementing other types of downstream (only FIFO/LIFO Disk/Memory at this moment) queues without touching Scheduler code is a desirable outcome for this PR.

To achieve this, next things are done:

  • request_from_dict/request_to_dict specifics moved to downstream queue code
  • crawler is added as a parameter of constructor to disk queues to provide spider for request_from_dict/request_to_dict functions. crawler will be used to get other settings for downstream queues from crawler.settings
  • common constructor interface for every downstream queue __init__(self, crawler, key)
  • _newmq/_newdq methods removed from Scheduler and downstream_queue_cls added to constructors of priority queues
  • priority parameter removed from priority queues push method, so they decide internally what to use as priority
  • support of key parameter in downstream queue's constructor require to rewrite code of queuelib.pqueue.PriorityQueue in scrapy.pqueues.ScrapyPriorityQueue, so we don't use queuelib version any more
  • after several code refactorings there was no much code in _SlotPriorityQueues class. Remained code is merged to DownloaderAwarePriorityQueue class
  • ability to create directories hierarchy is added to downstream disk queues
  • as a side effect on disk queues for DownloaderAwarePriorityQueue is stored in hierarchy now requests.queue/host/priority instead requests.queue/host-priority. With a lot of host-priority combinations plain structure create a lot of files in a single directory. This negatively affects performance of creation/deletion downstream queues
  • there is no need in _Priority structure any more, it is deleted

I decided to not unify constructor interface for priority queues and downstream queues.

After this PR something like #3723 will be much easier and we can provide implementations of downstream queues for different hosted and self-hosted message queue systems.

@codecov
Copy link

codecov bot commented Jul 18, 2019

Codecov Report

No coverage uploaded for pull request base (master@69c6eb9). Click here to learn what that means.
The diff coverage is 87.36%.

@@            Coverage Diff            @@
##             master    #3884   +/-   ##
=========================================
  Coverage          ?   83.95%           
=========================================
  Files             ?      165           
  Lines             ?     9953           
  Branches          ?     1479           
=========================================
  Hits              ?     8356           
  Misses            ?     1339           
  Partials          ?      258
Impacted Files Coverage Δ
scrapy/dupefilters.py 98.11% <ø> (ø)
scrapy/commands/genspider.py 83.72% <ø> (ø)
scrapy/pipelines/media.py 97.29% <ø> (ø)
scrapy/utils/display.py 31.25% <ø> (ø)
scrapy/spiders/init.py 100% <ø> (ø)
scrapy/commands/check.py 23.18% <ø> (ø)
scrapy/utils/template.py 100% <ø> (ø)
scrapy/statscollectors.py 92.15% <ø> (ø)
scrapy/exceptions.py 91.3% <ø> (ø)
scrapy/signalmanager.py 80% <ø> (ø)
... and 129 more

Copy link
Member

@Gallaecio Gallaecio left a comment

I’ve left a few 💄 aesthetic comments, but before reviewing any further I must ask about the changes to the signature of the constructor and from_crawler method of ScrapyPriorityQueue. Any chance we can make these changes while keeping those signatures backward compatible?

scrapy/core/scheduler.py Outdated Show resolved Hide resolved
scrapy/pqueues.py Outdated Show resolved Hide resolved
scrapy/pqueues.py Outdated Show resolved Hide resolved
scrapy/pqueues.py Outdated Show resolved Hide resolved
scrapy/pqueues.py Outdated Show resolved Hide resolved
@whalebot-helmsman
Copy link
Contributor Author

whalebot-helmsman commented Aug 6, 2019

@Gallaecio

Any chance we can make these changes while keeping those signatures backward compatible?

Current signatures has a very short lifespan. Probability of someone using them is very low. May be you have some specific case in mind?

@Gallaecio
Copy link
Member

Gallaecio commented Aug 6, 2019

May be you have some specific case in mind?

No specific case, I’m just a bit worried that there might be someone affected.

@Gallaecio
Copy link
Member

Gallaecio commented Feb 12, 2020

I’m not sure if this is the right pull request to address this, but if we aim to eventually support message queues, should we refactor the scheduler internals to switch “disk” and “memory” to more abstract concepts like “persistent” and “volatile”?

@kmike kmike requested a review from wRAR Feb 13, 2020
Copy link
Member

@Gallaecio Gallaecio left a comment

The changes look good to me, although I wonder if we should have more tests for them.

scrapy/squeues.py Outdated Show resolved Hide resolved
@whalebot-helmsman
Copy link
Contributor Author

whalebot-helmsman commented Feb 19, 2020

should we refactor the scheduler internals to switch “disk” and “memory” to more abstract concepts like “persistent” and “volatile”?

I think we should in future, not in this PR.

wRAR
wRAR approved these changes Feb 22, 2020
@wRAR wRAR merged commit 67ee0b0 into scrapy:master Feb 22, 2020
2 checks passed
@whalebot-helmsman whalebot-helmsman deleted the scheduler-refactoring-2 branch Feb 26, 2020
@whalebot-helmsman
Copy link
Contributor Author

whalebot-helmsman commented Feb 26, 2020

Thanks @Gallaecio , @kmike and @wRAR for review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants