Adding Round Robin Queue #21

tianhuil · 2018-02-25T05:39:45Z

This queue (with tests) is to solve the issues raised scrapy/scrapy#2474 and scrapy/scrapy#1802

I would like a domain scheduler implemented here which scrapes in a domain-smart way: by round-robin cycling through the domains. This has two benefits:

Spreading out load on the target server instead of hitting the server with many requests at once
Reducing delays caused by server-overloaded errors or CONCURRENT_REQUESTS_PER_IP type restrictions.

This implements the proposed solution in scrapy/scrapy#1802. I would like to merge the round-robin queue first, and then merge in the changes from in the domain scheduler into scrapy.

codecov · 2018-02-25T05:44:27Z

Codecov Report

Merging #21 into master will decrease coverage by 0.99%.
The diff coverage is 92.3%.

@@           Coverage Diff           @@
##           master      #21   +/-   ##
=======================================
- Coverage   98.52%   97.53%   -1%     
=======================================
  Files           3        4    +1     
  Lines         204      243   +39     
  Branches       26       34    +8     
=======================================
+ Hits          201      237   +36     
- Misses          1        2    +1     
- Partials        2        4    +2

Impacted Files	Coverage Δ
queuelib/__init__.py	`100% <100%> (ø)`	⬆️
queuelib/rrqueue.py	`92.1% <92.1%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 06439b7...b608deb. Read the comment docs.

dangra · 2018-02-28T15:08:38Z

.gitignore

@@ -0,0 +1,101 @@
+# Byte-compiled / optimized / DLL files


Please, remove this file. This kind of files are part of developer environment, if we have something project specific we add it here.

dangra · 2018-02-28T15:26:51Z

I like the idea and implementation. Let's merge as soon as my .gitignore comment is sorted out.

tianhuil · 2018-03-01T01:30:52Z

Thanks @dangra: I removed the .gitignore. Let me know if you'd like me to do anything else!

dangra · 2018-03-01T12:59:57Z

@cathalgarvey LGTM and release but we need to fix travis-ci build which seems broken due to missing pypy binary.

dangra · 2018-03-08T13:52:37Z

broken travis-ci builds addressed by #22

tianhuil added 2 commits February 24, 2018 20:16

Add Round Robin Queue

4f88939

Add Round Robin to README

74e68af

This was referenced Feb 25, 2018

Modify scrapy scheduler to allow for using queue class other than PriorityQueue scrapy/scrapy#1802

Closed

Optimised dequeing scrapy/scrapy#2474

Closed

tianhuil added 2 commits February 25, 2018 00:44

Tweak README

22761af

Remove print

377a70e

tianhuil mentioned this pull request Feb 26, 2018

Round Robin Domain Crawling second scheduler to improve performance scrapy/scrapy#3140

Closed

dangra reviewed Feb 28, 2018

View reviewed changes

tianhuil added 2 commits February 28, 2018 20:24

start_keys -> start_domains

5bfe5a3

remove .gitignore

32642ef

rename Queue -> RRQueue

b608deb

dangra merged commit e013af8 into scrapy:master Mar 8, 2018

tianhuil mentioned this pull request Mar 9, 2018

Adding domain scrapy/scrapy#3160

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding Round Robin Queue #21

Adding Round Robin Queue #21

tianhuil commented Feb 25, 2018 •

edited

codecov bot commented Feb 25, 2018 •

edited

dangra Feb 28, 2018

dangra commented Feb 28, 2018

tianhuil commented Mar 1, 2018

dangra commented Mar 1, 2018

dangra commented Mar 8, 2018

Adding Round Robin Queue #21

Adding Round Robin Queue #21

Conversation

tianhuil commented Feb 25, 2018 • edited

codecov bot commented Feb 25, 2018 • edited

Codecov Report

dangra Feb 28, 2018

Choose a reason for hiding this comment

dangra commented Feb 28, 2018

tianhuil commented Mar 1, 2018

dangra commented Mar 1, 2018

dangra commented Mar 8, 2018

tianhuil commented Feb 25, 2018 •

edited

codecov bot commented Feb 25, 2018 •

edited