[MRG] Fix engine to support filtered start_requests #707
Conversation
@redapple checking if it's |
log.err(None, 'Obtaining request from start requests', \ | ||
spider=spider) | ||
else: | ||
self.crawl(request, spider) | ||
|
||
if self.spider_is_idle(spider) and slot.close_if_idle: | ||
if (self.spider_is_idle(spider) and slot.close_if_idle): |
dangra
Apr 28, 2014
Member
nitpick, no need to change this line :)
nitpick, no need to change this line :)
redapple
Apr 28, 2014
Author
Contributor
ah, yeah, sorry. it was from the previous trials adding the test here instead of spider_is_idle...
ah, yeah, sorry. it was from the previous trials adding the test here instead of spider_is_idle...
@@ -112,12 +112,13 @@ def _next_request(self, spider): | |||
except StopIteration: | |||
slot.start_requests = None | |||
except Exception as exc: | |||
slot.start_requests = None |
dangra
Apr 28, 2014
Member
good one, I hope this was spotted by our current test suite.
good one, I hope this was spotted by our current test suite.
redapple
Apr 28, 2014
Author
Contributor
yes, the tests are blocked without this
yes, the tests are blocked without this
dangra
Apr 28, 2014
Member
yes, that's perfect.
yes, that's perfect.
nramirezuy
Apr 29, 2014
Contributor
Can I ask, Why Exception
and not the specific one?
Can I ask, Why Exception
and not the specific one?
dangra
Apr 29, 2014
Member
@nramirezuy: because that catch-all prevents unexpected errors in spider code from leaving the engine in hanged state. If an error on .start_requests()
happens it will be logged as such and engine gracefully stopped once idle.
@nramirezuy: because that catch-all prevents unexpected errors in spider code from leaving the engine in hanged state. If an error on .start_requests()
happens it will be logged as such and engine gracefully stopped once idle.
LGTM! |
I'll squash that |
are you going to squash into a single commit? |
last 3: 1 for tests and one for fix |
ok, squash them and I merge asap. |
[MRG] Fix engine to support filtered start_requests
start_requests can contain duplicates, and when DupeFilter is active, in
ExecutionEngine
_next_request()
these duplicate requests are not enqueued byself.crawl(request, spider)
This fix simply adds a test to
spider_is_idle()
to test is there are still start requests to consume before stopping everything.