Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Fix engine to support filtered start_requests #707

Merged
merged 2 commits into from Apr 28, 2014
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
4 changes: 3 additions & 1 deletion scrapy/core/engine.py
Expand Up @@ -112,6 +112,7 @@ def _next_request(self, spider):
except StopIteration:
slot.start_requests = None
except Exception as exc:
slot.start_requests = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good one, I hope this was spotted by our current test suite.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the tests are blocked without this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that's perfect.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I ask, Why Exception and not the specific one?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nramirezuy: because that catch-all prevents unexpected errors in spider code from leaving the engine in hanged state. If an error on .start_requests() happens it will be logged as such and engine gracefully stopped once idle.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dangra We were missing this cases, #708

log.err(None, 'Obtaining request from start requests', \
spider=spider)
else:
Expand Down Expand Up @@ -156,7 +157,8 @@ def spider_is_idle(self, spider):
scraper_idle = self.scraper.slot.is_idle()
pending = self.slot.scheduler.has_pending_requests()
downloading = bool(self.downloader.active)
idle = scraper_idle and not (pending or downloading)
pending_start_requests = self.slot.start_requests is not None
idle = scraper_idle and not (pending or downloading or pending_start_requests)
return idle

@property
Expand Down
23 changes: 23 additions & 0 deletions scrapy/tests/spiders.py
Expand Up @@ -157,3 +157,26 @@ def on_error(self, failure):
self.meta['failure'] = failure
if callable(self.errback_func):
return self.errback_func(failure)


class DuplicateStartRequestsSpider(Spider):
dont_filter = True
name = 'duplicatestartrequests'
distinct_urls = 2
dupe_factor = 3

def start_requests(self):
for i in range(0, self.distinct_urls):
for j in range(0, self.dupe_factor):
url = "http://localhost:8998/echo?headers=1&body=test%d" % i
yield self.make_requests_from_url(url)

def make_requests_from_url(self, url):
return Request(url, dont_filter=self.dont_filter)

def __init__(self, url="http://localhost:8998", *args, **kwargs):
super(DuplicateStartRequestsSpider, self).__init__(*args, **kwargs)
self.visited = 0

def parse(self, response):
self.visited += 1
17 changes: 16 additions & 1 deletion scrapy/tests/test_crawl.py
Expand Up @@ -5,7 +5,7 @@
from twisted.trial.unittest import TestCase
from scrapy.utils.test import docrawl, get_testlog
from scrapy.tests.spiders import FollowAllSpider, DelaySpider, SimpleSpider, \
BrokenStartRequestsSpider, SingleRequestSpider
BrokenStartRequestsSpider, SingleRequestSpider, DuplicateStartRequestsSpider
from scrapy.tests.mockserver import MockServer
from scrapy.http import Request

Expand Down Expand Up @@ -113,6 +113,21 @@ def test_start_requests_lazyness(self):
#self.assertTrue(spider.seedsseen.index(None) < spider.seedsseen.index(99),
# spider.seedsseen)

@defer.inlineCallbacks
def test_start_requests_dupes(self):
settings = {"CONCURRENT_REQUESTS": 1}
spider = DuplicateStartRequestsSpider(dont_filter=True,
distinct_urls=2,
dupe_factor=3)
yield docrawl(spider, settings)
self.assertEqual(spider.visited, 6)

spider = DuplicateStartRequestsSpider(dont_filter=False,
distinct_urls=3,
dupe_factor=4)
yield docrawl(spider, settings)
self.assertEqual(spider.visited, 3)

@defer.inlineCallbacks
def test_unbounded_response(self):
# Completeness of responses without Content-Length or Transfer-Encoding
Expand Down