Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CrawlSpider: support process_links as generator #555

Merged
merged 1 commit into from Feb 18, 2014

Conversation

@redapple
Copy link
Contributor

@redapple redapple commented Jan 23, 2014

Adding tests for CrawlSpider's process_links

body=self.test_body)

class _CrawlSpider(self.spider_class):
import re

This comment has been minimized.

@redapple

redapple Jan 23, 2014
Author Contributor

probably not the best location

@pablohoffman
Copy link
Member

@pablohoffman pablohoffman commented Jan 28, 2014

This also adds some changes to CrawlSpider (perhaps those which motivated the tests?). Can we split them in two PR with a description on the intention for the CrawlSpider changes?

@redapple
Copy link
Contributor Author

@redapple redapple commented Jan 28, 2014

You mean this change?

-             links = [l for l in rule.link_extractor.extract_links(response) if l not in seen]
+            links = list(set(rule.link_extractor.extract_links(response)) - seen)

it was just cosmetic since there's a setat hand. (and can be removed from this PR)

The main change is this one though,

-             seen = seen.union(links)
              for link in links:
+                seen.add(link)

otherwise the links generator gets consumed updating the seen set and the loop on links exists immediately and thus requests don't get generated

@@ -49,11 +49,11 @@ def _requests_to_follow(self, response):
return
seen = set()
for n, rule in enumerate(self._rules):
links = [l for l in rule.link_extractor.extract_links(response) if l not in seen]
links = list(set(rule.link_extractor.extract_links(response)) - seen)

This comment has been minimized.

@kmike

kmike Jan 28, 2014
Member

This changes the order requests are scheduled - is it intended?

This comment has been minimized.

@redapple

redapple Jan 28, 2014
Author Contributor

ah good point! not intended no.
There could be a test for that

@dangra
Copy link
Member

@dangra dangra commented Jan 31, 2014

squash and merge?

Adding tests for CrawlSpider's process_links
@redapple
Copy link
Contributor Author

@redapple redapple commented Jan 31, 2014

Done.

@redapple
Copy link
Contributor Author

@redapple redapple commented Feb 18, 2014

anything preventing this from getting merged?

dangra added a commit that referenced this pull request Feb 18, 2014
CrawlSpider: support process_links as generator
@dangra dangra merged commit ec48df7 into scrapy:master Feb 18, 2014
1 check passed
1 check passed
default The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants