New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Response.follow_all #4057
Response.follow_all #4057
Conversation
Codecov Report
@@ Coverage Diff @@
## master #4057 +/- ##
==========================================
+ Coverage 83.9% 84.13% +0.22%
==========================================
Files 165 166 +1
Lines 9639 9761 +122
Branches 1448 1462 +14
==========================================
+ Hits 8088 8212 +124
+ Misses 1304 1296 -8
- Partials 247 253 +6
|
Thanks @elacuesta! I think there is one tricky case with follow_all API design: a case where some of the links are invalid. For example, # follow pagination links
for href in response.css('li.next a::attr(href)'):
yield response.follow(href, self.parse) instead of # follow pagination links
for a in response.css('li.next a'):
yield response.follow(a, self.parse) because the latter often fails in practice. This is common e.g. for pagination links - often a link to the current page is still an It'd be nice to be able to write yield from response.follow_all(".pagination a") but with the current implementation I wonder if we should solve it somehow in the API of response.follow_all, and skip some of the "bad" cases. It requires a discussion though, I'm not sure what's the best approach. |
Good point If we leave it as it currently is, I think the error message is probably clear enough ( yield from response.follow_all(css=".pagination a::attr(href)") On the other hand, I agree that it would be cleaner to write |
On second thought, who would pass |
@kmike updated, please check again |
Co-Authored-By: Adrián Chaves <adrian@chaves.io>
Co-Authored-By: Adrián Chaves <adrian@chaves.io>
Co-Authored-By: Adrián Chaves <adrian@chaves.io>
Co-Authored-By: Adrián Chaves <adrian@chaves.io>
tests/sample_data/link_extractor/sgml_linkextractor_no_href.html
Outdated
Show resolved
Hide resolved
Thanks @elacuesta for working on it, and @Gallaecio + @wRAR for reviews! I think that's very close to be ready, but it'd be good to update non-reference docs as well (tutorial?). We an do it after the merge as well, if @Gallaecio can take it. |
Only the tutorial had |
@@ -625,12 +625,12 @@ attribute automatically. So the code can be shortened further:: | |||
for a in response.css('li.next a'): | |||
yield response.follow(a, callback=self.parse) | |||
|
|||
.. note:: | |||
To create multiple requests from an iterable, you can use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This explanation is a bit confusing, because li.next a
should match a single link in most cases.
A more natural example would be something like response.css('.author + a')
- follow all links to authors - though the selector could be too complex for a tutorial (or maybe not? the same selector is present in examples below).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I looked at the loop and did not noticed the actual CSS expression. May I remove the loops as I revert back to follow
? (elacuesta#4)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Gallaecio!
There is a use case for a loop though (and for follow_all on a single link), as it works on a last page, where there is no "next" link; new code (with [0]
) fails with exception.
I'm on fence on whether this pattern is good or not (use follow_all for cases where 0 or 1 result is expected). I can see myself using this pattern in spider code, but it can make the code less readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest, I don't remember if we agreed on something specific here, but the following is a somewhat verbose IndexError
-safe alternative using response.follow
:
a = response.css('li.next a')
if a:
yield response.follow(a[0], callback=self.parse)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixes #2582