Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Response.follow_all #4057

Merged
merged 22 commits into from Jan 22, 2020
Merged

Response.follow_all #4057

merged 22 commits into from Jan 22, 2020

Conversation

elacuesta
Copy link
Member

@elacuesta elacuesta commented Oct 2, 2019

Fixes #2582

@codecov
Copy link

codecov bot commented Oct 2, 2019

Codecov Report

Merging #4057 into master will increase coverage by 0.22%.
The diff coverage is 92.59%.

@@            Coverage Diff             @@
##           master    #4057      +/-   ##
==========================================
+ Coverage    83.9%   84.13%   +0.22%     
==========================================
  Files         165      166       +1     
  Lines        9639     9761     +122     
  Branches     1448     1462      +14     
==========================================
+ Hits         8088     8212     +124     
+ Misses       1304     1296       -8     
- Partials      247      253       +6
Impacted Files Coverage Δ
scrapy/utils/markup.py 100% <ø> (ø) ⬆️
scrapy/utils/multipart.py 100% <ø> (ø) ⬆️
scrapy/linkextractors/lxmlhtml.py 92.3% <ø> (ø) ⬆️
scrapy/logformatter.py 92% <ø> (ø) ⬆️
scrapy/middleware.py 90.38% <0%> (ø) ⬆️
scrapy/core/engine.py 92.88% <100%> (ø) ⬆️
scrapy/utils/log.py 88.88% <100%> (+0.38%) ⬆️
scrapy/spiders/feed.py 66.15% <100%> (ø) ⬆️
scrapy/core/downloader/middleware.py 100% <100%> (ø) ⬆️
scrapy/contracts/default.py 88% <100%> (ø) ⬆️
... and 32 more

@kmike
Copy link
Member

kmike commented Oct 7, 2019

Thanks @elacuesta!

I think there is one tricky case with follow_all API design: a case where some of the links are invalid.

For example, response.follow_all(response.css("a")), or response.follow_all(css="a") may match some <a> elements without "href" attribute. This is not an uncommon case; I recall noticing it early, when updating a tutorial for response.follow. Check https://docs.scrapy.org/en/latest/intro/tutorial.html#more-examples-and-patterns - code like this is used everywhere:

    # follow pagination links
    for href in response.css('li.next a::attr(href)'):
        yield response.follow(href, self.parse)

instead of

    # follow pagination links
    for a in response.css('li.next a'):
        yield response.follow(a, self.parse)

because the latter often fails in practice. This is common e.g. for pagination links - often a link to the current page is still an <a> element, but without href.

It'd be nice to be able to write

yield from response.follow_all(".pagination a")

but with the current implementation response.follow_all would fail for these elements (raise exception), which would make some of the links on page followed, and some not - only links before the first exceptions are processed.

I wonder if we should solve it somehow in the API of response.follow_all, and skip some of the "bad" cases.

It requires a discussion though, I'm not sure what's the best approach.

@elacuesta
Copy link
Member Author

elacuesta commented Oct 8, 2019

Good point 🤔

If we leave it as it currently is, I think the error message is probably clear enough (<a> element has no href attribute) and the same solution used for follow also applies to follow_all:

yield from response.follow_all(css=".pagination a::attr(href)")

On the other hand, I agree that it would be cleaner to write .pagination a, but I wouldn't like to silently skip invalid links. An alternative could be to add a skip_invalid boolean parameter to follow_all, to be more explicit about that behaviour. What do you think @kmike ?

@elacuesta
Copy link
Member Author

elacuesta commented Oct 9, 2019

On second thought, who would pass skip_invalid=False? Skipping invalid links should be fine as long as it's properly documented.

@elacuesta
Copy link
Member Author

elacuesta commented Oct 10, 2019

@kmike updated, please check again 🙂

scrapy/http/response/__init__.py Outdated Show resolved Hide resolved
scrapy/http/response/__init__.py Outdated Show resolved Hide resolved
scrapy/http/response/text.py Outdated Show resolved Hide resolved
scrapy/http/response/text.py Outdated Show resolved Hide resolved
scrapy/http/response/text.py Outdated Show resolved Hide resolved
scrapy/http/response/text.py Outdated Show resolved Hide resolved
scrapy/http/response/text.py Outdated Show resolved Hide resolved
scrapy/http/response/text.py Outdated Show resolved Hide resolved
scrapy/http/response/text.py Outdated Show resolved Hide resolved
scrapy/http/response/text.py Outdated Show resolved Hide resolved
elacuesta and others added 5 commits Oct 14, 2019
Co-Authored-By: Adrián Chaves <adrian@chaves.io>
Co-Authored-By: Adrián Chaves <adrian@chaves.io>
Co-Authored-By: Adrián Chaves <adrian@chaves.io>
Co-Authored-By: Adrián Chaves <adrian@chaves.io>
wRAR
wRAR approved these changes Nov 12, 2019
scrapy/http/response/text.py Outdated Show resolved Hide resolved
@kmike
Copy link
Member

kmike commented Nov 12, 2019

Thanks @elacuesta for working on it, and @Gallaecio + @wRAR for reviews!

I think that's very close to be ready, but it'd be good to update non-reference docs as well (tutorial?). We an do it after the merge as well, if @Gallaecio can take it.

@Gallaecio
Copy link
Member

Gallaecio commented Nov 25, 2019

it'd be good to update non-reference docs as well (tutorial?).

@elacuesta elacuesta#3

Only the tutorial had follow usages that were better suited for follow_all.

@@ -625,12 +625,12 @@ attribute automatically. So the code can be shortened further::
for a in response.css('li.next a'):
yield response.follow(a, callback=self.parse)

.. note::
To create multiple requests from an iterable, you can use
Copy link
Member

@kmike kmike Nov 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This explanation is a bit confusing, because li.next a should match a single link in most cases.

A more natural example would be something like response.css('.author + a') - follow all links to authors - though the selector could be too complex for a tutorial (or maybe not? the same selector is present in examples below).

Copy link
Member

@Gallaecio Gallaecio Nov 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I looked at the loop and did not noticed the actual CSS expression. May I remove the loops as I revert back to follow? (elacuesta#4)

Copy link
Member

@kmike kmike Dec 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Gallaecio!

There is a use case for a loop though (and for follow_all on a single link), as it works on a last page, where there is no "next" link; new code (with [0]) fails with exception.

I'm on fence on whether this pattern is good or not (use follow_all for cases where 0 or 1 result is expected). I can see myself using this pattern in spider code, but it can make the code less readable.

Copy link
Member Author

@elacuesta elacuesta Jan 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I don't remember if we agreed on something specific here, but the following is a somewhat verbose IndexError-safe alternative using response.follow:

a = response.css('li.next a')
if a:
    yield response.follow(a[0], callback=self.parse)

Copy link
Member Author

@elacuesta elacuesta Jan 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced li.next occurrences with ul.pager, which is still consistent with the source page. Hopefully it's clearer now, let me know what you think.

Screenshot at 2020-01-22 10-41-14

scrapy/http/response/text.py Outdated Show resolved Hide resolved
@elacuesta elacuesta added this to the v2.0 milestone Jan 18, 2020
@kmike kmike merged commit c0a7dfb into scrapy:master Jan 22, 2020
2 checks passed
@elacuesta elacuesta deleted the response_follow_all branch Jan 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants