Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

response.follow_all or SelectorList.follow_all shortcut #2582

Closed
kmike opened this issue Feb 22, 2017 · 7 comments · Fixed by #4057
Closed

response.follow_all or SelectorList.follow_all shortcut #2582

kmike opened this issue Feb 22, 2017 · 7 comments · Fixed by #4057

Comments

@kmike
Copy link
Member

kmike commented Feb 22, 2017

What do you think about adding response.follow_all shortcut, which returns a list of requests? This is inspired by this note in docs:

response.follow(response.css('li.next a')) is not valid because response.css returns a list-like object with selectors for all results, not a single selector. A for loop like in the example above, or response.follow(response.css('li.next a')[0]) is fine.

So instead of

for href in response.css('li.next a::attr(href)'):
    yield response.follow(href, callback=self.parse)

users would be able to write (in Python 3)

yield from response.follow_all(response.css('li.next a::attr(href)'), self.parse)

We can also add 'css' and 'xpath' support to it, as keyword arguments; it would shorten the code to this:

yield from response.follow_all(css='li.next a::attr(href)', callback=self.parse)

(this is a follow-up to #1940 and #2540)

@kmike kmike added the discuss label Feb 22, 2017
@kmike
Copy link
Member Author

kmike commented Feb 22, 2017

An alternative is to implement it on SelectorList:

yield from response.css('li.next a').follow_all(self.parse, meta={...})

@kmike kmike changed the title response.follow_all shortcut response.follow_all or SelectorList.follow_all shortcut Feb 22, 2017
@kmike
Copy link
Member Author

kmike commented Feb 22, 2017

For Python 2 users we may add RequestSet container class, so that yield works. It could be helpful for await API as well - to allow waiting for several responses at the same time, while executing them in parallel.

page_response = await response.css('.pages a::attr(href)').follow_all()
# page_responses is a list of scrapy.Response objects for all pages

or

responses = await RequestSet([Request(url1), Request(url2)])

I'm not sure we can guarantee the order in the responses array.
But this async API is likely a separate question, for a far future :)

@immerrr
Copy link
Contributor

immerrr commented Feb 27, 2017

A RequestSet class is useful regardless of whether or not you are using Python 3.

It's useful to track state that's shared among a subset of requests, but not all of them. E.g. you are parsing a subsection of a site and you want to track duplicates or some statistics for that subsection only. You need to store some info which becomes irrelevant as soon as the last request from that subset is dealt with. If you store that info on the spider, you need to carefully track the set of active requests using callbacks & errbacks. The implementation is error-prone and hard to do right on the first try, so it would be nice to have it shipped in well-tested state with the package.

@kmike
Copy link
Member Author

kmike commented Feb 27, 2017

@immerrr do you have an idea on how this RequestSet API could look like?
It is also somewhat related to #1226, though in an indirect way.

@immerrr
Copy link
Contributor

immerrr commented Feb 27, 2017

@kmike, only a rough one.

I think of RequestSet as something that:

  • has a DeferredList-like API (asyncio.gather has the drawback of being a function rather than an object)
  • knows that it contains Requests as deferreds
  • has its own callback to be run when the request set is dealt with, probably an errback too, for symmetry has a Deferred that would fire when the requestset is being cleaned up
  • has its own 'meta' dictionary, much like the one shared between Request & Response objects
  • since it knows that it contains Requests, it can piggyback on the first received value and do response.meta['request_set'] = self so that the callbacks can access the shared data
  • (maybe) it should silently copy fields from request_set.meta to response.meta if they are unset in request.meta, or maybe even make request.meta a ChainDict with fallback to request_set.meta
  • it should wrap requests coming from its respective response callbacks unless specifically asked not to do that, e.g. with request.meta['request_set'] = None
  • (maybe) it should be possible to return other RequestSet from response callbacks
  • (maybe) returned RequestSets should be made nestable, i.e. to keep the parent RequestSet alive during their lifetime if not explicitly asked not to with request_set.meta['request_set'] = None (if nesting is considered, the request_set metadata key seems redundant and we might consider parent_set instead.
  • not sure if it's worth it to make them nestable, i.e. if a certain response callback produces a different RequestSet, should it be owned by the parent request set?

One more thing to consider is cross-referencing RequestSets, i.e. when two requests that should belong to one RequestSet are produced by different callbacks and thus have different scopes. Maybe a simple WeakValueDictionary would suffice to lookup the sets and ensure the references are cleaned up as necessary. But then you'd have the usual get-or-create operation, that might be worth creating an etalon implementation for.

@kmike
Copy link
Member Author

kmike commented Feb 27, 2017

@immerrr could you please coopy-paste this to a new issue?

@immerrr
Copy link
Contributor

immerrr commented Feb 27, 2017

@kmike done: #2600

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants