Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better API for creating requests from responses #1940

Closed
kmike opened this issue Apr 20, 2016 · 16 comments
Closed

Better API for creating requests from responses #1940

kmike opened this issue Apr 20, 2016 · 16 comments
Labels

Comments

@kmike
Copy link
Member

kmike commented Apr 20, 2016

Sometime Request needs information about the response to be sent correctly. There are at least 2 use cases:

  1. URL encoding depends on response encoding (see [MRG+1] Use newer w3lib.url.safe_url_string() and re-enable HTTP request tests #1923);
  2. relative URLs should be resolved based on response.url or its base url (see Support relative urls better #548).

I think the current API is not good enough. The most obvious code is not correct:

for url in response.xpath("//a/@href").extract():
    yield scrapy.Request(url, self.parse)

To do that correctly user has to write the following:

for url in response.xpath("//a/@href").extract():
    yield scrapy.Request(response.urljoin(url), self.parse, encoding=response.encoding)

Or this:

for link in LinkExtractor().extract_links(response):
    yield scrapy.Request(link.url, self.parse, encoding=response.encoding)

LinkeExtractor solution has gotchas, e.g. canonicalize_url is called by default and fragments are removed. It means that e.g. Ajax crawlable URLs are not handled (no escaped_fragment even if a website supports it); it also makes it harder to use Scrapy with scrapy-splash which can handle fragments.

This all is too easy to get wrong; I think just documenting these gotchas is not good enough for a framework - it should make the easiest way to write something the correct way. IMHO in the API shouldn't require user to instantiate weird objects or pass response encoding:

for url in response.xpath("//a/@href").extract():
    something.send_request(url, self.parse)

This can be implemented if we provide a method on Response to send new requests.

A related use case is async def functions or methods (#1144 (comment)): it is not possible to yield Requests in async def functions, so adding a request should be either a method of self or a method of response if we want to support async def callbacks.

FormRequest.from_response(response, ...) can also be written as something like response.submit(...).

@kmike kmike added the discuss label Apr 20, 2016
@Digenis
Copy link
Member

Digenis commented Apr 20, 2016

More LinkExtractor gotchas: #1941

I think providing a method in the response to prepare requests and urls
can be implemented fast and reliably without breaking compatibility.

The rest, sending the request to the scheduler, sounds like something that requires a separate issue.

@kmike
Copy link
Member Author

kmike commented Apr 25, 2016

Another way to fix response.encoding problem is to create a middleware which sets request.encoding to response.encoding if it is not set explicitly. But it would be backwards incompatible, and @dangra didn't like middleware approach for relative URLs and ruled them out (which was a good decision).

@redapple redapple added this to the v1.3 milestone Sep 1, 2016
@redapple redapple modified the milestones: v1.4, v1.3 Nov 23, 2016
@kmike
Copy link
Member Author

kmike commented Jan 30, 2017

Ok, to move it forward. What do you think about adding response.Request and response.FormRequest shortcuts?

Draft implementation:

class Response(object):
    # ....
    def Request(self, url, callback=None, method='GET', headers=None, body=None,
                     cookies=None, meta=None, encoding=None, priority=0,
                     dont_filter=False, errback=None):
        encoding = self.encoding if encoding is None else encoding
        url = self.urljoin(url.strip())
        return scrapy.Request(url, callback, method, headers, body, cookies, 
                              meta, encoding, priority, dont_filter, errback)

    def FormRequest(self, *args, **kwargs):
        return scrapy.FormRequest.from_response(self, *args, **kwargs)

"Sweet" names like response.follow and response.submit can be reserved for the API which adds requests to the scheduler directly. It is not clear how to implement these sweet names without adding a link to Crawler to the Response though..

An advantage of response.Request is that it mimics scrapy.Request, but does a right thing.
A disadvantage is that it could look strange for newcomers, and that it can be kind of hard to explain why we have both scrapy.Request and response.Request.

A disadvantage of response.follow name is that users can expect it to scheduler a request, but it will still have to be yielded.

@dangra
Copy link
Member

dangra commented Jan 30, 2017

A related use case is async def functions or methods (#1144 (comment)): it is not possible to yield Requests in async def functions, so adding a request should be either a method of self or a method of response if we want to support async def callbacks.

this is not true anymore since https://www.python.org/dev/peps/pep-0525/, right?

@kmike
Copy link
Member Author

kmike commented Jan 30, 2017

We can also add Selector support to Response.Request, in order to write this:

for a in response.css("a.my-link"):
    yield response.Request(a, self.parse)

instead of this:

for href in response.css("a.my-link::attr(href)").extract():
    yield response.Request(href, self.parse)

(or this, using scrapy.Request):

for href in response.css("a.my-link::attr(href)").extract():
    yield scrapy.Request(response.urljoin(href), self.parse, encoding=response.encoding)

@kmike
Copy link
Member Author

kmike commented Jan 30, 2017

@dangra yep! It is Python 3.6+ only, but I'm fine with that.

@dangra
Copy link
Member

dangra commented Jan 30, 2017

I like the idea of adding response methods to generate compelling requests, but I am not fond of using capitalized method names that also mimics scrapy builtin classes.

IMO we can set the request class the new methods are going to use to instantiate requests as Response class attribute. So for example xmlrpc responses default to scrapy.http.request.XmlRpcRequest

@dangra
Copy link
Member

dangra commented Jan 30, 2017

IMO we can set the request class the new methods are going to use to instantiate requests as Response class attribute. So for example xmlrpc responses default to scrapy.http.request.XmlRpcRequest

this is food for another discussion thread, my point is that using capitalized method names for the feature described in this ticket is not future compatible.

@kmike
Copy link
Member Author

kmike commented Jan 30, 2017

If adding a link to Crawler to the Response is too much (I personally feel it is too much) then we don't have to 'reserve' response.follow / response.submit names. If Response.Request deviates from vanilla scrapy.Request further (e.g. if it gets Selector support) then I think it adds points to response.follow (or alike) names. The same for response.submit vs response.FormRequest; the former gives a chance to cleanup the API. Disadvantages of "sweet" names still apply though.

@kmike
Copy link
Member Author

kmike commented Jan 30, 2017

IMO we can set the request class the new methods are going to use to instantiate requests as Response class attribute. So for example xmlrpc responses default to scrapy.http.request.XmlRpcRequest

I'm not sure I got it correctly; could you please provide an example?

@dangra
Copy link
Member

dangra commented Jan 30, 2017

If adding a link to Crawler to the Response is too much (I personally feel it is too much) then we don't have to 'reserve' response.follow / response.submit names.

I feel inclined towards not linking Crawler to responses too.

If Response.Request deviates from vanilla scrapy.Request further (e.g. if it gets Selector support) then I think it adds points to response.follow (or alike) names. The same for response.submit vs response.FormRequest; the former gives a chance to cleanup the API.

+1

Disadvantages of "sweet" names still apply though.

not sure what are the disadvantages you refer to :)

@kmike
Copy link
Member Author

kmike commented Jan 30, 2017

not sure what are the disadvantages you refer to :)

As a newcomer, a very easy mistake to make is to write

for a in response.css("a.my-link"):
    response.follow(a, self.parse)

instead of

for a in response.css("a.my-link"):
    yield response.follow(a, self.parse)

With yield scrapy.Request it is clear we're creating a Request object, but not necessarily executing an action. response.follow reads like an action, but it does nothing on its own. But I guess it applies to all async APIs, so people can get used to this.

@kmike
Copy link
Member Author

kmike commented Jan 30, 2017

Ideally, we should support both resp2 = await response.follow(url) and yield response.follow(url)

@kmike
Copy link
Member Author

kmike commented Jan 30, 2017

So, wishlist for response.follow:

  • support relative and absolute URLs;
  • set encoding correctly;
  • support Selector objects with a elements (what about other elements? img src?);
  • support Link objects;
  • support all Request options.

@dangra
Copy link
Member

dangra commented Jan 30, 2017

Ideally, we should support both resp2 = await response.follow(url) and yield response.follow(url)

Yes, I think it is the desired interface and should be possible to implement.

@Gallaecio
Copy link
Member

Fixed in #2540.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants