New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better API for creating requests from responses #1940
Comments
More LinkExtractor gotchas: #1941 I think providing a method in the response to prepare requests and urls The rest, sending the request to the scheduler, sounds like something that requires a separate issue. |
Another way to fix response.encoding problem is to create a middleware which sets request.encoding to response.encoding if it is not set explicitly. But it would be backwards incompatible, and @dangra didn't like middleware approach for relative URLs and ruled them out (which was a good decision). |
Ok, to move it forward. What do you think about adding Draft implementation: class Response(object):
# ....
def Request(self, url, callback=None, method='GET', headers=None, body=None,
cookies=None, meta=None, encoding=None, priority=0,
dont_filter=False, errback=None):
encoding = self.encoding if encoding is None else encoding
url = self.urljoin(url.strip())
return scrapy.Request(url, callback, method, headers, body, cookies,
meta, encoding, priority, dont_filter, errback)
def FormRequest(self, *args, **kwargs):
return scrapy.FormRequest.from_response(self, *args, **kwargs) "Sweet" names like An advantage of response.Request is that it mimics scrapy.Request, but does a right thing. A disadvantage of response.follow name is that users can expect it to scheduler a request, but it will still have to be yielded. |
this is not true anymore since https://www.python.org/dev/peps/pep-0525/, right? |
We can also add Selector support to Response.Request, in order to write this: for a in response.css("a.my-link"):
yield response.Request(a, self.parse) instead of this: for href in response.css("a.my-link::attr(href)").extract():
yield response.Request(href, self.parse) (or this, using scrapy.Request): for href in response.css("a.my-link::attr(href)").extract():
yield scrapy.Request(response.urljoin(href), self.parse, encoding=response.encoding) |
@dangra yep! It is Python 3.6+ only, but I'm fine with that. |
I like the idea of adding response methods to generate compelling requests, but I am not fond of using capitalized method names that also mimics scrapy builtin classes. IMO we can set the request class the new methods are going to use to instantiate requests as Response class attribute. So for example xmlrpc responses default to |
this is food for another discussion thread, my point is that using capitalized method names for the feature described in this ticket is not future compatible. |
If adding a link to Crawler to the Response is too much (I personally feel it is too much) then we don't have to 'reserve' response.follow / response.submit names. If Response.Request deviates from vanilla scrapy.Request further (e.g. if it gets Selector support) then I think it adds points to response.follow (or alike) names. The same for response.submit vs response.FormRequest; the former gives a chance to cleanup the API. Disadvantages of "sweet" names still apply though. |
I'm not sure I got it correctly; could you please provide an example? |
I feel inclined towards not linking Crawler to responses too.
+1
not sure what are the disadvantages you refer to :) |
As a newcomer, a very easy mistake to make is to write for a in response.css("a.my-link"):
response.follow(a, self.parse) instead of for a in response.css("a.my-link"):
yield response.follow(a, self.parse) With |
Ideally, we should support both |
So, wishlist for response.follow:
|
Yes, I think it is the desired interface and should be possible to implement. |
Fixed in #2540. |
Sometime Request needs information about the response to be sent correctly. There are at least 2 use cases:
I think the current API is not good enough. The most obvious code is not correct:
To do that correctly user has to write the following:
Or this:
LinkeExtractor solution has gotchas, e.g. canonicalize_url is called by default and fragments are removed. It means that e.g. Ajax crawlable URLs are not handled (no
escaped_fragment
even if a website supports it); it also makes it harder to use Scrapy with scrapy-splash which can handle fragments.This all is too easy to get wrong; I think just documenting these gotchas is not good enough for a framework - it should make the easiest way to write something the correct way. IMHO in the API shouldn't require user to instantiate weird objects or pass response encoding:
This can be implemented if we provide a method on Response to send new requests.
A related use case is
async def
functions or methods (#1144 (comment)): it is not possible to yield Requests inasync def
functions, so adding a request should be either a method ofself
or a method ofresponse
if we want to supportasync def
callbacks.FormRequest.from_response(response, ...)
can also be written as something likeresponse.submit(...)
.The text was updated successfully, but these errors were encountered: