-
Notifications
You must be signed in to change notification settings - Fork 161
Start requests #38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Start requests #38
Conversation
* add common logic of extracting api arguments in POST and GET * extract Scrapy Request arguments and parameters for API into separate variables * validate Scrapy Request arguments on resource level * update test so that they test POST and GET handlers
scrapyrt/core.py
Outdated
| dfd = self.crawler_process.crawl(self.spider_name, *args, **kwargs) | ||
| except KeyError as e: | ||
| # Spider not found. | ||
| # TODO is it the only possible exception here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's the only exception that requires 404 response - because spider wasn't found
|
hey @chekunkov thanks for comments I'm going to include your feedback in subsequent commits. One thing I wanted to ask you and perhaps other people in the community. My initial desigin choice was:
Do we all agree this is good principle? What do you think about enabling start_requests by default and disabling them with argument? I see some people are confused by the fact that start_requests are disabled, because it somehow does not match expectations about how Scrapy spiders work. Maybe this confusion tells us that we should adjust our app and make it more intuitive? We initially created ScrapyRT for a project where we didnt need start_requests, but maybe our original use case was not typical, and maybe world at large will benefit from us changing our assumptions in this respect? |
@pawelmhm my problem with that is a backwards compatibility. we know that some (many?) projects rely on the fact that by default start requests are disabled, by changing this here we will break those projects. another problem is that initial idea of ScrapyRT - get spider that performs long broad crawl, specify url and callback name - and return result from that callback without running the whole spider. if we switch default to start_requests enabled by default - many users will simply experience timeouts or hanging responses from ScrapyRT endpoints - because they'll start crawls that take much more time than expected request-response time. for example imagine a spider that has root of some relatively large online shop as its start url. if users want to run long running tasks there - they better use https://github.com/scrapy/scrapyd |
|
I haven't tried this, but how will be this different from calling the spider with |
you can do it like this if it suits your need. However I understand that some people have some custom |
request_data will be changed to api_params; spider_data will be changed to scrapy_request_args
are known to work
and also add docstring for that
|
One thing to keep in mind @chekunkov is that there is potentially breaking change here. api_parameters resource attribute requires you to register argument for api before using it. If user has resource that subclasses ScrapyRT resource and has some custom api param, e.g. parameter: "auth", or api parameter: "force_crawl", and he does something with it, his argument will be lost if he calls I think that separating api_params from scrapy_request_args and being explicit about what exactly is api_parameter is a good thing though and that we should stick to it. It seems better to me if user register api parameters as attributes. But of course need to document that and notify everyone. |
|
So I think it should work mostly ok, just need to add docs for start_requests parameter. |
scrapyrt/utils.py
Outdated
| """ | ||
| :param dictionary: Dictionary with parameters passed to API | ||
| :param raise_error: raise ValueError if key is not valid arg for | ||
| scrapy.httpRequest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing dot between http and Request
scrapyrt/resources.py
Outdated
| raise Error(400, e.message) | ||
|
|
||
| if not api_params.get("start_requests"): | ||
| self.get_required_argument(api_params, "request") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest to rearrange code like this to make it easier to follow:
if api_params.get("start_requests"):
_request = api_params.get("request") or {}
else:
_request = self.get_required_argument(api_params, "request")
try:
# validate Scrapy Request args
scrapy_request_args = extract_scrapy_request_args(
_request, raise_error=True)
except ValueError as e:
raise Error(400, e.message)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pawelmhm what do you think about this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah yes will add it (sorry missed that)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chekunkov I cleaned this up here: 59d0a01 and also added bunch of other tests. Test coverage should be much improved after this PR which should help in porting to Python3.
|
While I think that registering api parameters explicitly may be a good idea, I don't think it should be introduced in this PR, because:
We can keep the idea of such validation as a github issue and target it for version Variables renaming looks good to me as far as we keep compatibility in places like this. I agree that new naming is more clear, we can aim callback kwargs renaming for v |
Hmm yeah, actually maybe you're right, let's just do start_requests first and move registering api params to other PR. But we can keep scrapy_request args utility and extract Scrapy args, just dont clean non-Scrapy.http.Request arguments. It seems like current signature of
I think we do keep compatibility in this line you linked, or am I missing something? I'm not sure if we need much kwarg renaming because most kwargs are passed with double star **kwarg, at least in resources. |
Right, we do keep compatibility in this line, I pointed it out as an example of a place where renaming didn't (and shouldn't) happen because of compatibility. Sorry if my message wasn't clear enough. |
there was no test for this, should be one now
+ test for missing "request" argument to POST endpoint
test if missing request parameter in POST will raise 400
chekunkov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
add support for start_requests. This is still somewhat early beta, but I'm opening PR to start design discussion and get early feedback.
Core of changes
If user provides argument: start_requests=True it will enable start_requests in spider and url argument won't be required. If there is no start_requests url argument is required and API will return 400 Bad Requests (as it does currently). This behavior should be common acrosss GET and POST endpoints.
Other changes