Skip to content

Conversation

@pawelmhm
Copy link
Member

@pawelmhm pawelmhm commented May 30, 2016

add support for start_requests. This is still somewhat early beta, but I'm opening PR to start design discussion and get early feedback.

Core of changes

If user provides argument: start_requests=True it will enable start_requests in spider and url argument won't be required. If there is no start_requests url argument is required and API will return 400 Bad Requests (as it does currently). This behavior should be common acrosss GET and POST endpoints.

Other changes

  • Improve tests in test_resource_crawl so that they test both GET and POST for similar thing. This should improve test coverage.
  • Add start_requests_spider that will be initialized from string template, formatted with start_url taken from mockserver. This required changing order of things in setUp method in tests/servers.py and required to add site argument(so that we can easily obtain site.url())
  • separate Scrapy.Request arguments from API arguments. This is needed because it makes validation easier, and also allows us to remove API arguments from arguments passed to scrapy.Request. It is important in GET handler.
  • slighly improved validation of invalid Request arguments - 400 errors can be raised in either crawler manager (if there is incorrect value of arguments) or in API if there is incorrect name of argument.

* add common logic of extracting api arguments in POST and GET
* extract Scrapy Request arguments and parameters for API into separate variables
* validate Scrapy Request arguments on resource level
* update test so that they test POST and GET handlers
scrapyrt/core.py Outdated
dfd = self.crawler_process.crawl(self.spider_name, *args, **kwargs)
except KeyError as e:
# Spider not found.
# TODO is it the only possible exception here?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's the only exception that requires 404 response - because spider wasn't found

@pawelmhm
Copy link
Member Author

hey @chekunkov thanks for comments I'm going to include your feedback in subsequent commits. One thing I wanted to ask you and perhaps other people in the community. My initial desigin choice was:

If user provides argument: start_requests=True it will enable start_requests in spider and url argument won't be required. If there is no start_requests url argument is required and API will return 400 Bad Requests (as it does currently). This behavior should be common acrosss GET and POST endpoints.

Do we all agree this is good principle? What do you think about enabling start_requests by default and disabling them with argument? I see some people are confused by the fact that start_requests are disabled, because it somehow does not match expectations about how Scrapy spiders work. Maybe this confusion tells us that we should adjust our app and make it more intuitive? We initially created ScrapyRT for a project where we didnt need start_requests, but maybe our original use case was not typical, and maybe world at large will benefit from us changing our assumptions in this respect?

@chekunkov
Copy link
Contributor

What do you think about enabling start_requests by default and disabling them with argument?

@pawelmhm my problem with that is a backwards compatibility. we know that some (many?) projects rely on the fact that by default start requests are disabled, by changing this here we will break those projects.

another problem is that initial idea of ScrapyRT - get spider that performs long broad crawl, specify url and callback name - and return result from that callback without running the whole spider. if we switch default to start_requests enabled by default - many users will simply experience timeouts or hanging responses from ScrapyRT endpoints - because they'll start crawls that take much more time than expected request-response time. for example imagine a spider that has root of some relatively large online shop as its start url. if users want to run long running tasks there - they better use https://github.com/scrapy/scrapyd

@rmax
Copy link

rmax commented Sep 2, 2016

I haven't tried this, but how will be this different from calling the spider with url=<start-url> and callback=parse?

@pawelmhm
Copy link
Member Author

pawelmhm commented Sep 6, 2016

I haven't tried this, but how will be this different from calling the spider with url= and callback=parse?

you can do it like this if it suits your need. However I understand that some people have some custom start_requests defined. Their start_requests override scrapy.Spider start_requests and issue bunch of requests with some custom callbacks, after this they perhaps want to schedule other url with callback. I know about some spiders that work like this. For example they need to log in before making request, so code for login is constant and begins with start_requests. In this scenario it will be more convenient to login first and later schedule request based on querystring parametes url and callback.

@pawelmhm
Copy link
Member Author

pawelmhm commented Dec 22, 2016

One thing to keep in mind @chekunkov is that there is potentially breaking change here. api_parameters resource attribute requires you to register argument for api before using it. If user has resource that subclasses ScrapyRT resource and has some custom api param, e.g. parameter: "auth", or api parameter: "force_crawl", and he does something with it, his argument will be lost if he calls super(MyResource, self).render_GET(request).

I think that separating api_params from scrapy_request_args and being explicit about what exactly is api_parameter is a good thing though and that we should stick to it. It seems better to me if user register api parameters as attributes. But of course need to document that and notify everyone.

@pawelmhm
Copy link
Member Author

So I think it should work mostly ok, just need to add docs for start_requests parameter.

"""
:param dictionary: Dictionary with parameters passed to API
:param raise_error: raise ValueError if key is not valid arg for
scrapy.httpRequest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing dot between http and Request

raise Error(400, e.message)

if not api_params.get("start_requests"):
self.get_required_argument(api_params, "request")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest to rearrange code like this to make it easier to follow:

if api_params.get("start_requests"):
    _request = api_params.get("request") or {}
else:
    _request = self.get_required_argument(api_params, "request")
try:
    # validate Scrapy Request args
    scrapy_request_args = extract_scrapy_request_args(
        _request, raise_error=True)
except ValueError as e:
    raise Error(400, e.message)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pawelmhm what do you think about this one?

Copy link
Member Author

@pawelmhm pawelmhm Jan 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yes will add it (sorry missed that)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chekunkov I cleaned this up here: 59d0a01 and also added bunch of other tests. Test coverage should be much improved after this PR which should help in porting to Python3.

@chekunkov
Copy link
Contributor

chekunkov commented Jan 3, 2017

While I think that registering api parameters explicitly may be a good idea, I don't think it should be introduced in this PR, because:

  1. compatibility. If we keep this change we need to bump major version of the package - while I'd expect 1.0 to have much more useful design improvement and rewrites. Otherwise, it'll be a waste of the major version bump and we will not be able to introduce other breaking changes for a while.

  2. this change has nothing to do with the original intent to provide an option to run spider with start_requests.

We can keep the idea of such validation as a github issue and target it for version 1.0.

Variables renaming looks good to me as far as we keep compatibility in places like this. I agree that new naming is more clear, we can aim callback kwargs renaming for v 1.0. In addition to that, in v 1.0 we can provide a single context object that will be passed through callbacks related to the same request - this will simplify signatures for many internal functions and will provide a way to pass and access values in different parts of the callback chain.

@pawelmhm
Copy link
Member Author

pawelmhm commented Jan 4, 2017

While I think that registering api parameters explicitly may be a good idea, I don't think it should be introduced in this PR

Hmm yeah, actually maybe you're right, let's just do start_requests first and move registering api params to other PR. But we can keep scrapy_request args utility and extract Scrapy args, just dont clean non-Scrapy.http.Request arguments. It seems like current signature of prepare_crawl relies on the fact that first argument (now called api_params) contains all paramters passed to API (Scrapy and non-Scrapy alike), so we should keep it like that to avoid breaking code that might subclass this.

Variables renaming looks good to me as far as we keep compatibility in places like this. I agree that new naming is more clear, we can aim callback kwargs renaming for v 1.0.

I think we do keep compatibility in this line you linked, or am I missing something? I'm not sure if we need much kwarg renaming because most kwargs are passed with double star **kwarg, at least in resources.

@chekunkov
Copy link
Contributor

I think we do keep compatibility in this line you linked, or am I missing something?

Right, we do keep compatibility in this line, I pointed it out as an example of a place where renaming didn't (and shouldn't) happen because of compatibility. Sorry if my message wasn't clear enough.

Copy link
Contributor

@chekunkov chekunkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@pawelmhm pawelmhm merged commit aaced79 into master Jan 23, 2017
@pawelmhm pawelmhm deleted the start_requests branch January 23, 2017 07:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants