Start requests #38

pawelmhm · 2016-05-30T14:32:52Z

add support for start_requests. This is still somewhat early beta, but I'm opening PR to start design discussion and get early feedback.

Core of changes

If user provides argument: start_requests=True it will enable start_requests in spider and url argument won't be required. If there is no start_requests url argument is required and API will return 400 Bad Requests (as it does currently). This behavior should be common acrosss GET and POST endpoints.

Other changes

Improve tests in test_resource_crawl so that they test both GET and POST for similar thing. This should improve test coverage.
Add start_requests_spider that will be initialized from string template, formatted with start_url taken from mockserver. This required changing order of things in setUp method in tests/servers.py and required to add site argument(so that we can easily obtain site.url())
separate Scrapy.Request arguments from API arguments. This is needed because it makes validation easier, and also allows us to remove API arguments from arguments passed to scrapy.Request. It is important in GET handler.
slighly improved validation of invalid Request arguments - 400 errors can be raised in either crawler manager (if there is incorrect value of arguments) or in API if there is incorrect name of argument.

* add common logic of extracting api arguments in POST and GET * extract Scrapy Request arguments and parameters for API into separate variables * validate Scrapy Request arguments on resource level * update test so that they test POST and GET handlers

chekunkov · 2016-06-23T18:20:18Z

scrapyrt/core.py

            dfd = self.crawler_process.crawl(self.spider_name, *args, **kwargs)
        except KeyError as e:
            # Spider not found.
+            # TODO is it the only possible exception here?


it's the only exception that requires 404 response - because spider wasn't found

pawelmhm · 2016-07-19T11:41:17Z

hey @chekunkov thanks for comments I'm going to include your feedback in subsequent commits. One thing I wanted to ask you and perhaps other people in the community. My initial desigin choice was:

If user provides argument: start_requests=True it will enable start_requests in spider and url argument won't be required. If there is no start_requests url argument is required and API will return 400 Bad Requests (as it does currently). This behavior should be common acrosss GET and POST endpoints.

Do we all agree this is good principle? What do you think about enabling start_requests by default and disabling them with argument? I see some people are confused by the fact that start_requests are disabled, because it somehow does not match expectations about how Scrapy spiders work. Maybe this confusion tells us that we should adjust our app and make it more intuitive? We initially created ScrapyRT for a project where we didnt need start_requests, but maybe our original use case was not typical, and maybe world at large will benefit from us changing our assumptions in this respect?

chekunkov · 2016-07-19T12:50:12Z

What do you think about enabling start_requests by default and disabling them with argument?

@pawelmhm my problem with that is a backwards compatibility. we know that some (many?) projects rely on the fact that by default start requests are disabled, by changing this here we will break those projects.

another problem is that initial idea of ScrapyRT - get spider that performs long broad crawl, specify url and callback name - and return result from that callback without running the whole spider. if we switch default to start_requests enabled by default - many users will simply experience timeouts or hanging responses from ScrapyRT endpoints - because they'll start crawls that take much more time than expected request-response time. for example imagine a spider that has root of some relatively large online shop as its start url. if users want to run long running tasks there - they better use https://github.com/scrapy/scrapyd

rmax · 2016-09-02T17:55:37Z

I haven't tried this, but how will be this different from calling the spider with url=<start-url> and callback=parse?

pawelmhm · 2016-09-06T08:28:50Z

I haven't tried this, but how will be this different from calling the spider with url= and callback=parse?

you can do it like this if it suits your need. However I understand that some people have some custom start_requests defined. Their start_requests override scrapy.Spider start_requests and issue bunch of requests with some custom callbacks, after this they perhaps want to schedule other url with callback. I know about some spiders that work like this. For example they need to log in before making request, so code for login is constant and begins with start_requests. In this scenario it will be more convenient to login first and later schedule request based on querystring parametes url and callback.

request_data will be changed to api_params; spider_data will be changed to scrapy_request_args

are known to work

and also add docstring for that

pawelmhm · 2016-12-22T12:34:29Z

One thing to keep in mind @chekunkov is that there is potentially breaking change here. api_parameters resource attribute requires you to register argument for api before using it. If user has resource that subclasses ScrapyRT resource and has some custom api param, e.g. parameter: "auth", or api parameter: "force_crawl", and he does something with it, his argument will be lost if he calls super(MyResource, self).render_GET(request).

I think that separating api_params from scrapy_request_args and being explicit about what exactly is api_parameter is a good thing though and that we should stick to it. It seems better to me if user register api parameters as attributes. But of course need to document that and notify everyone.

pawelmhm · 2016-12-22T13:28:09Z

So I think it should work mostly ok, just need to add docs for start_requests parameter.

chekunkov · 2017-01-03T17:42:29Z

scrapyrt/utils.py

+    """
+    :param dictionary: Dictionary with parameters passed to API
+    :param raise_error: raise ValueError if key is not valid arg for
+                        scrapy.httpRequest


missing dot between http and Request

chekunkov · 2017-01-03T18:12:09Z

scrapyrt/resources.py

+            raise Error(400, e.message)
+
+        if not api_params.get("start_requests"):
+            self.get_required_argument(api_params, "request")


I'd suggest to rearrange code like this to make it easier to follow:

if api_params.get("start_requests"): _request = api_params.get("request") or {} else: _request = self.get_required_argument(api_params, "request") try: # validate Scrapy Request args scrapy_request_args = extract_scrapy_request_args( _request, raise_error=True) except ValueError as e: raise Error(400, e.message)

@pawelmhm what do you think about this one?

ah yes will add it (sorry missed that)

@chekunkov I cleaned this up here: 59d0a01 and also added bunch of other tests. Test coverage should be much improved after this PR which should help in porting to Python3.

chekunkov · 2017-01-03T18:31:47Z

While I think that registering api parameters explicitly may be a good idea, I don't think it should be introduced in this PR, because:

compatibility. If we keep this change we need to bump major version of the package - while I'd expect 1.0 to have much more useful design improvement and rewrites. Otherwise, it'll be a waste of the major version bump and we will not be able to introduce other breaking changes for a while.
this change has nothing to do with the original intent to provide an option to run spider with start_requests.

We can keep the idea of such validation as a github issue and target it for version 1.0.

Variables renaming looks good to me as far as we keep compatibility in places like this. I agree that new naming is more clear, we can aim callback kwargs renaming for v 1.0. In addition to that, in v 1.0 we can provide a single context object that will be passed through callbacks related to the same request - this will simplify signatures for many internal functions and will provide a way to pass and access values in different parts of the callback chain.

pawelmhm · 2017-01-04T16:38:08Z

While I think that registering api parameters explicitly may be a good idea, I don't think it should be introduced in this PR

Hmm yeah, actually maybe you're right, let's just do start_requests first and move registering api params to other PR. But we can keep scrapy_request args utility and extract Scrapy args, just dont clean non-Scrapy.http.Request arguments. It seems like current signature of prepare_crawl relies on the fact that first argument (now called api_params) contains all paramters passed to API (Scrapy and non-Scrapy alike), so we should keep it like that to avoid breaking code that might subclass this.

Variables renaming looks good to me as far as we keep compatibility in places like this. I agree that new naming is more clear, we can aim callback kwargs renaming for v 1.0.

I think we do keep compatibility in this line you linked, or am I missing something? I'm not sure if we need much kwarg renaming because most kwargs are passed with double star **kwarg, at least in resources.

chekunkov · 2017-01-04T16:48:02Z

I think we do keep compatibility in this line you linked, or am I missing something?

Right, we do keep compatibility in this line, I pointed it out as an example of a place where renaming didn't (and shouldn't) happen because of compatibility. Sorry if my message wasn't clear enough.

there was no test for this, should be one now

+ test for missing "request" argument to POST endpoint

test if missing request parameter in POST will raise 400

chekunkov

LGTM

pawelmhm added 5 commits March 25, 2016 14:53

skip failing test TODO investigate later

32ef41b

add start_requests support

bbe85f3

Merge branch 'master' into start_requests

005d76f

add test for spider with start_requests

a8c2b83

pawelmhm added the enhancement label May 30, 2016

pawelmhm assigned chekunkov May 30, 2016

chekunkov reviewed Jun 23, 2016
View reviewed changes

pawelmhm added 6 commits December 22, 2016 12:24

[resources] refactor, rename important variable names

a1fa6fd

request_data will be changed to api_params; spider_data will be changed to scrapy_request_args

[style] sort imports in resources

50f496e

[requirements-dev] update test packages to versions that

3b90103

are known to work

[utils] add tests for extract_scrapy_request_args

0da27f2

and also add docstring for that

[crawler_manager_test] unskip limit runtime test

abb4a2a

[resources] more renaming of spider_Data to scrapy_request_args

5454bf3

[tests] parametrize resource tests

9c2f456

chekunkov reviewed Jan 3, 2017

View reviewed changes

[resources] dont validate/register api parameters

a6ad46a

pawelmhm added 5 commits January 4, 2017 18:05

[docs] document start_requests parameter

b5ae65b

[tests/test_resource_crawl] dont rely on item order in tests

0c63621

[tests] add test for scenario when start_requests and url present

79e3e47

there was no test for this, should be one now

[resources] readability in start_requests POST handler

59d0a01

+ test for missing "request" argument to POST endpoint

[tests] add another test for POST handler

ebec1ae

test if missing request parameter in POST will raise 400

chekunkov approved these changes Jan 20, 2017

View reviewed changes

pawelmhm merged commit aaced79 into master Jan 23, 2017

pawelmhm deleted the start_requests branch January 23, 2017 07:43

pawelmhm mentioned this pull request Jan 23, 2017

URL required? why not use start_urls? #27

Closed

Start requests #38

Start requests #38

Uh oh!

Conversation

pawelmhm commented May 30, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Core of changes

Other changes

Uh oh!

chekunkov Jun 23, 2016

Choose a reason for hiding this comment

Uh oh!

pawelmhm commented Jul 19, 2016

Uh oh!

chekunkov commented Jul 19, 2016

Uh oh!

rmax commented Sep 2, 2016

Uh oh!

pawelmhm commented Sep 6, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pawelmhm commented Dec 22, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pawelmhm commented Dec 22, 2016

Uh oh!

chekunkov Jan 3, 2017

Choose a reason for hiding this comment

Uh oh!

chekunkov Jan 3, 2017

Choose a reason for hiding this comment

Uh oh!

chekunkov Jan 4, 2017

Choose a reason for hiding this comment

Uh oh!

pawelmhm Jan 4, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pawelmhm Jan 20, 2017

Choose a reason for hiding this comment

Uh oh!

chekunkov commented Jan 3, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pawelmhm commented Jan 4, 2017

Uh oh!

chekunkov commented Jan 4, 2017

Uh oh!

chekunkov left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pawelmhm commented May 30, 2016 •

edited

Loading

pawelmhm commented Sep 6, 2016 •

edited

Loading

pawelmhm commented Dec 22, 2016 •

edited

Loading

pawelmhm Jan 4, 2017 •

edited

Loading

chekunkov commented Jan 3, 2017 •

edited

Loading