allow users to pass spider arguments via url #29

pawelmhm · 2016-03-24T08:33:52Z

When running Scrapy from command line you can do:

> scrapy crawl foo_spider -a zipcode=10001

but this is NOT possible with ScrapyRT now. You cannot pass arguments for spiders, you can only pass arguments for request. Adding support for "command_line" arguments is not difficult to implement and seems important IMO.

You could simply pass

localhost:8050/crawl.json?spider=foo.spider&zipcode=10001&url=some_url

EDIT:
clarify we're talking about passing arguments to API via url

The text was updated successfully, but these errors were encountered:

titmy · 2017-04-19T23:31:19Z

having similar problems

gdelfresno · 2017-05-07T15:15:35Z

I managed to achive this in a very easy way.

scrapyrt allows to configure the CrawlResource class, so with a simple modification in the prepare_crawl method it will pass arguments to the spider. Just add this lines before the call to self.run_crawl

crawler_params = api_params.copy()
for api_param in ['max_requests', 'start_requests', 'spider_name', 'url']:
    crawler_params.pop(api_param, None)
kwargs.update(crawler_params)

Then, just configure scrapyrt RESOURCE to point to your custom CrawlerResource class.

Could this or something similar be added to the core class? I can do a PR.

dotungvp1994 · 2017-06-19T02:48:04Z

@gdelfresno : hey bro, i'm trying custom crawlresource but my project not change :(. You can show me expamle project using custom CrawlResource?

Edit: Yeah, i do pass arguments to the spider by your guide. but i must change lib source in /usr/bin/local :(. I'm trying add to CrawlerResource in spider.py and settings.py but not working

dotungvp1994 · 2017-06-19T03:33:38Z

@pawelmhm : I have problem the same your problem at 1 year ago. After 1 year, you are have solution for problem? I'm a newbie Python and Scrapy, do you can help me? Sr my english bad :(

pawelmhm · 2017-06-19T06:16:01Z

hey @dotungvp1994 yeah we'll prioritize this and add it to next release, but cant give you exact ETA yet. We'll need some time to implement it for sure, not sure how much time

pianista215 · 2017-11-08T11:57:06Z

News about that? We are having the same problems. We will try with the @gdelfresno solution may be constructing a docker modified with that solution :)

dotungvp1994 · 2017-11-08T12:02:31Z

@pianista215: tip for problems. You can pass arguments to meta data request and get it in response.
I think it will be useful for you

pianista215 · 2017-11-08T12:11:17Z

Edited: Sorry @dotungvp1994 , I'm trying to pass it to the API:

curl -XPOST -d '{ "spider_name":"XXX", "start_requests":true, "request":{ "meta": {"lookup_until_date": "23-09-2017" } } }' "http://localhost:9081/crawl.json" >> response

But unfortunately, is not on my response.request.meta dictionary in the parse method

Am I missing something????

pianista215 · 2017-11-08T21:23:16Z

Hi @dotungvp1994 ,

If you are already interested is really easy with @gdelfresno trick.

You first modify the file previously commented or if you want, you can use the docker I've already made with the changes:

pianista215/scrapyrt:0.10-parameter-patched

Now, you have to modify your spider, to get the parameters from kwargs in init :


    lookup_until_date = None
    __allowed = ("lookup_until_date")

    def __init__(self, lookup_until_date=None, *args, **kwargs):
        super(Ibex35Spider, self).__init__(*args, **kwargs)
        for k, v in kwargs.iteritems():
            assert( k in self.__class__.__allowed )
            setattr(self, k, v)

Now you have your lookup_until_date populated if you invoke to Scrapyrt doing:
curl -XPOST -d '{ "spider_name":"XXX", "start_requests":true, "lookup_until_date": "05-11-2017" }' "http://localhost:9080/crawl.json"

dotungvp1994 · 2017-11-09T00:51:31Z

@pianista215 : Yeah, i know i will to try @gdelfresno trick and working but i thought pass arguments to meta data is the same.
if you pass arguments by
curl -XPOST -d '{ "spider_name":"XXX", "start_requests":true, "request":{ "meta": {"lookup_until_date": "23-09-2017" } } }' "http://localhost:9081/crawl.json" >> response
you can get loockup_until_date by response.meta["lookup_until_date"] :D.
Yeah, i have seen you made with change in your repositories, i will to try it. 👍

gdelfresno · 2017-12-17T23:35:13Z

@pawelmhm Implemented here gdelfresno@ee3be05

Do you want me to open a pull request?

pawelmhm · 2017-12-18T10:10:50Z

sure open PR let's discuss this @gdelfresno

gdelfresno · 2018-01-22T22:20:43Z

@pawelmhm PR ready #72

aleroot · 2018-01-29T10:12:40Z

I have tested the pull request and it is working well, I think should be merged.

shadiakiki1986 · 2018-05-17T15:46:50Z

any news?

shadiakiki1986 · 2018-05-18T02:25:57Z

In my case, I just used the "meta" field as suggested earlier in this thread

janceChun · 2018-07-26T12:50:56Z

crawler_params = api_params.copy() for api_param in ['max_requests', 'start_requests', 'spider_name', 'url']: crawler_params.pop(api_param, None) kwargs.update(crawler_params)

working well for me，
curl -XPOST -d '{ "spider_name":"xxx", "start_requests":true, "lookup_until_date":"23-09-2017"}' "http://localhost:9080/crawl.json"

llermaly · 2018-10-26T23:13:21Z

@janceChun @shadiakiki1986 @dotungvp1994 I couldnt access to meta params from the spider. How did you do it exactly? I want to use that meta params to use it in the spider request form. Thanks!

shadiakiki1986 · 2018-10-26T23:34:00Z

@llermaly check here

llermaly · 2018-10-27T02:36:00Z

@shadiakiki1986 Thanks.

I finally patched the file with @gdelfresno code, my problem was I didnt know where was resources.py

In my case was stored in : :/usr/local/lib/python3.5/dist-packages/scrapyrt/resources.py

After that I could access to the variable with self.param

TapanHP1995 · 2019-08-20T13:54:30Z

any update on merging #72 ?

pawelmhm · 2019-10-03T06:38:08Z

@TapanInexture this is going to be part of next release. Probably will be released this month.

pawelmhm · 2019-10-03T19:21:07Z

There are two complications here. One is that arguments can override spider methods, and someone could crash your spider by passing bad argument. See this Scrapy issue scrapy/scrapy#1633, for example passing argument "start_requests" will break spider. So we should validate arguments.

Other thing is that it seems better to isolate spider_arguments and make them JSON. For example yo could pass:

http://localhost/crawl.json?url=http://aa.com&spider_arguments=%7B%22zipcode%22%3A%20%2214001%22%7D

where spider arguments is: %7B%22zipcode%22%3A%20%2214001%22%7D which is urlencoded {"zipcode": "14001"}. This way you will be able to pass any object as argument, it could be dictionary, list etc. It will be more flexible and it wont collide with same name api parameters and request arguments. E.g. someone could pass dont_filter spider argument but it could collide with dont_filter Request argument, and it would cause trouble.

I'm implementing this on branch, will create PR soon.

doverradio · 2021-04-23T19:32:38Z

Hi, once you pass in these arguments, how do you retrieve them in your script? Sorry, I'm attempting this first time here.

pawelmhm · 2021-04-26T05:21:02Z

They are available as spider attributes @doverradio I'll add this to documentation, e.g. you pass argument zipcode should be available in spider as self.zipcode, e.g.

url: 'http://localhost:8000/crawl.json?spider_name=my_spider&url=http://foo&crawl_args={"zipcode":"XXXX"}&callback=parse_xxx'

spider code

def parse_xxx(self, response):
    print('zipcode is ' + self.zipcode)

crawl args need to be passed as json

pawelmhm · 2021-04-26T05:22:28Z

closed via #120

pawelmhm added the enhancement label Mar 24, 2016

pawelmhm changed the title ~~allow users to pass command line arguments for spiders~~ allow users to pass command line arguments for spiders via url May 30, 2016

pawelmhm changed the title ~~allow users to pass command line arguments for spiders via url~~ allow users to pass spider arguments for spiders via url May 30, 2016

pawelmhm changed the title ~~allow users to pass spider arguments for spiders via url~~ allow users to pass spider arguments via url Jan 23, 2017

pawelmhm self-assigned this Oct 3, 2019

fcanobrash mentioned this issue Dec 6, 2019

Extra API parameters passed to the spider's constructor #99

Closed

pawelmhm mentioned this issue Jan 29, 2021

How can I pass the spider arguments in the scrapyrt API request? #115

Closed

avlm mentioned this issue Mar 30, 2021

Deciding which spider to run based on arguments #126

Closed

pawelmhm closed this as completed Apr 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

allow users to pass spider arguments via url #29

allow users to pass spider arguments via url #29

pawelmhm commented Mar 24, 2016 •

edited

Loading

titmy commented Apr 19, 2017

gdelfresno commented May 7, 2017

dotungvp1994 commented Jun 19, 2017 •

edited

Loading

dotungvp1994 commented Jun 19, 2017

pawelmhm commented Jun 19, 2017

pianista215 commented Nov 8, 2017

dotungvp1994 commented Nov 8, 2017

pianista215 commented Nov 8, 2017 •

edited

Loading

pianista215 commented Nov 8, 2017 •

edited

Loading

dotungvp1994 commented Nov 9, 2017 •

edited

Loading

gdelfresno commented Dec 17, 2017

pawelmhm commented Dec 18, 2017

gdelfresno commented Jan 22, 2018

aleroot commented Jan 29, 2018

shadiakiki1986 commented May 17, 2018

shadiakiki1986 commented May 18, 2018

janceChun commented Jul 26, 2018

llermaly commented Oct 26, 2018

shadiakiki1986 commented Oct 26, 2018

llermaly commented Oct 27, 2018 •

edited

Loading

TapanHP1995 commented Aug 20, 2019

pawelmhm commented Oct 3, 2019

pawelmhm commented Oct 3, 2019

doverradio commented Apr 23, 2021

pawelmhm commented Apr 26, 2021 •

edited

Loading

pawelmhm commented Apr 26, 2021

allow users to pass spider arguments via url #29

allow users to pass spider arguments via url #29

Comments

pawelmhm commented Mar 24, 2016 • edited Loading

titmy commented Apr 19, 2017

gdelfresno commented May 7, 2017

dotungvp1994 commented Jun 19, 2017 • edited Loading

dotungvp1994 commented Jun 19, 2017

pawelmhm commented Jun 19, 2017

pianista215 commented Nov 8, 2017

dotungvp1994 commented Nov 8, 2017

pianista215 commented Nov 8, 2017 • edited Loading

pianista215 commented Nov 8, 2017 • edited Loading

dotungvp1994 commented Nov 9, 2017 • edited Loading

gdelfresno commented Dec 17, 2017

pawelmhm commented Dec 18, 2017

gdelfresno commented Jan 22, 2018

aleroot commented Jan 29, 2018

shadiakiki1986 commented May 17, 2018

shadiakiki1986 commented May 18, 2018

janceChun commented Jul 26, 2018

llermaly commented Oct 26, 2018

shadiakiki1986 commented Oct 26, 2018

llermaly commented Oct 27, 2018 • edited Loading

TapanHP1995 commented Aug 20, 2019

pawelmhm commented Oct 3, 2019

pawelmhm commented Oct 3, 2019

doverradio commented Apr 23, 2021

pawelmhm commented Apr 26, 2021 • edited Loading

pawelmhm commented Apr 26, 2021

pawelmhm commented Mar 24, 2016 •

edited

Loading

dotungvp1994 commented Jun 19, 2017 •

edited

Loading

pianista215 commented Nov 8, 2017 •

edited

Loading

pianista215 commented Nov 8, 2017 •

edited

Loading

dotungvp1994 commented Nov 9, 2017 •

edited

Loading

llermaly commented Oct 27, 2018 •

edited

Loading

pawelmhm commented Apr 26, 2021 •

edited

Loading