Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow users to pass spider arguments via url #29

Closed
pawelmhm opened this issue Mar 24, 2016 · 26 comments
Closed

allow users to pass spider arguments via url #29

pawelmhm opened this issue Mar 24, 2016 · 26 comments
Assignees

Comments

@pawelmhm
Copy link
Member

pawelmhm commented Mar 24, 2016

When running Scrapy from command line you can do:

> scrapy crawl foo_spider -a zipcode=10001

but this is NOT possible with ScrapyRT now. You cannot pass arguments for spiders, you can only pass arguments for request. Adding support for "command_line" arguments is not difficult to implement and seems important IMO.

You could simply pass

localhost:8050/crawl.json?spider=foo.spider&zipcode=10001&url=some_url

EDIT:
clarify we're talking about passing arguments to API via url

@pawelmhm pawelmhm changed the title allow users to pass command line arguments for spiders allow users to pass command line arguments for spiders via url May 30, 2016
@pawelmhm pawelmhm changed the title allow users to pass command line arguments for spiders via url allow users to pass spider arguments for spiders via url May 30, 2016
@pawelmhm pawelmhm changed the title allow users to pass spider arguments for spiders via url allow users to pass spider arguments via url Jan 23, 2017
@titmy
Copy link

titmy commented Apr 19, 2017

having similar problems

@gdelfresno
Copy link

I managed to achive this in a very easy way.

scrapyrt allows to configure the CrawlResource class, so with a simple modification in the prepare_crawl method it will pass arguments to the spider. Just add this lines before the call to self.run_crawl

crawler_params = api_params.copy()
for api_param in ['max_requests', 'start_requests', 'spider_name', 'url']:
    crawler_params.pop(api_param, None)
kwargs.update(crawler_params)

Then, just configure scrapyrt RESOURCE to point to your custom CrawlerResource class.

Could this or something similar be added to the core class? I can do a PR.

@dotungvp1994
Copy link

dotungvp1994 commented Jun 19, 2017

@gdelfresno : hey bro, i'm trying custom crawlresource but my project not change :(. You can show me expamle project using custom CrawlResource?

Edit: Yeah, i do pass arguments to the spider by your guide. but i must change lib source in /usr/bin/local :(. I'm trying add to CrawlerResource in spider.py and settings.py but not working

@dotungvp1994
Copy link

@pawelmhm : I have problem the same your problem at 1 year ago. After 1 year, you are have solution for problem? I'm a newbie Python and Scrapy, do you can help me? Sr my english bad :(

@pawelmhm
Copy link
Member Author

hey @dotungvp1994 yeah we'll prioritize this and add it to next release, but cant give you exact ETA yet. We'll need some time to implement it for sure, not sure how much time

@pianista215
Copy link

News about that? We are having the same problems. We will try with the @gdelfresno solution may be constructing a docker modified with that solution :)

@dotungvp1994
Copy link

@pianista215: tip for problems. You can pass arguments to meta data request and get it in response.
I think it will be useful for you

@pianista215
Copy link

pianista215 commented Nov 8, 2017

Edited: Sorry @dotungvp1994 , I'm trying to pass it to the API:

curl -XPOST -d '{ "spider_name":"XXX", "start_requests":true, "request":{ "meta": {"lookup_until_date": "23-09-2017" } } }' "http://localhost:9081/crawl.json" >> response

But unfortunately, is not on my response.request.meta dictionary in the parse method

Am I missing something????

@pianista215
Copy link

pianista215 commented Nov 8, 2017

Hi @dotungvp1994 ,

If you are already interested is really easy with @gdelfresno trick.

You first modify the file previously commented or if you want, you can use the docker I've already made with the changes:

pianista215/scrapyrt:0.10-parameter-patched

Now, you have to modify your spider, to get the parameters from kwargs in init :


    lookup_until_date = None
    __allowed = ("lookup_until_date")

    def __init__(self, lookup_until_date=None, *args, **kwargs):
        super(Ibex35Spider, self).__init__(*args, **kwargs)
        for k, v in kwargs.iteritems():
            assert( k in self.__class__.__allowed )
            setattr(self, k, v)

Now you have your lookup_until_date populated if you invoke to Scrapyrt doing:
curl -XPOST -d '{ "spider_name":"XXX", "start_requests":true, "lookup_until_date": "05-11-2017" }' "http://localhost:9080/crawl.json"

@dotungvp1994
Copy link

dotungvp1994 commented Nov 9, 2017

@pianista215 : Yeah, i know i will to try @gdelfresno trick and working but i thought pass arguments to meta data is the same.
if you pass arguments by
curl -XPOST -d '{ "spider_name":"XXX", "start_requests":true, "request":{ "meta": {"lookup_until_date": "23-09-2017" } } }' "http://localhost:9081/crawl.json" >> response
you can get loockup_until_date by response.meta["lookup_until_date"] :D.
Yeah, i have seen you made with change in your repositories, i will to try it. 👍

@gdelfresno
Copy link

@pawelmhm Implemented here gdelfresno@ee3be05

Do you want me to open a pull request?

@pawelmhm
Copy link
Member Author

sure open PR let's discuss this @gdelfresno

@gdelfresno
Copy link

@pawelmhm PR ready #72

@aleroot
Copy link

aleroot commented Jan 29, 2018

I have tested the pull request and it is working well, I think should be merged.

@shadiakiki1986
Copy link

any news?

@shadiakiki1986
Copy link

In my case, I just used the "meta" field as suggested earlier in this thread

@janceChun
Copy link

crawler_params = api_params.copy() for api_param in ['max_requests', 'start_requests', 'spider_name', 'url']: crawler_params.pop(api_param, None) kwargs.update(crawler_params)

working well for me,
curl -XPOST -d '{ "spider_name":"xxx", "start_requests":true, "lookup_until_date":"23-09-2017"}' "http://localhost:9080/crawl.json"

@llermaly
Copy link

@janceChun @shadiakiki1986 @dotungvp1994 I couldnt access to meta params from the spider. How did you do it exactly? I want to use that meta params to use it in the spider request form. Thanks!

@shadiakiki1986
Copy link

@llermaly check here

@llermaly
Copy link

llermaly commented Oct 27, 2018

@shadiakiki1986 Thanks.

I finally patched the file with @gdelfresno code, my problem was I didnt know where was resources.py

In my case was stored in : :/usr/local/lib/python3.5/dist-packages/scrapyrt/resources.py

After that I could access to the variable with self.param

@TapanHP1995
Copy link

any update on merging #72 ?

@pawelmhm
Copy link
Member Author

pawelmhm commented Oct 3, 2019

@TapanInexture this is going to be part of next release. Probably will be released this month.

@pawelmhm pawelmhm self-assigned this Oct 3, 2019
@pawelmhm
Copy link
Member Author

pawelmhm commented Oct 3, 2019

There are two complications here. One is that arguments can override spider methods, and someone could crash your spider by passing bad argument. See this Scrapy issue scrapy/scrapy#1633, for example passing argument "start_requests" will break spider. So we should validate arguments.

Other thing is that it seems better to isolate spider_arguments and make them JSON. For example yo could pass:

http://localhost/crawl.json?url=http://aa.com&spider_arguments=%7B%22zipcode%22%3A%20%2214001%22%7D

where spider arguments is: %7B%22zipcode%22%3A%20%2214001%22%7D which is urlencoded {"zipcode": "14001"}. This way you will be able to pass any object as argument, it could be dictionary, list etc. It will be more flexible and it wont collide with same name api parameters and request arguments. E.g. someone could pass dont_filter spider argument but it could collide with dont_filter Request argument, and it would cause trouble.

I'm implementing this on branch, will create PR soon.

@doverradio
Copy link

Hi, once you pass in these arguments, how do you retrieve them in your script? Sorry, I'm attempting this first time here.

@pawelmhm
Copy link
Member Author

pawelmhm commented Apr 26, 2021

They are available as spider attributes @doverradio I'll add this to documentation, e.g. you pass argument zipcode should be available in spider as self.zipcode, e.g.

url: 'http://localhost:8000/crawl.json?spider_name=my_spider&url=http://foo&crawl_args={"zipcode":"XXXX"}&callback=parse_xxx'

spider code

def parse_xxx(self, response):
    print('zipcode is ' + self.zipcode)

crawl args need to be passed as json

@pawelmhm
Copy link
Member Author

closed via #120

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests