Skip to content

Load settings dynamically on a per-spider basis #2392

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mohmad-null opened this issue Nov 10, 2016 · 3 comments · Fixed by #6038
Closed

Load settings dynamically on a per-spider basis #2392

mohmad-null opened this issue Nov 10, 2016 · 3 comments · Fixed by #6038

Comments

@mohmad-null
Copy link

mohmad-null commented Nov 10, 2016

Feature request.

It would be good if scrapy had an easily accessible means of reading settings on a per-spider basis, and then making them accessible to the spider. From my many attempts to do this so far, in theory all of the components for this appear to be in place. Populating settings is already done:
https://doc.scrapy.org/en/latest/topics/settings.html#populating-the-settings - but then the problem is accessing them.
Ideally in a fashion that's compatible with scrapyd (so no calling process.crawl(spider, my_settings)).

Ideally: A project could have a generic project wide settings.py file with both the standard settings and any custom ones added by the developer. Then, using a command-line argument to indicate the settings file to use, the __init__ method of the spider overrides specific settings, (much as custom_settings does), and these settings are then accessible throughout the spider via self.settings in the usual way.

Current Problems
custom_settings
Unfortunately custom_settings doesn't seem to be usable for this because it cannot be declared in __init__, but needs to be declared earlier.

settings.py
Currently, even if a user is willing to just use a different settings.py file entirely for each spider (thereby duplicating most of it), that's not readily possible either.

os.environ['SCRAPY_SETTINGS_MODULE'] = 'myproject.settings'
these_settings = get_project_settings()

The above only gets the settings into the variable these_settings, they're not used by the spider or accessible via self.settings.

Desire for feature
Based on StackOverFlow, this is something a lot of people want. The fact there are so many answers that are all so different shows there isn't a particular good way of doing it.

http://stackoverflow.com/questions/9814827/creating-a-generic-scrapy-spider
http://stackoverflow.com/questions/12996910/how-to-setup-and-launch-a-scrapy-spider-programmatically-urls-and-settings
http://stackoverflow.com/questions/35662146/dynamic-spider-generation-with-scrapy-subclass-init-error
http://stackoverflow.com/questions/40510526/how-to-load-different-settings-for-different-scrapy-spiders
http://stackoverflow.com/questions/2396529/using-one-scrapy-spider-for-several-websites

Being able to readily get allowed_domains and start_urls within it would also be good.

@kmike
Copy link
Member

kmike commented Nov 10, 2016

To read settings in a spider one can use Spider.settings attribute. It doesn't work in __init__ method, but it works e.g. in start_requests.

If the goal is to change settings, it becomes more complicated. Generally, one can change settings only before other components are configured, so initialization order is important.

There is an undocumented Spider.update_settings method which receives project-wide settings and updates them; maybe we should document it and make it public. But I'm not sure it should be a final solution.

See also discussion at #1305 - there is a proposal to allow changing settings in Spider.__init__, it is backwards incompatible though.

There is also a PR for 'addons' - components which can change settings (#1272).

@vionemc
Copy link

vionemc commented Nov 7, 2018

Spider.update_settings doesn't work as expected. I prefer to change the settings via command line when starting the spider.

https://doc.scrapy.org/en/latest/topics/settings.html#command-line-options

For example: scrapy crawl myspider -s LOG_FILE=scrapy.log

@sgstq
Copy link

sgstq commented Aug 25, 2020

Hi, just faced with same problem.

My scenario is:
I run spiders from a script with the

setting = set_settings(get_project_settings(), config)
process = CrawlerProcess(setting)
process.crawl(spider_name).start()

where set_settings() changes project settings depending on passed config. This works fine, until I have no spider where I need to define custom_settings and append setting defined in set_settings() (DOWNLOADER_MIDDLEWARES for example).

The get_project_settings() in/before spider __init__() obviously doesn't work because it gets settings from settings.py which is not relevant already.

I will be happy to hear any ideas, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants