Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make it possible to update settings in __init__ or from_crawler #3663

Closed
ejulio opened this issue Mar 7, 2019 · 14 comments · Fixed by #6038
Closed

Make it possible to update settings in __init__ or from_crawler #3663

ejulio opened this issue Mar 7, 2019 · 14 comments · Fixed by #6038

Comments

@ejulio
Copy link
Contributor

ejulio commented Mar 7, 2019

This issue might be related to #1305

I noticed that settings are frozen in https://github.com/scrapy/scrapy/blob/master/scrapy/crawler.py#L57
However, in a given project I had a requirement to change some settings based on some spider arguments. An alternative would be to write this spider as a base class and extend it from specific spiders setting the proper settings.
However, I think it would make sense to only freeze settings after the spider and other components were initialized. Or, provide some other entry point to configure settings based on arguments.
The other option is to use -s arguments, but in my case I was changing the FEED_EXPORT_FIELDS setting (https://docs.scrapy.org/en/latest/topics/feed-exports.html#std:setting-FEED_EXPORT_FIELDS).

Any thoughts here?

@GeorgeA92
Copy link
Contributor

usage of -s argument with list based FEED_EXPORT_FIELDS setting:
scrapy crawl quotes -s FEED_EXPORT_FIELDS=author,quote -o data_without_tags.csv

Option is to set list setting like FEED_EXPORT_FIELDS inside command is actual for all settings where BaseSettings.getlist method is used to read settings.:

def getlist(self, name, default=None):
"""
Get a setting value as a list. If the setting original type is a list, a
copy of it will be returned. If it's a string it will be split by ",".
For example, settings populated through environment variables set to
``'one,two'`` will return a list ['one', 'two'] when using this method.
:param name: the setting name
:type name: string
:param default: the value to return if no setting is found
:type default: any
"""
value = self.get(name, default or [])
if isinstance(value, six.string_types):
value = value.split(',')
return list(value)

I noticed that BaseSettings.freeze method does only one thing:

def freeze(self):
"""
Disable further changes to the current settings.
After calling this method, the present state of the settings will become
immutable. Trying to change values through the :meth:`~set` method and
its variants won't be possible and will be alerted.
"""
self.frozen = True

frozen attribute is used in _assert_mutability method which is actually prevents any changes to settings

def _assert_mutability(self):
if self.frozen:
raise TypeError("Trying to modify an immutable Settings object")

But if we change frozen attribute to False settings will become mutable and application will able to change make changes to settings using methods where _assert_mutability called:
set , setmodule, update, delete, __delitem__

Spider code with updating settings inside from_crawler will look like this:

class SomeSpider(scrapy.Spider):
......
    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        if(crawler.settings.frozen):
            crawler.settings.frozen = False
            rawler.settings.set("SETTING","NEW_VALUE")
            #crawler.settings.overrides.settings.set("SETTING","NEW_VALUE")
            crawler.settings.freeze()
        spider = cls(*args, **kwargs)
        spider._set_crawler(crawler)
        return spider

@ejulio
Copy link
Contributor Author

ejulio commented Mar 11, 2019

@GeorgeA92 , thanks for your reply.
I agree that it is possible to use -s, but my main concern is that I'm writing a configuration of the fields outside the spider. By writing this setting in the code, it is side-by-side with my items, so less prone to errors. Also, when use -s, every time I start a job, I need to write the fields, so it could cause errors just because of writing the wrong field name. If there is no other way, this would be my preferred approach.

I also agree that your solutions for from_crawler works, but this way I'm hacking the framework.
This can also lead to unwanted behavior, and if something changes in the future, it might lead to code changes just because I was relying on some non-documented behavior.

My main reasoning here is to check if it is the scope of Scrapy to allow this kind of configuration or if this behavior (freeze the settings before the spider is created) is expected for some reason.

@asciidiego
Copy link

I am in a similar issue. I have to change the settings of a specific spider based on information read in a JSON file that is used as a configuration file for the Spider behaviour. Problem is, I have to change the setting according to the argument of the spider as well. Is there a better way to address this problem? I do not want to create a single spider for this case, more code to maintain = more problems.

@ejulio
Copy link
Contributor Author

ejulio commented May 3, 2019

Hey, I was looking to work on this issue, but I got in a kind of deadlock 😛
Looking for ideas:

The spider is instantiated here https://github.com/scrapy/scrapy/blob/master/scrapy/crawler.py#L84
However, I cannot freeze the settings only at this point because all other components were already instantiated.
So, that could cause some misunderstanding by updating a setting that is used in a middleware, but the middleware was already started.

I thought about moving the spider instantiation to https://github.com/scrapy/scrapy/blob/master/scrapy/crawler.py#L41 and make update_settings deprecate.
However, there I won't have access to the input arguments.
It would be possible if I change the runner to get arguments when it is instantiated and not when crawl is called.

@Gallaecio
Copy link
Member

I’m quite unsure on how to best address this. Maybe we could allow defining a new class method on spiders that has access to both spider arguments and settings?

@ejulio
Copy link
Contributor Author

ejulio commented May 7, 2019

@Gallaecio, probably this is the best way to go.
Add cli_args to __init__ and use them in update_settings.
Just need to make sure it won't break backwards compatibility

@GeorgeA92
Copy link
Contributor

The spider is instantiated here https://github.com/scrapy/scrapy/blob/master/scrapy/crawler.py#L84
However, I cannot freeze the settings only at this point because all other components were already instantiated.

As we can see in crawler.py after self.spidercls.update_settings(self.settings) (method which reads custom_settingsspider class attribute) and before spider instantiation -- StatsCollertor and ScrapyExtensions were instantiated.

But the rest of scrapy modules (downloader,downloadermiddlewares,itempipelines,spidermiddlewares etc.) instatiate after spider in self.engine = self._create_engine()

Some project related modules (Spiderloader, Logging...) instantiated before self.spidercls.update_settings(self.settings) and it is unable to change it with spider custom_settings attribute.

@joeharrison714
Copy link

joeharrison714 commented Jan 4, 2021

I have a need to modify settings based off a spider argument and I found this thread. Is there any new information about how this could be accomplished?

@GeorgeA92
Copy link
Contributor

@joeharrison714 #4196

@rpocase
Copy link

rpocase commented Aug 24, 2021

A possible workaround for this use case, depending on your run model, is to override settings using Crawler. In my case, I'm executing via a scripted workflow and may need to override settings at run time. Since CrawlerProcess.crawl can take a Crawler object, it is pretty seamless to absorb the crawler settings and override the settings I care about.

def run_crawl(process: CrawlerProcess):
    s = copy.deepcopy(process.settings)
    s[''MY_SETTING"] = "overridden"
    c = Crawler(MySpider, settings=s)
    process.crawl(c)
    return process.join()

@iamumairayub
Copy link

Hi, see my https://gist.github.com/iamumairayub/452432a2e78255de890e5e3d925efaa4 on how to take settings values from command line or even from DB.

@Dhruv97Sharma
Copy link

One possible solution for this could also be creating a few class variables and using them in the custom_settings being passed to the spider and then update the values of these class variables in the __init__ function of the spider, so when these custom settings are being applied, it will start using the updated values as passed from the __init__.

Example:

class NameOfYourSpider(scrapy.Spider):
    name = "spider_name"
    var1 = None

    custom_settings = {
        'ROBOTSTXT_OBEY' : var1 if var1 is not None else False,
    }

    # ... other class variables
    
    def __init__(self, obey_robotstxt):
        # This modifies the value of the class variable and also the custom value of the settings that one wanted to update
        self.var1 = True

@SardarDelha
Copy link

SardarDelha commented Mar 3, 2023

According to scrapyd docs to transfer scrapy settings is require to set setting=DOWNLOAD_DELAY=2 in query to scrapyd/schedule. As far as I know this is the only supported way to transfer settings in scrapy

@webee
Copy link

webee commented Sep 12, 2023

use -s CONFIGS=/path/to/config/file to specify custom_settings and other custom configurations.

    @classmethod
    def update_settings(cls, settings):
        configs = {}
        configs_file = settings.get("CONFIGS")
        if configs_file:
            configs = data.load(configs_file)

        cls.custom_settings = cls.custom_settings or {}
        for k, v in configs.get("custom_settings", {}).items():
            cls.custom_settings[k] = v

        cls.custom_settings["CONFIGS"] = configs

        super().update_settings(settings)

    def __init__(self, name=None, **kwargs):
        super().__init__(name, **kwargs)

        self.configs = self.custom_settings["CONFIGS"]

then, settings can be set/override by custom_settings in config file and we can also refer self.configs in spider.
So that's it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants