Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Per-key priorities for dictionary-like settings #1135

Closed
jdemaeyer opened this issue Apr 4, 2015 · 3 comments
Closed

Per-key priorities for dictionary-like settings #1135

jdemaeyer opened this issue Apr 4, 2015 · 3 comments

Comments

@jdemaeyer
Copy link
Contributor

@jdemaeyer jdemaeyer commented Apr 4, 2015

I'm new, so here's a quick intro: Hi! I'm Jakob :) I came upon this while working on my GSoC proposal.

During Julia's GSoC, settings priorities were introduced to allow updating settings from different places without paying attention to order. They work great for 'simple' (non-compound) settings, but there are two issues with the dictionary-like settings (e.g. DOWNLOADER_MIDDLEWARES or EXTENSIONS) mainly used to manage extensions:

  1. They are fiddly to update (and not overwrite) when honoring priorities and mutability (#1110)
  2. They only assign a single priority to the complete dictionary. As this contains no information on which keys exactly were changed with what priority, the dictionary priority is not very meaninful. It forbids updating a key still at default setting (lowest priority) with a medium priority as soon as any key has been updated with a higher priority.

This should prove especially problematic for the proposed add-on structure (SEP-021, discussed in #591). To resolve this, every key could have its own priority associated with it. @curita suggested making the compound setting variables an instance of the Settings class (and no longer an instance of dict). This could further clean up the Settings API by deprecating the _BASE settings, as the 'real' settings dictionaries (without appendix) could be used for default settings without fearing that they get lost when reading from settings.py. I've gathered some thoughts on this in my proposal (rather long, sorry, tl;dr is that it should be possible in a fully transparent, as in "backwards-compatible with no changes to the API", fashion):

As the Settings class already provides a __getitem__() method, this will introduce no API change to reading these settings.

There are currently three places where the dict-like settings are written to:

  1. When defined in scrapy/settings/default_settings.py
  2. When reading from settings.py in the Settings.setmodule() method
  3. When combining the dictionaries with their _BASE in scrapy.utils.conf.build_component_list()

Scrapy's code could be updated in the following fashion with full backwards compatibility, even for non-intended uses (such as users working directly on a _BASE dictionary):

  1. Complete Settings dictionary-like interface by implementing:

    • __setitem__(self, k, v) method that will use some default priority (maybe 'project')
    • __iter__(self) method which returns an iterator over Settings.attributes
    • update(self, custom, priority = 'project') method that behaves like dict.update() while respecting mutability and priorities. If custom is a dict object, the given priority will be used for all keys. If it is a Settings object, the already existing per-key priority values will be used. The setdict() method should become a proxy to this (more general) method
  2. Deprecate _BASE dictionaries by replacing them with empty ones (for backwards-compatibility) and moving default settings into 'real' dictionary, i.e.

    SPIDER_MIDDLEWARES = {}
    SPIDER_MIDDLEWARES_BASE = {
        'scrapy.contrib.spidermiddleware.httperror.HttpErrorMiddleware': 50,
        # ...
    }

    becomes

      SPIDER_MIDDLEWARES_BASE = {}
      SPIDER_MIDDLEWARES = Settings( values = {
          'scrapy.contrib.spidermiddleware.httperror.HttpErrorMiddleware': 50,
          # ...
          }, priority = 'default' )
  3. Configuration in settings.py should be no different for the user. The Settings.setmodule() method should therefore recognise which of its attributes are themselves Settings instances, and call their respective update() methods instead of replacing them. Alternatively, this check could be done in the SettingsAttribute.set() method.

  4. Introduce a small change to the build_component_list() helper function such that it works on Settings instances instead of on dict:

    def build_component_list(base, custom):
        # ...
        # OLD:  compdict = base.copy()
        compdict = Settings(base, priority = 'default')
        # As before:
        compdict.update(custom)
        # ...
  5. For the 1.0 release, the settings, middleware managers and build_component_list() helper function could be tidied up by removing support for the deprecated _BASE settings

@curita
Copy link
Member

@curita curita commented Apr 5, 2015

Great work Jakob, thanks! I really like your implementation ideas.

Some thoughts:

Another place where you can update settings is in the class attribute custom_settings from spiders, settings defined there are populated in Spider.update_settings with a ‘spider’ priority. Actually the idea of using priorities per keyword in dict-like settings came out from here, I often end up rewriting update_settings in my spiders since I practically never want to replace all middlewares/pipelines/etc., I just want to update those defined at project level in settings.py.

I don’t particularly like using Settings objects directly in default_settings.py, this will create a circular import since default_settings.py should import scrapy.settings and scrapy.settings already imports default_settings.py. We could avoid this by importing default_settings.py in scrapy.settings only when required (i.e. putting the from . import default_settings line inside every function that uses it instead of keeping it at top level) but I don’t like this approach since it can create problems in the future, I think we should keep imports to a minimum in default_settings.py, using builtin types only, as it is right now.

I particularly like the idea of deprecating _BASE settings, one of the purposes of using settings priorities was to avoid these kinds of patterns. Not sure if I would keep those _BASE settings with an empty dict for backward compatibility, we’re modifying their value so users using them probably won’t get what they were expecting. I think supporting to set _BASE settings in settings.py, custom_settings, etc. is enough, and I’d remove those from default_settings.py.

There are some missing things to consider about the old setting and getting dictionaries usage. It makes sense that setdict calls the suggested Settings.update, but we should review what should return getdict. Not sure if returning the actual Settings instance here is a good idea, this is going to break backward compatibility since users expect a dictionary (Settings will have the same interface as dictionaries but I think that’s not enough), after all getdict tries to cast any given value to dictionary. We’ll have the same problem while getting dictionary settings using ‘get’ directly, I don’t like the idea too much since this is an also common way to retrieve those dictionaries, but it’s probably best that we return Settings instances here.

Another thing we have to make sure is that every dictionary setting allows to disable keys, otherwise we won’t be able to ‘turn off’ keys that were already set (For example, ‘scrapy.contrib.downloadermiddleware.retry.RetryMiddleware', a default downloader middleware, is added as key to DOWNLOADER_MIDDLEWARES with a default value of 500, there should be a value to indicate we want to disable it in the middleware manager). I think most dictionary settings allow to set None as value to indicate that we want to disable some key, but we should verify this.

Also we should evaluate what to do with user defined settings (those not mentioned in default_settings.py) that happen to take dictionary values. I’m not sure we should promote those to Settings instances since their intended usage could differ from those in default_settings.py (say, by actually wanting to override the whole dictionary with each ‘set’ or ‘setdict’, or not providing a value to disable keys).

@jdemaeyer
Copy link
Contributor Author

@jdemaeyer jdemaeyer commented Apr 11, 2015

Another place where you can update settings is in the class attribute |custom_settings| from spiders, settings defined there are populated in Spider.update_settings with a ‘spider’ priority.

Right! I missed that. I think with that in mind the check whether a SettingsAttribute.value is an instance of Settings should be done in SettingsAttribute.set, not in some of the 'setter' methods of Settings.

I don’t particularly like using Settings objects directly in |default_settings.py|, this will create a circular import since |default_settings.py| should import |scrapy.settings| and |scrapy.settings| already imports |default_settings.py|. We could avoid this by importing |default_settings.py| in |scrapy.settings| only when required (i.e. putting the |from . import default_settings| line inside every function that uses it instead of keeping it at top level) but I don’t like this approach since it can create problems in the future, I think we should keep imports to a minimum in |default_settings.py|, using builtin types only, as it is right now.

I agree. As the default settings are loaded with a setmodule() call in Settings.__init__() we will have to put the promotion from dict to settings somewhere in that vicinity. However, this could create a conflict when we decide to not auto-promote user-defined dicts. I'm not quite sure what the nicest way to solve this would be. I see two approaches which I both dislike:

  1. Auto-promote dicts to Settings in Settings.set(), but allow giving a pre-existing SettingsAttribute instance, then ask developers who do want to use dicts without per-key priorities to manually wrap them in SettingsAttribute
  2. Provide a 'promote_dicts' keyword to the setmodule() method, which would be False by default but True when loading the defaults. However we should probably introduce that keyword to the set and setdict method then as well, then do the promotion only in set

I particularly like the idea of deprecating _BASE settings, one of the purposes of using settings priorities was to avoid these kinds of patterns. Not sure if I would keep those _BASE settings with an empty dict for backward compatibility, we’re modifying their value so users using them probably won’t get what they were expecting. I think supporting to set _BASE settings in |settings.py|, |custom_settings|, etc. is enough, and I’d remove those from |default_settings.py|.

Sounds reasonable. As Settings.__init__() can work with values = None (in case the _BASE setting was not defined anywhere) this shouldn't even require a modifying the change in build_component_list proposed above. =)

There are some missing things to consider about the old setting and getting dictionaries usage. It makes sense that |setdict| calls the suggested |Settings.update|, but we should review what should return |getdict|. Not sure if returning the actual Settings instance here is a good idea, this is going to break backward compatibility since users expect a dictionary (Settings will have the same interface as
dictionaries but I think that’s not enough), after all |getdict| tries to cast any given value to dictionary. We’ll have the same problem while getting dictionary settings using ‘get’ directly, I don’t like the idea
too much since this is an also common way to retrieve those dictionaries, but it’s probably best that we return Settings instances here.

I also think we should return the Settings instance in get(), developers might want to work with its methods. For getdict, if we complete the dict-like interface of Settings (#1149), it can be casted to dict by using dict(mysettings) (i.e. at it is done in getdict()) with the expected result.

Another thing we have to make sure is that every dictionary setting allows to disable keys, otherwise we won’t be able to ‘turn off’ keys that were already set (For example, ‘scrapy.contrib.downloadermiddleware.retry.RetryMiddleware', a default downloader middleware, is added as key to DOWNLOADER_MIDDLEWARES with a default value of 500, there should be a value to indicate we want to disable it in the middleware manager). I think most dictionary settings
allow to set |None| as value to indicate that we want to disable some key, but we should verify this.

We should definitely check whether all builtin settings allow disabling via None. I have also included a mutability-and-priority-aware delete() method (with a __del__() proxy) in #1149.

On a side note, I think it would be nice if we could completely decouple Settings from the default settings, and then simply instantiate it with Settings(default_settings, priority = 'default') where needed (most prominently in get_project_settings()). However this seems strongly backwards-incompatible. When I tried it it gave me 228 failed tests (mainly from tests which used CrawlerProcess(Settings(...)) instead of getting settings via get_project_settings())...

@jdemaeyer
Copy link
Contributor Author

@jdemaeyer jdemaeyer commented Oct 29, 2015

Implemented via #1149 :)

@jdemaeyer jdemaeyer closed this Oct 29, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.