From b167d9abed04f1029426ac932fe251ff6da038b5 Mon Sep 17 00:00:00 2001 From: Jakob de Maeyer Date: Fri, 19 Jun 2015 15:01:24 +0200 Subject: [PATCH 01/13] Introduce BaseSettings with full dictionary interface --- docs/topics/api.rst | 75 +++++++++--- scrapy/settings/__init__.py | 89 ++++++++++++-- tests/test_cmdline/__init__.py | 14 +++ tests/test_cmdline/extensions.py | 5 + tests/test_settings/__init__.py | 155 +++++++++++++++++++----- tests/test_settings/default_settings.py | 3 + 6 files changed, 278 insertions(+), 63 deletions(-) diff --git a/docs/topics/api.rst b/docs/topics/api.rst index f54341eb888..923bd80b0c6 100644 --- a/docs/topics/api.rst +++ b/docs/topics/api.rst @@ -140,26 +140,41 @@ Settings API For a detailed explanation on each settings sources, see: :ref:`topics-settings`. +.. function:: get_settings_priority(priority) + + Small helper function that looks up a given string priority in the + :attr:`~scrapy.settings.SETTINGS_PRIORITIES` dictionary and returns its + numerical value, or directly returns a given numerical priority. + .. class:: Settings(values={}, priority='project') This object stores Scrapy settings for the configuration of internal components, and can be used for any further customization. - After instantiation of this class, the new object will have the global - default settings described on :ref:`topics-settings-ref` already - populated. + It is a direct subclass and supports all methods of + :class:`~scrapy.settings.BaseSettings`. Additionally, after instantiation + of this class, the new object will have the global default settings + described on :ref:`topics-settings-ref` already populated. + +.. class:: BaseSettings(values={}, priority='project') - Additional values can be passed on initialization with the ``values`` - argument, and they would take the ``priority`` level. If the latter + Instances of this class behave like dictionaries, but store priorities + along with their ``(key, value)`` pairs, and can be frozen (i.e. marked + immutable). + + Key-value entries can be passed on initialization with the ``values`` + argument, and they would take the ``priority`` level (unless ``values`` is + already an instance of :class:`~scrapy.settings.BaseSettings`, in which + case the existing priority levels will be kept). If the ``priority`` argument is a string, the priority name will be looked up in - :attr:`~scrapy.settings.SETTINGS_PRIORITIES`. Otherwise, a expecific - integer should be provided. + :attr:`~scrapy.settings.SETTINGS_PRIORITIES`. Otherwise, a specific integer + should be provided. Once the object is created, new settings can be loaded or updated with the - :meth:`~scrapy.settings.Settings.set` method, and can be accessed with the - square bracket notation of dictionaries, or with the - :meth:`~scrapy.settings.Settings.get` method of the instance and its value - conversion variants. When requesting a stored key, the value with the + :meth:`~scrapy.settings.BaseSettings.set` method, and can be accessed with + the square bracket notation of dictionaries, or with the + :meth:`~scrapy.settings.BaseSettings.get` method of the instance and its + value conversion variants. When requesting a stored key, the value with the highest priority will be retrieved. .. method:: set(name, value, priority='project') @@ -180,16 +195,23 @@ Settings API :attr:`~scrapy.settings.SETTINGS_PRIORITIES` or an integer :type priority: string or int - .. method:: setdict(values, priority='project') + .. method:: update(values, priority='project') Store key/value pairs with a given priority. This is a helper function that calls - :meth:`~scrapy.settings.Settings.set` for every item of ``values`` + :meth:`~scrapy.settings.BaseSettings.set` for every item of ``values`` with the provided ``priority``. + If ``values`` is a string, it is assumed to be JSON-encoded and parsed + into a dict with ``json.loads()`` first. If it is a + :class:`~scrapy.settings.BaseSettings` instance, the per-key priorities + will be used and the ``priority`` parameter ignored. This allows + inserting/updating settings with different priorities with a single + command. + :param values: the settings names and values - :type values: dict + :type values: dict or string or :class:`~scrapy.settings.BaseSettings` :param priority: the priority of the settings. Should be a key of :attr:`~scrapy.settings.SETTINGS_PRIORITIES` or an integer @@ -200,7 +222,7 @@ Settings API Store settings from a module with a given priority. This is a helper function that calls - :meth:`~scrapy.settings.Settings.set` for every globally declared + :meth:`~scrapy.settings.BaseSettings.set` for every globally declared uppercase variable of ``module`` with the provided ``priority``. :param module: the module or the path of the module @@ -272,8 +294,12 @@ Settings API .. method:: getdict(name, default=None) Get a setting value as a dictionary. If the setting original type is a - dictionary, a copy of it will be returned. If it's a string it will - evaluated as a json dictionary. + dictionary, a copy of it will be returned. If it is a string it will be + evaluated as a JSON dictionary. In the case that it is a + :class:`~scrapy.settings.BaseSettings` instance itself, it will be + converted to a dictionary, containing all its current settings values + as they would be returned by :meth:`~scrapy.settings.BaseSettings.get`, + and losing all information about priority and mutability. :param name: the setting name :type name: string @@ -305,6 +331,21 @@ Settings API Alias for a :meth:`~freeze` call in the object returned by :meth:`copy` + .. method:: getpriority(name) + + Return the current numerical priority value of a setting, or ``None`` if + the given ``name`` does not exist. + + :param name: the setting name + :type name: string + + .. method:: maxpriority() + + Return the numerical value of the highest priority present throughout + all settings, or the numerical value for ``default`` from + :attr:`~scrapy.settings.SETTINGS_PRIORITIES` if there are no settings + stored. + .. _topics-api-spiderloader: SpiderLoader API diff --git a/scrapy/settings/__init__.py b/scrapy/settings/__init__.py index af0d0dff199..fa7fa317893 100644 --- a/scrapy/settings/__init__.py +++ b/scrapy/settings/__init__.py @@ -2,7 +2,7 @@ import json import copy import warnings -from collections import MutableMapping +from collections import Mapping, MutableMapping from importlib import import_module from scrapy.utils.deprecate import create_deprecated_class @@ -19,6 +19,12 @@ 'cmdline': 40, } +def get_settings_priority(priority): + if isinstance(priority, six.string_types): + return SETTINGS_PRIORITIES[priority] + else: + return priority + class SettingsAttribute(object): @@ -45,21 +51,22 @@ def __str__(self): __repr__ = __str__ -class Settings(object): +class BaseSettings(MutableMapping): def __init__(self, values=None, priority='project'): self.frozen = False self.attributes = {} - self.setmodule(default_settings, priority='default') - if values is not None: - self.setdict(values, priority) + self.update(values, priority) def __getitem__(self, opt_name): value = None - if opt_name in self.attributes: + if opt_name in self: value = self.attributes[opt_name].value return value + def __contains__(self, name): + return name in self.attributes + def get(self, name, default=None): return self[name] if self[name] is not None else default @@ -88,19 +95,34 @@ def getdict(self, name, default=None): value = json.loads(value) return dict(value) + def getpriority(self, name): + prio = None + if name in self: + prio = self.attributes[name].priority + return prio + + def maxpriority(self): + if len(self) > 0: + return max(self.getpriority(name) for name in self) + else: + return get_settings_priority('default') + + def __setitem__(self, name, value): + self.set(name, value) + def set(self, name, value, priority='project'): self._assert_mutability() - if isinstance(priority, six.string_types): - priority = SETTINGS_PRIORITIES[priority] - if name not in self.attributes: - self.attributes[name] = SettingsAttribute(value, priority) + priority = get_settings_priority(priority) + if name not in self: + if isinstance(value, SettingsAttribute): + self.attributes[name] = value + else: + self.attributes[name] = SettingsAttribute(value, priority) else: self.attributes[name].set(value, priority) def setdict(self, values, priority='project'): - self._assert_mutability() - for name, value in six.iteritems(values): - self.set(name, value, priority) + self.update(values, priority) def setmodule(self, module, priority='project'): self._assert_mutability() @@ -110,6 +132,28 @@ def setmodule(self, module, priority='project'): if key.isupper(): self.set(key, getattr(module, key), priority) + def update(self, values, priority='project'): + self._assert_mutability() + if isinstance(values, six.string_types): + values = json.loads(values) + if values is not None: + if isinstance(values, BaseSettings): + for name, value in six.iteritems(values): + self.set(name, value, values.getpriority(name)) + else: + for name, value in six.iteritems(values): + self.set(name, value, priority) + + def delete(self, name, priority='project'): + self._assert_mutability() + priority = get_settings_priority(priority) + if priority >= self.getpriority(name): + del self.attributes[name] + + def __delitem__(self, name): + self._assert_mutability() + del self.attributes[name] + def _assert_mutability(self): if self.frozen: raise TypeError("Trying to modify an immutable Settings object") @@ -125,6 +169,17 @@ def frozencopy(self): copy.freeze() return copy + def __iter__(self): + return iter(self.attributes) + + def __len__(self): + return len(self.attributes) + + def __str__(self): + return str(self.attributes) + + __repr__ = __str__ + @property def overrides(self): warnings.warn("`Settings.overrides` attribute is deprecated and won't " @@ -174,6 +229,14 @@ def __iter__(self, k, v): return iter(self.o) +class Settings(BaseSettings): + + def __init__(self, values=None, priority='project'): + super(Settings, self).__init__() + self.setmodule(default_settings, 'default') + self.update(values, priority) + + class CrawlerSettings(Settings): def __init__(self, settings_module=None, **kw): diff --git a/tests/test_cmdline/__init__.py b/tests/test_cmdline/__init__.py index 1e2905e9582..5192fb0fa4c 100644 --- a/tests/test_cmdline/__init__.py +++ b/tests/test_cmdline/__init__.py @@ -1,4 +1,5 @@ import os +import json import sys import shutil import pstats @@ -54,3 +55,16 @@ def test_profiling(self): self.assertIn('tottime', stats) finally: shutil.rmtree(path) + + def test_override_dict_settings(self): + settingsstr = self._execute('settings', '--get', 'EXTENSIONS', '-s', + ('EXTENSIONS={"tests.test_cmdline.extensions.TestExtension": ' + '100, "tests.test_cmdline.extensions.DummyExtension": 200}')) + # XXX: There's gotta be a smarter way to do this... + self.assertNotIn("...", settingsstr) + for char in ("'", "<", ">", 'u"'): + settingsstr = settingsstr.replace(char, '"') + settingsdict = json.loads(settingsstr) + self.assertIn('tests.test_cmdline.extensions.DummyExtension', settingsdict) + self.assertIn('value=200', settingsdict['tests.test_cmdline.extensions.DummyExtension']) + self.assertIn('value=100', settingsdict['tests.test_cmdline.extensions.TestExtension']) diff --git a/tests/test_cmdline/extensions.py b/tests/test_cmdline/extensions.py index 4d347966a6a..72867eb560c 100644 --- a/tests/test_cmdline/extensions.py +++ b/tests/test_cmdline/extensions.py @@ -8,3 +8,8 @@ def __init__(self, settings): @classmethod def from_crawler(cls, crawler): return cls(crawler.settings) + + +class DummyExtension(object): + pass + diff --git a/tests/test_settings/__init__.py b/tests/test_settings/__init__.py index 54b834aa0dc..a473f3c3f91 100644 --- a/tests/test_settings/__init__.py +++ b/tests/test_settings/__init__.py @@ -2,7 +2,8 @@ import unittest import warnings -from scrapy.settings import Settings, SettingsAttribute, CrawlerSettings +from scrapy.settings import (BaseSettings, Settings, SettingsAttribute, + CrawlerSettings) from tests import mock from . import default_settings @@ -33,35 +34,16 @@ class SettingsTest(unittest.TestCase): if six.PY3: assertItemsEqual = unittest.TestCase.assertCountEqual - def setUp(self): - self.settings = Settings() - - @mock.patch.dict('scrapy.settings.SETTINGS_PRIORITIES', {'default': 10}) - @mock.patch('scrapy.settings.default_settings', default_settings) - def test_initial_defaults(self): - settings = Settings() - self.assertEqual(len(settings.attributes), 1) - self.assertIn('TEST_DEFAULT', settings.attributes) - attr = settings.attributes['TEST_DEFAULT'] - self.assertIsInstance(attr, SettingsAttribute) - self.assertEqual(attr.value, 'defvalue') - self.assertEqual(attr.priority, 10) +class BaseSettingsTest(unittest.TestCase): - @mock.patch.dict('scrapy.settings.SETTINGS_PRIORITIES', {}) - @mock.patch('scrapy.settings.default_settings', {}) - def test_initial_values(self): - settings = Settings({'TEST_OPTION': 'value'}, 10) - self.assertEqual(len(settings.attributes), 1) - self.assertIn('TEST_OPTION', settings.attributes) + if six.PY3: + assertItemsEqual = unittest.TestCase.assertCountEqual - attr = settings.attributes['TEST_OPTION'] - self.assertIsInstance(attr, SettingsAttribute) - self.assertEqual(attr.value, 'value') - self.assertEqual(attr.priority, 10) + def setUp(self): + self.settings = BaseSettings() def test_set_new_attribute(self): - self.settings.attributes = {} self.settings.set('TEST_OPTION', 'value', 0) self.assertIn('TEST_OPTION', self.settings.attributes) @@ -70,6 +52,12 @@ def test_set_new_attribute(self): self.assertEqual(attr.value, 'value') self.assertEqual(attr.priority, 0) + def test_set_settingsattribute(self): + myattr = SettingsAttribute(0, 30) # Note priority 30 + self.settings.set('TEST_ATTR', myattr, 10) + self.assertEqual(self.settings.get('TEST_ATTR'), 0) + self.assertEqual(self.settings.getpriority('TEST_ATTR'), 30) + def test_set_instance_identity_on_update(self): attr = SettingsAttribute('value', 0) self.settings.attributes = {'TEST_OPTION': attr} @@ -79,13 +67,11 @@ def test_set_instance_identity_on_update(self): self.assertIs(attr, self.settings.attributes['TEST_OPTION']) def test_set_calls_settings_attributes_methods_on_update(self): - with mock.patch.object(SettingsAttribute, '__setattr__') as mock_setattr, \ - mock.patch.object(SettingsAttribute, 'set') as mock_set: + attr = SettingsAttribute('value', 10) + with mock.patch.object(attr, '__setattr__') as mock_setattr, \ + mock.patch.object(attr, 'set') as mock_set: - attr = SettingsAttribute('value', 10) self.settings.attributes = {'TEST_OPTION': attr} - mock_set.reset_mock() - mock_setattr.reset_mock() for priority in (0, 10, 20): self.settings.set('TEST_OPTION', 'othervalue', priority) @@ -94,6 +80,19 @@ def test_set_calls_settings_attributes_methods_on_update(self): mock_set.reset_mock() mock_setattr.reset_mock() + def test_setitem(self): + settings = BaseSettings() + settings.set('key', 'a', 'default') + settings['key'] = 'b' + self.assertEqual(settings['key'], 'b') + self.assertEqual(settings.getpriority('key'), 20) + settings['key'] = 'c' + self.assertEqual(settings['key'], 'c') + settings['key2'] = 'x' + self.assertIn('key2', settings) + self.assertEqual(settings['key2'], 'x') + self.assertEqual(settings.getpriority('key2'), 20) + def test_setdict_alias(self): with mock.patch.object(self.settings, 'set') as mock_set: self.settings.setdict({'TEST_1': 'value1', 'TEST_2': 'value2'}, 10) @@ -118,7 +117,8 @@ class ModuleMock(): def test_setmodule_alias(self): with mock.patch.object(self.settings, 'set') as mock_set: self.settings.setmodule(default_settings, 10) - mock_set.assert_called_with('TEST_DEFAULT', 'defvalue', 10) + mock_set.assert_any_call('TEST_DEFAULT', 'defvalue', 10) + mock_set.assert_any_call('TEST_DICT', {'key': 'val'}, 10) def test_setmodule_by_path(self): self.settings.attributes = {} @@ -132,11 +132,55 @@ def test_setmodule_by_path(self): self.assertItemsEqual(six.iterkeys(self.settings.attributes), six.iterkeys(ctrl_attributes)) - for attr, ctrl_attr in zip(six.itervalues(self.settings.attributes), - six.itervalues(ctrl_attributes)): + for key in six.iterkeys(ctrl_attributes): + attr = self.settings.attributes[key] + ctrl_attr = ctrl_attributes[key] self.assertEqual(attr.value, ctrl_attr.value) self.assertEqual(attr.priority, ctrl_attr.priority) + def test_update(self): + settings = BaseSettings({'key_lowprio': 0}, priority=0) + settings.set('key_highprio', 10, priority=50) + custom_settings = BaseSettings({'key_lowprio': 1, 'key_highprio': 11}, priority=30) + custom_settings.set('newkey_one', None, priority=50) + custom_dict = {'key_lowprio': 2, 'key_highprio': 12, 'newkey_two': None} + + settings.update(custom_dict, priority=20) + self.assertEqual(settings['key_lowprio'], 2) + self.assertEqual(settings.getpriority('key_lowprio'), 20) + self.assertEqual(settings['key_highprio'], 10) + self.assertIn('newkey_two', settings) + self.assertEqual(settings.getpriority('newkey_two'), 20) + + settings.update(custom_settings) + self.assertEqual(settings['key_lowprio'], 1) + self.assertEqual(settings.getpriority('key_lowprio'), 30) + self.assertEqual(settings['key_highprio'], 10) + self.assertIn('newkey_one', settings) + self.assertEqual(settings.getpriority('newkey_one'), 50) + + settings.update({'key_lowprio': 3}, priority=20) + self.assertEqual(settings['key_lowprio'], 1) + + def test_update_jsonstring(self): + settings = BaseSettings({'number': 0, 'dict': BaseSettings({'key': 'val'})}) + settings.update('{"number": 1, "newnumber": 2}') + self.assertEqual(settings['number'], 1) + self.assertEqual(settings['newnumber'], 2) + settings.set("dict", '{"key": "newval", "newkey": "newval2"}') + self.assertEqual(settings['dict']['key'], "newval") + self.assertEqual(settings['dict']['newkey'], "newval2") + + def test_delete(self): + settings = BaseSettings({'key': None}) + settings.set('key_highprio', None, priority=50) + settings.delete('key') + settings.delete('key_highprio') + self.assertNotIn('key', settings) + self.assertIn('key_highprio', settings) + del settings['key_highprio'] + self.assertNotIn('key_highprio', settings) + def test_get(self): test_configuration = { 'TEST_ENABLED1': '1', @@ -190,6 +234,18 @@ def test_get(self): self.assertEqual(settings.getdict('TEST_DICT3', {'key1': 5}), {'key1': 5}) self.assertRaises(ValueError, settings.getdict, 'TEST_LIST1') + def test_getpriority(self): + settings = BaseSettings({'key': 'value'}, priority=99) + self.assertEqual(settings.getpriority('key'), 99) + self.assertEqual(settings.getpriority('nonexistentkey'), None) + + def test_maxpriority(self): + # Empty settings should return 'default' + self.assertEqual(self.settings.maxpriority(), 0) + self.settings.set('A', 0, 10) + self.settings.set('B', 0, 30) + self.assertEqual(self.settings.maxpriority(), 30) + def test_copy(self): values = { 'TEST_BOOL': True, @@ -254,6 +310,39 @@ def test_deprecated_attribute_defaults(self): self.assertIn('BAR', self.settings.defaults) +class SettingsTest(unittest.TestCase): + + if six.PY3: + assertItemsEqual = unittest.TestCase.assertCountEqual + + def setUp(self): + self.settings = Settings() + + @mock.patch.dict('scrapy.settings.SETTINGS_PRIORITIES', {'default': 10}) + @mock.patch('scrapy.settings.default_settings', default_settings) + def test_initial_defaults(self): + settings = Settings() + self.assertEqual(len(settings.attributes), 2) + self.assertIn('TEST_DEFAULT', settings.attributes) + + attr = settings.attributes['TEST_DEFAULT'] + self.assertIsInstance(attr, SettingsAttribute) + self.assertEqual(attr.value, 'defvalue') + self.assertEqual(attr.priority, 10) + + @mock.patch.dict('scrapy.settings.SETTINGS_PRIORITIES', {}) + @mock.patch('scrapy.settings.default_settings', {}) + def test_initial_values(self): + settings = Settings({'TEST_OPTION': 'value'}, 10) + self.assertEqual(len(settings.attributes), 1) + self.assertIn('TEST_OPTION', settings.attributes) + + attr = settings.attributes['TEST_OPTION'] + self.assertIsInstance(attr, SettingsAttribute) + self.assertEqual(attr.value, 'value') + self.assertEqual(attr.priority, 10) + + class CrawlerSettingsTest(unittest.TestCase): def test_deprecated_crawlersettings(self): diff --git a/tests/test_settings/default_settings.py b/tests/test_settings/default_settings.py index 23005d4c6e4..c24b5a9b9c6 100644 --- a/tests/test_settings/default_settings.py +++ b/tests/test_settings/default_settings.py @@ -1,2 +1,5 @@ TEST_DEFAULT = 'defvalue' + +TEST_DICT = {'key': 'val'} + From 2e6f5ac7b67c6171e3f407e619da9357d97b3193 Mon Sep 17 00:00:00 2001 From: Jakob de Maeyer Date: Fri, 19 Jun 2015 15:09:36 +0200 Subject: [PATCH 02/13] Deprecate _BASE settings, unify _BASE backwards-compatibility --- docs/topics/downloader-middleware.rst | 24 ++- docs/topics/extensions.rst | 20 ++- docs/topics/feed-exports.rst | 41 +++-- docs/topics/settings.rst | 142 +++++++----------- docs/topics/spider-middleware.rst | 24 ++- scrapy/commands/check.py | 5 +- scrapy/commands/crawl.py | 8 +- scrapy/commands/runspider.py | 8 +- scrapy/core/downloader/handlers/__init__.py | 8 +- scrapy/core/downloader/middleware.py | 3 +- scrapy/core/spidermw.py | 3 +- .../downloadermiddlewares/defaultheaders.py | 5 +- scrapy/extension.py | 3 +- scrapy/extensions/feedexport.py | 4 +- scrapy/pipelines/__init__.py | 10 +- scrapy/settings/__init__.py | 41 ++++- scrapy/settings/default_settings.py | 25 +-- scrapy/utils/conf.py | 41 +++-- tests/test_settings/__init__.py | 49 +++++- tests/test_utils_conf.py | 43 ++++-- 20 files changed, 268 insertions(+), 239 deletions(-) diff --git a/docs/topics/downloader-middleware.rst b/docs/topics/downloader-middleware.rst index 6d986bbf761..1e87f7f44db 100644 --- a/docs/topics/downloader-middleware.rst +++ b/docs/topics/downloader-middleware.rst @@ -23,22 +23,20 @@ Here's an example:: 'myproject.middlewares.CustomDownloaderMiddleware': 543, } -The :setting:`DOWNLOADER_MIDDLEWARES` setting is merged with the -:setting:`DOWNLOADER_MIDDLEWARES_BASE` setting defined in Scrapy (and not meant to -be overridden) and then sorted by order to get the final sorted list of enabled -middlewares: the first middleware is the one closer to the engine and the last -is the one closer to the downloader. - -To decide which order to assign to your middleware see the -:setting:`DOWNLOADER_MIDDLEWARES_BASE` setting and pick a value according to +The specified :setting:`DOWNLOADER_MIDDLEWARES` setting is merged with the +default one (i.e. it does not overwrite it) and then sorted by order to get the +final sorted list of enabled middlewares: the first middleware is the one +closer to the engine and the last is the one closer to the downloader. + +To decide which order to assign to your middleware see the default +:setting:`DOWNLOADER_MIDDLEWARES` setting and pick a value according to where you want to insert the middleware. The order does matter because each middleware performs a different action and your middleware could depend on some previous (or subsequent) middleware being applied. -If you want to disable a built-in middleware (the ones defined in -:setting:`DOWNLOADER_MIDDLEWARES_BASE` and enabled by default) you must define it -in your project's :setting:`DOWNLOADER_MIDDLEWARES` setting and assign `None` -as its value. For example, if you want to disable the user-agent middleware:: +If you want to disable a built-in middleware you must define it in your +project's :setting:`DOWNLOADER_MIDDLEWARES` setting and assign ``None`` as its +value. For example, if you want to disable the user-agent middleware:: DOWNLOADER_MIDDLEWARES = { 'myproject.middlewares.CustomDownloaderMiddleware': 543, @@ -162,7 +160,7 @@ middleware, see the :ref:`downloader middleware usage guide `. For a list of the components enabled by default (and their orders) see the -:setting:`DOWNLOADER_MIDDLEWARES_BASE` setting. +:setting:`DOWNLOADER_MIDDLEWARES` setting. .. _cookies-mw: diff --git a/docs/topics/extensions.rst b/docs/topics/extensions.rst index fb5220e9df4..a71b8bcee3b 100644 --- a/docs/topics/extensions.rst +++ b/docs/topics/extensions.rst @@ -42,17 +42,15 @@ by a string: the full Python path to the extension's class name. For example:: As you can see, the :setting:`EXTENSIONS` setting is a dict where the keys are the extension paths, and their values are the orders, which define the -extension *loading* order. Extensions orders are not as important as middleware -orders though, and they are typically irrelevant, ie. it doesn't matter in -which order the extensions are loaded because they don't depend on each other -[1]. +extension *loading* order. The specified :setting:`EXTENSIONS` setting is merged +with the default one (i.e. it does not overwrite it) and then sorted by order +to get the final sorted list of enabled extensions. -However, this feature can be exploited if you need to add an extension which -depends on other extensions already loaded. - -[1] This is is why the :setting:`EXTENSIONS_BASE` setting in Scrapy (which -contains all built-in extensions enabled by default) defines all the extensions -with the same order (``500``). +As extensions typically do not depend on each other, their loading order is +irrelevant in most cases. This is why the default :setting:`EXTENSIONS` setting +defines all extensions with the same order (``500``). However, this feature can +be exploited if you need to add an extension which depends on other extensions +already loaded. Available, enabled and disabled extensions ========================================== @@ -65,7 +63,7 @@ Disabling an extension ====================== In order to disable an extension that comes enabled by default (ie. those -included in the :setting:`EXTENSIONS_BASE` setting) you must set its order to +included in the default :setting:`EXTENSIONS` setting) you must set its order to ``None``. For example:: EXTENSIONS = { diff --git a/docs/topics/feed-exports.rst b/docs/topics/feed-exports.rst index d9444e34ae2..d8b8da166bb 100644 --- a/docs/topics/feed-exports.rst +++ b/docs/topics/feed-exports.rst @@ -265,16 +265,6 @@ Whether to export empty feeds (ie. feeds with no items). FEED_STORAGES ------------- -Default:: ``{}`` - -A dict containing additional feed storage backends supported by your project. -The keys are URI schemes and the values are paths to storage classes. - -.. setting:: FEED_STORAGES_BASE - -FEED_STORAGES_BASE ------------------- - Default:: { @@ -285,36 +275,39 @@ Default:: 'ftp': 'scrapy.extensions.feedexport.FTPFeedStorage', } -A dict containing the built-in feed storage backends supported by Scrapy. +A dict containing all feed storage backends supported by your project. The keys +are URI schemes and the values are paths to storage classes. + +When you set :setting:`FEED_STORAGES` manually, e.g. in your project's settings +module, it will be merged with the default, not overwrite it. If you want to +disable any of the default feed storage backends, you must assign ``None`` as +their value. .. setting:: FEED_EXPORTERS FEED_EXPORTERS -------------- -Default:: ``{}`` - -A dict containing additional exporters supported by your project. The keys are -URI schemes and the values are paths to :ref:`Item exporter ` -classes. - -.. setting:: FEED_EXPORTERS_BASE - -FEED_EXPORTERS_BASE -------------------- - Default:: - FEED_EXPORTERS_BASE = { + { 'json': 'scrapy.exporters.JsonItemExporter', 'jsonlines': 'scrapy.exporters.JsonLinesItemExporter', + 'jl': 'scrapy.exporters.JsonLinesItemExporter', 'csv': 'scrapy.exporters.CsvItemExporter', 'xml': 'scrapy.exporters.XmlItemExporter', 'marshal': 'scrapy.exporters.MarshalItemExporter', + 'pickle': 'scrapy.exporters.PickleItemExporter', } -A dict containing the built-in feed exporters supported by Scrapy. +A dict containing all feed exporters supported by your project. The keys are +URI schemes and the values are paths to :ref:`Item exporter ` +classes. +When you set :setting:`FEED_EXPORTERS` manually, e.g. in your project's settings +module, it will be merged with the default, not overwrite it. If you want to +disable any of the default feed exporters, you must assign ``None`` as their +value. .. _URI: http://en.wikipedia.org/wiki/Uniform_Resource_Identifier .. _Amazon S3: http://aws.amazon.com/s3/ diff --git a/docs/topics/settings.rst b/docs/topics/settings.rst index 48406540694..642f4eb84a5 100644 --- a/docs/topics/settings.rst +++ b/docs/topics/settings.rst @@ -269,6 +269,11 @@ Default:: The default headers used for Scrapy HTTP Requests. They're populated in the :class:`~scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware`. +When you set :setting:`DEFAULT_REQUEST_HEADERS` manually, e.g. in your +project's settings module, it will be merged with the default, not overwrite it. +If you want to disable any of the default request headers (and not replace them) +you must assign ``None`` as their value. + .. setting:: DEPTH_LIMIT DEPTH_LIMIT @@ -350,16 +355,6 @@ The downloader to use for crawling. DOWNLOADER_MIDDLEWARES ---------------------- -Default:: ``{}`` - -A dict containing the downloader middlewares enabled in your project, and their -orders. For more info see :ref:`topics-downloader-middleware-setting`. - -.. setting:: DOWNLOADER_MIDDLEWARES_BASE - -DOWNLOADER_MIDDLEWARES_BASE ---------------------------- - Default:: { @@ -369,6 +364,7 @@ Default:: 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 400, 'scrapy.downloadermiddlewares.retry.RetryMiddleware': 500, 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware': 550, + 'scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware': 560, 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware': 580, 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 590, 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 600, @@ -379,10 +375,16 @@ Default:: 'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware': 900, } -A dict containing the downloader middlewares enabled by default in Scrapy. You -should never modify this setting in your project, modify -:setting:`DOWNLOADER_MIDDLEWARES` instead. For more info see -:ref:`topics-downloader-middleware-setting`. +A dict containing the downloader middlewares enabled in your project, and their +orders. Low orders are closer to the engine, high orders are closer to the +downloader. + +When you set :setting:`DOWNLOADER_MIDDLEWARES` manually, e.g. in your project's +settings module, it will be merged with the default, not overwrite it. If you +want to disable any of the default downloader middlewares you must assign +``None`` as their value. + +For more info see :ref:`topics-downloader-middleware-setting`. .. setting:: DOWNLOADER_STATS @@ -423,33 +425,23 @@ spider attribute. DOWNLOAD_HANDLERS ----------------- -Default: ``{}`` - -A dict containing the request downloader handlers enabled in your project. -See `DOWNLOAD_HANDLERS_BASE` for example format. - -.. setting:: DOWNLOAD_HANDLERS_BASE - -DOWNLOAD_HANDLERS_BASE ----------------------- - Default:: { 'file': 'scrapy.core.downloader.handlers.file.FileDownloadHandler', - 'http': 'scrapy.core.downloader.handlers.http.HttpDownloadHandler', - 'https': 'scrapy.core.downloader.handlers.http.HttpDownloadHandler', + 'http': 'scrapy.core.downloader.handlers.http.HTTPDownloadHandler', + 'https': 'scrapy.core.downloader.handlers.http.HTTPDownloadHandler', 's3': 'scrapy.core.downloader.handlers.s3.S3DownloadHandler', + 'ftp': 'scrapy.core.downloader.handlers.ftp.FTPDownloadHandler', } -A dict containing the request download handlers enabled by default in Scrapy. -You should never modify this setting in your project, modify -:setting:`DOWNLOAD_HANDLERS` instead. -If you want to disable any of the above download handlers you must define them -in your project's :setting:`DOWNLOAD_HANDLERS` setting and assign `None` -as their value. For example, if you want to disable the file download -handler:: +A dict containing the request downloader handlers enabled in your project. + +When you set :setting:`DOWNLOAD_HANDLERS` manually, e.g. in your project's +settings module, it will be merged with the default, not overwrite it. If you +want to disable any of the default download handlers you must assign ``None`` +as their value. For example, if you want to disable the file download handler:: DOWNLOAD_HANDLERS = { 'file': None, @@ -552,15 +544,6 @@ to ``vi`` (on Unix systems) or the IDLE editor (on Windows). EXTENSIONS ---------- -Default:: ``{}`` - -A dict containing the extensions enabled in your project, and their orders. - -.. setting:: EXTENSIONS_BASE - -EXTENSIONS_BASE ---------------- - Default:: { @@ -575,13 +558,19 @@ Default:: 'scrapy.extensions.throttle.AutoThrottle': 0, } -The list of available extensions. Keep in mind that some of them need to -be enabled through a setting. By default, this setting contains all stable -built-in extensions. +A dict containing the extensions enabled in your project, and their orders. By +default, this setting contains all stable built-in extensions. Keep in mind that +some of them need to be enabled through a setting. + +When you set :setting:`EXTENSIONS` manually, e.g. in your project's settings +module, it will be merged with the default, not overwrite it. If you want to +disable any of the default enabled extensions you must assign ``None`` as their +value. For more information See the :ref:`extensions user guide ` and the :ref:`list of available extensions `. + .. setting:: ITEM_PIPELINES ITEM_PIPELINES @@ -589,12 +578,9 @@ ITEM_PIPELINES Default: ``{}`` -A dict containing the item pipelines to use, and their orders. The dict is -empty by default order values are arbitrary but it's customary to define them -in the 0-1000 range. - -Lists are supported in :setting:`ITEM_PIPELINES` for backwards compatibility, -but they are deprecated. +A dict containing the item pipelines to use, and their orders. Order values are +arbitrary, but it is customary to define them in the 0-1000 range. Lower orders +process before higher orders. Example:: @@ -603,16 +589,6 @@ Example:: 'mybot.pipelines.validate.StoreMyItem': 800, } -.. setting:: ITEM_PIPELINES_BASE - -ITEM_PIPELINES_BASE -------------------- - -Default: ``{}`` - -A dict containing the pipelines enabled by default in Scrapy. You should never -modify this setting in your project, modify :setting:`ITEM_PIPELINES` instead. - .. setting:: LOG_ENABLED LOG_ENABLED @@ -638,7 +614,7 @@ LOG_FILE Default: ``None`` -File name to use for logging output. If None, standard error will be used. +File name to use for logging output. If ``None``, standard error will be used. .. setting:: LOG_FORMAT @@ -902,16 +878,6 @@ The scheduler to use for crawling. SPIDER_CONTRACTS ---------------- -Default:: ``{}`` - -A dict containing the scrapy contracts enabled in your project, used for -testing spiders. For more info see :ref:`topics-contracts`. - -.. setting:: SPIDER_CONTRACTS_BASE - -SPIDER_CONTRACTS_BASE ---------------------- - Default:: { @@ -920,9 +886,13 @@ Default:: 'scrapy.contracts.default.ScrapesContract': 3, } -A dict containing the scrapy contracts enabled by default in Scrapy. You should -never modify this setting in your project, modify :setting:`SPIDER_CONTRACTS` -instead. For more info see :ref:`topics-contracts`. +A dict containing the scrapy contracts enabled in your project, used for +testing spiders. For more info see :ref:`topics-contracts`. + +When you set :setting:`SPIDER_CONTRACTS` manually, e.g. in your project's +settings module, it will be merged with the default, not overwrite it. If you +want to disable any of the default contracts you must assign ``None`` as their +value. .. setting:: SPIDER_LOADER_CLASS @@ -939,16 +909,6 @@ The class that will be used for loading spiders, which must implement the SPIDER_MIDDLEWARES ------------------ -Default:: ``{}`` - -A dict containing the spider middlewares enabled in your project, and their -orders. For more info see :ref:`topics-spider-middleware-setting`. - -.. setting:: SPIDER_MIDDLEWARES_BASE - -SPIDER_MIDDLEWARES_BASE ------------------------ - Default:: { @@ -959,10 +919,14 @@ Default:: 'scrapy.spidermiddlewares.depth.DepthMiddleware': 900, } -A dict containing the spider middlewares enabled by default in Scrapy. You -should never modify this setting in your project, modify -:setting:`SPIDER_MIDDLEWARES` instead. For more info see -:ref:`topics-spider-middleware-setting`. +A dict containing the spider middlewares enabled in your project, and their +orders. Low orders are closer to the engine, high orders are closer to the +spider. For more info see :ref:`topics-spider-middleware-setting`. + +When you set :setting:`SPIDER_MIDDLEWARES` manually, e.g. in your project's +settings module, it will be merged with the default, not overwrite it. If you +want to disable any of the default spider middlewares you must assign ``None`` +as their value. .. setting:: SPIDER_MODULES diff --git a/docs/topics/spider-middleware.rst b/docs/topics/spider-middleware.rst index 84daaaa5573..d448801d3ab 100644 --- a/docs/topics/spider-middleware.rst +++ b/docs/topics/spider-middleware.rst @@ -24,22 +24,20 @@ Here's an example:: 'myproject.middlewares.CustomSpiderMiddleware': 543, } -The :setting:`SPIDER_MIDDLEWARES` setting is merged with the -:setting:`SPIDER_MIDDLEWARES_BASE` setting defined in Scrapy (and not meant to -be overridden) and then sorted by order to get the final sorted list of enabled -middlewares: the first middleware is the one closer to the engine and the last -is the one closer to the spider. - -To decide which order to assign to your middleware see the -:setting:`SPIDER_MIDDLEWARES_BASE` setting and pick a value according to where +The specified :setting:`SPIDER_MIDDLEWARES` setting is merged with the default +one (i.e. it does not overwrite it) and then sorted by order to get the final +sorted list of enabled middlewares: the first middleware is the one closer to +the engine and the last is the one closer to the spider. + +To decide which order to assign to your middleware see the default +:setting:`SPIDER_MIDDLEWARES` setting and pick a value according to where you want to insert the middleware. The order does matter because each middleware performs a different action and your middleware could depend on some previous (or subsequent) middleware being applied. -If you want to disable a builtin middleware (the ones defined in -:setting:`SPIDER_MIDDLEWARES_BASE`, and enabled by default) you must define it -in your project :setting:`SPIDER_MIDDLEWARES` setting and assign `None` as its -value. For example, if you want to disable the off-site middleware:: +If you want to disable a builtin middleware you must define it in your project's +:setting:`SPIDER_MIDDLEWARES` setting and assign ``None`` as its value. For +example, if you want to disable the off-site middleware:: SPIDER_MIDDLEWARES = { 'myproject.middlewares.CustomSpiderMiddleware': 543, @@ -173,7 +171,7 @@ information on how to use them and how to write your own spider middleware, see the :ref:`spider middleware usage guide `. For a list of the components enabled by default (and their orders) see the -:setting:`SPIDER_MIDDLEWARES_BASE` setting. +:setting:`SPIDER_MIDDLEWARES` setting. DepthMiddleware --------------- diff --git a/scrapy/commands/check.py b/scrapy/commands/check.py index 2917b8ba726..2a89a13877e 100644 --- a/scrapy/commands/check.py +++ b/scrapy/commands/check.py @@ -58,10 +58,7 @@ def add_options(self, parser): def run(self, args, opts): # load contracts - contracts = build_component_list( - self.settings['SPIDER_CONTRACTS_BASE'], - self.settings['SPIDER_CONTRACTS'], - ) + contracts = build_component_list(self.settings._getcomposite('SPIDER_CONTRACTS')) conman = ContractsManager([load_object(c) for c in contracts]) runner = TextTestRunner(verbosity=2 if opts.verbose else 1) result = TextTestResult(runner.stream, runner.descriptions, runner.verbosity) diff --git a/scrapy/commands/crawl.py b/scrapy/commands/crawl.py index 72df1147695..9c8a3d4ce4b 100644 --- a/scrapy/commands/crawl.py +++ b/scrapy/commands/crawl.py @@ -1,6 +1,6 @@ import os from scrapy.commands import ScrapyCommand -from scrapy.utils.conf import arglist_to_dict +from scrapy.utils.conf import arglist_to_dict, remove_none_values from scrapy.exceptions import UsageError @@ -34,10 +34,8 @@ def process_options(self, args, opts): self.settings.set('FEED_URI', 'stdout:', priority='cmdline') else: self.settings.set('FEED_URI', opts.output, priority='cmdline') - valid_output_formats = ( - list(self.settings.getdict('FEED_EXPORTERS').keys()) + - list(self.settings.getdict('FEED_EXPORTERS_BASE').keys()) - ) + feed_exporters = remove_none_values(self.settings._getcomposite('FEED_EXPORTERS')) + valid_output_formats = feed_exporters.keys() if not opts.output_format: opts.output_format = os.path.splitext(opts.output)[1].replace(".", "") if opts.output_format not in valid_output_formats: diff --git a/scrapy/commands/runspider.py b/scrapy/commands/runspider.py index 88f5a30152e..7d85984c3bf 100644 --- a/scrapy/commands/runspider.py +++ b/scrapy/commands/runspider.py @@ -5,7 +5,7 @@ from scrapy.utils.spider import iter_spider_classes from scrapy.commands import ScrapyCommand from scrapy.exceptions import UsageError -from scrapy.utils.conf import arglist_to_dict +from scrapy.utils.conf import arglist_to_dict, remove_none_values def _import_file(filepath): @@ -57,10 +57,8 @@ def process_options(self, args, opts): self.settings.set('FEED_URI', 'stdout:', priority='cmdline') else: self.settings.set('FEED_URI', opts.output, priority='cmdline') - valid_output_formats = ( - list(self.settings.getdict('FEED_EXPORTERS').keys()) + - list(self.settings.getdict('FEED_EXPORTERS_BASE').keys()) - ) + feed_exporters = remove_none_values(self.settings._getcomposite('FEED_EXPORTERS')) + valid_output_formats = feed_exporters.keys() if not opts.output_format: opts.output_format = os.path.splitext(opts.output)[1].replace(".", "") if opts.output_format not in valid_output_formats: diff --git a/scrapy/core/downloader/handlers/__init__.py b/scrapy/core/downloader/handlers/__init__.py index 6c9514af6a4..9b118c39bc5 100644 --- a/scrapy/core/downloader/handlers/__init__.py +++ b/scrapy/core/downloader/handlers/__init__.py @@ -4,6 +4,7 @@ from twisted.internet import defer import six from scrapy.exceptions import NotSupported, NotConfigured +from scrapy.utils.conf import remove_none_values from scrapy.utils.httpobj import urlparse_cached from scrapy.utils.misc import load_object from scrapy import signals @@ -19,13 +20,8 @@ def __init__(self, crawler): self._schemes = {} # stores acceptable schemes on instancing self._handlers = {} # stores instanced handlers for schemes self._notconfigured = {} # remembers failed handlers - handlers = crawler.settings.get('DOWNLOAD_HANDLERS_BASE') - handlers.update(crawler.settings.get('DOWNLOAD_HANDLERS', {})) + handlers = remove_none_values(crawler.settings._getcomposite('DOWNLOAD_HANDLERS')) for scheme, clspath in six.iteritems(handlers): - # Allow to disable a handler just like any other - # component (extension, middleware, etc). - if clspath is None: - continue self._schemes[scheme] = clspath crawler.signals.connect(self._close, signals.engine_stopped) diff --git a/scrapy/core/downloader/middleware.py b/scrapy/core/downloader/middleware.py index 413a05dd147..0433443d099 100644 --- a/scrapy/core/downloader/middleware.py +++ b/scrapy/core/downloader/middleware.py @@ -15,8 +15,7 @@ class DownloaderMiddlewareManager(MiddlewareManager): @classmethod def _get_mwlist_from_settings(cls, settings): - return build_component_list(settings['DOWNLOADER_MIDDLEWARES_BASE'], \ - settings['DOWNLOADER_MIDDLEWARES']) + return build_component_list(settings._getcomposite('DOWNLOADER_MIDDLEWARES')) def _add_middleware(self, mw): if hasattr(mw, 'process_request'): diff --git a/scrapy/core/spidermw.py b/scrapy/core/spidermw.py index c1c5b10fcd5..b5c80c350be 100644 --- a/scrapy/core/spidermw.py +++ b/scrapy/core/spidermw.py @@ -18,8 +18,7 @@ class SpiderMiddlewareManager(MiddlewareManager): @classmethod def _get_mwlist_from_settings(cls, settings): - return build_component_list(settings['SPIDER_MIDDLEWARES_BASE'], \ - settings['SPIDER_MIDDLEWARES']) + return build_component_list(settings._getcomposite('SPIDER_MIDDLEWARES')) def _add_middleware(self, mw): super(SpiderMiddlewareManager, self)._add_middleware(mw) diff --git a/scrapy/downloadermiddlewares/defaultheaders.py b/scrapy/downloadermiddlewares/defaultheaders.py index f1d2bd6311f..c8924c04a63 100644 --- a/scrapy/downloadermiddlewares/defaultheaders.py +++ b/scrapy/downloadermiddlewares/defaultheaders.py @@ -4,6 +4,8 @@ See documentation in docs/topics/downloader-middleware.rst """ +from scrapy.utils.conf import remove_none_values + class DefaultHeadersMiddleware(object): @@ -12,7 +14,8 @@ def __init__(self, headers): @classmethod def from_crawler(cls, crawler): - return cls(crawler.settings.get('DEFAULT_REQUEST_HEADERS').items()) + headers = remove_none_values(crawler.settings['DEFAULT_REQUEST_HEADERS']) + return cls(headers.items()) def process_request(self, request, spider): for k, v in self._headers: diff --git a/scrapy/extension.py b/scrapy/extension.py index f68b1ba6822..4ceb32c6847 100644 --- a/scrapy/extension.py +++ b/scrapy/extension.py @@ -12,5 +12,4 @@ class ExtensionManager(MiddlewareManager): @classmethod def _get_mwlist_from_settings(cls, settings): - return build_component_list(settings['EXTENSIONS_BASE'], \ - settings['EXTENSIONS']) + return build_component_list(settings._getcomposite('EXTENSIONS')) diff --git a/scrapy/extensions/feedexport.py b/scrapy/extensions/feedexport.py index 7560e89d341..fb07657d69d 100644 --- a/scrapy/extensions/feedexport.py +++ b/scrapy/extensions/feedexport.py @@ -18,6 +18,7 @@ from w3lib.url import file_uri_to_path from scrapy import signals +from scrapy.utils.conf import remove_none_values from scrapy.utils.ftp import ftp_makedirs_cwd from scrapy.exceptions import NotConfigured from scrapy.utils.misc import load_object @@ -195,8 +196,7 @@ def item_scraped(self, item, spider): return item def _load_components(self, setting_prefix): - conf = dict(self.settings['%s_BASE' % setting_prefix]) - conf.update(self.settings[setting_prefix]) + conf = remove_none_values(self.settings._getcomposite(setting_prefix)) d = {} for k, v in conf.items(): try: diff --git a/scrapy/pipelines/__init__.py b/scrapy/pipelines/__init__.py index d433498f50c..8df0d315439 100644 --- a/scrapy/pipelines/__init__.py +++ b/scrapy/pipelines/__init__.py @@ -13,15 +13,7 @@ class ItemPipelineManager(MiddlewareManager): @classmethod def _get_mwlist_from_settings(cls, settings): - item_pipelines = settings['ITEM_PIPELINES'] - if isinstance(item_pipelines, (tuple, list, set, frozenset)): - from scrapy.exceptions import ScrapyDeprecationWarning - import warnings - warnings.warn('ITEM_PIPELINES defined as a list or a set is deprecated, switch to a dict', - category=ScrapyDeprecationWarning, stacklevel=1) - # convert old ITEM_PIPELINE list to a dict with order 500 - item_pipelines = dict(zip(item_pipelines, range(500, 500+len(item_pipelines)))) - return build_component_list(settings['ITEM_PIPELINES_BASE'], item_pipelines) + return build_component_list(settings._getcomposite('ITEM_PIPELINES')) def _add_middleware(self, pipe): super(ItemPipelineManager, self)._add_middleware(pipe) diff --git a/scrapy/settings/__init__.py b/scrapy/settings/__init__.py index fa7fa317893..7eea562e1af 100644 --- a/scrapy/settings/__init__.py +++ b/scrapy/settings/__init__.py @@ -36,13 +36,21 @@ class SettingsAttribute(object): def __init__(self, value, priority): self.value = value - self.priority = priority + if isinstance(self.value, BaseSettings): + self.priority = max(self.value.maxpriority(), priority) + else: + self.priority = priority def set(self, value, priority): """Sets value if priority is higher or equal than current priority.""" - if priority >= self.priority: - self.value = value - self.priority = priority + if isinstance(self.value, BaseSettings): + # Ignore self.priority if self.value has per-key priorities + self.value.update(value, priority) + self.priority = max(self.value.maxpriority(), priority) + else: + if priority >= self.priority: + self.value = value + self.priority = priority def __str__(self): return " Date: Thu, 2 Jul 2015 16:51:15 +0200 Subject: [PATCH 03/13] Move Settings documentation to docstrings --- docs/topics/api.rst | 209 ++---------------------------------- scrapy/settings/__init__.py | 193 ++++++++++++++++++++++++++++++++- 2 files changed, 195 insertions(+), 207 deletions(-) diff --git a/docs/topics/api.rst b/docs/topics/api.rst index 923bd80b0c6..42c0133c13e 100644 --- a/docs/topics/api.rst +++ b/docs/topics/api.rst @@ -140,211 +140,14 @@ Settings API For a detailed explanation on each settings sources, see: :ref:`topics-settings`. -.. function:: get_settings_priority(priority) +.. autofunction:: get_settings_priority - Small helper function that looks up a given string priority in the - :attr:`~scrapy.settings.SETTINGS_PRIORITIES` dictionary and returns its - numerical value, or directly returns a given numerical priority. - -.. class:: Settings(values={}, priority='project') - - This object stores Scrapy settings for the configuration of internal - components, and can be used for any further customization. - - It is a direct subclass and supports all methods of - :class:`~scrapy.settings.BaseSettings`. Additionally, after instantiation - of this class, the new object will have the global default settings - described on :ref:`topics-settings-ref` already populated. - -.. class:: BaseSettings(values={}, priority='project') - - Instances of this class behave like dictionaries, but store priorities - along with their ``(key, value)`` pairs, and can be frozen (i.e. marked - immutable). - - Key-value entries can be passed on initialization with the ``values`` - argument, and they would take the ``priority`` level (unless ``values`` is - already an instance of :class:`~scrapy.settings.BaseSettings`, in which - case the existing priority levels will be kept). If the ``priority`` - argument is a string, the priority name will be looked up in - :attr:`~scrapy.settings.SETTINGS_PRIORITIES`. Otherwise, a specific integer - should be provided. - - Once the object is created, new settings can be loaded or updated with the - :meth:`~scrapy.settings.BaseSettings.set` method, and can be accessed with - the square bracket notation of dictionaries, or with the - :meth:`~scrapy.settings.BaseSettings.get` method of the instance and its - value conversion variants. When requesting a stored key, the value with the - highest priority will be retrieved. - - .. method:: set(name, value, priority='project') - - Store a key/value attribute with a given priority. - - Settings should be populated *before* configuring the Crawler object - (through the :meth:`~scrapy.crawler.Crawler.configure` method), - otherwise they won't have any effect. - - :param name: the setting name - :type name: string - - :param value: the value to associate with the setting - :type value: any - - :param priority: the priority of the setting. Should be a key of - :attr:`~scrapy.settings.SETTINGS_PRIORITIES` or an integer - :type priority: string or int - - .. method:: update(values, priority='project') - - Store key/value pairs with a given priority. - - This is a helper function that calls - :meth:`~scrapy.settings.BaseSettings.set` for every item of ``values`` - with the provided ``priority``. - - If ``values`` is a string, it is assumed to be JSON-encoded and parsed - into a dict with ``json.loads()`` first. If it is a - :class:`~scrapy.settings.BaseSettings` instance, the per-key priorities - will be used and the ``priority`` parameter ignored. This allows - inserting/updating settings with different priorities with a single - command. - - :param values: the settings names and values - :type values: dict or string or :class:`~scrapy.settings.BaseSettings` - - :param priority: the priority of the settings. Should be a key of - :attr:`~scrapy.settings.SETTINGS_PRIORITIES` or an integer - :type priority: string or int - - .. method:: setmodule(module, priority='project') - - Store settings from a module with a given priority. - - This is a helper function that calls - :meth:`~scrapy.settings.BaseSettings.set` for every globally declared - uppercase variable of ``module`` with the provided ``priority``. - - :param module: the module or the path of the module - :type module: module object or string - - :param priority: the priority of the settings. Should be a key of - :attr:`~scrapy.settings.SETTINGS_PRIORITIES` or an integer - :type priority: string or int - - .. method:: get(name, default=None) - - Get a setting value without affecting its original type. - - :param name: the setting name - :type name: string - - :param default: the value to return if no setting is found - :type default: any - - .. method:: getbool(name, default=False) - - Get a setting value as a boolean. For example, both ``1`` and ``'1'``, and - ``True`` return ``True``, while ``0``, ``'0'``, ``False`` and ``None`` - return ``False```` - - For example, settings populated through environment variables set to ``'0'`` - will return ``False`` when using this method. - - :param name: the setting name - :type name: string - - :param default: the value to return if no setting is found - :type default: any - - .. method:: getint(name, default=0) - - Get a setting value as an int - - :param name: the setting name - :type name: string - - :param default: the value to return if no setting is found - :type default: any - - .. method:: getfloat(name, default=0.0) - - Get a setting value as a float - - :param name: the setting name - :type name: string - - :param default: the value to return if no setting is found - :type default: any - - .. method:: getlist(name, default=None) - - Get a setting value as a list. If the setting original type is a list, a - copy of it will be returned. If it's a string it will be split by ",". - - For example, settings populated through environment variables set to - ``'one,two'`` will return a list ['one', 'two'] when using this method. - - :param name: the setting name - :type name: string - - :param default: the value to return if no setting is found - :type default: any - - .. method:: getdict(name, default=None) - - Get a setting value as a dictionary. If the setting original type is a - dictionary, a copy of it will be returned. If it is a string it will be - evaluated as a JSON dictionary. In the case that it is a - :class:`~scrapy.settings.BaseSettings` instance itself, it will be - converted to a dictionary, containing all its current settings values - as they would be returned by :meth:`~scrapy.settings.BaseSettings.get`, - and losing all information about priority and mutability. - - :param name: the setting name - :type name: string - - :param default: the value to return if no setting is found - :type default: any - - .. method:: copy() - - Make a deep copy of current settings. - - This method returns a new instance of the :class:`Settings` class, - populated with the same values and their priorities. - - Modifications to the new object won't be reflected on the original - settings. - - .. method:: freeze() - - Disable further changes to the current settings. - - After calling this method, the present state of the settings will become - immutable. Trying to change values through the :meth:`~set` method and - its variants won't be possible and will be alerted. - - .. method:: frozencopy() - - Return an immutable copy of the current settings. - - Alias for a :meth:`~freeze` call in the object returned by :meth:`copy` - - .. method:: getpriority(name) - - Return the current numerical priority value of a setting, or ``None`` if - the given ``name`` does not exist. - - :param name: the setting name - :type name: string - - .. method:: maxpriority() +.. autoclass:: Settings + :show-inheritance: + :members: - Return the numerical value of the highest priority present throughout - all settings, or the numerical value for ``default`` from - :attr:`~scrapy.settings.SETTINGS_PRIORITIES` if there are no settings - stored. +.. autoclass:: BaseSettings + :members: .. _topics-api-spiderloader: diff --git a/scrapy/settings/__init__.py b/scrapy/settings/__init__.py index 7eea562e1af..1216aabcb54 100644 --- a/scrapy/settings/__init__.py +++ b/scrapy/settings/__init__.py @@ -20,6 +20,11 @@ } def get_settings_priority(priority): + """ + Small helper function that looks up a given string priority in the + :attr:`~scrapy.settings.SETTINGS_PRIORITIES` dictionary and returns its + numerical value, or directly returns a given numerical priority. + """ if isinstance(priority, six.string_types): return SETTINGS_PRIORITIES[priority] else: @@ -60,6 +65,26 @@ def __str__(self): class BaseSettings(MutableMapping): + """ + Instances of this class behave like dictionaries, but store priorities + along with their ``(key, value)`` pairs, and can be frozen (i.e. marked + immutable). + + Key-value entries can be passed on initialization with the ``values`` + argument, and they would take the ``priority`` level (unless ``values`` is + already an instance of :class:`~scrapy.settings.BaseSettings`, in which + case the existing priority levels will be kept). If the ``priority`` + argument is a string, the priority name will be looked up in + :attr:`~scrapy.settings.SETTINGS_PRIORITIES`. Otherwise, a specific integer + should be provided. + + Once the object is created, new settings can be loaded or updated with the + :meth:`~scrapy.settings.BaseSettings.set` method, and can be accessed with + the square bracket notation of dictionaries, or with the + :meth:`~scrapy.settings.BaseSettings.get` method of the instance and its + value conversion variants. When requesting a stored key, the value with the + highest priority will be retrieved. + """ def __init__(self, values=None, priority='project'): self.frozen = False @@ -76,28 +101,94 @@ def __contains__(self, name): return name in self.attributes def get(self, name, default=None): + """ + Get a setting value without affecting its original type. + + :param name: the setting name + :type name: string + + :param default: the value to return if no setting is found + :type default: any + """ return self[name] if self[name] is not None else default def getbool(self, name, default=False): """ - True is: 1, '1', True - False is: 0, '0', False, None + Get a setting value as a boolean. + + ``1``, ``'1'``, and ``True`` return ``True``, while ``0``, ``'0'``, + ``False`` and ``None`` return ``False``. + + For example, settings populated through environment variables set to + ``'0'`` will return ``False`` when using this method. + + :param name: the setting name + :type name: string + + :param default: the value to return if no setting is found + :type default: any """ return bool(int(self.get(name, default))) def getint(self, name, default=0): + """ + Get a setting value as an int. + + :param name: the setting name + :type name: string + + :param default: the value to return if no setting is found + :type default: any + """ return int(self.get(name, default)) def getfloat(self, name, default=0.0): + """ + Get a setting value as a float. + + :param name: the setting name + :type name: string + + :param default: the value to return if no setting is found + :type default: any + """ return float(self.get(name, default)) def getlist(self, name, default=None): + """ + Get a setting value as a list. If the setting original type is a list, a + copy of it will be returned. If it's a string it will be split by ",". + + For example, settings populated through environment variables set to + ``'one,two'`` will return a list ['one', 'two'] when using this method. + + :param name: the setting name + :type name: string + + :param default: the value to return if no setting is found + :type default: any + """ value = self.get(name, default or []) if isinstance(value, six.string_types): value = value.split(',') return list(value) def getdict(self, name, default=None): + """ + Get a setting value as a dictionary. If the setting original type is a + dictionary, a copy of it will be returned. If it is a string it will be + evaluated as a JSON dictionary. In the case that it is a + :class:`~scrapy.settings.BaseSettings` instance itself, it will be + converted to a dictionary, containing all its current settings values + as they would be returned by :meth:`~scrapy.settings.BaseSettings.get`, + and losing all information about priority and mutability. + + :param name: the setting name + :type name: string + + :param default: the value to return if no setting is found + :type default: any + """ value = self.get(name, default or {}) if isinstance(value, six.string_types): value = json.loads(value) @@ -118,12 +209,25 @@ def _getcomposite(self, name): return self[name] def getpriority(self, name): + """ + Return the current numerical priority value of a setting, or ``None`` if + the given ``name`` does not exist. + + :param name: the setting name + :type name: string + """ prio = None if name in self: prio = self.attributes[name].priority return prio def maxpriority(self): + """ + Return the numerical value of the highest priority present throughout + all settings, or the numerical value for ``default`` from + :attr:`~scrapy.settings.SETTINGS_PRIORITIES` if there are no settings + stored. + """ if len(self) > 0: return max(self.getpriority(name) for name in self) else: @@ -133,6 +237,23 @@ def __setitem__(self, name, value): self.set(name, value) def set(self, name, value, priority='project'): + """ + Store a key/value attribute with a given priority. + + Settings should be populated *before* configuring the Crawler object + (through the :meth:`~scrapy.crawler.Crawler.configure` method), + otherwise they won't have any effect. + + :param name: the setting name + :type name: string + + :param value: the value to associate with the setting + :type value: any + + :param priority: the priority of the setting. Should be a key of + :attr:`~scrapy.settings.SETTINGS_PRIORITIES` or an integer + :type priority: string or int + """ self._assert_mutability() priority = get_settings_priority(priority) if name not in self: @@ -147,6 +268,20 @@ def setdict(self, values, priority='project'): self.update(values, priority) def setmodule(self, module, priority='project'): + """ + Store settings from a module with a given priority. + + This is a helper function that calls + :meth:`~scrapy.settings.BaseSettings.set` for every globally declared + uppercase variable of ``module`` with the provided ``priority``. + + :param module: the module or the path of the module + :type module: module object or string + + :param priority: the priority of the settings. Should be a key of + :attr:`~scrapy.settings.SETTINGS_PRIORITIES` or an integer + :type priority: string or int + """ self._assert_mutability() if isinstance(module, six.string_types): module = import_module(module) @@ -155,6 +290,27 @@ def setmodule(self, module, priority='project'): self.set(key, getattr(module, key), priority) def update(self, values, priority='project'): + """ + Store key/value pairs with a given priority. + + This is a helper function that calls + :meth:`~scrapy.settings.BaseSettings.set` for every item of ``values`` + with the provided ``priority``. + + If ``values`` is a string, it is assumed to be JSON-encoded and parsed + into a dict with ``json.loads()`` first. If it is a + :class:`~scrapy.settings.BaseSettings` instance, the per-key priorities + will be used and the ``priority`` parameter ignored. This allows + inserting/updating settings with different priorities with a single + command. + + :param values: the settings names and values + :type values: dict or string or :class:`~scrapy.settings.BaseSettings` + + :param priority: the priority of the settings. Should be a key of + :attr:`~scrapy.settings.SETTINGS_PRIORITIES` or an integer + :type priority: string or int + """ self._assert_mutability() if isinstance(values, six.string_types): values = json.loads(values) @@ -181,12 +337,33 @@ def _assert_mutability(self): raise TypeError("Trying to modify an immutable Settings object") def copy(self): + """ + Make a deep copy of current settings. + + This method returns a new instance of the :class:`Settings` class, + populated with the same values and their priorities. + + Modifications to the new object won't be reflected on the original + settings. + """ return copy.deepcopy(self) def freeze(self): + """ + Disable further changes to the current settings. + + After calling this method, the present state of the settings will become + immutable. Trying to change values through the :meth:`~set` method and + its variants won't be possible and will be alerted. + """ self.frozen = True def frozencopy(self): + """ + Return an immutable copy of the current settings. + + Alias for a :meth:`~freeze` call in the object returned by :meth:`copy`. + """ copy = self.copy() copy.freeze() return copy @@ -252,6 +429,15 @@ def __iter__(self, k, v): class Settings(BaseSettings): + """ + This object stores Scrapy settings for the configuration of internal + components, and can be used for any further customization. + + It is a direct subclass and supports all methods of + :class:`~scrapy.settings.BaseSettings`. Additionally, after instantiation + of this class, the new object will have the global default settings + described on :ref:`topics-settings-ref` already populated. + """ def __init__(self, values=None, priority='project'): # Do not pass kwarg values here. We don't want to promote user-defined @@ -261,8 +447,7 @@ def __init__(self, values=None, priority='project'): self.setmodule(default_settings, 'default') # Promote default dictionaries to BaseSettings instances for per-key # priorities - for name in self: - val = self[name] + for name, val in six.iteritems(self): if isinstance(val, dict): self.set(name, BaseSettings(val, 'default'), 'default') self.update(values, priority) From 649735f412439ef82be06bd8e7546152520c0c54 Mon Sep 17 00:00:00 2001 From: Jakob de Maeyer Date: Mon, 10 Aug 2015 23:43:06 +0200 Subject: [PATCH 04/13] Add ExecutionEngine.close() method --- scrapy/core/engine.py | 15 +++++++++++++++ scrapy/crawler.py | 6 ++++-- tests/test_crawl.py | 10 ++++++++++ tests/test_engine.py | 24 ++++++++++++++++++++++++ 4 files changed, 53 insertions(+), 2 deletions(-) diff --git a/scrapy/core/engine.py b/scrapy/core/engine.py index 992327bfeea..753d6165d0e 100644 --- a/scrapy/core/engine.py +++ b/scrapy/core/engine.py @@ -84,6 +84,21 @@ def stop(self): dfd = self._close_all_spiders() return dfd.addBoth(lambda _: self._finish_stopping_engine()) + def close(self): + """Close the execution engine gracefully. + + If it has already been started, stop it. In all cases, close all spiders + and the downloader. + """ + if self.running: + # Will close spiders and downloader + return self.stop() + elif self.open_spiders: + # Will close downloader + return self._close_all_spiders() + else: + return defer.succeed(self.downloader.close()) + def pause(self): """Pause the execution engine""" self.paused = True diff --git a/scrapy/crawler.py b/scrapy/crawler.py index 2f1a92d3190..95d56d67128 100644 --- a/scrapy/crawler.py +++ b/scrapy/crawler.py @@ -72,9 +72,11 @@ def crawl(self, *args, **kwargs): start_requests = iter(self.spider.start_requests()) yield self.engine.open_spider(self.spider, start_requests) yield defer.maybeDeferred(self.engine.start) - except Exception: + except Exception as e: self.crawling = False - raise + if self.engine is not None: + yield self.engine.close() + raise e def _create_spider(self, *args, **kwargs): return self.spidercls.from_crawler(self, *args, **kwargs) diff --git a/tests/test_crawl.py b/tests/test_crawl.py index 6d21acab08f..82aaf20279c 100644 --- a/tests/test_crawl.py +++ b/tests/test_crawl.py @@ -226,3 +226,13 @@ def cb(response): s = dict(est[0]) self.assertEqual(s['engine.spider.name'], crawler.spider.name) self.assertEqual(s['len(engine.scraper.slot.active)'], 1) + + @defer.inlineCallbacks + def test_graceful_crawl_error_handling(self): + crawler = get_crawler(SimpleSpider) + class TestError(Exception): + pass + with mock.patch('scrapy.crawler.ExecutionEngine.open_spider') as mock_os: + mock_os.side_effect = TestError + yield self.assertFailure(crawler.crawl(), TestError) + self.assertFalse(crawler.crawling) diff --git a/tests/test_engine.py b/tests/test_engine.py index e14957eae75..dad921a60d8 100644 --- a/tests/test_engine.py +++ b/tests/test_engine.py @@ -19,6 +19,7 @@ from twisted.trial import unittest from scrapy import signals +from scrapy.core.engine import ExecutionEngine from scrapy.utils.test import get_crawler from pydispatch import dispatcher from tests import tests_datadir @@ -234,6 +235,29 @@ def _assert_signals_catched(self): self.assertEqual({'spider': self.run.spider, 'reason': 'finished'}, self.run.signals_catched[signals.spider_closed]) + @defer.inlineCallbacks + def test_close_downloader(self): + e = ExecutionEngine(get_crawler(TestSpider), lambda: None) + yield e.close() + + @defer.inlineCallbacks + def test_close_spiders_downloader(self): + e = ExecutionEngine(get_crawler(TestSpider), lambda: None) + yield e.open_spider(TestSpider(), []) + self.assertEqual(len(e.open_spiders), 1) + yield e.close() + self.assertEqual(len(e.open_spiders), 0) + + @defer.inlineCallbacks + def test_close_engine_spiders_downloader(self): + e = ExecutionEngine(get_crawler(TestSpider), lambda: None) + yield e.open_spider(TestSpider(), []) + e.start() + self.assertTrue(e.running) + yield e.close() + self.assertFalse(e.running) + self.assertEqual(len(e.open_spiders), 0) + if __name__ == "__main__": if len(sys.argv) > 1 and sys.argv[1] == 'runserver': From b45ca14eb495b012d0f745805d75f770a9711d7d Mon Sep 17 00:00:00 2001 From: Jakob de Maeyer Date: Wed, 15 Jul 2015 17:27:57 +0200 Subject: [PATCH 05/13] Allow passing Python objects to middleware dict settings --- scrapy/middleware.py | 5 ++++- scrapy/utils/misc.py | 7 ++++++- tests/test_middleware.py | 13 +++++++++++++ tests/test_utils_misc/__init__.py | 4 +++- 4 files changed, 26 insertions(+), 3 deletions(-) diff --git a/scrapy/middleware.py b/scrapy/middleware.py index a7adc39e3a0..47020f1002d 100644 --- a/scrapy/middleware.py +++ b/scrapy/middleware.py @@ -1,3 +1,4 @@ +from inspect import isclass import logging from collections import defaultdict @@ -30,7 +31,9 @@ def from_settings(cls, settings, crawler=None): for clspath in mwlist: try: mwcls = load_object(clspath) - if crawler and hasattr(mwcls, 'from_crawler'): + if not isclass(mwcls): + mw = mwcls + elif crawler and hasattr(mwcls, 'from_crawler'): mw = mwcls.from_crawler(crawler) elif hasattr(mwcls, 'from_settings'): mw = mwcls.from_settings(settings) diff --git a/scrapy/utils/misc.py b/scrapy/utils/misc.py index 4215e41d27b..3b5c9436e4d 100644 --- a/scrapy/utils/misc.py +++ b/scrapy/utils/misc.py @@ -31,10 +31,15 @@ def arg_to_iter(arg): def load_object(path): """Load an object given its absolute object path, and return it. - object can be a class, function, variable o instance. + If ``path`` is not a string, it will be returned. + + The object can be a class, function, variable, or instance. path ie: 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware' """ + if not isinstance(path, six.string_types): + return path + try: dot = path.rindex('.') except ValueError: diff --git a/tests/test_middleware.py b/tests/test_middleware.py index b6d885330a7..4e3c67d2089 100644 --- a/tests/test_middleware.py +++ b/tests/test_middleware.py @@ -91,3 +91,16 @@ def test_enabled_from_settings(self): mwman = TestMiddlewareManager.from_settings(settings) classes = [x.__class__ for x in mwman.middlewares] self.assertEqual(classes, [M1, M3]) + + def test_instances_from_settings(self): + settings = Settings() + myM3 = M3() + class InstanceTestMiddlewareManager(MiddlewareManager): + @classmethod + def _get_mwlist_from_settings(cls, settings): + return [ 'tests.test_middleware.M1', M2, myM3 ] + mwman = InstanceTestMiddlewareManager.from_settings(settings) + self.assertIsInstance(mwman.middlewares[0], M1) + self.assertIsInstance(mwman.middlewares[1], M2) + self.assertIs(mwman.middlewares[2], myM3) + diff --git a/tests/test_utils_misc/__init__.py b/tests/test_utils_misc/__init__.py index 01460a10b64..06af3c00940 100644 --- a/tests/test_utils_misc/__init__.py +++ b/tests/test_utils_misc/__init__.py @@ -11,7 +11,9 @@ class UtilsMiscTestCase(unittest.TestCase): def test_load_object(self): obj = load_object('scrapy.utils.misc.load_object') - assert obj is load_object + self.assertIs(obj, load_object) + not_a_string = int(1000) + self.assertIs(load_object(not_a_string), not_a_string) self.assertRaises(ImportError, load_object, 'nomodule999.mod.function') self.assertRaises(NameError, load_object, 'scrapy.utils.misc.load_object999') From 039675ca1338d1b8ec4abf970fae918fe5d44f06 Mon Sep 17 00:00:00 2001 From: Jakob de Maeyer Date: Mon, 1 Jun 2015 17:39:41 +0200 Subject: [PATCH 06/13] Redraft SEP-021 --- sep/sep-021.rst | 340 +++++++++++++++++++++++++++++++++++++++++------- 1 file changed, 290 insertions(+), 50 deletions(-) diff --git a/sep/sep-021.rst b/sep/sep-021.rst index 628a95dd26c..ce500fc0031 100644 --- a/sep/sep-021.rst +++ b/sep/sep-021.rst @@ -17,19 +17,31 @@ Scrapy currently supports many hooks and mechanisms for extending its functionality, but no single entry point for enabling and configuring them. Instead, the hooks are spread over: -* Spider middlewares (SPIDER_MIDDLEWARES) -* Downloader middlewares (DOWNLOADER_MIDDLEWARES) -* Downloader handlers (DOWNLOADER_HANDLERS) -* Item pipelines (ITEM_PIPELINES) -* Feed exporters and storages (FEED_EXPORTERS, FEED_STORAGES) -* Overrideable components (DUPEFILTER_CLASS, STATS_CLASS, SCHEDULER, SPIDER_MANAGER_CLASS, ITEM_PROCESSOR, etc) -* Generic extensions (EXTENSIONS) -* CLI commands (COMMANDS_MODULE) - -One problem of this approach is that enabling an extension often requires -modifying many settings, often in a coordinated way, which is complex and error -prone. Add-ons are meant to fix this by providing a simple mechanism for -enabling extensions. +* Spider middlewares (``SPIDER_MIDDLEWARES``) +* Downloader middlewares (``DOWNLOADER_MIDDLEWARES``) +* Downloader handlers (``DOWNLOADER_HANDLERS``) +* Item pipelines (``ITEM_PIPELINES``) +* Feed exporters and storages (``FEED_EXPORTERS``, ``FEED_STORAGES``) +* Overrideable components (``DUPEFILTER_CLASS``, ``STATS_CLASS``, + ``SCHEDULER``, ``SPIDER_MANAGER_CLASS``, ``ITEM_PROCESSOR``, etc.) +* Generic extensions (``EXTENSIONS``) +* CLI commands (``COMMANDS_MODULE``) + +This approach has several shortfalls: + +* Enabling an extension often requires modifying many settings, often in a + coordinated way, which is complex and error prone. +* Extension developers have little control over ensuring their library + dependencies and configuration requirements are met, especially since most + extensions never 'see' a fully-configured crawler before it starts running. +* The user is burdened with supervising potential interplay of extensions, + especially non-included ones, ranging from setting name clashes to mutually + excluding dependencies/configuration requirements. + +*Add-ons* search to remedy these shortcomings by enhancing Scrapy's extension +management, making it easy-to-use and transparent for users while giving more +configuration control to developers. + Design goals and non-goals ========================== @@ -37,8 +49,8 @@ Design goals and non-goals Goals: * simple to manage: adding or removing extensions should be just a matter of - adding or removing lines in a ``scrapy.cfg`` file -* backward compatibility with enabling extension the "old way" (ie. modifying + adding or removing lines in a configuration file +* backward compatibility with enabling extension the "old way" (i.e. modifying settings directly) Non-goals: @@ -46,62 +58,290 @@ Non-goals: * a way to publish, distribute or discover extensions (use pypi for that) -Managing add-ons -================ +User experience: managing add-ons +================================= -Add-ons are defined in the ``scrapy.cfg`` file, inside the ``[addons]`` -section. +Add-ons are enabled and configured either via Scrapy's settings, or (for add-ons +not bound to any project) in ``scrapy.cfg``. -To enable the "httpcache" addon, either shipped with Scrapy or in the Python -search path, create an entry for it in your ``scrapy.cfg``, like this:: +In the settings, add-ons can be enabled by adding either their name (for +built-in add-ons), their Python path, or their file path, to a +``INSTALLED_ADDONS`` setting. If necessary, each add-on can be configured by +providing a dictionary-valued setting with the uppercase add-on name. For +example, to enable and configure the built-in ``httpcache`` add-on and enable +(without configuring) two custom add-ons, one via Python path and one via file +path, add these entries to your settings module:: - [addons] - httpcache = + INSTALLED_ADDONS = ( + 'httpcache', + 'mymodule.filters.myfilter', + 'mymodule/filters/otherfilter.py', + ) -You may also specify the full path to an add-on (which may be either a .py file -or a folder containing __init__.py):: + HTTPCACHE = { + 'ignore_http_codes': [404, 503], + } - [addons] - mongodb_pipeline = /path/to/mongodb_pipeline.py +In ``scrapy.cfg``, add-ons are enabled and configured with one section per +add-on. The section names correspond to the entries of ``INSTALLED_ADDONS``. +The configuration from above could look like this:: + [addon:httpcache] + ignore_http_codes = 404,503 -Writing add-ons -=============== + [addon:mymodule.filters.myfilter] -Add-ons are Python modules that implement the following callbacks. + [addon:mymodule/filters/otherfilter.py] -addon_configure ---------------- -Receives the Settings object and modifies it to enable the required components. -If it raises an exception, Scrapy will print it and exit. +Developer experience: writing add-ons +===================================== -Examples:: +Add-ons are (any) Python *objects* that implement Scrapy's *add-on interface*. +The interface is enforced through ``zope.interface``. This leaves the choice of +Python object up the developer. Examples: - def addon_configure(settings): - settings.overrides['DOWNLADER_MIDDLEWARES'].update({ - 'scrapy.contrib.downloadermiddleware.httpcache.HttpCacheMiddleware': 900, - }) +* for a small pipeline, the add-on interface could be implemented in the same + class that also implements the ``open/close_spider`` and ``process_item`` + callbacks +* for larger add-ons, or for clearer structure, the interface could be provided + by a stand-alone module + +The absolute minimum interface consists of just two attributes: + +* ``NAME``: string with add-on name +* ``VERSION``: PEP-440 style version string + +To be any useful, an add-on should implement at least one of the following +callback methods: + +* ``update_addons()``: adds and configures other add-ons +* ``update_settings()``: sets configuration (such as default values for this + add-on and required settings for other extensions) and enables needed + components. +* ``check_configuration()``: receives the fully-initialized ``Crawler`` + instance before it starts running, performs additional dependency and + configuration requirement checks + +Additionally, an add-on may (and should, where appropriate) provide one or more +variables that can be used for automated detection of possible dependency +clashes: + +* ``REQUIRES``: list of built-in or custom components required by this add-on, + as PEP-440 strings +* ``MODIFIES``: list of components whose functionality is affected or replaced + by this add-on (a custom HTTP cache should list ``httpcache`` here) +* ``PROVIDES``: list of components provided by this add-on (e.g. ``mongodb`` + for an extension that provides generic read/write access to a MongoDB + database, releasing other components from having to provide their own + database access methods) + +update_addons() +----------------- + +Called: +~~~~~~~ + +Shortly after initialisation of the ``Crawler`` object. + +Arguments: +~~~~~~~~~~ + +* ``config``: configuration of this add-on +* ``addons``: the add-on manager, providing methods to add and configure add-ons + +Purpose: +~~~~~~~~ + +* Configure and enable related add-ons, useful for 'umbrella add-ons' which + chain-load other add-ons based on the configuration + +Examples: +~~~~~~~~~ :: - def addon_configure(settings): + def update_addons(config, addons): + if 'httpcache' not in addons.enabled: + addons.add('httpcache', {'expiration_secs': 60}) + +or:: + + def update_addons(config, addons): + if 'otheraddon' in addons.enabled: + addons.configs['otheraddon']['some_config_name'] = True + +update_settings() +----------------- + +Called: +~~~~~~~ + +Directly after the ``update_addons()`` callback of all add-ons has been called. + +Arguments: +~~~~~~~~~~ + +* ``config``: configuration of this add-on +* ``settings``: the crawler's ``Settings`` instance containing all project + settings + +Purpose: +~~~~~~~~ + +* Modify ``settings`` to enable required components +* Expose some add-on specific configuration (``config``) into the global + settings namespace (``settings``) if necessary +* Raise exception if components can not be properly configured (e.g. on missing + dependencies); Scrapy will print this exception *and exit* (making users + explicitly acknowledge that the add-on does not work by forcing them to + disable it). + +Side note: +~~~~~~~~~~ + +The ``MiddlewareManager.from_settings()`` method will receive a slight +modification to allow directly placing Python objects instead of class paths +in the middleware dict settings. This way, add-ons can place already +instantiated components into the settings. This allows keeping configuration +as local to components as possible and avoids cluttering up the global +settings namespace. Furthermore, it allows reusing components (e.g. using +two instances of the same mongodb pipeline to write to different locations). + +Examples: +~~~~~~~~~ + +:: + + def update_settings(config, settings): + # Don't care where this module is located + settings.set['DOWNLADER_MIDDLEWARES']({ + __name__ + '.downloadermw.coolmw': 900, + }) + + # Instantiate components to not expose settings into + # the global namespace + from .pipelines import MySQLPipeline + mysqlpl = MySQLPipeline(password = config['password']) + settings.set['ITEM_PIPELINES']({ + mysqlpl: 200, + }) + +or:: + + def update_settings(config, settings): + # Assuming this class also has a process_item() method + settings.set['ITEM_PIPELINES']({ + self: 200, + }) + +or:: + + def update_settings(config, settings): try: import boto except ImportError: raise RuntimeError("boto library is required") +check_configuration() +--------------------- + +Called: +~~~~~~~ + +Shortly before the crawler starts crawling. + +Arguments: +~~~~~~~~~~ + +* ``config``: configuration of this add-on +* ``crawler``: fully-initialized ``Crawler`` object, ready to start crawling + +Purpose: +~~~~~~~~ + +* Perform post-initialization checks like making sure the extension and its + dependencies were configured properly. +* Raise exception if a critical check failed; Scrapy will print this exception + *and exit* (see ``update_settings()`` purpose for rationale on this). + +Examples: +~~~~~~~~~ + +:: + + def check_configuration(config, crawler): + if 'some.other.addon' not in crawler.addons.enabled: + raise RuntimeError("Some other add-on required to use this add-on") + + +Implementation +============== + +A new core component, the *add-on manager*, is introduced to Scrapy. It +facilitates loading add-ons, gathering and providing information on them, +calling their callbacks at appropriate times, and performing basic checks for +dependency and configuration clashes. + +Layout +------ + +A new ``AddonManager`` class is introduced, providing methods to + +* add and remove add-ons, +* search for add-ons by name +* read enabled add-ons and their configurations from the settings module and + from ``settings.py``, +* enable and disable add-ons +* check for possible dependency incompatibilites by inspecting the collected + ``REQUIRES``, ``MODIFIES`` and ``PROVIDES`` add-on variables +* call the add-on callbacks + +Integration into start-up process +--------------------------------- + +The settings used to crawl are not complete until the spider-specific settings +have been loaded in ``Crawler.__init__()``. Add-on management follows this +approach and only starts loading add-ons when the crawler is initialised. + +Instantiation and the calls ``update_addons()`` and ``update_settings()`` happen +in ``Crawler.__init__()``. The final checks (i.e. the callback to +``check_configuration()``) is coded into the ``Crawler.crawl()`` method after +creating the engine. + +Finding add-ons +--------------- + +Add-on localisation is governed by the add-on paths given in +``INSTALLED_ADDONS`` (or by the section names if using ``scrapy.cfg``). If +nothing is found at the given path, it is tried again with ``addons.`` +prepended (i.e. pointing to the project's ``addons`` folder or module), then +with ``scrapy.addons.`` prepended (i.e. pointing to Scrapy's ``addons`` +submodule). If the object found has an ``_addon`` attribute, that attribute +will be treated as the found add-on. This allows, for example, to change the +add-on based on the Python version. + +Updating existing extensions +---------------------------- + +An ``Addon`` class is introduced that add-on developers may or may not subclass +depending on how much of the 'default functionality' they want. Naturally, it +does not provide ``NAME`` and ``VERSION``. Its default ``update_settings()`` +exposes the add-on configuration into the global settings namespace with an +appropriate name, e.g. this section from ``scrapy.cfg``:: + + [httpcache] + dir = /some/dir -crawler_ready -------------- +would expose ``HTTPCACHE_DIR``. -``crawler_ready`` receives a Crawler object after it has been initialized and -is meant to be used to perform post-initialization checks like making sure the -extension and its dependencies were configured properly. If it raises an -exception, Scrapy will print and exit. +Add-on modules will be written for all built-in extensions and placed in +``scrapy.addons``. For many default Scrapy components, it will be sufficient to +create a subclass of ``Addon`` with minor or no method modifications. The +component code remains where it is (i.e. in ``scrapy.pipelines``, etc.). -Examples:: +Later, the global settings namespace could be cleaned up in a backwards +-incompatible fashion by deprecating support for the global setting names, e.g. +``HTTPCACHE_DIR``, and instead instantiate the components with the add-on +configuration in ``update_settings()``. - def crawler_ready(crawler): - if 'some.other.addon' not in crawler.extensions.enabled: - raise RuntimeError("Some other addon is required to use this addon") From fa2554cf26934c9a6e71f87285ae512cd5153407 Mon Sep 17 00:00:00 2001 From: Jakob de Maeyer Date: Wed, 19 Aug 2015 16:16:53 +0200 Subject: [PATCH 07/13] Introduce add-ons via AddonManager and Addon base class --- docs/topics/api.rst | 11 + docs/topics/settings.rst | 9 + scrapy/addons/__init__.py | 497 ++++++++++++++++++ scrapy/interfaces.py | 23 +- scrapy/settings/__init__.py | 1 + scrapy/settings/default_settings.py | 2 + scrapy/utils/conf.py | 12 +- scrapy/utils/misc.py | 41 ++ scrapy/utils/project.py | 12 + tests/test_addons/__init__.py | 388 ++++++++++++++ tests/test_addons/addonmod.py | 16 + tests/test_addons/addons.py | 40 ++ tests/test_addons/cfg.cfg | 5 + tests/test_addons/project/__init__.py | 0 tests/test_addons/project/addons/__init__.py | 0 tests/test_addons/project/addons/addonmod.py | 7 + tests/test_addons/project/addons/addonmod2.py | 7 + tests/test_addons/scrapy_addons/__init__.py | 0 tests/test_addons/scrapy_addons/addonmod.py | 7 + tests/test_addons/scrapy_addons/addonmod2.py | 7 + tests/test_addons/scrapy_addons/addonmod3.py | 7 + tests/test_utils_misc/__init__.py | 26 +- tests/test_utils_misc/testmod.py | 1 + tests/test_utils_misc/testpkg/__init__.py | 1 + tests/test_utils_misc/testpkg/submod.py | 1 + tests/test_utils_project.py | 27 + 26 files changed, 1142 insertions(+), 6 deletions(-) create mode 100644 scrapy/addons/__init__.py create mode 100644 tests/test_addons/__init__.py create mode 100644 tests/test_addons/addonmod.py create mode 100644 tests/test_addons/addons.py create mode 100644 tests/test_addons/cfg.cfg create mode 100644 tests/test_addons/project/__init__.py create mode 100644 tests/test_addons/project/addons/__init__.py create mode 100644 tests/test_addons/project/addons/addonmod.py create mode 100644 tests/test_addons/project/addons/addonmod2.py create mode 100644 tests/test_addons/scrapy_addons/__init__.py create mode 100644 tests/test_addons/scrapy_addons/addonmod.py create mode 100644 tests/test_addons/scrapy_addons/addonmod2.py create mode 100644 tests/test_addons/scrapy_addons/addonmod3.py create mode 100644 tests/test_utils_misc/testmod.py create mode 100644 tests/test_utils_misc/testpkg/__init__.py create mode 100644 tests/test_utils_misc/testpkg/submod.py create mode 100644 tests/test_utils_project.py diff --git a/docs/topics/api.rst b/docs/topics/api.rst index 42c0133c13e..0c22b3ce9e9 100644 --- a/docs/topics/api.rst +++ b/docs/topics/api.rst @@ -149,6 +149,17 @@ Settings API .. autoclass:: BaseSettings :members: +.. _topics-api-addonmanager: + +AddonManager API +================ + +.. module:: scrapy.addons + :synopsis: Add-on manager + +.. autoclass:: AddonManager + :members: + .. _topics-api-spiderloader: SpiderLoader API diff --git a/docs/topics/settings.rst b/docs/topics/settings.rst index 642f4eb84a5..de8bcf36813 100644 --- a/docs/topics/settings.rst +++ b/docs/topics/settings.rst @@ -570,6 +570,15 @@ value. For more information See the :ref:`extensions user guide ` and the :ref:`list of available extensions `. +.. setting:: INSTALLED_ADDONS + +INSTALLED_ADDONS +---------------- + +Default: ``()`` + +A tuple containing paths to the add-ons enabled in your project. For more +information, see :ref:`topics-addons`. .. setting:: ITEM_PIPELINES diff --git a/scrapy/addons/__init__.py b/scrapy/addons/__init__.py new file mode 100644 index 00000000000..59e59e15ff5 --- /dev/null +++ b/scrapy/addons/__init__.py @@ -0,0 +1,497 @@ +from collections import defaultdict, Mapping +from importlib import import_module +from inspect import isclass +import os +import six +import warnings + +from pkg_resources import WorkingSet, Distribution, Requirement +import zope.interface +from zope.interface.verify import verifyObject + +from scrapy.exceptions import NotConfigured +from scrapy.interfaces import IAddon +from scrapy.settings import BaseSettings +from scrapy.utils.conf import config_from_filepath, get_config +from scrapy.utils.misc import load_module_or_object +from scrapy.utils.project import get_project_path + + +@zope.interface.implementer(IAddon) +class Addon(object): + + basic_settings = None + """``dict`` of settings that will be exported via :meth:`export_basics`.""" + + default_config = None + """``dict`` with default configuration.""" + + config_mapping = None + """``dict`` with mappings from config names to setting names. The given + setting names will be taken as given, i.e. they will be neither prefixed + nor uppercased. + """ + + component_type = None + """Component setting into which to export via :meth:`export_component`. Can + be any of the dictionary-like component setting names (e.g. + ``DOWNLOADER_MIDDLEWARES``) or any of their abbreviations in + :attr:`~scrapy.addons.COMPONENT_TYPE_ABBR`. If ``None``, + :meth:`export_component` will do nothing. + """ + + component_key = None + """Key to be used in the component dictionary setting when exporting via + :meth:`export_component`. This is only useful for the settings that have + no order, e.g. ``DOWNLOAD_HANDLERS`` or ``FEED_EXPORTERS``. + """ + + component_order = 0 + """Component order to use when not given in the add-on configuration. Has + no effect for component types that use :attr:`component_key`. + """ + + component = None + """Component to be inserted via :meth:`export_component`. This can be + anything that can be used in the dictionary-like component settings, i.e. + a class path, a class, or an instance. If ``None``, it is assumed that the + add-on itself is also provides the component interface, and ``self`` will be + used. + """ + + settings_prefix = None + """Prefix with which the add-on configuration will be exported into the + global settings namespace via :meth:`export_config`. If ``None``, + :attr:`name` will be used. If ``False``, no configuration will be exported. + """ + + def export_component(self, config, settings): + """Export the component in :attr:`component` into the dictionary-like + component setting derived from :attr:`component_type`. + + Where applicable, the order parameter of the component (i.e. the + dictionary value) will be retrieved from the ``order`` add-on + configuration value. + + :param config: Add-on configuration from which to read component order + :type config: ``dict`` + + :param settings: Settings object into which to export component + :type settings: :class:`~scrapy.settings.Settings` + """ + if self.component_type: + comp = self.component or self + if self.component_key: + # e.g. for DOWNLOAD_HANDLERS: {'http': 'myclass'} + k = self.component_key + v = comp + else: + # e.g. for DOWNLOADER_MIDDLEWARES: {'myclass': 100} + k = comp + v = config.get('order', self.component_order) + settings.set(self.component_type, {k: v}, 'addon') + + def export_basics(self, settings): + """Export the :attr:`basic_settings` attribute into the settings object. + + All settings will be exported with ``addon`` priority (see + :ref:`topics-api-settings`). + + :param settings: Settings object into which to expose the basic settings + :type settings: :class:`~scrapy.settings.Settings` + """ + for setting, value in six.iteritems(self.basic_settings or {}): + settings.set(setting, value, 'addon') + + def export_config(self, config, settings): + """Export the add-on configuration, all keys in caps and with + :attr:`settings_prefix` or :attr:`name` prepended, into the settings + object. + + For example, the add-on configuration ``{'key': 'value'}`` will export + the setting ``ADDONNAME_KEY`` with a value of ``value``. All settings + will be exported with ``addon`` priority (see + :ref:`topics-api-settings`). + + :param config: Add-on configuration to be exposed + :type config: ``dict`` + + :param settings: Settings object into which to export the configuration + :type settings: :class:`~scrapy.settings.Settings` + """ + if self.settings_prefix is False: + return + conf = self.default_config or {} + conf.update(config) + prefix = self.settings_prefix or self.name + # Since default exported config is case-insensitive (everything will be + # uppercased), make mapped config case-insensitive as well + conf_mapping = {k.lower(): v + for k, v in six.iteritems(self.config_mapping or {})} + for key, val in six.iteritems(conf): + if key.lower() in conf_mapping: + key = conf_mapping[key.lower()] + else: + key = (prefix + '_' + key).upper() + settings.set(key, val, 'addon') + + def update_settings(self, config, settings): + """Export both the basic settings and the add-on configuration. I.e., + call :meth:`export_basics` and :meth:`export_config`. + + For more advanced add-ons, you may want to override this callback. + + :param config: Add-on configuration + :type config: ``dict`` + + :param settings: Crawler settings object + :type settings: :class:`~scrapy.settings.Settings` + """ + self.export_component(config, settings) + self.export_basics(settings) + self.export_config(config, settings) + + +class AddonManager(Mapping): + """This class facilitates loading and storing :ref:`topics-addons`. + + You can treat it like a read-only dictionary in which keys correspond to + add-on names and values correspond to the add-on objects:: + + addons = AddonManager() + # ... load some add-ons here + print addons.enabled # prints names of all enabled add-ons + print addons['TestAddon'].version # prints version of add-on with name + # 'TestAddon' + + """ + + def __init__(self): + self._addons = {} + self.configs = {} + self._disable_on_add = [] + + def __getitem__(self, name): + return self._addons[name] + + def __delitem__(self, name): + del self._addons[name] + del self.configs[name] + + def __iter__(self): + return iter(self._addons) + + def __len__(self): + return len(self._addons) + + def add(self, addon, config=None): + """Store an add-on. + + If ``addon`` is a string, it will be treated as add-on path and passed + to :meth:`get_addon`. Otherwise, ``addon`` must be a Python object + implementing or providing Scrapy's add-on interface. The interface + will be enforced through ``zope.interface``'s ``verifyObject()``. + + If ``addon`` is a class, it will be instantiated. You can avoid this + (for example if you have implemented the add-on callbacks as class + methods) by declaring -- via ``zope.interface`` -- that your class + directly *provides* ``scrapy.interfaces.IAddon``. + + :param addon: The add-on object (or path) to be stored + :type addon: Any Python object providing the add-on interface or ``str`` + + :param config: The add-on configuration dictionary + :type config: ``dict`` + """ + addon = self.get_addon(addon) + if isclass(addon) and not IAddon.providedBy(addon): + addon = addon() + if not IAddon.providedBy(addon): + zope.interface.alsoProvides(addon, IAddon) + # zope.interface's exceptions are already quite helpful. Still, should + # we catch them and log an error message? + verifyObject(IAddon, addon) + name = addon.name + if name in self: + raise ValueError("Addon '{}' already loaded".format(name)) + self._addons[name] = addon + self.configs[name] = config or {} + if name in self._disable_on_add: + self.configs[name]['_enabled'] = False + self._disable_on_add.remove(name) + + def remove(self, addon): + """Remove an add-on. + + If ``addon`` is the name of a stored add-on, that add-on will be + removed. Otherwise, you can use the argument in the same fashion as + in :meth:`add`. + + :param addon: The add-on name, object, or path to be removed + :type addon: Any Python object providing the add-on interface or ``str`` + """ + if addon in self: + del self[addon] + elif hasattr(addon, 'name') and addon.name in self: + del self[addon.name] + else: + try: + del self[self.get_addon(addon).name] + except NameError: + raise KeyError + + @staticmethod + def get_addon(path): + """Get an add-on object by its Python or file path. + + ``path`` is assumed to be either a Python or a file path of a Scrapy + add-on. If no object is found at ``path``, it is tried again first with + ``projectname.addons`` prepended (pointing to the current project's + ``addons`` folder), then with ``scrapy.addons`` prepended (poiting to + Scrapy's built-in add-ons). These convenience shortcuts will only work + with Python paths, not file paths. + + If the object or module pointed to by ``path`` has an attribute named + ``_addon`` that attribute will be assumed to be the add-on. + :meth:`get_addon` will keep following ``_addon`` attributes until it + finds an object that does not have an attribute named ``_addon``. + + :param path: Python or file path to an add-on + :type path: ``str`` + """ + if isinstance(path, six.string_types): + prefixes = ['', 'scrapy.addons.'] + try: + prefixes.insert(1, get_project_path() + '.addons.') + except NotConfigured: + warnings.warn("Unable to locate project Python path") + for prefix in prefixes: + fullpath = prefix + path + try: + obj = load_module_or_object(fullpath) + except NameError: + pass + else: + break + else: + raise NameError("Could not find add-on '%s'" % path) + else: + obj = path + if hasattr(obj, '_addon'): + obj = AddonManager.get_addon(obj._addon) + return obj + + def load_dict(self, addonsdict): + """Load add-ons and configurations from given dictionary. + + Each add-on should be an entry in the dictionary, where the key + corresponds to the add-on path. The value should be a dictionary + representing the add-on configuration. + + Example add-on dictionary:: + + addonsdict = { + 'path.to.addon1': { + 'setting1': 'value', + 'setting2': 42, + }, + 'path/to/addon2.py': { + 'addon2setting': True, + }, + } + + :param addonsdict: dictionary where keys correspond to add-on paths \ + and values correspond to their configuration + :type addonsdict: ``dict`` + """ + for addonpath, addoncfg in six.iteritems(addonsdict): + self.add(addonpath, addoncfg) + + def load_settings(self, settings): + """Load add-ons and configurations from settings object. + + This will invoke :meth:`get_addon` for every add-on path in the + ``INSTALLED_ADDONS`` setting. For each of these add-ons, the + configuration will be read from the dictionary setting whose name + matches the uppercase add-on name. + + :param settings: The :class:`~scrapy.settings.Settings` object from \ + which to read the add-on configuration + :type settings: :class:`~scrapy.settings.Settings` + """ + paths = settings.getlist('INSTALLED_ADDONS') + addons = [self.get_addon(path) for path in paths] + configs = [settings.getdict(addon.name.upper()) for addon in addons] + for a, c in zip(addons, configs): + self.add(a, c) + + def load_cfg(self, cfg=None): + """Load add-ons and configurations from given ``ConfigParser`` object or + config file path. + + Each add-on should have its own section, where the section has a name in + the form ``addon:my_addon_path``. The add-on object is searched for via + the :meth:`get_addon` method, ``my_addon_path`` can be either a Python + or a file path. + + If ``cfg`` is ``None``, ``scrapy.cfg`` will be used. + + :param cfg: ``ConfigParser`` object or config file path from which to \ + read add-on configuration + :type cfg: ``ConfigParser`` or ``str`` + """ + if cfg is None: + cfg = get_config() + elif isinstance(cfg, six.string_types): + cfg = config_from_filepath(cfg) + for secname in cfg.sections(): + if secname.startswith("addon:"): + addonkey = secname.split("addon:", 1)[1] + addoncfg = dict(cfg.items(secname)) + self.add(addonkey, addoncfg) + + def check_dependency_clashes(self): + """Check for incompatibilities in add-on dependencies. + + Add-ons can provide information about their dependencies in their + ``provides``, ``modifies`` and ``requires`` attributes. This method will + raise an ``ImportError`` if + + * a component required by an add-on is not provided by any other add-on, + or + * a component modified by an add-on is not provided by any other add-on, + or + * the same component is provided by more than one add-on, + + and warn when a component required by an add-on is modified by any other + add-on. + """ + # Collect all active add-ons and the components they provide + ws = WorkingSet('') + def add_dist(project_name, version, **kwargs): + if project_name in ws.entry_keys.get('scrapy', []): + raise ImportError("Component {} provided by multiple add-ons" + "".format(project_name)) + else: + dist = Distribution(project_name=project_name, version=version, + **kwargs) + ws.add(dist, entry='scrapy') + for name in self: + ver = self[name].version + add_dist(name, ver) + for provides_name in getattr(self[name], 'provides', []): + add_dist(provides_name, ver) + + # Collect all required and modified components + def compile_attribute_dict(attribute_name): + attrs = defaultdict(list) + for name in self: + for entry in getattr(self[name], attribute_name, []): + attrs[entry].append(name) + return attrs + modified = compile_attribute_dict('modifies') + required = compile_attribute_dict('requires') + + req_or_mod = set(required.keys()).union(modified.keys()) + for reqstr in req_or_mod: + req = Requirement.parse(reqstr) + # May raise VersionConflict. Do we want to catch it and raise + # our own exception or is it helpful enough? + if ws.find(req) is None: + raise ImportError( + "Add-ons {} require or modify missing component {}" + "".format(required[reqstr]+modified[reqstr], reqstr)) + + mod_and_req = set(required.keys()).intersection(modified.keys()) + for conflict in mod_and_req: + warnings.warn("Component '{}', required by add-ons {}, is modified " + "by add-ons {}".format(conflict, required[conflict], + modified[conflict])) + + def disable(self, addon): + """Disable an add-on, i.e. prevent its callbacks from being called. + + If you disable an add-on before it is loaded, it will be disabled as + soon as it is added to the :class:`AddonManager`. + + :param addon: Name of the add-on to be disabled + :type addon: ``str`` + """ + if addon in self: + self.configs[addon]['_enabled'] = False + else: + self._disable_on_add.append(addon) + + def enable(self, addon): + """Re-enable a disabled add-on. + + Will raise ``ValueError`` if the add-on is neither already loaded nor + marked for being disabled on adding. + + :param addon: Name of the add-on to be enabled + :type addon: ``str`` + """ + if addon in self: + self.configs[addon]['_enabled'] = True + elif addon in self._disable_on_add: + self._disable_on_add.remove(addon) + else: + raise ValueError("Add-ons need to be added before they can be " + "enabled") + + @property + def disabled(self): + """Names of disabled add-ons""" + return ([a for a in self if not self.configs[a].get('_enabled', True)] + + self._disable_on_add) + + @property + def enabled(self): + """Names of enabled add-ons""" + return [a for a in self if self.configs[a].get('_enabled', True)] + + def _call_if_exists(self, obj, cbname, *args, **kwargs): + if obj is None: + return + try: + cb = getattr(obj, cbname) + except AttributeError: + return + else: + cb(*args, **kwargs) + + def _call_addon(self, addonname, cbname, *args, **kwargs): + if self.configs[addonname].get('_enabled', True): + self._call_if_exists(self[addonname], cbname, + self.configs[addonname], *args, **kwargs) + + def update_addons(self): + """Call ``update_addons()`` of all held add-ons. + + This will also call ``update_addons()`` of all add-ons that are added + last minute during the ``update_addons()`` routine of other add-ons. + """ + called_addons = set() + while called_addons != set(self): + for name in set(self).difference(called_addons): + called_addons.add(name) + self._call_addon(name, 'update_addons', self) + + def update_settings(self, settings): + """Call ``update_settings()`` of all held add-ons. + + :param settings: The :class:`~scrapy.settings.Settings` object to be \ + updated + :type settings: :class:`~scrapy.settings.Settings` + """ + for name in self: + self._call_addon(name, 'update_settings', settings) + + def check_configuration(self, crawler): + """Call ``check_configuration()`` of all held add-ons. + + :param crawler: the fully-initialized crawler + :type crawler: :class:`~scrapy.crawler.Crawler` + """ + for name in self: + self._call_addon(name, 'check_configuration', crawler) diff --git a/scrapy/interfaces.py b/scrapy/interfaces.py index eb93c6f7e2a..75b72899e8a 100644 --- a/scrapy/interfaces.py +++ b/scrapy/interfaces.py @@ -1,6 +1,6 @@ -from zope.interface import Interface +import zope.interface -class ISpiderLoader(Interface): +class ISpiderLoader(zope.interface.Interface): def from_settings(settings): """Return an instance of the class for the given settings""" @@ -20,3 +20,22 @@ def find_by_request(request): # ISpiderManager is deprecated, don't use it! # An alias is kept for backwards compatibility. ISpiderManager = ISpiderLoader + + +class IAddon(zope.interface.Interface): + """Scrapy add-on""" + + name = zope.interface.Attribute("""Add-on name""") + version = zope.interface.Attribute("""Add-on version string (PEP440)""") + + # XXX: Can methods be declared optional? I.e., can I enforce the signature + # but not the existence of a method? + + #def update_addons(config, addons): + # """Enables and configures other add-ons""" + + #def update_settings(config, settings): + # """Modifies `settings` to enable and configure required components""" + + #def check_configuration(config, crawler): + # """Performs post-initialization checks on fully configured `crawler`""" diff --git a/scrapy/settings/__init__.py b/scrapy/settings/__init__.py index 1216aabcb54..a5b62e27a7b 100644 --- a/scrapy/settings/__init__.py +++ b/scrapy/settings/__init__.py @@ -14,6 +14,7 @@ SETTINGS_PRIORITIES = { 'default': 0, 'command': 10, + 'addon': 15, 'project': 20, 'spider': 30, 'cmdline': 40, diff --git a/scrapy/settings/default_settings.py b/scrapy/settings/default_settings.py index bca8c99b240..79f252a81cd 100644 --- a/scrapy/settings/default_settings.py +++ b/scrapy/settings/default_settings.py @@ -160,6 +160,8 @@ HTTPCACHE_POLICY = 'scrapy.extensions.httpcache.DummyPolicy' HTTPCACHE_GZIP = False +INSTALLED_ADDONS = () + ITEM_PROCESSOR = 'scrapy.pipelines.ItemPipelineManager' ITEM_PIPELINES = {} diff --git a/scrapy/utils/conf.py b/scrapy/utils/conf.py index 80c64465706..7b9b3e38d80 100644 --- a/scrapy/utils/conf.py +++ b/scrapy/utils/conf.py @@ -81,14 +81,20 @@ def init_env(project='default', set_syspath=True): sys.path.append(projdir) -def get_config(use_closest=True): - """Get Scrapy config file as a SafeConfigParser""" - sources = get_sources(use_closest) +def config_from_filepath(sources): + """Create a SafeConfigParser and read in the given `sources`, which can be + either a filename or a list of filenames.""" cfg = SafeConfigParser() cfg.read(sources) return cfg +def get_config(use_closest=True): + """Get Scrapy config file as a SafeConfigParser""" + sources = get_sources(use_closest) + return config_from_filepath(sources) + + def get_sources(use_closest=True): xdg_config_home = os.environ.get('XDG_CONFIG_HOME') or \ os.path.expanduser('~/.config') diff --git a/scrapy/utils/misc.py b/scrapy/utils/misc.py index 3b5c9436e4d..b2b614e41e7 100644 --- a/scrapy/utils/misc.py +++ b/scrapy/utils/misc.py @@ -1,5 +1,8 @@ """Helper functions which doesn't fit anywhere else""" +import itertools +import os.path import re +import sys import hashlib from importlib import import_module from pkgutil import iter_modules @@ -56,6 +59,26 @@ def load_object(path): return obj +def load_module_or_object(path): + """Load python module or (non-module) object from given path. + + Path can be both a Python or a file path. + """ + try: + return import_module(path) + except ImportError: + pass + try: + return load_object(path) + except (ValueError, NameError, ImportError): + pass + try: + return get_module_from_filepath(path) + except ImportError: + pass + raise NameError("Could not load '%s'" % path) + + def walk_modules(path): """Loads a module and all its submodules from a the given module path and returns them. If *any* module throws an exception while importing, that @@ -78,6 +101,23 @@ def walk_modules(path): return mods +def get_module_from_filepath(path): + """Load and return a python module/package from a file path""" + path = path.rstrip("/") + if path.endswith('.py'): + path = path.rsplit('.py', 1)[0] + basefolder, modname = os.path.split(path) + # XXX: There are other ways to import modules from a full path which don't + # need to modify PYTHONPATH, see + # https://stackoverflow.com/questions/67631/ + # These methods differ between py2 and py3, and apparently the + # py3 method was deprecated in Python 3.4 + sys.path.insert(0, basefolder) + mod = import_module(modname) + sys.path.pop(0) + return mod + + def extract_regex(regex, text, encoding='utf-8'): """Extract a list of unicode strings from the given text/encoding using the following policies: @@ -117,3 +157,4 @@ def md5sum(file): break m.update(d) return m.hexdigest() + diff --git a/scrapy/utils/project.py b/scrapy/utils/project.py index a15a0d90f37..a1266c87944 100644 --- a/scrapy/utils/project.py +++ b/scrapy/utils/project.py @@ -71,3 +71,15 @@ def get_project_settings(): settings.setdict(env_overrides, priority='project') return settings + +def get_project_path(): + """Return the Python path of the current project. + + This fails when the settings module does not live in the project's root. + """ + if not inside_project(): + raise NotConfigured("Not inside a project") + settings_module_path = os.environ.get(ENVVAR) + if not settings_module_path: + raise NotConfigured("Unable to locate project's python path") + return settings_module_path.rsplit('.', 1)[0] diff --git a/tests/test_addons/__init__.py b/tests/test_addons/__init__.py new file mode 100644 index 00000000000..c98f0ab6581 --- /dev/null +++ b/tests/test_addons/__init__.py @@ -0,0 +1,388 @@ +import os.path +import six +from six.moves.configparser import SafeConfigParser +import sys +from tests import mock +import unittest +import warnings + +from pkg_resources import VersionConflict +import zope.interface +from zope.interface.verify import verifyObject +from zope.interface.exceptions import BrokenImplementation + +import scrapy.addons +from scrapy.addons import Addon, AddonManager +from scrapy.crawler import Crawler +from scrapy.interfaces import IAddon +from scrapy.settings import BaseSettings, Settings + +from . import addons +from . import addonmod + + +class AddonTest(unittest.TestCase): + + def setUp(self): + self.rawaddon = Addon() + class AddonWithAttributes(Addon): + name = 'Test' + version = '1.0' + self.testaddon = AddonWithAttributes() + + def test_interface(self): + # Raw Addon should fail exactly b/c name and version are not given + self.assertFalse(hasattr(self.rawaddon, 'name')) + self.assertFalse(hasattr(self.rawaddon, 'version')) + self.assertRaises(BrokenImplementation, verifyObject, IAddon, + self.rawaddon) + verifyObject(IAddon, self.testaddon) + + def test_export_component(self): + settings = BaseSettings({'ITEM_PIPELINES': {}}, 'default') + self.testaddon.component_type = None + self.testaddon.export_component({}, settings) + self.assertEqual(len(settings['ITEM_PIPELINES']), 0) + self.testaddon.component_type = 'ITEM_PIPELINES' + self.testaddon.component = 'test.component' + self.testaddon.export_component({}, settings) + six.assertCountEqual(self, settings['ITEM_PIPELINES'], + ['test.component']) + self.assertEqual(settings['ITEM_PIPELINES']['test.component'], 0) + self.testaddon.component_order = 313 + self.testaddon.export_component({}, settings) + self.assertEqual(settings['ITEM_PIPELINES']['test.component'], 313) + self.testaddon.component_type = 'DOWNLOAD_HANDLERS' + self.testaddon.component_key = 'http' + self.testaddon.export_component({}, settings) + self.assertEqual(settings['DOWNLOAD_HANDLERS']['http'], + 'test.component') + + def test_export_basics(self): + settings = BaseSettings() + self.testaddon.basic_settings = {'TESTKEY': 313, 'OTHERKEY': True} + self.testaddon.export_basics(settings) + self.assertEqual(settings['TESTKEY'], 313) + self.assertEqual(settings['OTHERKEY'], True) + self.assertEqual(settings.getpriority('TESTKEY'), 15) + + def test_export_config(self): + settings = BaseSettings() + self.testaddon.settings_prefix = None + self.testaddon.config_mapping = {'MAPPED_key': 'MAPPING_WORKED'} + self.testaddon.default_config = {'key': 55, 'defaultkey': 100} + self.testaddon.export_config({'key': 313, 'OTHERKEY': True, + 'mapped_KEY': 99}, settings) + self.assertEqual(settings['TEST_KEY'], 313) + self.assertEqual(settings['TEST_DEFAULTKEY'], 100) + self.assertEqual(settings['TEST_OTHERKEY'], True) + self.assertNotIn('MAPPED_key', settings) + self.assertNotIn('MAPPED_KEY', settings) + self.assertEqual(settings['MAPPING_WORKED'], 99) + self.assertEqual(settings.getpriority('TEST_KEY'), 15) + + self.testaddon.settings_prefix = 'PREF' + self.testaddon.export_config({'newkey': 99}, settings) + self.assertEqual(settings['PREF_NEWKEY'], 99) + + with mock.patch.object(settings, 'set') as mock_set: + self.testaddon.settings_prefix = False + self.testaddon.export_config({'thirdnewkey': 99}, settings) + self.assertEqual(mock_set.call_count, 0) + + def test_update_settings(self): + settings = BaseSettings() + settings.set('TEST_KEY1', 'default', priority='default') + settings.set('TEST_KEY2', 'project', priority='project') + self.testaddon.settings_prefix = None + self.testaddon.basic_settings = {'OTHERTEST_KEY': 'addon'} + addon_config = {'key1': 'addon', 'key2': 'addon', 'key3': 'addon'} + self.testaddon.update_settings(addon_config, settings) + self.assertEqual(settings['OTHERTEST_KEY'], 'addon') + self.assertEqual(settings['TEST_KEY1'], 'addon') + self.assertEqual(settings['TEST_KEY2'], 'project') + self.assertEqual(settings['TEST_KEY3'], 'addon') + + +class AddonManagerTest(unittest.TestCase): + + TESTCFGPATH = os.path.join(os.path.dirname(__file__), 'cfg.cfg') + ADDONMODPATH = os.path.join(os.path.dirname(__file__), 'addonmod.py') + + def setUp(self): + self.manager = AddonManager() + + def test_add(self): + manager = AddonManager() + manager.add(addonmod, {'key': 'val1'}) + manager.add('tests.test_addons.addons.GoodAddon') + six.assertCountEqual(self, manager, ['AddonModule', 'GoodAddon']) + self.assertIsInstance(manager['GoodAddon'], addons.GoodAddon) + six.assertCountEqual(self, manager.configs['AddonModule'], ['key']) + self.assertEqual(manager.configs['AddonModule']['key'], 'val1') + self.assertRaises(ValueError, manager.add, addonmod) + + def test_add_dont_instantiate_providing_classes(self): + class ProviderGoodAddon(addons.GoodAddon): + pass + zope.interface.directlyProvides(ProviderGoodAddon, IAddon) + manager = AddonManager() + manager.add(ProviderGoodAddon) + self.assertIs(manager['GoodAddon'], ProviderGoodAddon) + + def test_add_verifies(self): + brokenaddon = self.manager.get_addon( + 'tests.test_addons.addons.BrokenAddon') + self.assertRaises(zope.interface.exceptions.BrokenImplementation, + self.manager.add, + brokenaddon) + + def test_add_adds_missing_interface_declaration(self): + class GoodAddonWithoutDeclaration(object): + name = 'GoodAddonWithoutDeclaration' + version = '1.0' + self.manager.add(GoodAddonWithoutDeclaration) + + def test_remove(self): + manager = AddonManager() + def test_gets_removed(removearg): + manager.add(addonmod) + self.assertIn('AddonModule', manager) + manager.remove(removearg) + self.assertNotIn('AddonModule', manager) + test_gets_removed('AddonModule') + test_gets_removed(addonmod) + test_gets_removed('tests.test_addons.addonmod') + test_gets_removed(self.ADDONMODPATH) + self.assertRaises(KeyError, manager.remove, 'nonexistent') + self.assertRaises(KeyError, manager.remove, addons.GoodAddon()) + + def test_get_addon(self): + goodaddon = self.manager.get_addon( + 'tests.test_addons.addons.GoodAddon') + self.assertIs(goodaddon, addons.GoodAddon) + + loaded_addonmod = self.manager.get_addon(self.ADDONMODPATH) + # XXX: The module is in fact imported twice under different names into + # sys.modules, is there a good assertion for module equality? + self.assertEqual(loaded_addonmod.name, addonmod.name) + + # Does not provide interface, but has _addon attribute pointing to + # GoodAddon instance + addonspath = os.path.join(os.path.dirname(__file__), 'addons.py') + goodaddon = self.manager.get_addon(addonspath) + # XXX: Again, the imported class and addons.GoodAddon are different + # since they are imported twice. How to use isInstance? + self.assertEqual(goodaddon.name, addons.GoodAddon.name) + + self.assertRaises(NameError, self.manager.get_addon, 'xy.n_onexistent') + + def test_get_addon_forward(self): + class SomeCls(object): + _addon = 'tests.test_addons.addons.GoodAddon' + self.assertIs(self.manager.get_addon(SomeCls()), addons.GoodAddon) + + def test_get_addon_nested(self): + x = addons.GoodAddon('outer') + x._addon = addons.GoodAddon('middle') + x._addon._addon = addons.GoodAddon('inner') + self.assertIs(self.manager.get_addon(x), x._addon._addon) + + @mock.patch.object(scrapy.addons, 'get_project_path', + return_value='tests.test_addons.project') + def test_get_addon_prefixes(self, get_project_path_mock): + # From python path + self.assertEqual(self.manager.get_addon('addonmod').FROM, + 'test_addons.addonmod') + + # From project 'addons' folder + self.assertEqual(self.manager.get_addon('addonmod2').FROM, + 'test_addons.project.addons.addonmod2') + # Assert prefix priority '' > 'project.addons' + self.assertEqual(self.manager.get_addon('addonmod').FROM, + 'test_addons.addonmod') + + # From scrapy's 'addons' + from . import scrapy_addons + with mock.patch.dict('sys.modules', {'scrapy.addons': scrapy_addons}): + self.assertEqual(self.manager.get_addon('addonmod3').FROM, + 'test_addons.scrapy_addons.addonmod3') + # Assert prefix priority 'project.addons' > 'scrapy.addons' + self.assertEqual(self.manager.get_addon('addonmod2').FROM, + 'test_addons.project.addons.addonmod2') + # Assert prefix priority '' > 'scrapy.addons.' + self.assertEqual(self.manager.get_addon('addonmod').FROM, + 'test_addons.addonmod') + + def test_load_dict_load_settings(self): + def _test_load_method(func, *args, **kwargs): + manager = AddonManager() + getattr(manager, func)(*args, **kwargs) + six.assertCountEqual(self, manager, ['GoodAddon', 'AddonModule']) + self.assertIsInstance(manager['GoodAddon'], addons.GoodAddon) + six.assertCountEqual(self, manager.configs['GoodAddon'], + ['key']) + self.assertEqual(manager.configs['GoodAddon']['key'], 'val2') + # XXX: Check module equality, see above + self.assertEqual(manager['AddonModule'].name, addonmod.name) + self.assertIn('key', manager.configs['AddonModule']) + self.assertEqual(manager.configs['AddonModule']['key'], 'val1') + + addonsdict = { + self.ADDONMODPATH: { + 'key': 'val1', + }, + 'tests.test_addons.addons.GoodAddon': {'key': 'val2'}, + } + _test_load_method('load_dict', addonsdict) + + settings = BaseSettings() + settings.set('INSTALLED_ADDONS', [ + self.ADDONMODPATH, + 'tests.test_addons.addons.GoodAddon', + ]) + settings.set('ADDONMODULE', {'key': 'val1'}) + settings.set('GOODADDON', {'key': 'val2'}) + _test_load_method('load_settings', settings) + + def test_load_cfg(self): + manager = AddonManager() + manager.load_cfg(self.TESTCFGPATH) + six.assertCountEqual(self, manager, ['GoodAddon', 'AddonModule']) + self.assertIsInstance(manager['GoodAddon'], addons.GoodAddon) + six.assertCountEqual(self, manager.configs['GoodAddon'], ['key']) + self.assertEqual(manager.configs['GoodAddon']['key'], 'val1') + # XXX: Check module equality, see above + self.assertEqual(manager['AddonModule'].name, addonmod.name) + six.assertCountEqual(self, manager.configs['AddonModule'], ['key']) + self.assertEqual(manager.configs['AddonModule']['key'], 'val2') + + def test_enabled_disabled(self): + manager = AddonManager() + manager.add(addons.GoodAddon('FirstAddon')) + manager.add(addons.GoodAddon('SecondAddon')) + self.assertEqual(set(manager.enabled), + set(('FirstAddon', 'SecondAddon'))) + self.assertEqual(manager.disabled, []) + manager.disable('FirstAddon') + self.assertEqual(manager.enabled, ['SecondAddon']) + self.assertEqual(manager.disabled, ['FirstAddon']) + manager.enable('FirstAddon') + self.assertEqual(set(manager.enabled), + set(('FirstAddon', 'SecondAddon'))) + self.assertEqual(manager.disabled, []) + + def test_enable_before_add(self): + manager = AddonManager() + self.assertRaises(ValueError, manager.enable, 'FirstAddon') + manager.disable('FirstAddon') + manager.enable('FirstAddon') + manager.add(addons.GoodAddon('FirstAddon')) + self.assertIn('FirstAddon', manager.enabled) + + def test_disable_before_add(self): + manager = AddonManager() + manager.disable('FirstAddon') + manager.add(addons.GoodAddon('FirstAddon')) + self.assertEqual(manager.disabled, ['FirstAddon']) + + def test_callbacks(self): + first_addon = addons.GoodAddon('FirstAddon') + second_addon = addons.GoodAddon('SecondAddon') + + manager = AddonManager() + manager.add(first_addon, {'test': 'first'}) + manager.add(second_addon, {'test': 'second'}) + crawler = mock.create_autospec(Crawler) + settings = BaseSettings() + + with mock.patch.object(first_addon, 'update_addons') as ua_first, \ + mock.patch.object(second_addon, 'update_addons') as ua_second, \ + mock.patch.object(first_addon, 'update_settings') as us_first, \ + mock.patch.object(second_addon, 'update_settings') as us_second, \ + mock.patch.object(first_addon, 'check_configuration') as cc_first, \ + mock.patch.object(second_addon, 'check_configuration') as cc_second: + manager.update_addons() + ua_first.assert_called_once_with(manager.configs['FirstAddon'], + manager) + ua_second.assert_called_once_with(manager.configs['SecondAddon'], + manager) + manager.update_settings(settings) + us_first.assert_called_once_with(manager.configs['FirstAddon'], + settings) + us_second.assert_called_once_with(manager.configs['SecondAddon'], + settings) + manager.check_configuration(crawler) + cc_first.assert_called_once_with(manager.configs['FirstAddon'], + crawler) + cc_second.assert_called_once_with(manager.configs['SecondAddon'], + crawler) + self.assertEqual(ua_first.call_count, 1) + self.assertEqual(ua_second.call_count, 1) + self.assertEqual(us_first.call_count, 1) + self.assertEqual(us_second.call_count, 1) + + us_first.reset_mock() + us_second.reset_mock() + manager.disable('FirstAddon') + manager.update_settings(settings) + self.assertEqual(us_first.call_count, 0) + manager.enable('FirstAddon') + manager.update_settings(settings) + self.assertEqual(us_first.call_count, 1) + self.assertEqual(us_second.call_count, 2) + + def test_update_addons_last_minute_add(self): + class AddedAddon(addons.GoodAddon): + name = 'AddedAddon' + + class FirstAddon(addons.GoodAddon): + name = 'FirstAddon' + def update_addons(self, config, addons): + addons.add(AddedAddon()) + + manager = AddonManager() + first_addon = FirstAddon() + with mock.patch.object(first_addon, 'update_addons', + wraps=first_addon.update_addons) as ua_first, \ + mock.patch.object(AddedAddon, 'update_addons') as ua_added: + manager.add(first_addon, {'non-empty': 'dict'}) + manager.update_addons() + six.assertCountEqual(self, manager, ['FirstAddon', 'AddedAddon']) + ua_first.assert_called_once_with(manager.configs['FirstAddon'], + manager) + ua_added.assert_called_once_with(manager.configs['AddedAddon'], + manager) + + def test_check_dependency_clashes_attributes(self): + provides = addons.GoodAddon("ProvidesAddon") + provides.provides = ('test', ) + provides2 = addons.GoodAddon("ProvidesAddon2") + provides2.provides = ('test', ) + requires = addons.GoodAddon("RequiresAddon") + requires.requires = ('test', ) + requires_name = addons.GoodAddon("RequiresNameAddon") + requires_name.requires = ('ProvidesAddon', ) + requires_newer = addons.GoodAddon("RequiresNewerAddon") + requires_newer.requires = ('test>=2.0', ) + modifies = addons.GoodAddon("ModifiesAddon") + modifies.modifies = ('test', ) + + def check_with(*addons): + manager = AddonManager() + for a in addons: + manager.add(a) + return manager.check_dependency_clashes() + + self.assertRaises(ImportError, check_with, requires) + self.assertRaises(ImportError, check_with, modifies) + self.assertRaises(ImportError, check_with, provides, provides2) + self.assertRaises(VersionConflict, check_with, provides, requires_newer) + with warnings.catch_warnings(record=True) as w: + check_with(provides, modifies) + check_with(provides) + check_with(provides, requires) + check_with(provides, requires_name) + self.assertEqual(len(w), 0) + check_with(requires, provides, modifies) + self.assertEqual(len(w), 1) diff --git a/tests/test_addons/addonmod.py b/tests/test_addons/addonmod.py new file mode 100644 index 00000000000..8ecf4b81d63 --- /dev/null +++ b/tests/test_addons/addonmod.py @@ -0,0 +1,16 @@ +import zope.interface + +from scrapy.interfaces import IAddon + +zope.interface.moduleProvides(IAddon) + +FROM = "test_addons.addonmod" + +name = "AddonModule" +version = "1.0" + +def update_settings(config, settings): + pass + +def check_configuration(config, crawler): + pass diff --git a/tests/test_addons/addons.py b/tests/test_addons/addons.py new file mode 100644 index 00000000000..f3442b192b1 --- /dev/null +++ b/tests/test_addons/addons.py @@ -0,0 +1,40 @@ +import zope.interface + +from scrapy.addons import Addon +from scrapy.interfaces import IAddon + + +class Addon(object): + FROM = 'test_addons.addons' + + +@zope.interface.declarations.implementer(IAddon) +class GoodAddon(object): + + name = 'GoodAddon' + version = '1.0' + + def __init__(self, name=None, version=None): + if name is not None: + self.name = name + if version is not None: + self.version = version + + def update_addons(self, config, addons): + pass + + def update_settings(self, config, settings): + pass + + def check_configuration(self, config, crawler): + pass + + +@zope.interface.declarations.implementer(IAddon) +class BrokenAddon(object): + + name = 'BrokenAddon' + # No version + + +_addon = GoodAddon() diff --git a/tests/test_addons/cfg.cfg b/tests/test_addons/cfg.cfg new file mode 100644 index 00000000000..98c4f0f2532 --- /dev/null +++ b/tests/test_addons/cfg.cfg @@ -0,0 +1,5 @@ +[addon:tests.test_addons.addons.GoodAddon] +key = val1 + +[addon:tests/test_addons/addonmod.py] +key = val2 diff --git a/tests/test_addons/project/__init__.py b/tests/test_addons/project/__init__.py new file mode 100644 index 00000000000..e69de29bb2d diff --git a/tests/test_addons/project/addons/__init__.py b/tests/test_addons/project/addons/__init__.py new file mode 100644 index 00000000000..e69de29bb2d diff --git a/tests/test_addons/project/addons/addonmod.py b/tests/test_addons/project/addons/addonmod.py new file mode 100644 index 00000000000..66ca644f8f5 --- /dev/null +++ b/tests/test_addons/project/addons/addonmod.py @@ -0,0 +1,7 @@ +import zope.interface + +from scrapy.interfaces import IAddon + +zope.interface.moduleProvides(IAddon) + +FROM = 'test_addons.project.addons.addonmod' diff --git a/tests/test_addons/project/addons/addonmod2.py b/tests/test_addons/project/addons/addonmod2.py new file mode 100644 index 00000000000..0dbdd70ff88 --- /dev/null +++ b/tests/test_addons/project/addons/addonmod2.py @@ -0,0 +1,7 @@ +import zope.interface + +from scrapy.interfaces import IAddon + +zope.interface.moduleProvides(IAddon) + +FROM = 'test_addons.project.addons.addonmod2' diff --git a/tests/test_addons/scrapy_addons/__init__.py b/tests/test_addons/scrapy_addons/__init__.py new file mode 100644 index 00000000000..e69de29bb2d diff --git a/tests/test_addons/scrapy_addons/addonmod.py b/tests/test_addons/scrapy_addons/addonmod.py new file mode 100644 index 00000000000..fa479aa68ba --- /dev/null +++ b/tests/test_addons/scrapy_addons/addonmod.py @@ -0,0 +1,7 @@ +import zope.interface + +from scrapy.interfaces import IAddon + +zope.interface.moduleProvides(IAddon) + +FROM = 'test_addons.scrapy_addons.addonmod' diff --git a/tests/test_addons/scrapy_addons/addonmod2.py b/tests/test_addons/scrapy_addons/addonmod2.py new file mode 100644 index 00000000000..da053af4ae3 --- /dev/null +++ b/tests/test_addons/scrapy_addons/addonmod2.py @@ -0,0 +1,7 @@ +import zope.interface + +from scrapy.interfaces import IAddon + +zope.interface.moduleProvides(IAddon) + +FROM = 'test_addons.scrapy_addons.addonmod2' diff --git a/tests/test_addons/scrapy_addons/addonmod3.py b/tests/test_addons/scrapy_addons/addonmod3.py new file mode 100644 index 00000000000..c645214789d --- /dev/null +++ b/tests/test_addons/scrapy_addons/addonmod3.py @@ -0,0 +1,7 @@ +import zope.interface + +from scrapy.interfaces import IAddon + +zope.interface.moduleProvides(IAddon) + +FROM = 'test_addons.scrapy_addons.addonmod3' diff --git a/tests/test_utils_misc/__init__.py b/tests/test_utils_misc/__init__.py index 06af3c00940..f33562b7d90 100644 --- a/tests/test_utils_misc/__init__.py +++ b/tests/test_utils_misc/__init__.py @@ -3,7 +3,8 @@ import unittest from scrapy.item import Item, Field -from scrapy.utils.misc import load_object, arg_to_iter, walk_modules +from scrapy.utils.misc import (load_object, load_module_or_object, arg_to_iter, + walk_modules, get_module_from_filepath) __doctests__ = ['scrapy.utils.misc'] @@ -17,6 +18,15 @@ def test_load_object(self): self.assertRaises(ImportError, load_object, 'nomodule999.mod.function') self.assertRaises(NameError, load_object, 'scrapy.utils.misc.load_object999') + def test_load_module_or_object(self): + testmod = load_module_or_object(__name__ + '.testmod') + self.assertTrue(hasattr(testmod, 'TESTVAR')) + testmod = load_module_or_object( + os.path.join(os.path.dirname(__file__), 'testmod.py')) + self.assertTrue(hasattr(testmod, 'TESTVAR')) + obj = load_object('scrapy.utils.misc.load_object') + self.assertIs(obj, load_object) + def test_walk_modules(self): mods = walk_modules('tests.test_utils_misc.test_walk_modules') expected = [ @@ -57,6 +67,20 @@ def test_walk_modules_egg(self): finally: sys.path.remove(egg) + def test_get_module_from_filepath(self): + testmodpath = os.path.join(os.path.dirname(__file__), 'testmod.py') + testmod = get_module_from_filepath(testmodpath) + self.assertTrue(hasattr(testmod, 'TESTVAR')) + + testpkgpath = os.path.join(os.path.dirname(__file__), 'testpkg') + testpkg = get_module_from_filepath(testpkgpath) + self.assertTrue(hasattr(testpkg, 'TESTVAR2')) + # Check submodule access + import testpkg.submod + self.assertTrue(hasattr(testpkg.submod, 'TESTVAR3')) + self.assertIs(testpkg.submod.TESTVAR3, + load_object(testpkg.__name__ + ".submod.TESTVAR3")) + def test_arg_to_iter(self): class TestItem(Item): diff --git a/tests/test_utils_misc/testmod.py b/tests/test_utils_misc/testmod.py new file mode 100644 index 00000000000..eb540335fdf --- /dev/null +++ b/tests/test_utils_misc/testmod.py @@ -0,0 +1 @@ +TESTVAR = True diff --git a/tests/test_utils_misc/testpkg/__init__.py b/tests/test_utils_misc/testpkg/__init__.py new file mode 100644 index 00000000000..12cc2f6d9e6 --- /dev/null +++ b/tests/test_utils_misc/testpkg/__init__.py @@ -0,0 +1 @@ +TESTVAR2 = True diff --git a/tests/test_utils_misc/testpkg/submod.py b/tests/test_utils_misc/testpkg/submod.py new file mode 100644 index 00000000000..8a07e359201 --- /dev/null +++ b/tests/test_utils_misc/testpkg/submod.py @@ -0,0 +1 @@ +TESTVAR3 = True diff --git a/tests/test_utils_project.py b/tests/test_utils_project.py new file mode 100644 index 00000000000..cea4d99504d --- /dev/null +++ b/tests/test_utils_project.py @@ -0,0 +1,27 @@ +import os +from tests import mock +import unittest + +from scrapy.exceptions import NotConfigured +from scrapy.utils.project import get_project_path, inside_project + + +class UtilsProjectTestCase(unittest.TestCase): + + @mock.patch('scrapy.utils.project.inside_project', return_value=True) + def test_get_project_path(self, mock_ip): + def _test(settingsmod, expected): + with mock.patch.dict('os.environ', + {'SCRAPY_SETTINGS_MODULE': settingsmod}): + self.assertEqual(get_project_path(), expected) + _test('project.settings', 'project') + _test('project.othername', 'project') + _test('nested.project.settings', 'nested.project') + + with mock.patch.dict('os.environ', {}, clear=True): + self.assertRaises(NotConfigured, get_project_path) + + mock_ip.return_value = False + with mock.patch.dict('os.environ', + {'SCRAPY_SETTINGS_MODULE': 'some.settings'}): + self.assertRaises(NotConfigured, get_project_path) From f57bc04b9e153a4aafa9f5e3cafb998daf526083 Mon Sep 17 00:00:00 2001 From: Jakob de Maeyer Date: Wed, 19 Aug 2015 16:17:09 +0200 Subject: [PATCH 08/13] Integrate add-ons into start-up process --- scrapy/cmdline.py | 6 +++++- scrapy/crawler.py | 19 ++++++++++++++----- scrapy/utils/test.py | 4 ++-- tests/test_crawl.py | 17 +++++++++++++++++ tests/test_crawler.py | 24 ++++++++++++++++++++++++ 5 files changed, 62 insertions(+), 8 deletions(-) diff --git a/scrapy/cmdline.py b/scrapy/cmdline.py index 35050c13d96..b403df570b5 100644 --- a/scrapy/cmdline.py +++ b/scrapy/cmdline.py @@ -6,6 +6,7 @@ import pkg_resources import scrapy +from scrapy.addons import AddonManager from scrapy.crawler import CrawlerProcess from scrapy.xlib import lsprofcalltree from scrapy.commands import ScrapyCommand @@ -118,6 +119,9 @@ def execute(argv=None, settings=None): conf.settings = settings # ------------------------------------------------------------------ + addons = AddonManager() + addons.load_cfg() + inproject = inside_project() cmds = _get_commands_dict(settings, inproject) cmdname = _pop_command_name(argv) @@ -139,7 +143,7 @@ def execute(argv=None, settings=None): opts, args = parser.parse_args(args=argv[1:]) _run_print_help(parser, cmd.process_options, args, opts) - cmd.crawler_process = CrawlerProcess(settings) + cmd.crawler_process = CrawlerProcess(settings, addons) _run_print_help(parser, _run_command, cmd, args, opts) sys.exit(cmd.exitcode) diff --git a/scrapy/crawler.py b/scrapy/crawler.py index 95d56d67128..dddd7c655ad 100644 --- a/scrapy/crawler.py +++ b/scrapy/crawler.py @@ -6,6 +6,7 @@ from twisted.internet import reactor, defer from zope.interface.verify import verifyClass, DoesNotImplement +from scrapy.addons import AddonManager from scrapy.core.engine import ExecutionEngine from scrapy.resolver import CachingThreadedResolver from scrapy.interfaces import ISpiderLoader @@ -23,7 +24,7 @@ class Crawler(object): - def __init__(self, spidercls, settings=None): + def __init__(self, spidercls, settings=None, addons=None): if isinstance(settings, dict) or settings is None: settings = Settings(settings) @@ -31,6 +32,12 @@ def __init__(self, spidercls, settings=None): self.settings = settings.copy() self.spidercls.update_settings(self.settings) + self.addons = addons if addons is not None else AddonManager() + self.addons.load_settings(self.settings) + self.addons.update_addons() + self.addons.check_dependency_clashes() + self.addons.update_settings(self.settings) + self.signals = SignalManager(self) self.stats = load_object(self.settings['STATS_CLASS'])(self) @@ -69,6 +76,7 @@ def crawl(self, *args, **kwargs): try: self.spider = self._create_spider(*args, **kwargs) self.engine = self._create_engine() + self.addons.check_configuration(self) start_requests = iter(self.spider.start_requests()) yield self.engine.open_spider(self.spider, start_requests) yield defer.maybeDeferred(self.engine.start) @@ -110,10 +118,11 @@ class CrawlerRunner(object): ":meth:`crawl` and managed by this class." ) - def __init__(self, settings=None): + def __init__(self, settings=None, addons=None): if isinstance(settings, dict) or settings is None: settings = Settings(settings) self.settings = settings + self.addons = addons self.spider_loader = _get_spider_loader(settings) self._crawlers = set() self._active = set() @@ -167,7 +176,7 @@ def _done(result): def _create_crawler(self, spidercls): if isinstance(spidercls, six.string_types): spidercls = self.spider_loader.load(spidercls) - return Crawler(spidercls, self.settings) + return Crawler(spidercls, self.settings, self.addons) def stop(self): """ @@ -209,8 +218,8 @@ class CrawlerProcess(CrawlerRunner): process. See :ref:`run-from-script` for an example. """ - def __init__(self, settings=None): - super(CrawlerProcess, self).__init__(settings) + def __init__(self, settings=None, addons=None): + super(CrawlerProcess, self).__init__(settings, addons) install_shutdown_handlers(self._signal_shutdown) configure_logging(self.settings) log_scrapy_info(self.settings) diff --git a/scrapy/utils/test.py b/scrapy/utils/test.py index bec9bdda97b..40fac67a3ec 100644 --- a/scrapy/utils/test.py +++ b/scrapy/utils/test.py @@ -20,7 +20,7 @@ def assert_aws_environ(): if 'AWS_ACCESS_KEY_ID' not in os.environ: raise SkipTest("AWS keys not found") -def get_crawler(spidercls=None, settings_dict=None): +def get_crawler(spidercls=None, settings_dict=None, addons=None): """Return an unconfigured Crawler object. If settings_dict is given, it will be used to populate the crawler settings with a project level priority. @@ -29,7 +29,7 @@ def get_crawler(spidercls=None, settings_dict=None): from scrapy.settings import Settings from scrapy.spiders import Spider - runner = CrawlerRunner(Settings(settings_dict)) + runner = CrawlerRunner(Settings(settings_dict), addons) return runner._create_crawler(spidercls or Spider) def get_pythonpath(): diff --git a/tests/test_crawl.py b/tests/test_crawl.py index 82aaf20279c..3369296b4bc 100644 --- a/tests/test_crawl.py +++ b/tests/test_crawl.py @@ -6,6 +6,8 @@ from twisted.internet import defer from twisted.trial.unittest import TestCase +from scrapy.addons import Addon, AddonManager +from scrapy.crawler import ExecutionEngine from scrapy.utils.test import get_crawler from tests import mock from tests.spiders import FollowAllSpider, DelaySpider, SimpleSpider, \ @@ -236,3 +238,18 @@ class TestError(Exception): mock_os.side_effect = TestError yield self.assertFailure(crawler.crawl(), TestError) self.assertFalse(crawler.crawling) + + @defer.inlineCallbacks + def test_abort_on_addon_failed_check(self): + class FailedCheckAddon(Addon): + name = 'FailedCheckAddon' + version = '1.0' + def check_configuration(self, config, crawler): + raise ValueError + addonmgr = AddonManager() + addonmgr.add(FailedCheckAddon()) + crawler = get_crawler(SimpleSpider, addons=addonmgr) + # Doesn't work in 'precise' test environment: + #with self.assertRaises(ValueError): + # yield crawler.crawl() + yield self.assertFailure(crawler.crawl(), ValueError) diff --git a/tests/test_crawler.py b/tests/test_crawler.py index 53a1202e343..dfad11405ec 100644 --- a/tests/test_crawler.py +++ b/tests/test_crawler.py @@ -2,6 +2,7 @@ import unittest import scrapy +from scrapy.addons import Addon, AddonManager from scrapy.crawler import Crawler, CrawlerRunner, CrawlerProcess from scrapy.settings import Settings, default_settings from scrapy.spiderloader import SpiderLoader @@ -51,6 +52,29 @@ class CustomSettingsSpider(DefaultSpider): self.assertFalse(settings.frozen) self.assertTrue(crawler.settings.frozen) + def test_populate_addons_settings(self): + class TestAddon(Addon): + name = 'TestAddon' + version = '1.0' + addonconfig = {'TEST1': 'addon', 'TEST2': 'addon', 'TEST3': 'addon'} + class TestAddon2(Addon): + name = 'testAddon2' + version = '1.0' + addonconfig2 = {'TEST': 'addon2'} + + settings = Settings() + settings.set('TESTADDON_TEST1', 'project', priority='project') + settings.set('TESTADDON_TEST2', 'default', priority='default') + addonmgr = AddonManager() + addonmgr.add(TestAddon(), addonconfig) + addonmgr.add(TestAddon2(), addonconfig2) + crawler = Crawler(DefaultSpider, settings, addonmgr) + + self.assertEqual(crawler.settings['TESTADDON_TEST1'], 'project') + self.assertEqual(crawler.settings['TESTADDON_TEST2'], 'addon') + self.assertEqual(crawler.settings['TESTADDON_TEST3'], 'addon') + self.assertEqual(crawler.settings['TESTADDON2_TEST'], 'addon2') + def test_crawler_accepts_dict(self): crawler = Crawler(DefaultSpider, {'foo': 'bar'}) self.assertEqual(crawler.settings['foo'], 'bar') From 2ed71b1184885ec8d382ba2cada774873612eae5 Mon Sep 17 00:00:00 2001 From: Jakob de Maeyer Date: Mon, 17 Aug 2015 02:48:12 +0200 Subject: [PATCH 09/13] Add built-in add-ons --- scrapy/addons/__init__.py | 3 + scrapy/addons/builtins.py | 293 +++++++++++++++++++++++++++++ tests/test_addons/test_builtins.py | 42 +++++ 3 files changed, 338 insertions(+) create mode 100644 scrapy/addons/builtins.py create mode 100644 tests/test_addons/test_builtins.py diff --git a/scrapy/addons/__init__.py b/scrapy/addons/__init__.py index 59e59e15ff5..420d46b6884 100644 --- a/scrapy/addons/__init__.py +++ b/scrapy/addons/__init__.py @@ -495,3 +495,6 @@ def check_configuration(self, crawler): """ for name in self: self._call_addon(name, 'check_configuration', crawler) + + +from scrapy.addons.builtins import * diff --git a/scrapy/addons/builtins.py b/scrapy/addons/builtins.py new file mode 100644 index 00000000000..ff7902afbcf --- /dev/null +++ b/scrapy/addons/builtins.py @@ -0,0 +1,293 @@ +import scrapy +from scrapy.addons import Addon + +__all__ = ['make_builtin_addon', + + 'depth', 'httperror', 'offsite', 'referer', 'urllength', + + 'ajaxcrawl', 'chunked', 'cookies', 'defaultheaders', + 'downloadtimeout', 'httpauth', 'httpcache', 'httpcompression', + 'httpproxy', 'metarefresh', 'redirect', 'retry', 'robotstxt', + 'stats', 'useragent', + + 'autothrottle', 'corestats', 'closespider', 'debugger', 'feedexport', + 'logstats', 'memdebug', 'memusage', 'spiderstate', 'stacktracedump', + 'statsmailer', 'telnetconsole', + ] + + +def make_builtin_addon(addon_name, comp_type, comp, order=0, + addon_default_config=None, addon_version=None): + class ThisAddon(Addon): + name = addon_name + version = addon_version or scrapy.__version__ + component_type = comp_type + component = comp + component_order = order + default_config = addon_default_config or {} + + return ThisAddon + + +# XXX: Below are CLASSES that have lowercase names. This is in line with the +# original SEP-021 but violates PEP8. +# We might consider prepending all built-in addon names with scrapy_ or similar +# to reduce the chance of name clashes. + +# SPIDER MIDDLEWARES + +depth = make_builtin_addon( + 'depth', + 'SPIDER_MIDDLEWARES', + 'scrapy.spidermiddlewares.depth.DepthMiddleware', + 900, +) + +httperror = make_builtin_addon( + 'httperror', + 'SPIDER_MIDDLEWARES', + 'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', + 50, +) + +offsite = make_builtin_addon( + 'offsite', + 'SPIDER_MIDDLEWARES', + 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', + 500, +) + +referer = make_builtin_addon( + 'referer', + 'SPIDER_MIDDLEWARES', + 'scrapy.spidermiddlewares.referer.RefererMiddleware', + 700, + {'enabled': True}, +) + +urllength = make_builtin_addon( + 'urllength', + 'SPIDER_MIDDLEWARES', + 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', + 800, +) + + +# DOWNLOADER MIDDLEWARES + +ajaxcrawl = make_builtin_addon( + 'ajaxcrawl', + 'DOWNLOADER_MIDDLEWARES', + 'scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware', + 560, +) + +chunked = make_builtin_addon( + 'chunked', + 'DOWNLOADER_MIDDLEWARES', + 'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware', + 830, +) + +cookies = make_builtin_addon( + 'cookies', + 'DOWNLOADER_MIDDLEWARES', + 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', + 700, + {'enabled': True}, +) + +defaultheaders = make_builtin_addon( + 'defaultheaders', + 'DOWNLOADER_MIDDLEWARES', + 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', + 550, +) +# Assume every config entry is a header +def defaultheaders_export_config(self, config, settings): + conf = self.default_config or {} + conf.update(config) + settings.set('DEFAULT_REQUEST_HEADERS', conf, 'addon') +defaultheaders.export_config = defaultheaders_export_config + +downloadtimeout = make_builtin_addon( + 'downloadtimeout', + 'DOWNLOADER_MIDDLEWARES', + 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', + 350, +) +downloadtimeout.config_mapping = {'timeout': 'DOWNLOAD_TIMEOUT', + 'download_timeout': 'DOWNLOAD_TIMEOUT'} + +httpauth = make_builtin_addon( + 'httpauth', + 'DOWNLOADER_MIDDLEWARES', + 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', + 300, +) + +httpcache = make_builtin_addon( + 'httpcache', + 'DOWNLOADER_MIDDLEWARES', + 'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware', + 900, + {'enabled': True}, +) + +httpcompression = make_builtin_addon( + 'httpcompression', + 'DOWNLOADER_MIDDLEWARES', + 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', + 590, + {'enabled': True}, +) +httpcompression.config_mapping = {'enabled': 'COMPRESSION_ENABLED'} + +httpproxy = make_builtin_addon( + 'httpproxy', + 'DOWNLOADER_MIDDLEWARES', + 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', + 750, +) + +metarefresh = make_builtin_addon( + 'metarefresh', + 'DOWNLOADER_MIDDLEWARES', + 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', + 580, + {'enabled': True}, +) +metarefresh.config_mapping = {'max_times': 'REDIRECT_MAX_TIMES'} + +redirect = make_builtin_addon( + 'redirect', + 'DOWNLOADER_MIDDLEWARES', + 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', + 600, + {'enabled': True}, +) + +retry = make_builtin_addon( + 'retry', + 'DOWNLOADER_MIDDLEWARES', + 'scrapy.downloadermiddlewares.retry.RetryMiddleware', + 500, + {'enabled': True}, +) + +robotstxt = make_builtin_addon( + 'robotstxt', + 'DOWNLOADER_MIDDLEWARES', + 'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', + 100, + {'obey': True}, +) + +stats = make_builtin_addon( + 'stats', + 'DOWNLOADER_MIDDLEWARES', + 'scrapy.downloadermiddlewares.stats.DownloaderStats', + 850, +) + +useragent = make_builtin_addon( + 'useragent', + 'DOWNLOADER_MIDDLEWARES', + 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', + 400, +) +useragent.config_mapping = {'user_agent': 'USER_AGENT'} + + +# ITEM PIPELINES + + +# EXTENSIONS + +autothrottle = make_builtin_addon( + 'throttle', + 'EXTENSIONS', + 'scrapy.extensions.throttle.AutoThrottle', + 0, + {'enabled': True}, +) + +corestats = make_builtin_addon( + 'corestats', + 'EXTENSIONS' + 'scrapy.extensions.corestats.CoreStats', + 0, +) + +closespider = make_builtin_addon( + 'closespider', + 'EXTENSIONS' + 'scrapy.extensions.closespider.CloseSpider', + 0, +) + +debugger = make_builtin_addon( + 'debugger', + 'EXTENSIONS' + 'scrapy.extensions.debug.Debugger', + 0, +) + +feedexport = make_builtin_addon( + 'feedexport', + 'EXTENSIONS' + 'scrapy.extensions.feedexport.FeedExporter', + 0, +) +feedexport.settings_prefix = 'FEED' + +logstats = make_builtin_addon( + 'logstats', + 'EXTENSIONS' + 'scrapy.extensions.logstats.LogStats', + 0, +) + +memdebug = make_builtin_addon( + 'memdebug', + 'EXTENSIONS' + 'scrapy.extensions.memdebug.MemoryDebugger', + 0, + {'enabled': True}, +) + +memusage = make_builtin_addon( + 'memusage', + 'EXTENSIONS' + 'scrapy.extensions.memusage.MemoryUsage', + 0, + {'enabled': True}, +) + +spiderstate = make_builtin_addon( + 'spiderstate', + 'EXTENSIONS' + 'scrapy.extensions.spiderstate.SpiderState', + 0, +) + +stacktracedump = make_builtin_addon( + 'stacktracedump', + 'EXTENSIONS' + 'scrapy.extensions.debug.StackTraceDump', + 0, +) + +statsmailer = make_builtin_addon( + 'statsmailer', + 'EXTENSIONS' + 'scrapy.extensions.statsmailer.StatsMailer', + 0, +) + +telnetconsole = make_builtin_addon( + 'telnetconsole', + 'EXTENSIONS' + 'scrapy.telnet.TelnetConsole', + 0, +) diff --git a/tests/test_addons/test_builtins.py b/tests/test_addons/test_builtins.py new file mode 100644 index 00000000000..607c911fb7e --- /dev/null +++ b/tests/test_addons/test_builtins.py @@ -0,0 +1,42 @@ +import unittest + +import scrapy +import scrapy.addons +from scrapy.addons.builtins import make_builtin_addon +from scrapy.settings import Settings + + +class BuiltinAddonsTest(unittest.TestCase): + + def test_make_builtin_addon(self): + httpcache = make_builtin_addon( + 'httpcache', + 'DOWNLOADER_MIDDLEWARES', + 'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware', + 900, + {'enabled': True}, + ) + self.assertEqual(httpcache.name, 'httpcache') + self.assertEqual(httpcache.component_type, 'DOWNLOADER_MIDDLEWARES') + self.assertEqual(httpcache.component, 'scrapy.downloadermiddlewares.' + 'httpcache.HttpCacheMiddleware') + self.assertEqual(httpcache.component_order, 900) + self.assertEqual(httpcache.default_config, {'enabled': True}) + self.assertEqual(httpcache.version, scrapy.__version__) + httpcache = make_builtin_addon( + 'httpcache', + 'DOWNLOADER_MIDDLEWARES', + 'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware', + 900, + {'enabled': True}, + '99.9', + ) + self.assertEqual(httpcache.version, '99.9') + + def test_defaultheaders_export_config(self): + settings = Settings() + dh = scrapy.addons.defaultheaders() + dh.export_config({'X-Test-Header': 'val'}, settings) + self.assertIn('X-Test-Header', settings['DEFAULT_REQUEST_HEADERS']) + self.assertEqual(settings['DEFAULT_REQUEST_HEADERS']['X-Test-Header'], + 'val') From 3044fd6054e85790ec6e6e0b4fccec939abd56bb Mon Sep 17 00:00:00 2001 From: Jakob de Maeyer Date: Tue, 28 Jul 2015 12:58:05 +0200 Subject: [PATCH 10/13] Document add-ons --- docs/index.rst | 4 + docs/topics/addons.rst | 387 ++++++++++++++++++++++++++++++++++++++ scrapy/addons/__init__.py | 4 +- 3 files changed, 394 insertions(+), 1 deletion(-) create mode 100644 docs/topics/addons.rst diff --git a/docs/index.rst b/docs/index.rst index 0d21f5d4030..3e8a220e913 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -206,6 +206,7 @@ Extending Scrapy :hidden: topics/architecture + topics/addons topics/downloader-middleware topics/spider-middleware topics/extensions @@ -217,6 +218,9 @@ Extending Scrapy :doc:`topics/architecture` Understand the Scrapy architecture. +:doc:`topics/addons` + Enable and configure built-in and third-party extensions. + :doc:`topics/downloader-middleware` Customize how pages get requested and downloaded. diff --git a/docs/topics/addons.rst b/docs/topics/addons.rst new file mode 100644 index 00000000000..39ef286eba5 --- /dev/null +++ b/docs/topics/addons.rst @@ -0,0 +1,387 @@ +.. _topics-addons: + +======= +Add-ons +======= + +Scrapy's add-on system is a framework which unifies managing and configuring +components that extend Scrapy's core functionality, such as middlewares, +extensions, or pipelines. It provides users with a plug-and-play experience in +Scrapy extension management, and grants extensive configuration control to +developers. + + +Activating and configuring add-ons +================================== + +Add-ons and their configuration live in Scrapy's +:class:`~scrapy.addons.AddonManager`. During Scrapy's start-up process, and +only then, the add-on manager will read a list of enabled add-ons and their +configurations from your settings. There are two places where you can provide +the paths to add-ons you want to enable: + +* the ``INSTALLED_ADDONS`` setting, and +* the ``scrapy.cfg`` file. + +As Scrapy settings can be modified from many places, e.g. in a project's +``settings.py``, in a Spider's ``custom_settings`` attribute, or from the +command line, using the ``INSTALLED_ADDONS`` setting is the preferred way to +manage add-ons. + +The ``INSTALLED_ADDONS`` setting a tuple in which every item is a path to an +add-on. The path can be both a Python or a file path. While more precise, it is +not necessary to specify the full add-on Python path if it is either built into +Scrapy or lives in your project's ``addons`` submodule. + +The configuration of an add-on, if necessary at all, is stored as a dictionary +setting whose name is the uppercase add-on name. + +This is an example where an internal add-on and two third-party add-ons (in this +case with one requiring no configuration) are enabled/configured in a project's +``settings.py``:: + + INSTALLED_ADDONS = ( + 'httpcache', + 'path.to.some.addon', + 'path/to/other/addon.py', + ) + + HTTPCACHE = { + 'expiration_secs': 60, + 'ignore_http_codes': [404, 405], + } + + SOMEADDON = { + 'some_config': True, + } + +It is also possible to manage add-ons from ``scrapy.cfg``. While the syntax is +a little friendlier, be aware that this file, and therefore the configuration in +it, is not bound to a particular Scrapy project. While this should not pose a +problem when you use the project on your development machine only, a common +stumbling block is that ``scrapy.cfg`` is not deployed via ``scrapyd-deploy``. + +In ``scrapy.cfg``, section names, prepended with ``addon:``, replace the +dictionary keys. I.e., the configuration from above would look like this: + +.. code-block:: cfg + + [addon:httpcache] + expiration_secs = 60 + ignore_http_codes = 404,405 + + [addon:path.to.some.addon] + some_config = true + + [addon:path/to/other/addon.py] + + +Enabling and configuring add-ons within Python code +--------------------------------------------------- + +The :class:`~scrapy.addons.AddonManager` will only read from Scrapy's settings +and from ``scrapy.cfg`` *at the beginning* of Scrapy's start-up process. +Afterwards, i.e. as soon as the :class:`~scrapy.addons.AddonManager` is +populated, changing the ``INSTALLED_ADDONS`` setting or any of the add-on +configuration dictionary settings will have no effect. + +If you want to enable, disable, or configure add-ons in Python code, for example +when writing your own add-on, you will have to use the +:class:`~scrapy.addons.AddonManager`. You can access the add-on manager through +either ``crawler.addons`` or, if you are writing an add-on, through the +``addons`` argument of the :meth:`update_addons` callback. The add-on manager +provides many useful methods and attributes to facilitate interacting with the +add-ons framework, e.g.: + +* an :meth:`~scrapy.addons.AddonManager.add` method to load add-ons, +* the :attr:`~scrapy.addons.AddonManager.enabled` list of enabled add-ons, +* :meth:`~scrapy.addons.AddonManager.enable` and + :meth:`~scrapy.addons.AddonManager.disable` methods, +* the :attr:`~scrapy.addons.AddonManager.configs` dictionary which holds the + configuration of all add-ons + +In this example, we ensure that the ``httpcache`` add-on is loaded, and that +its ``expiration_secs`` configuration is set to ``60``:: + + # addons is an instance of AddonManager + if 'httpcache' not in addons: + addons.add('httpcache', {'expiration_secs': 60}) + else: + addons.configs['httpcache']['expiration_secs'] = 60 + + +Writing your own add-ons +======================== + +Add-ons are (any) Python *objects* that provide Scrapy's *add-on interface*. +The interface is enforced through ``zope.interface``. This leaves the choice of +Python object up the developer. Examples: + +* for a small pipeline, the add-on interface could be implemented in the same + class that also implements the ``open/close_spider`` and ``process_item`` + callbacks +* for larger add-ons, or for clearer structure, the interface could be provided + by a stand-alone module + +The absolute minimum interface consists of two attributes: + +.. attribute:: name + + string with add-on name + +.. attribute:: version + + version string (PEP-404, e.g. ``'1.0.1'``) + +Of course, stating just these two attributes will not get you very far. Add-ons +can provide three callback methods that are called at various stages before the +crawling process: + +.. method:: update_settings(config, settings) + + This method is called during the initialization of the + :class:`~scrapy.crawler.Crawler`. Here, you should perform dependency checks + (e.g. for external Python libraries) and update the + :class:`~scrapy.settings.Settings` object as wished, e.g. enable components + for this add-on or set required configuration of other extensions. + + :param config: Configuration of this add-on + :type config: ``dict`` + + :param settings: The settings object storing Scrapy/component configuration + :type settings: :class:`~scrapy.settings.Settings` + +.. method:: check_configuration(config, crawler) + + This method is called when the :class:`~scrapy.crawler.Crawler` has been + fully initialized, immediately before it starts crawling. You can perform + additional dependency and configuration checks here. + + :param config: Configuration of this add-on + :type config: ``dict`` + + :param crawler: Fully initialized Scrapy crawler + :type crawler: :class:`~scrapy.crawler.Crawler` + +.. method:: update_addons(config, addons) + + This method is called immediately before :meth:`update_settings`, and should + be used to enable and configure other *add-ons* only. + + When using this callback, be aware that there is no guarantee in which order + the :meth:`update_addons` callbacks of enabled add-ons will be called. + Add-ons that are added to the :class:`~scrapy.addons.AddonManager` during + this callback will also have their :meth:`update_addons` method called. + + :param config: Configuration of this add-on + :type config: ``dict`` + + :param addons: Add-on manager holding all loaded add-ons + :type addons: :class:`~scrapy.addons.AddonManager` + +Additionally, add-ons may (and should, where appropriate) provide one or more +attributes that can be used for limited automated detection of possible +dependency clashes: + +.. attribute:: requires + + list of built-in or custom components needed by this add-on, as strings. + +.. attribute:: modifies + + list of built-in or custom components whose functionality is affected or + replaced by this add-on (a custom HTTP cache should list ``httpcache`` here) + +.. attribute:: provides + + list of components provided by this add-on (e.g. ``mongodb`` for an + extension that provides generic read/write access to a MongoDB database) + +The entries in the :attr:`requires` and :attr:`modifies` attributes can be add-on +names or components from other add-ons' :attr:`provides` attribute. You can +specify :pep:`440`-style information about required versions. Examples:: + + requires = ['httpcache'] + requires = ['otheraddon >= 2.0', 'yetanotheraddon'] + +The Python object or module that is pointed to by an add-on path (e.g. given in +the ``INSTALLED_ADDONS`` setting, or given to +:meth:`~scrapy.addons.AddonManager.add`) does not necessarily have to be an +add-on. Instead, it can provide an ``_addon`` attribute. This attribute can be +either an add-on or another add-on path. + + +Add-on base class +================= + +Scrapy comes with a built-in base class for add-ons which provides some +convenience functionality: + +* basic settings can be exported via :meth:`~scrapy.addons.Addon.export_basics`, + configurable via :attr:`~scrapy.addons.Addon.basic_settings`. +* a single component (e.g. an item pipeline or a downloader middleware) can be + inserted into Scrapy's settings via + :meth:`~scrapy.addons.Addon.export_component`, configurable via + :attr:`~scrapy.addons.Addon.component_type`, + :attr:`~scrapy.addons.Addon.component_key`, + :attr:`~scrapy.addons.Addon.component`, and the ``order`` key in + :attr:`~scrapy.addons.Addon.default_config`. +* the add-on configuration can be exposed into Scrapy's settings via + :meth:`~scrapy.addons.Addon.export_config`, configurable via + :attr:`~scrapy.addons.Addon.default_config`, + :attr:`~scrapy.addons.Addon.config_mapping`, and + :attr:`~scrapy.addons.Addon.settings_prefix`. + +By default, the base add-on class will expose the add-on configuration into +Scrapy's settings namespace, in caps and with the add-on name prepended. It is +easy to write your own functionality while still being able to use the +convenience functions by overwriting +:meth:`~scrapy.addons.Addon.update_settings`. + +.. module:: scrapy.addons + +.. autoclass:: Addon + :members: + + +Add-on examples +=============== + +Set some basic configuration using the :class:`Addon` base class:: + + from scrapy.addons import Addon + + class MyAddon(Addon): + name = 'myaddon' + version = '1.0' + component = 'path.to.mypipeline' + component_type = 'ITEM_PIPELINES' + component_order = 200 + basic_settings = { + 'DNSCACHE_ENABLED': False, + } + +Check dependencies:: + + from scrapy.addons import Addon + + class MyAddon(Addon): + name = 'myaddon' + version = '1.0' + + def update_settings(self, config, settings): + try: + import boto + except ImportError: + raise RuntimeError("myaddon requires the boto library") + else: + self.export_config(config, settings) + +Enable a component that lives relative to the add-on (see +:ref:`topics-api-settings`):: + + from scrapy.addons import Addon + + class MyAddon(Addon): + name = 'myaddon' + version = '1.0' + component = __name__ + '.downloadermw.coolmw' + component_type = 'DOWNLOADER_MIDDLEWARES' + component_order = 900 + +Instantiate components ad hoc:: + + from path.to.my.pipelines import MySQLPipeline + + class MyAddon(object): + name = 'myaddon' + version = '1.0' + + def update_settings(self, config, settings): + mysqlpl = MySQLPipeline(password=config['password']) + settings.set( + 'ITEM_PIPELINES', + {mysqlpl: 200}, + priority='addon', + ) + +Provide add-on interface along component interface:: + + class MyPipeline(object): + name = 'mypipeline' + version = '1.0' + + def process_item(self, item, spider): + # Do some processing here + return item + + def update_settings(self, config, settings): + settings.set( + 'ITEM_PIPELINES', + {self: 200}, + priority='addon', + ) + +Enable another addon (see :ref:`topics-api-addonmanager`):: + + class MyAddon(object): + name = 'myaddon' + version = '1.0' + + def update_addons(self, config, addons): + if 'httpcache' not in addons.enabled: + addons.add('httpcache', {'expiration_secs': 60}) + +Check configuration of fully initialized crawler (see +:ref:`topics-api-crawler`):: + + class MyAddon(object): + name = 'myaddon' + version = '1.0' + + def update_settings(self, config, settings): + settings.set('DNSCACHE_ENABLED', False, priority='addon') + + def check_configuration(self, config, crawler): + if crawler.settings.getbool('DNSCACHE_ENABLED'): + # The spider, some other add-on, or the user messed with the + # DNS cache setting + raise ValueError("myaddon is incompatible with DNS cache") + +Provide add-on interface through a module: + +.. No idea why just using '::' doesn't work for this one +.. code-block:: python + + name = 'AddonModule' + version = '1.0' + + class MyPipeline(object): + # ... + + class MyDownloaderMiddleware(object): + # ... + + def update_settings(config, settings): + settings.set( + 'ITEM_PIPELINES', + {MyPipeline(): 200}, + priority='addon', + } + settings.set( + 'DOWNLOADER_MIDDLEWARES', + {MyDownloaderMiddleware(): 800}, + priority='addon', + } + +Forward to other add-ons depending on Python version:: + + # This could be a Python module, say project/pipelines/mypipeline.py, but + # could also be done inside a class, etc. + import six + + if six.PY3: + # We're running Python 3 + _addon = 'path.to.addon' + else: + _addon = 'path.to.other.addon' diff --git a/scrapy/addons/__init__.py b/scrapy/addons/__init__.py index 420d46b6884..15460143c8a 100644 --- a/scrapy/addons/__init__.py +++ b/scrapy/addons/__init__.py @@ -156,13 +156,15 @@ class AddonManager(Mapping): """This class facilitates loading and storing :ref:`topics-addons`. You can treat it like a read-only dictionary in which keys correspond to - add-on names and values correspond to the add-on objects:: + add-on names and values correspond to the add-on objects. Add-on + configurations are saved in the :attr:`config` dictionary attribute:: addons = AddonManager() # ... load some add-ons here print addons.enabled # prints names of all enabled add-ons print addons['TestAddon'].version # prints version of add-on with name # 'TestAddon' + print addons.configs['TestAddon'] # prints configuration of 'TestAddon' """ From e5c9492fe57dbb239d8659ab42634155acc05b42 Mon Sep 17 00:00:00 2001 From: Jakob de Maeyer Date: Mon, 24 Aug 2015 18:04:58 +0200 Subject: [PATCH 11/13] Remove unused imports in add-ons --- scrapy/addons/__init__.py | 3 --- tests/test_addons/__init__.py | 4 +--- 2 files changed, 1 insertion(+), 6 deletions(-) diff --git a/scrapy/addons/__init__.py b/scrapy/addons/__init__.py index 15460143c8a..b1d6e14cb59 100644 --- a/scrapy/addons/__init__.py +++ b/scrapy/addons/__init__.py @@ -1,7 +1,5 @@ from collections import defaultdict, Mapping -from importlib import import_module from inspect import isclass -import os import six import warnings @@ -11,7 +9,6 @@ from scrapy.exceptions import NotConfigured from scrapy.interfaces import IAddon -from scrapy.settings import BaseSettings from scrapy.utils.conf import config_from_filepath, get_config from scrapy.utils.misc import load_module_or_object from scrapy.utils.project import get_project_path diff --git a/tests/test_addons/__init__.py b/tests/test_addons/__init__.py index c98f0ab6581..84870ec520a 100644 --- a/tests/test_addons/__init__.py +++ b/tests/test_addons/__init__.py @@ -1,7 +1,5 @@ import os.path import six -from six.moves.configparser import SafeConfigParser -import sys from tests import mock import unittest import warnings @@ -15,7 +13,7 @@ from scrapy.addons import Addon, AddonManager from scrapy.crawler import Crawler from scrapy.interfaces import IAddon -from scrapy.settings import BaseSettings, Settings +from scrapy.settings import BaseSettings from . import addons from . import addonmod From 03c7cb42aa2756da467a50019d33371aab2b9b69 Mon Sep 17 00:00:00 2001 From: Jakob de Maeyer Date: Wed, 26 Aug 2015 01:43:59 +0200 Subject: [PATCH 12/13] Fix class signatures in Extensions docs --- docs/topics/extensions.rst | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/docs/topics/extensions.rst b/docs/topics/extensions.rst index a71b8bcee3b..3e4c2e260c2 100644 --- a/docs/topics/extensions.rst +++ b/docs/topics/extensions.rst @@ -182,7 +182,7 @@ Telnet console extension .. module:: scrapy.telnet :synopsis: Telnet console -.. class:: scrapy.telnet.TelnetConsole +.. class:: TelnetConsole Provides a telnet console for getting into a Python interpreter inside the currently running Scrapy process, which can be very useful for debugging. @@ -199,7 +199,7 @@ Memory usage extension .. module:: scrapy.extensions.memusage :synopsis: Memory usage extension -.. class:: scrapy.extensions.memusage.MemoryUsage +.. class:: MemoryUsage .. note:: This extension does not work in Windows. @@ -228,7 +228,7 @@ Memory debugger extension .. module:: scrapy.extensions.memdebug :synopsis: Memory debugger extension -.. class:: scrapy.extensions.memdebug.MemoryDebugger +.. class:: MemoryDebugger An extension for debugging memory usage. It collects information about: @@ -244,7 +244,7 @@ Close spider extension .. module:: scrapy.extensions.closespider :synopsis: Close spider extension -.. class:: scrapy.extensions.closespider.CloseSpider +.. class:: CloseSpider Closes a spider automatically when some conditions are met, using a specific closing reason for each condition. @@ -315,7 +315,7 @@ StatsMailer extension .. module:: scrapy.extensions.statsmailer :synopsis: StatsMailer extension -.. class:: scrapy.extensions.statsmailer.StatsMailer +.. class:: StatsMailer This simple extension can be used to send a notification e-mail every time a domain has finished scraping, including the Scrapy stats collected. The email @@ -331,7 +331,7 @@ Debugging extensions Stack trace dump extension ~~~~~~~~~~~~~~~~~~~~~~~~~~ -.. class:: scrapy.extensions.debug.StackTraceDump +.. class:: StackTraceDump Dumps information about the running process when a `SIGQUIT`_ or `SIGUSR2`_ signal is received. The information dumped is the following: @@ -360,7 +360,7 @@ There are at least two ways to send Scrapy the `SIGQUIT`_ signal: Debugger extension ~~~~~~~~~~~~~~~~~~ -.. class:: scrapy.extensions.debug.Debugger +.. class:: Debugger Invokes a `Python debugger`_ inside a running Scrapy process when a `SIGUSR2`_ signal is received. After the debugger is exited, the Scrapy process continues From 68ea9519274d24f4eefff6a839a00933d4311568 Mon Sep 17 00:00:00 2001 From: Jakob de Maeyer Date: Fri, 21 Aug 2015 16:29:27 +0200 Subject: [PATCH 13/13] Document built-in add-ons --- docs/topics/addons.rst | 123 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 123 insertions(+) diff --git a/docs/topics/addons.rst b/docs/topics/addons.rst index 39ef286eba5..4dab15a2ad9 100644 --- a/docs/topics/addons.rst +++ b/docs/topics/addons.rst @@ -385,3 +385,126 @@ Forward to other add-ons depending on Python version:: _addon = 'path.to.addon' else: _addon = 'path.to.other.addon' + + +Built-in add-on reference +========================= + +Scrapy comes with gateway add-ons that you can use to configure the built-in +middlewares and extensions. For example, to activate and configure the +:class:`~scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware`, instead +of placing this in your ``settings.py``:: + + HTTPCACHE_ENABLED = True + HTTPCACHE_EXPIRATION_SECS = 60 + HTTPCACHE_IGNORE_HTTP_CODES = [404] + +you can also use the add-on framework:: + + INSTALLED_ADDONS = ( + # ..., + 'httpcache', + ) + + HTTPCACHE = { + 'expiration_secs': 60, + 'ignore_http_codes': [404], + } + +Note that you *must* enable built-in addons by placing them in your +``INSTALLED_ADDONS`` setting before you can use them for configuring built-in +components. I.e., configuring the ``HTTPCACHE`` setting will have no effect +when ``httpcache`` is not listed in ``INSTALLED_ADDONS``. + +In general, the add-on names match the lowercase name of the component, with its +type suffix removed (i.e. the add-on configuring the +:class:`~scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware` is called +``httpcache``), and the configuration option names match the names of the +settings they map to, with the component prefix removed (i.e. +``expiration_secs`` maps to :setting:`HTTPCACHE_EXPIRATION_SECS`, as above). +The available add-ons are: + + ++--------------------------------------+--------------------------------------+ +| Add-on | Notes | ++======================================+======================================+ +| **Spider middlewares** | ++--------------------------------------+--------------------------------------+ +| depth (:class:`~scrapy.spidermi\ | | +| ddlewares.depth.DepthMiddleware`) | | ++--------------------------------------+--------------------------------------+ +| httperror (:class:`~scrapy.spid\ | | +| ermiddlewares.httperror.HttpErrorMi\ | | +| ddleware`) | | ++--------------------------------------+--------------------------------------+ +| offsite (:class:`~scrapy.spid\ | | +| ermiddlewares.offsite.OffsiteMiddle\ | | +| ware`) | | ++--------------------------------------+--------------------------------------+ +| referer (:class:`~scrapy.spid\ | | +| ermiddlewares.referer.RefererMiddle\ | | +| ware`) | | ++--------------------------------------+--------------------------------------+ +| urllength (:class:`~scrapy.spid\ | | +| ermiddlewares.urllength.UrlLengthMi\ | | +| ddleware`) | | ++--------------------------------------+--------------------------------------+ +| **Downloader middlewares** | ++--------------------------------------+--------------------------------------+ +| ajaxcrawl (:class:`~scrapy.download\ | | +| ermiddlewares.ajaxcrawl.AjaxCrawlMi\ | | +| ddleware`) | | ++--------------------------------------+--------------------------------------+ +| chunked (:class:`~scrapy.download\ | | +| ermiddlewares.chunked.ChunkedTrans\ | | +| ferMiddleware`) | | ++--------------------------------------+--------------------------------------+ +| cookies (:class:`~scrapy.download\ | | +| ermiddlewares.cookies.CookiesMiddle\ | | +| ware`) | | ++--------------------------------------+--------------------------------------+ +| defaultheaders (:class:`~scrapy.down\| Every configuration entry is treated | +| loadermiddlewares.defaultheaders.Def\| as a default header. | +| aultHeadersMiddleware`) | | ++--------------------------------------+--------------------------------------+ +| **Extensions** | ++--------------------------------------+--------------------------------------+ +| autothrottle | Installing sets | +| (:ref:`topics-autothrottle`) | :setting:`AUTOTHROTTLE_ENABLED` to | +| | ``True``. | ++--------------------------------------+--------------------------------------+ +| corestats (:class:`~scrapy.exten\ | | +| sions.corestats.CoreStats`) | | ++--------------------------------------+--------------------------------------+ +| closespider (:class:`~scrapy.exten\ | | +| sions.closespider.CloseSpider`) | | ++--------------------------------------+--------------------------------------+ +| debugger (:class:`~scrapy.exten\ | | +| sions.debug.Debugger`) | | ++--------------------------------------+--------------------------------------+ +| feedexport (:ref:`topics-feed-expor\ | | +| ts`) | | ++--------------------------------------+--------------------------------------+ +| logstats (:class:`~scrapy.exten\ | | +| sions.logstats.LogStats`) | | ++--------------------------------------+--------------------------------------+ +| memdebug (:class:`~scrapy.exten\ | Installing sets | +| sions.memdebug.MemoryDebugger`) | :setting:`MEMDEBUG_ENABLED` to | +| | ``True``. | ++--------------------------------------+--------------------------------------+ +| memusage (:class:`~scrapy.exten\ | Installing sets | +| sions.memusage.MemoryUsage`) | :setting:`MEMUSAGE_ENABLED` to | +| | ``True``. | ++--------------------------------------+--------------------------------------+ +| spiderstate (:class:`~scrapy.exten\ | | +| sions.spiderstate.SpiderState`) | | ++--------------------------------------+--------------------------------------+ +| stacktracedump (:class:`~scrapy.ext\ | | +| ensions.debug.StackTraceDump`) | | ++--------------------------------------+--------------------------------------+ +| statsmailer (:class:`~scrapy.exten\ | | +| sions.statsmailer.StatsMailer`) | | ++--------------------------------------+--------------------------------------+ +| telnetconsole (:ref:`topics-telnet\ | | +| console`) | | ++--------------------------------------+--------------------------------------+