New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add-ons #1272

Open
wants to merge 19 commits into
base: master
from

Conversation

Projects
None yet
8 participants
@jdemaeyer
Contributor

jdemaeyer commented Jun 1, 2015

Based on #1149 and #1423 #1586 master

Closes #591 and #1215

Implementation of SEP-021 (updated version from this PR, original version)

Design decision to-do here

Old to-do:

  • Design decisions
    • Configuration entry point (settled on INSTALLED_ADDONS setting and additional support for scrapy.cfg so far)
    • Integration into start-up process
    • Add-on interactions (can add-ons configure other add-ons?)
    • Find better name for AddonManager?
    • Update SEP
  • Code
    • Add-on interface
    • Base Addon class
    • Load add-on configuration from scrapy.cfg
    • Load add-on configuration from settings module
    • Search add-ons by name
      • Improve get_project_path() robustness (done partially, still fails when settings module does not live in project's root)
    • Load & verify add-ons
    • Add-on "holding" (a place where loaded add-ons live)
    • Helper functions to call add-on callbacks
    • Dependency clash management
      • Add version support Use pkg_resources' extensive dependency management
      • Should pkg_resources be used throughout the manager and not just within the dependency checking? Will go to separate PR if we decide to
    • Integrate into standard start-up process
      • Integrate dependency clash checks
    • Allow putting objects (not just class paths) in component lists, e.g. PIPELINES (check #1215)
    • Write add-ons for built-in components
      • Fix add-ons for components which do not prepend their settings with their name
    • Move settings for built-in components from global namespace to add-ons (is this possible without making them harder to configure?) Allow setting global namespace settings via add-on framework
    • Allow spider to implement add-on callbacks (like it already does with update_settings())
  • Tests
    • Base Addon class
    • Loading configuration from cfg files
    • Loading configuration from Python modules
    • Search add-ons
    • Load and verify add-ons
    • Callback helper functions
    • Dependency clash management
    • Integration into start-up process
    • Putting objects in component lists
  • Documentation
    • Docstrings for AddonManager, then use autodoc in docs/topics/api.rst
    • New topic on add-ons
      • Mention _addon attribute, more versatile examples (using _addon, module as add-on)

Review on Reviewable

@jdemaeyer

This comment has been minimized.

Show comment
Hide comment
@jdemaeyer

jdemaeyer Jun 1, 2015

Contributor

I think one of the major open questions is where the add-on manager should live. Here's an excerpt from my proposal on that (I have opted for the inside of crawler option for the SEP so far):

Integration into existing start-up process

The question where the add-on manager should be instantiated and when the add-on callbacks should be called is not as obvious as it may seem at first glance. Outlined below are three options and their rationale:

Outside of crawler

Besides the different priority and level of direct user involvement, settings made by the add-on callbacks are not different from settings made in settings.py. The callbacks should therefore be called around the same time that the settings module is loaded. This puts the instantiation of an AddonManager instance outside of the Crawler instance.

Given that the project settings are read in the get_project_settings() function in scrapy.util.project, this seems a reasonable place to call AddonManager.update_settings(). However, we cannot instantiate the add-on manager within this function, as the function is left (and the manager would therefore be lost) long before the crawler becomes ready (when we wish to call the second round of add-on callbacks).

There are two possible approaches to instantiating the add-on manager outside of get_project_settings():

  1. In a normal Scrapy start via the command-line tool, calling get_project_settings() is embedded into the execute() function in scrapy.cmdline. In summary, the changes to this function would be (with analogoue changes in scrapy/conf.py, where necessary, for backwards-compatibility):
    1. Instantiate add-on manager before calling get_project_settings()
    2. Pass add-on manager to get_project_settings() when calling it (the function then calls update_settings()).
    3. Connect the manager's check_configuration method to the engine_started signal (this could also be done in the add-on managers 'init()' method)
  2. Alternatively, we could (grudgingly) introduce another singleton to Scrapy (e.g. scrapy.addonmanager.addonmanager). This would allow moving the above code related to add-on management above into the more appropriate get_project_settings() function.

Integrating add-on management outside of the crawler ensures that settings management, except for spider-specific settings, remains within a rather small part of the start-up process (before instantiating a Crawler): the helper function get_project_settings() indeed keeps returning the full set of project (i.e. non-spider) settings. The downside is that it either introduces a new singleton or clutters up a function (execute()) that should be more about parsing command line options and starting a crawler process than about loading add-ons and settings.

Inside of crawler

The settings used to crawl are not complete until the spider-specific settings have been loaded in Crawler.__init__(). Add-on management could follow this approach and only start loading add-ons when the crawler is initialised.

Instantiation and the call to AddonManager.update_settings() would happen in Crawler.__init__(). The final checks (i.e. the callback to AddonManager.check_configuration()) could then again either be tied to the engine_started signal, or coded into the Crawler.crawl() method after creating the engine.

Integrating add-on management inside of the crawler avoids introducing a new singleton or cluttering up the execute() function, but rips apart compiling the complete configuration settings. This is especially critical since many of the settings previously made in settings.py will move to scrapy.cfg with the implementation of this proposal, and may prove backwards-incompatible since the get_project_settings() helper function no longer works as expected (and as its name suggests).

Contributor

jdemaeyer commented Jun 1, 2015

I think one of the major open questions is where the add-on manager should live. Here's an excerpt from my proposal on that (I have opted for the inside of crawler option for the SEP so far):

Integration into existing start-up process

The question where the add-on manager should be instantiated and when the add-on callbacks should be called is not as obvious as it may seem at first glance. Outlined below are three options and their rationale:

Outside of crawler

Besides the different priority and level of direct user involvement, settings made by the add-on callbacks are not different from settings made in settings.py. The callbacks should therefore be called around the same time that the settings module is loaded. This puts the instantiation of an AddonManager instance outside of the Crawler instance.

Given that the project settings are read in the get_project_settings() function in scrapy.util.project, this seems a reasonable place to call AddonManager.update_settings(). However, we cannot instantiate the add-on manager within this function, as the function is left (and the manager would therefore be lost) long before the crawler becomes ready (when we wish to call the second round of add-on callbacks).

There are two possible approaches to instantiating the add-on manager outside of get_project_settings():

  1. In a normal Scrapy start via the command-line tool, calling get_project_settings() is embedded into the execute() function in scrapy.cmdline. In summary, the changes to this function would be (with analogoue changes in scrapy/conf.py, where necessary, for backwards-compatibility):
    1. Instantiate add-on manager before calling get_project_settings()
    2. Pass add-on manager to get_project_settings() when calling it (the function then calls update_settings()).
    3. Connect the manager's check_configuration method to the engine_started signal (this could also be done in the add-on managers 'init()' method)
  2. Alternatively, we could (grudgingly) introduce another singleton to Scrapy (e.g. scrapy.addonmanager.addonmanager). This would allow moving the above code related to add-on management above into the more appropriate get_project_settings() function.

Integrating add-on management outside of the crawler ensures that settings management, except for spider-specific settings, remains within a rather small part of the start-up process (before instantiating a Crawler): the helper function get_project_settings() indeed keeps returning the full set of project (i.e. non-spider) settings. The downside is that it either introduces a new singleton or clutters up a function (execute()) that should be more about parsing command line options and starting a crawler process than about loading add-ons and settings.

Inside of crawler

The settings used to crawl are not complete until the spider-specific settings have been loaded in Crawler.__init__(). Add-on management could follow this approach and only start loading add-ons when the crawler is initialised.

Instantiation and the call to AddonManager.update_settings() would happen in Crawler.__init__(). The final checks (i.e. the callback to AddonManager.check_configuration()) could then again either be tied to the engine_started signal, or coded into the Crawler.crawl() method after creating the engine.

Integrating add-on management inside of the crawler avoids introducing a new singleton or cluttering up the execute() function, but rips apart compiling the complete configuration settings. This is especially critical since many of the settings previously made in settings.py will move to scrapy.cfg with the implementation of this proposal, and may prove backwards-incompatible since the get_project_settings() helper function no longer works as expected (and as its name suggests).

@kmike

This comment has been minimized.

Show comment
Hide comment
@kmike

kmike Jun 3, 2015

Member

Hey @jdemaeyer,

I've read your proposal and the SEP for the first time, so please excuse me if I'm missing some context :)

  1. Is it a coincidence that scrapy.Spider implements update_settings method which addons will implement?

  2. Addons are utilities to auto-update and check Settings objects? Why do they need Crawler?

  3. How can user activate an addon when crawling is started by CrawlerRunner or CrawlerProcess?

  4. How can user enable an addon if there is no Scrapy project?

  5. How to enable an addon if scrapy runspider is used to start a spider?

  6. Why is scrapy.cfg needed? It is confusing to have two config files, and it is an unnecessary boilerplate if you don't have a Scrapy project.

  7. In the SEP it is said that MiddlewareManager.from_settings() will be changed to allow python objects instead of class paths. Why should it be specific to MiddlewareManager? (see discussion at #1215).

  8. Do you think it is possible to make Scrapy addons feature less framework-ish? In the current proposal Scrapy calls various AddonManager methods during the standard scrapy crawl init sequence, and AddonManager calls various addon methods itself. Can we made it so that user can also use the AddonManger (or individual addon) to update or check settings, without invoking all the scrapy crawl machinery?

  9. It looks like the proposed AddonManager does two unrelated things:

    • it loads addon classes from scrapy.cfg;
    • it uses loaded addons to update settings and check crawler.

    IMHO it is better to separate those. Scrapy users who don't use scrapy crawl command may prefer to import addon classes themselves, or these classes can be in the same module as CrawlerProcess call and a spider, all in the same self-contained Python script.

// a rant: I think it is usually a bad idea to name somathing ..Manager :) "Manager" tells us that a class "manages" something. It is quite broad, so 1) often such classes end up doing several unrelated things and 2) it is not clear what class is doing just from the class name - it can do anything. In Scrapy we renamed SpiderManager to SpiderLoader because of that. It turns out this is not only mine opinion.

Member

kmike commented Jun 3, 2015

Hey @jdemaeyer,

I've read your proposal and the SEP for the first time, so please excuse me if I'm missing some context :)

  1. Is it a coincidence that scrapy.Spider implements update_settings method which addons will implement?

  2. Addons are utilities to auto-update and check Settings objects? Why do they need Crawler?

  3. How can user activate an addon when crawling is started by CrawlerRunner or CrawlerProcess?

  4. How can user enable an addon if there is no Scrapy project?

  5. How to enable an addon if scrapy runspider is used to start a spider?

  6. Why is scrapy.cfg needed? It is confusing to have two config files, and it is an unnecessary boilerplate if you don't have a Scrapy project.

  7. In the SEP it is said that MiddlewareManager.from_settings() will be changed to allow python objects instead of class paths. Why should it be specific to MiddlewareManager? (see discussion at #1215).

  8. Do you think it is possible to make Scrapy addons feature less framework-ish? In the current proposal Scrapy calls various AddonManager methods during the standard scrapy crawl init sequence, and AddonManager calls various addon methods itself. Can we made it so that user can also use the AddonManger (or individual addon) to update or check settings, without invoking all the scrapy crawl machinery?

  9. It looks like the proposed AddonManager does two unrelated things:

    • it loads addon classes from scrapy.cfg;
    • it uses loaded addons to update settings and check crawler.

    IMHO it is better to separate those. Scrapy users who don't use scrapy crawl command may prefer to import addon classes themselves, or these classes can be in the same module as CrawlerProcess call and a spider, all in the same self-contained Python script.

// a rant: I think it is usually a bad idea to name somathing ..Manager :) "Manager" tells us that a class "manages" something. It is quite broad, so 1) often such classes end up doing several unrelated things and 2) it is not clear what class is doing just from the class name - it can do anything. In Scrapy we renamed SpiderManager to SpiderLoader because of that. It turns out this is not only mine opinion.

@jdemaeyer

This comment has been minimized.

Show comment
Hide comment
@jdemaeyer

jdemaeyer Jun 3, 2015

Contributor

Hey @kmike, thanks for the feedback!

1 . Is it a coincidence that scrapy.Spider implements update_settings method which addons will implement?

That is on purpose. Spider.update_settings() and the add-on update_settings() both do roughly the same, i.e. update settings outside of settings.py, albeit with a different level of "generality"

2 . Addons are utilities to auto-update and check Settings objects? Why do they need Crawler?

Add-ons should create a plug-and-play experience for the user. For checking that everything "Works As Advertised™", and provide help when it doesn't, they need to know the actual state of the crawler, and not just how its state is supposed to be from the configuration. E.g. an extension listed in settings['EXTENSIONS'] could have thrown NotConfigured. Add-ons can only detect this if they have access to the crawler.

3 . How can user activate an addon when crawling is started by CrawlerRunner or CrawlerProcess?

Hm, don't both of these instantiate Crawler objects where the add-on manager would then be instantiated and load add-ons? Then again I guess it again comes down to whether a scrapy.cfg exists in these use cases (also see answer to 6).

4 . How can user enable an addon if there is no Scrapy project?
5 . How to enable an addon if scrapy runspider is used to start a spider?

That would not be possible with the current suggested implementation. Just like you cannot have a settings.py file or use scrapyd-deploy in these cases. It could be made possible if add-on configuration were to live in settings using per-spider settings. I'll put the (current) rationale for scrapy.cfg below.

6 . Why is scrapy.cfg needed? It is confusing to have two config files, and it is an unnecessary boilerplate if you don't have a Scrapy project.

I opted for scrapy.cfg for two reasons:

  1. Its syntax is more user-friendly for uninitiated users

  2. It makes it easy to keep settings local through the ini sections. E.g. something like this:

      [mysql-pipe]
      server = some.server
      port = 1234
      user = a
      password = b
    
      [mongodb-pipe]
      server = some.other.server
      user = c
      password = d

    is much easier in scrapy.cfg than in settings.py, and would not have to expose anything into the settings object, avoiding possible nameclashes.

I think if done properly, users won't have to touch settings.py for any of the shipped add-ons at all (instead they could set, say HTTPCACHE_DIR, in the scrapy.cfg. We could even tidy up the settings namespace this way). Those users that then need to touch settings.py will most probably know what they're doing anyways.

7 . In the SEP it is said that MiddlewareManager.from_settings() will be changed to allow python objects instead of class paths. Why should it be specific to MiddlewareManager? (see discussion at #1215).

Uh, I didn't see #1215 before, that looks good. The SEP goes a little further than the PR though: It suggests to allow not only classes, but instances. This will allow add-ons to instantiate components and keep settings as local as possible. Objects retrieved from X = load_object() are (so far) always assumed to be classes and then instantiated with y = X(). That should probably stay that way, and it is a good question where the "allow instances" code could live so it is as generic as possible.

8 . Do you think it is possible to make Scrapy addons feature less framework-ish? In the current proposal Scrapy calls various AddonManager methods during the standard scrapy crawl init sequence, and AddonManager calls various addon methods itself. Can we made it so that user can also use the AddonManger (or individual addon) to update or check settings, without invoking all the scrapy crawl machinery?

Hm, I'd say it is already quite modular. The user could in principle instantiate their own AddonManager and hand it the .cfg file. Or they could load an add-on object and call its update_settings() method directly.

9 . It looks like the proposed AddonManager does two unrelated things:

  • it loads addon classes from scrapy.cfg;
  • it uses loaded addons to update settings and check crawler.
    IMHO it is better to separate those. Scrapy users who don't use scrapy crawl command may prefer to import addon classes themselves, or these classes can be in the same module as CrawlerProcess call and a spider, all in the same self-contained Python script.

That is a good point. The Manager idea arised from the #591 discussion, but now that I come to think of it splitting it up seems to make more sense. Also, it would avoid having a "...Manager" named class :). The loaded add-ons need to live somewhere between loading <-> updating <-> checking, though. With the current SEP, that could be in an attribute of Crawler (much like extensions).

Maybe like this?

  • An AddonLoader class, or even only a function, is handed the scrapy.cfg file and provides/returns an { 'addon name': (<addon object>, <addon config>) } dict. The check_dependency_clashes() functionality also lives within this class/function.
  • The addons dict is given to the Crawler object upon instantiation, much like settings.
  • Crawler calls the add-ons update_settings() and check_configuration() functions at appropriate times.
  • If the user does not want the Crawler to call the callbacks (e.g. because she does it somewhere else), she hands over an empty addons dict on instantiation. In this case, it might be tricky that the check_configuration() callback requires the crawler object.

I'm still quite unhappy about the scrapy.cfg/settings.py issue. It is indeed annoying overhead to have two configuration files. But then, I think it makes sense that add-ons and their configuration are not present in any Settings instance, and the scrapy.cfg syntax is very much suited for user-friendliness and local settings.

Contributor

jdemaeyer commented Jun 3, 2015

Hey @kmike, thanks for the feedback!

1 . Is it a coincidence that scrapy.Spider implements update_settings method which addons will implement?

That is on purpose. Spider.update_settings() and the add-on update_settings() both do roughly the same, i.e. update settings outside of settings.py, albeit with a different level of "generality"

2 . Addons are utilities to auto-update and check Settings objects? Why do they need Crawler?

Add-ons should create a plug-and-play experience for the user. For checking that everything "Works As Advertised™", and provide help when it doesn't, they need to know the actual state of the crawler, and not just how its state is supposed to be from the configuration. E.g. an extension listed in settings['EXTENSIONS'] could have thrown NotConfigured. Add-ons can only detect this if they have access to the crawler.

3 . How can user activate an addon when crawling is started by CrawlerRunner or CrawlerProcess?

Hm, don't both of these instantiate Crawler objects where the add-on manager would then be instantiated and load add-ons? Then again I guess it again comes down to whether a scrapy.cfg exists in these use cases (also see answer to 6).

4 . How can user enable an addon if there is no Scrapy project?
5 . How to enable an addon if scrapy runspider is used to start a spider?

That would not be possible with the current suggested implementation. Just like you cannot have a settings.py file or use scrapyd-deploy in these cases. It could be made possible if add-on configuration were to live in settings using per-spider settings. I'll put the (current) rationale for scrapy.cfg below.

6 . Why is scrapy.cfg needed? It is confusing to have two config files, and it is an unnecessary boilerplate if you don't have a Scrapy project.

I opted for scrapy.cfg for two reasons:

  1. Its syntax is more user-friendly for uninitiated users

  2. It makes it easy to keep settings local through the ini sections. E.g. something like this:

      [mysql-pipe]
      server = some.server
      port = 1234
      user = a
      password = b
    
      [mongodb-pipe]
      server = some.other.server
      user = c
      password = d

    is much easier in scrapy.cfg than in settings.py, and would not have to expose anything into the settings object, avoiding possible nameclashes.

I think if done properly, users won't have to touch settings.py for any of the shipped add-ons at all (instead they could set, say HTTPCACHE_DIR, in the scrapy.cfg. We could even tidy up the settings namespace this way). Those users that then need to touch settings.py will most probably know what they're doing anyways.

7 . In the SEP it is said that MiddlewareManager.from_settings() will be changed to allow python objects instead of class paths. Why should it be specific to MiddlewareManager? (see discussion at #1215).

Uh, I didn't see #1215 before, that looks good. The SEP goes a little further than the PR though: It suggests to allow not only classes, but instances. This will allow add-ons to instantiate components and keep settings as local as possible. Objects retrieved from X = load_object() are (so far) always assumed to be classes and then instantiated with y = X(). That should probably stay that way, and it is a good question where the "allow instances" code could live so it is as generic as possible.

8 . Do you think it is possible to make Scrapy addons feature less framework-ish? In the current proposal Scrapy calls various AddonManager methods during the standard scrapy crawl init sequence, and AddonManager calls various addon methods itself. Can we made it so that user can also use the AddonManger (or individual addon) to update or check settings, without invoking all the scrapy crawl machinery?

Hm, I'd say it is already quite modular. The user could in principle instantiate their own AddonManager and hand it the .cfg file. Or they could load an add-on object and call its update_settings() method directly.

9 . It looks like the proposed AddonManager does two unrelated things:

  • it loads addon classes from scrapy.cfg;
  • it uses loaded addons to update settings and check crawler.
    IMHO it is better to separate those. Scrapy users who don't use scrapy crawl command may prefer to import addon classes themselves, or these classes can be in the same module as CrawlerProcess call and a spider, all in the same self-contained Python script.

That is a good point. The Manager idea arised from the #591 discussion, but now that I come to think of it splitting it up seems to make more sense. Also, it would avoid having a "...Manager" named class :). The loaded add-ons need to live somewhere between loading <-> updating <-> checking, though. With the current SEP, that could be in an attribute of Crawler (much like extensions).

Maybe like this?

  • An AddonLoader class, or even only a function, is handed the scrapy.cfg file and provides/returns an { 'addon name': (<addon object>, <addon config>) } dict. The check_dependency_clashes() functionality also lives within this class/function.
  • The addons dict is given to the Crawler object upon instantiation, much like settings.
  • Crawler calls the add-ons update_settings() and check_configuration() functions at appropriate times.
  • If the user does not want the Crawler to call the callbacks (e.g. because she does it somewhere else), she hands over an empty addons dict on instantiation. In this case, it might be tricky that the check_configuration() callback requires the crawler object.

I'm still quite unhappy about the scrapy.cfg/settings.py issue. It is indeed annoying overhead to have two configuration files. But then, I think it makes sense that add-ons and their configuration are not present in any Settings instance, and the scrapy.cfg syntax is very much suited for user-friendliness and local settings.

@kmike

This comment has been minimized.

Show comment
Hide comment
@kmike

kmike Jun 3, 2015

Member

Hi @jdemaeyer,

That is on purpose. Spider.update_settings() and the add-on update_settings() both do roughly the same, i.e. update settings outside of settings.py, albeit with a different level of "generality"

Hm, do you think we can use exactly the same code path for Spider then? I.e. maybe spider can implement addon interface itself? I'm not sure when can it be useful, but why not :) This way a spider will be able not only to update settings, but also to check them.

It makes it easy to keep settings local through the ini sections. E.g. something like this: (... snip ...) is much easier in scrapy.cfg than in settings.py, and would not have to expose anything into the settings object, avoiding possible nameclashes.

I don't think it is a problem to separate sections in settings.py. Django-like approach works:

MYSQL_PIPE = dict(
    server='some-server',
    port=1234,
    user='foo',
    password='bar'
)

Maybe like this? ...

Yeah, your approach looks good to me.

Some questions:

Why are { 'addon name': (<addon object>, <addon config>) } dicts needed, what are keys ('addon names') for? Can it be a list of (<addon class>, <addon config>) pairs, or just a list of instantiated addon objects?

I'm still quite unhappy about the scrapy.cfg/settings.py issue. It is indeed annoying overhead to have two configuration files. But then, I think it makes sense that add-ons and their configuration are not present in any Settings instance, and the scrapy.cfg syntax is very much suited for user-friendliness and local settings.

What's the problem with having ADDONS list in settings.py? It means one less config file and more flexibility - addons will be able to disable or enable or check other addons. This way you also avoid changing Crawler API and other realted APIs - no need to add 'addons' argument. I think that two config files is less user-friendly than one config file. By default settings.py may contain only this ADDONS option.

Member

kmike commented Jun 3, 2015

Hi @jdemaeyer,

That is on purpose. Spider.update_settings() and the add-on update_settings() both do roughly the same, i.e. update settings outside of settings.py, albeit with a different level of "generality"

Hm, do you think we can use exactly the same code path for Spider then? I.e. maybe spider can implement addon interface itself? I'm not sure when can it be useful, but why not :) This way a spider will be able not only to update settings, but also to check them.

It makes it easy to keep settings local through the ini sections. E.g. something like this: (... snip ...) is much easier in scrapy.cfg than in settings.py, and would not have to expose anything into the settings object, avoiding possible nameclashes.

I don't think it is a problem to separate sections in settings.py. Django-like approach works:

MYSQL_PIPE = dict(
    server='some-server',
    port=1234,
    user='foo',
    password='bar'
)

Maybe like this? ...

Yeah, your approach looks good to me.

Some questions:

Why are { 'addon name': (<addon object>, <addon config>) } dicts needed, what are keys ('addon names') for? Can it be a list of (<addon class>, <addon config>) pairs, or just a list of instantiated addon objects?

I'm still quite unhappy about the scrapy.cfg/settings.py issue. It is indeed annoying overhead to have two configuration files. But then, I think it makes sense that add-ons and their configuration are not present in any Settings instance, and the scrapy.cfg syntax is very much suited for user-friendliness and local settings.

What's the problem with having ADDONS list in settings.py? It means one less config file and more flexibility - addons will be able to disable or enable or check other addons. This way you also avoid changing Crawler API and other realted APIs - no need to add 'addons' argument. I think that two config files is less user-friendly than one config file. By default settings.py may contain only this ADDONS option.

@jdemaeyer

This comment has been minimized.

Show comment
Hide comment
@jdemaeyer

jdemaeyer Jun 10, 2015

Contributor

I've written the add-on interface spec and two add-on loaders, one loading from scrapy.cfg and one loading from settings.py.

Is there a way I can retrieve the project module name? I could use the first part of os.env['SCRAPY_SETTINGS_MODULE'] but it seems to me that there should be a cleaner way.

Again, thanks for the feedback @kmike!

Hm, do you think we can use exactly the same code path for Spider then?

I think that should be possible, but I'm not sure how enforce that the update_settings() method is not called twice.

I don't think it is a problem to separate sections in settings.py. Django-like approach works:

One problem with the current SEP is that the add-on name (= variable name) is allowed to be a python path to the add-on, but python does not allow dots in variable names. I have therefore allowed a _name variable in the settings.py dicts for now.

Why are { 'addon name': (, ) } dicts needed, what are keys ('addon names') for? Can it be a list of (, ) pairs, or just a list of instantiated addon objects?

The keys are simply whatever the addon was called in scrapy.cfg/settings.py. I'm not yet fully sure if we can drop it this early so I'd prefer to leave it in. It cannot be a list of instantiated add-on objects with the current SEP b/c we also allow add-ons to be modules right now (i.e. cannot be instantiated)

I don't really like the idea of exposing add-ons in the global settings, because they are essentially "one level above it": Add-ons fiddle with settings and extensions, but never the other way around. That's why I've designed the settings module add-on loader to only load variables starting with addon_ (in lower case, so they're not picked up by Settings.setmodule()) so far. It would be cool to allow add-ons to load/configure other add-ons, though. In that case an add-on manager could become useful again...

Contributor

jdemaeyer commented Jun 10, 2015

I've written the add-on interface spec and two add-on loaders, one loading from scrapy.cfg and one loading from settings.py.

Is there a way I can retrieve the project module name? I could use the first part of os.env['SCRAPY_SETTINGS_MODULE'] but it seems to me that there should be a cleaner way.

Again, thanks for the feedback @kmike!

Hm, do you think we can use exactly the same code path for Spider then?

I think that should be possible, but I'm not sure how enforce that the update_settings() method is not called twice.

I don't think it is a problem to separate sections in settings.py. Django-like approach works:

One problem with the current SEP is that the add-on name (= variable name) is allowed to be a python path to the add-on, but python does not allow dots in variable names. I have therefore allowed a _name variable in the settings.py dicts for now.

Why are { 'addon name': (, ) } dicts needed, what are keys ('addon names') for? Can it be a list of (, ) pairs, or just a list of instantiated addon objects?

The keys are simply whatever the addon was called in scrapy.cfg/settings.py. I'm not yet fully sure if we can drop it this early so I'd prefer to leave it in. It cannot be a list of instantiated add-on objects with the current SEP b/c we also allow add-ons to be modules right now (i.e. cannot be instantiated)

I don't really like the idea of exposing add-ons in the global settings, because they are essentially "one level above it": Add-ons fiddle with settings and extensions, but never the other way around. That's why I've designed the settings module add-on loader to only load variables starting with addon_ (in lower case, so they're not picked up by Settings.setmodule()) so far. It would be cool to allow add-ons to load/configure other add-ons, though. In that case an add-on manager could become useful again...

@jdemaeyer

This comment has been minimized.

Show comment
Hide comment
@jdemaeyer

jdemaeyer Jun 24, 2015

Contributor

After some more coding, I've deciced to again merge the AddonLoader and what I called the AddonHolder into an AddonManager. There are just too many cross-dependencies: I wanted the holder to still be able to load add-ons last minute (allowing add-ons to load other add-ons, though I'm not exactly sure whether we want to allow this) and check dependency issues, at the same time get rid of the confusing { used_key: (addon, configuration) } structure if the loader.

Now, the AddonManager provides two class methods that the user may use when she doesn't want to use the 'holding' part, and provides convenience methods to add add-ons in case she doesn't want to use the 'loading' part.

I have integrated the add-on system into Scrapy's start-up process (it currently reads from both scrapy.cfg and the settings module), but did no testing of a "fully integrated" system so far.

One issue currently in my head is: Do we allow add-ons to load/configure other add-ons? I have not fully thought through the implications of this. The main problem I see is that at the moment the first callback of an add-on is called (update_settings()), other add-ons might already have been called. I could see this work if we require update_settings() to not do any settings introspection/dependency checking/interaction with other add-ons whatsoever. Being able to load/configure an add-on at this point could be useful for "umbrella add-ons" that load other add-ons depending on their configuration

Contributor

jdemaeyer commented Jun 24, 2015

After some more coding, I've deciced to again merge the AddonLoader and what I called the AddonHolder into an AddonManager. There are just too many cross-dependencies: I wanted the holder to still be able to load add-ons last minute (allowing add-ons to load other add-ons, though I'm not exactly sure whether we want to allow this) and check dependency issues, at the same time get rid of the confusing { used_key: (addon, configuration) } structure if the loader.

Now, the AddonManager provides two class methods that the user may use when she doesn't want to use the 'holding' part, and provides convenience methods to add add-ons in case she doesn't want to use the 'loading' part.

I have integrated the add-on system into Scrapy's start-up process (it currently reads from both scrapy.cfg and the settings module), but did no testing of a "fully integrated" system so far.

One issue currently in my head is: Do we allow add-ons to load/configure other add-ons? I have not fully thought through the implications of this. The main problem I see is that at the moment the first callback of an add-on is called (update_settings()), other add-ons might already have been called. I could see this work if we require update_settings() to not do any settings introspection/dependency checking/interaction with other add-ons whatsoever. Being able to load/configure an add-on at this point could be useful for "umbrella add-ons" that load other add-ons depending on their configuration

Show outdated Hide outdated scrapy/utils/project.py Outdated
@jdemaeyer

This comment has been minimized.

Show comment
Hide comment
@jdemaeyer

jdemaeyer Jul 3, 2015

Contributor

I've put some thought into the "add-ons loading/configuring other add-ons" issue. The big problem with the current implementation is that the first time code of an add-on is executed is during the first round of callbacks to update_settings(). Should an add-on load or reconfigure another add-on here, other add-ons might already have been called. While it is possible to ensure that the update_settings() method of the newly added add-on is called, there is no guarantee (and in fact, it is quite unlikely) that all add-ons see the same add-on configuration in their update_settings().

I see three possible approaches to this:

  1. Forbid add-ons from loading or configuring other add-ons. In this case 'umbrella add-ons' would not be possible and all cross-configuration dependencies would again be burdened onto the user.
  2. Forbid add-ons to do any kind of settings introspection in update_settings(), instead only allow them to do changes to the settings object or load other add-ons. In this case, configuring already enabled add-ons should be avoided, as there is no guarantee that their update_settings() method has not already been called
  3. Add a third callback, update_addons(config, addonmgr), to the add-on interface. Only loading and configuring other add-ons should be done in this method. While it may be allowed, developers should be aware that depending on config (of their own add-on, i.e. the one whose update_addons() is currently called) is fragile, as once again, there is no guarantee in which order add-ons will be called back and whether some other add-on will want to touch config.

Currently I have implemented option 2. I have not quite made up my mind about it just yet, but I think I option 3 is my favourite for now.

Contributor

jdemaeyer commented Jul 3, 2015

I've put some thought into the "add-ons loading/configuring other add-ons" issue. The big problem with the current implementation is that the first time code of an add-on is executed is during the first round of callbacks to update_settings(). Should an add-on load or reconfigure another add-on here, other add-ons might already have been called. While it is possible to ensure that the update_settings() method of the newly added add-on is called, there is no guarantee (and in fact, it is quite unlikely) that all add-ons see the same add-on configuration in their update_settings().

I see three possible approaches to this:

  1. Forbid add-ons from loading or configuring other add-ons. In this case 'umbrella add-ons' would not be possible and all cross-configuration dependencies would again be burdened onto the user.
  2. Forbid add-ons to do any kind of settings introspection in update_settings(), instead only allow them to do changes to the settings object or load other add-ons. In this case, configuring already enabled add-ons should be avoided, as there is no guarantee that their update_settings() method has not already been called
  3. Add a third callback, update_addons(config, addonmgr), to the add-on interface. Only loading and configuring other add-ons should be done in this method. While it may be allowed, developers should be aware that depending on config (of their own add-on, i.e. the one whose update_addons() is currently called) is fragile, as once again, there is no guarantee in which order add-ons will be called back and whether some other add-on will want to touch config.

Currently I have implemented option 2. I have not quite made up my mind about it just yet, but I think I option 3 is my favourite for now.

@jdemaeyer

This comment has been minimized.

Show comment
Hide comment
@jdemaeyer

jdemaeyer Jul 3, 2015

Contributor

I'm trying to write a test that ensures the crawler breaks down when an add-on's check_configuration() callback raises an exception, but I don't know enough about Twisted to understand why it doesn't work :/

Here's the test, put into the CrawlTestCase in tests/test_crawl.py:

    @defer.inlineCallbacks
    def test_abort_on_addon_failed_check(self):
        class FailedCheckAddon(Addon):
            NAME = 'FailedCheckAddon'
            VERSION = '1.0'
            def check_configuration(self, crawler):
                raise ValueError
        addonmgr = AddonManager()
        addonmgr.add(FailedCheckAddon())
        crawler = get_crawler(SimpleSpider, addons=addonmgr)
        yield self.assertFailure(crawler.crawl(), ValueError)

And this is the output I get from tox:

====================================================== FAILURES ======================================================
___________________________________ CrawlTestCase.test_abort_on_addon_failed_check ___________________________________
NOTE: Incompatible Exception Representation, displaying natively:

DirtyReactorAggregateError: Reactor was unclean.
DelayedCalls: (set twisted.internet.base.DelayedCall.debug = True to debug)
<DelayedCall 0x7ff8a481a950 [59.7946248055s] called=0 cancelled=0 LoopingCall<60>(Downloader._slot_gc, *(), **{})()>

I've also tried assertRaises, that tells me:

FailTest: ValueError not raised (<Deferred at 0x7fea3d2b3e60 current result: <twisted.python.failure.Failure <type 'exceptions.ValueError'>>> returned)
Contributor

jdemaeyer commented Jul 3, 2015

I'm trying to write a test that ensures the crawler breaks down when an add-on's check_configuration() callback raises an exception, but I don't know enough about Twisted to understand why it doesn't work :/

Here's the test, put into the CrawlTestCase in tests/test_crawl.py:

    @defer.inlineCallbacks
    def test_abort_on_addon_failed_check(self):
        class FailedCheckAddon(Addon):
            NAME = 'FailedCheckAddon'
            VERSION = '1.0'
            def check_configuration(self, crawler):
                raise ValueError
        addonmgr = AddonManager()
        addonmgr.add(FailedCheckAddon())
        crawler = get_crawler(SimpleSpider, addons=addonmgr)
        yield self.assertFailure(crawler.crawl(), ValueError)

And this is the output I get from tox:

====================================================== FAILURES ======================================================
___________________________________ CrawlTestCase.test_abort_on_addon_failed_check ___________________________________
NOTE: Incompatible Exception Representation, displaying natively:

DirtyReactorAggregateError: Reactor was unclean.
DelayedCalls: (set twisted.internet.base.DelayedCall.debug = True to debug)
<DelayedCall 0x7ff8a481a950 [59.7946248055s] called=0 cancelled=0 LoopingCall<60>(Downloader._slot_gc, *(), **{})()>

I've also tried assertRaises, that tells me:

FailTest: ValueError not raised (<Deferred at 0x7fea3d2b3e60 current result: <twisted.python.failure.Failure <type 'exceptions.ValueError'>>> returned)
Show outdated Hide outdated tests/test_addons/__init__.py Outdated
@curita

This comment has been minimized.

Show comment
Hide comment
@curita

curita Jul 7, 2015

Member

The twisted error when testing check_configure() is a little tricky, and it's actually something that wasn't handled before.

====================================================== FAILURES ======================================================
___________________________________ CrawlTestCase.test_abort_on_addon_failed_check ___________________________________
NOTE: Incompatible Exception Representation, displaying natively:

DirtyReactorAggregateError: Reactor was unclean.
DelayedCalls: (set twisted.internet.base.DelayedCall.debug = True to debug)
<DelayedCall 0x7ff8a481a950 [59.7946248055s] called=0 cancelled=0 LoopingCall<60>(Downloader._slot_gc, *(), **{})()>

This output means there's a pending call that wasn't stopped properly. In that case is a twisted LoopingCall started by the Downloader. The downloader is stored in the engine, so that call reference can be accessed in a crawler as crawler.engine.downloader._slot_gc_loop.

The engine is instantiated in Crawler.crawl(). Problem is, that call is already running when check_configure() is called and fails in FailedCheckAddon, but there is no place that addresses that and closes the downloader (which stops the looping call) properly. Not sure why that wasn't handled for errors happening after creating the engine, I guess it's not noticeable unless running tests that tell you about it.

So, there are two ways of fixing this. First one is to move the line calling addons.check_configuration() above the instantiation of the engine. This would fix the test but doesn't help solving the underlining problem. Second approach would be to close the downloader if any problem raises from the commands after the engine creation. Something like:

try:
    self.spider = self._create_spider(*args, **kwargs)
    self.engine = self._create_engine()
    self.addons.check_configuration(self)
    ...
   except Exception:
       self.crawling = False
       if self.engine is not None:
            self.engine.downloader.close()
       raise

should be enough. If you want to you can implement this fix in this pr (or in another one) or we can implement it at some other time and follow the first approach for now.

Last thing about the test: I'd refrain to use self.assertFailure() if possible, it's something present in trial and not in the builtin unittest that can be omitted. This should work too:

@defer.inlineCallbacks
def test_abort_on_addon_failed_check(self):
    class FailedCheckAddon(Addon):
        NAME = 'FailedCheckAddon'
        VERSION = '1.0'
        def check_configuration(self, crawler):
            raise ValueError
    crawler = get_crawler(SimpleSpider)
    crawler.addons.add(FailedCheckAddon())
    with self.assertRaises(ValueError):
        yield crawler.crawl()
Member

curita commented Jul 7, 2015

The twisted error when testing check_configure() is a little tricky, and it's actually something that wasn't handled before.

====================================================== FAILURES ======================================================
___________________________________ CrawlTestCase.test_abort_on_addon_failed_check ___________________________________
NOTE: Incompatible Exception Representation, displaying natively:

DirtyReactorAggregateError: Reactor was unclean.
DelayedCalls: (set twisted.internet.base.DelayedCall.debug = True to debug)
<DelayedCall 0x7ff8a481a950 [59.7946248055s] called=0 cancelled=0 LoopingCall<60>(Downloader._slot_gc, *(), **{})()>

This output means there's a pending call that wasn't stopped properly. In that case is a twisted LoopingCall started by the Downloader. The downloader is stored in the engine, so that call reference can be accessed in a crawler as crawler.engine.downloader._slot_gc_loop.

The engine is instantiated in Crawler.crawl(). Problem is, that call is already running when check_configure() is called and fails in FailedCheckAddon, but there is no place that addresses that and closes the downloader (which stops the looping call) properly. Not sure why that wasn't handled for errors happening after creating the engine, I guess it's not noticeable unless running tests that tell you about it.

So, there are two ways of fixing this. First one is to move the line calling addons.check_configuration() above the instantiation of the engine. This would fix the test but doesn't help solving the underlining problem. Second approach would be to close the downloader if any problem raises from the commands after the engine creation. Something like:

try:
    self.spider = self._create_spider(*args, **kwargs)
    self.engine = self._create_engine()
    self.addons.check_configuration(self)
    ...
   except Exception:
       self.crawling = False
       if self.engine is not None:
            self.engine.downloader.close()
       raise

should be enough. If you want to you can implement this fix in this pr (or in another one) or we can implement it at some other time and follow the first approach for now.

Last thing about the test: I'd refrain to use self.assertFailure() if possible, it's something present in trial and not in the builtin unittest that can be omitted. This should work too:

@defer.inlineCallbacks
def test_abort_on_addon_failed_check(self):
    class FailedCheckAddon(Addon):
        NAME = 'FailedCheckAddon'
        VERSION = '1.0'
        def check_configuration(self, crawler):
            raise ValueError
    crawler = get_crawler(SimpleSpider)
    crawler.addons.add(FailedCheckAddon())
    with self.assertRaises(ValueError):
        yield crawler.crawl()
Show outdated Hide outdated scrapy/utils/misc.py Outdated
Show outdated Hide outdated scrapy/addons/__init__.py Outdated
Show outdated Hide outdated sep/sep-021.rst Outdated
@curita

This comment has been minimized.

Show comment
Hide comment
@curita

curita Jul 14, 2015

Member

I think there are some things to discuss about the design, I took note on some questions and suggestions I had so we can talk them through.

There are some concerns about the usage of the scrapy.cfg file to set the addons configuration. I agree here with @kmike that we need to support using the current settings for a couple of reasons:

  • I'm not sure if we can maintain the locality of the addons configuration. This could be backward incompatible when scrapy components get ported, there could be addons that depend in other addons config. Not sure if I'm stretching the issue, but at least it was mentioned in #1334 (sharing a retry setting with redirect middlewares).
  • As mentioned earlier in the pr, scrapy.cfg can't be the only way of configuring addons because having a project is not required for running scrapy, setting core components is something that should be possible with single spiders.
  • We can't remove the already present flexibly of configuring components dynamically via command line or custom_settings.

Both options, using the current settings and the scrapy.cfg files could coexist, but if there is any complication keeping both we should favor the first one.

On a related note, I'm not sure it's a good idea to discard _ENABLED settings (read about discarding them in the SEP), specially if we're keeping the possibility of dynamic configuration. Users may want to disable a component temporary without deleting or commenting out the whole .cfg section.

Some miscellaneous comments in the addons interface:

  • Related with what I mentioned before, if we loose settings locality, is it required to instantiate the addons so early in the loading process? Could update_settings() and check_configuration() be class methods?
  • An idea :) There could be more non-required attributes in the zope.interface with general information about the addons, for example, some field with a brief description, a dictionary detailing the settings that the addon reads, maybe a list with some non-required dependencies. This sort of copies some utilities from python packages, I wonder if we can make a specific interface for python packages that grabs these settings (name, version, dependencies, etc.) using setuptools.
  • I think we should add a field in the addons (maybe not required?) where the devs could set a fixed number denoting the loading order position for that addon. I know that we have REQUIRES/MODIFIES/PROVIDES to try to address that, but I doubt this is going to be enough to define all cases.
  • A detail, is it ok to enforce having a check_configuration() method in the addons? I understand update_settings() being required, addons' only connection with the crawling engine is by updating the settings, there is no point in not having this method, but enforcing check_configuration() seems a little restrictive. We could check in AddonManager.check_configuration() if the method exists or not and call it in the first case.
Member

curita commented Jul 14, 2015

I think there are some things to discuss about the design, I took note on some questions and suggestions I had so we can talk them through.

There are some concerns about the usage of the scrapy.cfg file to set the addons configuration. I agree here with @kmike that we need to support using the current settings for a couple of reasons:

  • I'm not sure if we can maintain the locality of the addons configuration. This could be backward incompatible when scrapy components get ported, there could be addons that depend in other addons config. Not sure if I'm stretching the issue, but at least it was mentioned in #1334 (sharing a retry setting with redirect middlewares).
  • As mentioned earlier in the pr, scrapy.cfg can't be the only way of configuring addons because having a project is not required for running scrapy, setting core components is something that should be possible with single spiders.
  • We can't remove the already present flexibly of configuring components dynamically via command line or custom_settings.

Both options, using the current settings and the scrapy.cfg files could coexist, but if there is any complication keeping both we should favor the first one.

On a related note, I'm not sure it's a good idea to discard _ENABLED settings (read about discarding them in the SEP), specially if we're keeping the possibility of dynamic configuration. Users may want to disable a component temporary without deleting or commenting out the whole .cfg section.

Some miscellaneous comments in the addons interface:

  • Related with what I mentioned before, if we loose settings locality, is it required to instantiate the addons so early in the loading process? Could update_settings() and check_configuration() be class methods?
  • An idea :) There could be more non-required attributes in the zope.interface with general information about the addons, for example, some field with a brief description, a dictionary detailing the settings that the addon reads, maybe a list with some non-required dependencies. This sort of copies some utilities from python packages, I wonder if we can make a specific interface for python packages that grabs these settings (name, version, dependencies, etc.) using setuptools.
  • I think we should add a field in the addons (maybe not required?) where the devs could set a fixed number denoting the loading order position for that addon. I know that we have REQUIRES/MODIFIES/PROVIDES to try to address that, but I doubt this is going to be enough to define all cases.
  • A detail, is it ok to enforce having a check_configuration() method in the addons? I understand update_settings() being required, addons' only connection with the crawling engine is by updating the settings, there is no point in not having this method, but enforcing check_configuration() seems a little restrictive. We could check in AddonManager.check_configuration() if the method exists or not and call it in the first case.
Show outdated Hide outdated sep/sep-021.rst Outdated
@jdemaeyer

This comment has been minimized.

Show comment
Hide comment
@jdemaeyer

jdemaeyer Jul 15, 2015

Contributor

Pheew okay, improving the error handling in Crawler.crawl() took me an awful lot of time and I learned quite a bit about Deferreds and Scrapy's architectural details. I'm still really unhappy, so here's what I have, up for discussion. ;) (Sorry for the wall of text, I'm doing this also to sort out my thoughts)

So I put in the

       if self.engine is not None:
            self.engine.downloader.close()

in the Crawler.crawl() except clause as you suggested, and the test I had worked fine. But of course I wanted to introduce that code The Right Way™ so I wrote another test that checks whether the crawler shuts down gracefully when there's any error during the crawl startup process.

The test is in tests.test_crawl.CrawlTestCase.test_graceful_crawl_error_handling. I mocked ExecutionEngine.start to throw an exception and asserted that Crawler.crawling == False and that the downloader's close method has been called (although I couldn't figure out how to properly wrap the mock around the original method):

    @defer.inlineCallbacks
    def test_graceful_crawl_error_handling(self):
        crawler = get_crawler(SimpleSpider)
        class TestError(Exception):
            pass
        from scrapy.core.downloader import Downloader
        Downloader._orig_close = Downloader.close
        def close_downloader(*args, **kwargs):
            # Ugly work-around since I couldn't get supplying
            # wraps=Downloader.close to the mock to work...
            Downloader._orig_close(crawler.engine.downloader)

        with mock.patch('scrapy.crawler.ExecutionEngine.start') as mock_es, \
             mock.patch.object(Downloader, 'close') as mock_dc:
            mock_es.side_effect = TestError
            mock_dc.side_effect = close_downloader
            with self.assertRaises(TestError):
                yield crawler.crawl()
            self.assertFalse(crawler.crawling)
            self.assertTrue(mock_dc.called)

Now I got a different delayed call remaining after the test: <DelayedCall 0x7f069902cab8 [59.7916400433s] called=0 cancelled=0 LoopingCall<60.0>(LogStats.log, *(<SimpleSpider 'simple' at 0x7f06992bbd90>,), **{})(). I figured (after a while) that I needed to close the spider through the engine. However, as ExecutionEngine.close_spider() also closes the downloader, I need a case separation for the graceful shutdown. So here's what it looks like now:

        try:
            self.spider = self._create_spider(*args, **kwargs)
            self.engine = self._create_engine()
            self.addons.check_configuration(self)
            start_requests = iter(self.spider.start_requests())
            yield self.engine.open_spider(self.spider, start_requests)
            yield defer.maybeDeferred(self.engine.start)
        except Exception:
            self.crawling = False
            if self.engine is not None:
                if self.engine.spider is not None:
                    self.engine.close_spider(self.spider)
                else:
                    self.engine.downloader.close()
            raise

Both tests pass with this, but I think there are some issues . First, the except clause shouldn't have to figure out in which line the error occured. This is probably better (exception handling) style but quite ugly:

        try:
            self.spider = self._create_spider(*args, **kwargs)
            self.engine = self._create_engine()
            try:
                self.addons.check_configuration(self)
                start_requests = iter(self.spider.start_requests())
                yield self.engine.open_spider(self.spider, start_requests)
            except Exception:
                self.engine.downloader.close()
                raise
            try:
                yield defer.maybeDeferred(self.engine.start)
            except Exception:
                self.engine.close_spider(self.spider)
                raise
        except Exception:
            self.crawling = False
            raise

Maybe ExecutionEngine could get a new close() method that does the logic and decides whether it needs to stop the engine, close the spider, or close the downloader (or the stop() method could be updated)?
Then, there's the test itself. It should test behaviour and not implementation: I don't really care whether Downloader.close() was called, I just want to know whether the reactor is clean. But the downloader doesn't have a status attribute or something. I could check whether the Downloader._slot_gc_loop was stopped, but again that would check implementation not behaviour.

Maybe it's best to remove the Downloader.close() mock and rely on twisted.trial to check whether the reactor is clean after the error shutdown?

Contributor

jdemaeyer commented Jul 15, 2015

Pheew okay, improving the error handling in Crawler.crawl() took me an awful lot of time and I learned quite a bit about Deferreds and Scrapy's architectural details. I'm still really unhappy, so here's what I have, up for discussion. ;) (Sorry for the wall of text, I'm doing this also to sort out my thoughts)

So I put in the

       if self.engine is not None:
            self.engine.downloader.close()

in the Crawler.crawl() except clause as you suggested, and the test I had worked fine. But of course I wanted to introduce that code The Right Way™ so I wrote another test that checks whether the crawler shuts down gracefully when there's any error during the crawl startup process.

The test is in tests.test_crawl.CrawlTestCase.test_graceful_crawl_error_handling. I mocked ExecutionEngine.start to throw an exception and asserted that Crawler.crawling == False and that the downloader's close method has been called (although I couldn't figure out how to properly wrap the mock around the original method):

    @defer.inlineCallbacks
    def test_graceful_crawl_error_handling(self):
        crawler = get_crawler(SimpleSpider)
        class TestError(Exception):
            pass
        from scrapy.core.downloader import Downloader
        Downloader._orig_close = Downloader.close
        def close_downloader(*args, **kwargs):
            # Ugly work-around since I couldn't get supplying
            # wraps=Downloader.close to the mock to work...
            Downloader._orig_close(crawler.engine.downloader)

        with mock.patch('scrapy.crawler.ExecutionEngine.start') as mock_es, \
             mock.patch.object(Downloader, 'close') as mock_dc:
            mock_es.side_effect = TestError
            mock_dc.side_effect = close_downloader
            with self.assertRaises(TestError):
                yield crawler.crawl()
            self.assertFalse(crawler.crawling)
            self.assertTrue(mock_dc.called)

Now I got a different delayed call remaining after the test: <DelayedCall 0x7f069902cab8 [59.7916400433s] called=0 cancelled=0 LoopingCall<60.0>(LogStats.log, *(<SimpleSpider 'simple' at 0x7f06992bbd90>,), **{})(). I figured (after a while) that I needed to close the spider through the engine. However, as ExecutionEngine.close_spider() also closes the downloader, I need a case separation for the graceful shutdown. So here's what it looks like now:

        try:
            self.spider = self._create_spider(*args, **kwargs)
            self.engine = self._create_engine()
            self.addons.check_configuration(self)
            start_requests = iter(self.spider.start_requests())
            yield self.engine.open_spider(self.spider, start_requests)
            yield defer.maybeDeferred(self.engine.start)
        except Exception:
            self.crawling = False
            if self.engine is not None:
                if self.engine.spider is not None:
                    self.engine.close_spider(self.spider)
                else:
                    self.engine.downloader.close()
            raise

Both tests pass with this, but I think there are some issues . First, the except clause shouldn't have to figure out in which line the error occured. This is probably better (exception handling) style but quite ugly:

        try:
            self.spider = self._create_spider(*args, **kwargs)
            self.engine = self._create_engine()
            try:
                self.addons.check_configuration(self)
                start_requests = iter(self.spider.start_requests())
                yield self.engine.open_spider(self.spider, start_requests)
            except Exception:
                self.engine.downloader.close()
                raise
            try:
                yield defer.maybeDeferred(self.engine.start)
            except Exception:
                self.engine.close_spider(self.spider)
                raise
        except Exception:
            self.crawling = False
            raise

Maybe ExecutionEngine could get a new close() method that does the logic and decides whether it needs to stop the engine, close the spider, or close the downloader (or the stop() method could be updated)?
Then, there's the test itself. It should test behaviour and not implementation: I don't really care whether Downloader.close() was called, I just want to know whether the reactor is clean. But the downloader doesn't have a status attribute or something. I could check whether the Downloader._slot_gc_loop was stopped, but again that would check implementation not behaviour.

Maybe it's best to remove the Downloader.close() mock and rely on twisted.trial to check whether the reactor is clean after the error shutdown?

@jdemaeyer

This comment has been minimized.

Show comment
Hide comment
@jdemaeyer

jdemaeyer Jul 15, 2015

Contributor

Last thing about the test: I'd refrain to use self.assertFailure() if possible, it's something present in trial and not in the builtin unittest that can be omitted. This should work too:

Hm, why do we subclass trial's TestCase but then refrain from using its methods? I have (temporarily?) switched back to assertFailure because the twisted version in precise does not support using assertRaises as a context manager, and replacing trials's TestCase with the built-in one lets the tests pass but warns me that LogCapture instances are not uninstalled by shutdown (and I wasn't sure how bad that is ;))

Contributor

jdemaeyer commented Jul 15, 2015

Last thing about the test: I'd refrain to use self.assertFailure() if possible, it's something present in trial and not in the builtin unittest that can be omitted. This should work too:

Hm, why do we subclass trial's TestCase but then refrain from using its methods? I have (temporarily?) switched back to assertFailure because the twisted version in precise does not support using assertRaises as a context manager, and replacing trials's TestCase with the built-in one lets the tests pass but warns me that LogCapture instances are not uninstalled by shutdown (and I wasn't sure how bad that is ;))

@jdemaeyer

This comment has been minimized.

Show comment
Hide comment
@jdemaeyer

jdemaeyer Jul 15, 2015

Contributor

Hey everyone, thanks again for the feedback (especially for the full review @curita :)), I really appreciate it. Here's my two cents on some of the design issues (once again, caution, wall of text):

Add-on configuration entry point

After I had to discuss the add-on management in a blog post a couple of days ago, I have backpedaled a little on my preference. scrapy.cfg is a place for those settings that are not necessarily bound to any particular project (if I have multiple projects I probably still want to upload them to the same scrapyd server). Add-on configuration is (most commonly) bound to a project, much like other crawling settings, and I agree that it should therefore live inside the project. If we don't want to introduce a new settings file (which I think would be a bad idea), that leaves us with the project settings module.

I see two possible syntaxes for this:

  1. Each add-on has its own variable, like it is implemented right now, and is not exposed into the global settings:

    addon_MYPIPELINE = dict(
        _name = 'path.to.mypipeline',
        user = 'blah',
        ...)
    addon_MYEXT = dict(...)
  2. There is an ADDONS dictionary setting that lives among all other global settings:

    ADDONS = {
        'path.to.mypipeline': {
            'user': 'blah',
            ...
        },
        'path.to.myext': {...},
    }
    
    # or
    
    ADDONS = {
        'path.to.mypipeline': dict(
            user = 'blah',
            ...
        ),
        'path.to.myext': dict(...),
    }

The first approach does not mix settings and add-ons, following the 'add-ons are one level above settings' philosophy I had in mind earlier. However the syntax is not very user-friendly, particularly having to use the _name key. The second approach puts add-ons onto the same level as the settings they are intended to fiddle with. I don't see an immediate drawback of this, but it strucks me as odd design. I also find the nested dictionaries, like these used for logging.config.dictConfig(), a little tedious. The user experience would be much more cumbersome than what we had initially hoped for. The big big plus is that it would integrate really well with custom Spider settings.

Now the user might want to enable an add-on system-wide, and I think we could keep the support for scrapy.cfg add-on configuration for this use case only. However, the docs should emphasise that this configuration does not belong to a project, and is therefore not deployed.

Add-ons without a project

I agree that there should be a way to enable and configure add-ons for stand-alone spiders. That would be no problem at all when using syntax 2 from above, so I guess that is what it will boil down to. We could keep syntax 1 if we introduced some new mechanism to enable/configure add-ons from within a spider, but that seems unnecessarily confusing. And frankly, I think syntax 1 is just as ugly, if not a little more, as syntax 2. ;D

Local configuration

First I want to point out that the (local) add-on configuration is not inaccessible for other add-ons or components. All add-on configurations are available to all components via the crawler.addons.configs dictionary. They just no longer clog up the global settings namespace. That being said, for backwards compatibility we should definitely keep the global settings for all builtin components. The idea I have for the builtin add-ons right now is that they simply take their configuration and expose it into the global settings namespace, much like in scrapy.addons.Addon.update_settings(). The components themselves won't be touched at all. This would allow users to configure built-in add-ons/component through both the new (add-on config) and old (settings module) way, and keep overwriting component configuration ad-hoc via Spider settings or the command line possible.

Add-on callback placement

Related with what I mentioned before, if we loose settings locality, is it required to instantiate the addons so early in the loading process? Could update_settings() and check_configuration() be class methods?

I'm not quite sure what you mean, aren't the callbacks already class methods in the common use case (i.e. when the add-on interface is provided by a class instance, such as in the scrapy.addons.Addon class)? I have used the call to Spider.update_settings() as orientation where the add-on callbacks should be placed. An add-on might want to replace the stats class, so I guess one line below where it is right now is where we should call them at latest. Then again, I just realised there's already some settings that cannot be changed in add-ons right now, like the spider loader class. I'm curious what you had in mind?

Add-on interface

+1 for dropping the requirement of check_configuration(), it is indeed very conceivable that many add-ons won't want to use this callback.

I like the additional attributes for further describing the add-on, and reusing setuptools/distutils/packaging etc., in particular for replacing the clumpy version check function I coded (I wasn't finished, I swear ;), but of course I could've figured that there's got to be a much better implementation ready out there).

As to the additional field for load order, i'm not really sure if that really helps us. Opposed to the current implementation of component orders, the user would have no (easy) influence on the add-on load order. That means there needs to be some kind of communication between devs, or a guide with recommended load order numbers for different use cases. While that might be doable, it seems to me that this will become hard to maintain if full-blown dependency management is introduced at some point.

Add-ons configuring other add-ons

I'm just keeping this in here as a reminder to myself to think about it. As I outlined earlier, I think as soon as the first update_settings() callback is touched, depending on whether we want to introduce the load order attribute, add-on configuration should be fixed. This only leaves the option of introducing a third (optional) callback, configure_addons(), that is called before update_settings().

Contributor

jdemaeyer commented Jul 15, 2015

Hey everyone, thanks again for the feedback (especially for the full review @curita :)), I really appreciate it. Here's my two cents on some of the design issues (once again, caution, wall of text):

Add-on configuration entry point

After I had to discuss the add-on management in a blog post a couple of days ago, I have backpedaled a little on my preference. scrapy.cfg is a place for those settings that are not necessarily bound to any particular project (if I have multiple projects I probably still want to upload them to the same scrapyd server). Add-on configuration is (most commonly) bound to a project, much like other crawling settings, and I agree that it should therefore live inside the project. If we don't want to introduce a new settings file (which I think would be a bad idea), that leaves us with the project settings module.

I see two possible syntaxes for this:

  1. Each add-on has its own variable, like it is implemented right now, and is not exposed into the global settings:

    addon_MYPIPELINE = dict(
        _name = 'path.to.mypipeline',
        user = 'blah',
        ...)
    addon_MYEXT = dict(...)
  2. There is an ADDONS dictionary setting that lives among all other global settings:

    ADDONS = {
        'path.to.mypipeline': {
            'user': 'blah',
            ...
        },
        'path.to.myext': {...},
    }
    
    # or
    
    ADDONS = {
        'path.to.mypipeline': dict(
            user = 'blah',
            ...
        ),
        'path.to.myext': dict(...),
    }

The first approach does not mix settings and add-ons, following the 'add-ons are one level above settings' philosophy I had in mind earlier. However the syntax is not very user-friendly, particularly having to use the _name key. The second approach puts add-ons onto the same level as the settings they are intended to fiddle with. I don't see an immediate drawback of this, but it strucks me as odd design. I also find the nested dictionaries, like these used for logging.config.dictConfig(), a little tedious. The user experience would be much more cumbersome than what we had initially hoped for. The big big plus is that it would integrate really well with custom Spider settings.

Now the user might want to enable an add-on system-wide, and I think we could keep the support for scrapy.cfg add-on configuration for this use case only. However, the docs should emphasise that this configuration does not belong to a project, and is therefore not deployed.

Add-ons without a project

I agree that there should be a way to enable and configure add-ons for stand-alone spiders. That would be no problem at all when using syntax 2 from above, so I guess that is what it will boil down to. We could keep syntax 1 if we introduced some new mechanism to enable/configure add-ons from within a spider, but that seems unnecessarily confusing. And frankly, I think syntax 1 is just as ugly, if not a little more, as syntax 2. ;D

Local configuration

First I want to point out that the (local) add-on configuration is not inaccessible for other add-ons or components. All add-on configurations are available to all components via the crawler.addons.configs dictionary. They just no longer clog up the global settings namespace. That being said, for backwards compatibility we should definitely keep the global settings for all builtin components. The idea I have for the builtin add-ons right now is that they simply take their configuration and expose it into the global settings namespace, much like in scrapy.addons.Addon.update_settings(). The components themselves won't be touched at all. This would allow users to configure built-in add-ons/component through both the new (add-on config) and old (settings module) way, and keep overwriting component configuration ad-hoc via Spider settings or the command line possible.

Add-on callback placement

Related with what I mentioned before, if we loose settings locality, is it required to instantiate the addons so early in the loading process? Could update_settings() and check_configuration() be class methods?

I'm not quite sure what you mean, aren't the callbacks already class methods in the common use case (i.e. when the add-on interface is provided by a class instance, such as in the scrapy.addons.Addon class)? I have used the call to Spider.update_settings() as orientation where the add-on callbacks should be placed. An add-on might want to replace the stats class, so I guess one line below where it is right now is where we should call them at latest. Then again, I just realised there's already some settings that cannot be changed in add-ons right now, like the spider loader class. I'm curious what you had in mind?

Add-on interface

+1 for dropping the requirement of check_configuration(), it is indeed very conceivable that many add-ons won't want to use this callback.

I like the additional attributes for further describing the add-on, and reusing setuptools/distutils/packaging etc., in particular for replacing the clumpy version check function I coded (I wasn't finished, I swear ;), but of course I could've figured that there's got to be a much better implementation ready out there).

As to the additional field for load order, i'm not really sure if that really helps us. Opposed to the current implementation of component orders, the user would have no (easy) influence on the add-on load order. That means there needs to be some kind of communication between devs, or a guide with recommended load order numbers for different use cases. While that might be doable, it seems to me that this will become hard to maintain if full-blown dependency management is introduced at some point.

Add-ons configuring other add-ons

I'm just keeping this in here as a reminder to myself to think about it. As I outlined earlier, I think as soon as the first update_settings() callback is touched, depending on whether we want to introduce the load order attribute, add-on configuration should be fixed. This only leaves the option of introducing a third (optional) callback, configure_addons(), that is called before update_settings().

@jdemaeyer

This comment has been minimized.

Show comment
Hide comment
@jdemaeyer

jdemaeyer Jul 23, 2015

Contributor

7eb837d is a rough version of my idea for the builtin add-ons, and it'd be cool to know if this is in line with the maintainer's ideas before I start writing tests and documentation.

In principal, the add-ons don't do much. All they do is act as a gateway to the 'traditional' settings. Neither the components nor Scrapy's default settings have been touched, so configuring components via the old way still works as expected. Additionally, it allows changing the settings via the add-on mechanism. That is, if I want to enable and the configure the http cache middleware, instead of adding this to my settings.py:

HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 60
HTTPCACHE_IGNORE_HTTP_CODES = [500, 503]

I can now alternatively use this in my scrapy.cfg (or the equivalent in the ADDONS setting as outlined in the comments above):

[addon:httpcache]
expiration_secs = 60
ignore_http_codes = 500,503
; enabled is already implicitly set through the existence of this section

I think this is close to the idea of the original SEP-021, with the addition that each add-on has its own section.

If there ever comes a new milestone release where backwards-incompatible cleanup is encouraged, the add-on configuration could be removed from the global settings and live within the add-ons/components instead, making them fully independent of any Scrapy/settings ecosystem. Components with interdependencies could still expose some settings into the global settings namespace, or access other add-ons settings through the add-on manager. Or, as they are interdependent anyways, they could be configured through the same add-on.

Contributor

jdemaeyer commented Jul 23, 2015

7eb837d is a rough version of my idea for the builtin add-ons, and it'd be cool to know if this is in line with the maintainer's ideas before I start writing tests and documentation.

In principal, the add-ons don't do much. All they do is act as a gateway to the 'traditional' settings. Neither the components nor Scrapy's default settings have been touched, so configuring components via the old way still works as expected. Additionally, it allows changing the settings via the add-on mechanism. That is, if I want to enable and the configure the http cache middleware, instead of adding this to my settings.py:

HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 60
HTTPCACHE_IGNORE_HTTP_CODES = [500, 503]

I can now alternatively use this in my scrapy.cfg (or the equivalent in the ADDONS setting as outlined in the comments above):

[addon:httpcache]
expiration_secs = 60
ignore_http_codes = 500,503
; enabled is already implicitly set through the existence of this section

I think this is close to the idea of the original SEP-021, with the addition that each add-on has its own section.

If there ever comes a new milestone release where backwards-incompatible cleanup is encouraged, the add-on configuration could be removed from the global settings and live within the add-ons/components instead, making them fully independent of any Scrapy/settings ecosystem. Components with interdependencies could still expose some settings into the global settings namespace, or access other add-ons settings through the add-on manager. Or, as they are interdependent anyways, they could be configured through the same add-on.

@curita

This comment has been minimized.

Show comment
Hide comment
@curita

curita Jul 29, 2015

Member

Sorry I took so long to reply, I'm still unsure on the alternatives for the add-ons configuration in settings.py, but I'll comment some thoughts I have about this matter. Most of them are things you already said but I wanted to write them to get my head around this.

Add-on configuration entry point
I agree that it seems an odd design decision having the add-ons config in the settings file, but it definitely simplifies things. Out of both presented options, the ADDONS variable seems the more flexible one. I'm not crazy about it but I don't mind configuring add-ons this way in settings.py, django uses this approach with multiple nested dicts a lot.

What I'm unsure of is that I think we're overcomplicating how current components are configured. We would be changing these:

scrapy crawl myspider -s HTTPCACHE_EXPIRATION_SECS=60
class MySpider(scrapy.Spider):
    name = 'myspider'
    custom_settings = {
        HTTPCACHE_EXPIRATION_SECS=60,
    }

to these:

scrapy crawl myspider -s ADDONS={'httcache':{'expiration_secs': 60}}
class MySpider(scrapy.Spider):
    name = 'myspider'
    custom_settings = {
        'ADDONS': {
            'httpcache': {
                'expiration_secs: 60,
            }
        }
    }

and I think that's sort of an usability step back.

I wouldn't mind doing this in external components that I probably won't be configuring too often, but for builtin components I'm not so sure, this could become too tedious too soon. I know we're still supporting the current global settings, but if we're going to favor a single config entry point that one should be as carefully considered as possible.

For this reason I started to prefer your first design option since it looses the furthermost enclosing dictionary, but a) the add-ons configurations should be loaded with the other settings so they can be configured using commandline and custom_settings, though I think adding them as ADDONS_* isn't much of a problem besides cluttering a little the settings, and b) the ADDONS_ prefix isn't really much of a difference from the other approach, it's a workaround for defining a namespace.

So, I dislike a little having to explicitly tell you're configuring an add-on if that can be deduced some other way. A third option could be to copy some other functionality from django here (I wonder how many times that has been said in scrapy's history 😄), I'm thinking something like:

INSTALLED_ADDONS = (
    'scrapy.addons.builtins.depth',  # or 'depth' maybe
    'scrapy.addons.builtins.httperror',
    ... # all builtins, this could be added to the settings.py template
    'path.to.mypipeline',
)

MYPIPELINE = { # uppercased add-on name
    'user': 'blah',
    ...
)

Not sure if possible though, but I kind of like copying the INSTALLED_APPS django setting, and the separated add-on config looks a lot like current settings:

HTTPCACHE = {
    'enabled': True,
    'expiration_secs': 60,
    'ignore_http_codes': [500, 503],
}

Add-on callback placement
Don't mind the comment I made before, forgot how zope.interface.Interface worked and thought it was limiting update_settings and check_configuration to be only instance methods. Then I thought about having the possibility to call those methods without instantiating the class (if those are classmethods) but I don't see any advantage over instantiating it and calling them later. I like the current callback placement.

Builtin add-ons
I really like the overall implementation minus a detail: I pictured adding components to their related component-dict settings slightly differently. I was thinking add-ons managing a single component could have a type and order fields in the interface (maybe order can be a config too) that could be used for automatically inserting themselves (or their path) to the component-dict setting defined in type.

I realize these builtin add-ons are sort of encapsulating the actual components, they have to add the components' paths to the settings and not themselves, but maybe the inserted path/obj could be parametrized for this particular case.

Member

curita commented Jul 29, 2015

Sorry I took so long to reply, I'm still unsure on the alternatives for the add-ons configuration in settings.py, but I'll comment some thoughts I have about this matter. Most of them are things you already said but I wanted to write them to get my head around this.

Add-on configuration entry point
I agree that it seems an odd design decision having the add-ons config in the settings file, but it definitely simplifies things. Out of both presented options, the ADDONS variable seems the more flexible one. I'm not crazy about it but I don't mind configuring add-ons this way in settings.py, django uses this approach with multiple nested dicts a lot.

What I'm unsure of is that I think we're overcomplicating how current components are configured. We would be changing these:

scrapy crawl myspider -s HTTPCACHE_EXPIRATION_SECS=60
class MySpider(scrapy.Spider):
    name = 'myspider'
    custom_settings = {
        HTTPCACHE_EXPIRATION_SECS=60,
    }

to these:

scrapy crawl myspider -s ADDONS={'httcache':{'expiration_secs': 60}}
class MySpider(scrapy.Spider):
    name = 'myspider'
    custom_settings = {
        'ADDONS': {
            'httpcache': {
                'expiration_secs: 60,
            }
        }
    }

and I think that's sort of an usability step back.

I wouldn't mind doing this in external components that I probably won't be configuring too often, but for builtin components I'm not so sure, this could become too tedious too soon. I know we're still supporting the current global settings, but if we're going to favor a single config entry point that one should be as carefully considered as possible.

For this reason I started to prefer your first design option since it looses the furthermost enclosing dictionary, but a) the add-ons configurations should be loaded with the other settings so they can be configured using commandline and custom_settings, though I think adding them as ADDONS_* isn't much of a problem besides cluttering a little the settings, and b) the ADDONS_ prefix isn't really much of a difference from the other approach, it's a workaround for defining a namespace.

So, I dislike a little having to explicitly tell you're configuring an add-on if that can be deduced some other way. A third option could be to copy some other functionality from django here (I wonder how many times that has been said in scrapy's history 😄), I'm thinking something like:

INSTALLED_ADDONS = (
    'scrapy.addons.builtins.depth',  # or 'depth' maybe
    'scrapy.addons.builtins.httperror',
    ... # all builtins, this could be added to the settings.py template
    'path.to.mypipeline',
)

MYPIPELINE = { # uppercased add-on name
    'user': 'blah',
    ...
)

Not sure if possible though, but I kind of like copying the INSTALLED_APPS django setting, and the separated add-on config looks a lot like current settings:

HTTPCACHE = {
    'enabled': True,
    'expiration_secs': 60,
    'ignore_http_codes': [500, 503],
}

Add-on callback placement
Don't mind the comment I made before, forgot how zope.interface.Interface worked and thought it was limiting update_settings and check_configuration to be only instance methods. Then I thought about having the possibility to call those methods without instantiating the class (if those are classmethods) but I don't see any advantage over instantiating it and calling them later. I like the current callback placement.

Builtin add-ons
I really like the overall implementation minus a detail: I pictured adding components to their related component-dict settings slightly differently. I was thinking add-ons managing a single component could have a type and order fields in the interface (maybe order can be a config too) that could be used for automatically inserting themselves (or their path) to the component-dict setting defined in type.

I realize these builtin add-ons are sort of encapsulating the actual components, they have to add the components' paths to the settings and not themselves, but maybe the inserted path/obj could be parametrized for this particular case.

@curita

This comment has been minimized.

Show comment
Hide comment
@curita

curita Jul 29, 2015

Member

Forgot to comment the error handling in Crawler.crawl() 😓

I don't have a strong preference here, I don't think your last implementation dealing with failures in every deferred looks that ugly. Maybe all those try/excepts could be flattened by handling callbacks manually (i.e. without inlineCallbacks), adding the corresponding errbacks for exception handling to the deferreds, but I think your version is more readable. I like either updating the engine.stop() method or adding a new engine.close() too.

For monitoring pending calls, I like relying on trial to check whether the reactor is clean, seems simpler and I think I read this was a good practice in the twisted testing guidelines. trial probably calls reactor.getDelayedCalls() for this, maybe we could call it explicitly but I think leaving this to the testing suite is probably the best option.

About assertFailure, we have tried to get rid of trial sometime in the past because it wasn't fully python3 compatible. Removing the trial.TestCase inheritance wasn't possible though because the pytest plugin we're using to mimic trial behavior doesn't work on unittest.TestCase classes. But seeing how we're already using assertFailure in other tests, and how we probably won't be changing that inheritance now I agree there isn't any problem using it.

Member

curita commented Jul 29, 2015

Forgot to comment the error handling in Crawler.crawl() 😓

I don't have a strong preference here, I don't think your last implementation dealing with failures in every deferred looks that ugly. Maybe all those try/excepts could be flattened by handling callbacks manually (i.e. without inlineCallbacks), adding the corresponding errbacks for exception handling to the deferreds, but I think your version is more readable. I like either updating the engine.stop() method or adding a new engine.close() too.

For monitoring pending calls, I like relying on trial to check whether the reactor is clean, seems simpler and I think I read this was a good practice in the twisted testing guidelines. trial probably calls reactor.getDelayedCalls() for this, maybe we could call it explicitly but I think leaving this to the testing suite is probably the best option.

About assertFailure, we have tried to get rid of trial sometime in the past because it wasn't fully python3 compatible. Removing the trial.TestCase inheritance wasn't possible though because the pytest plugin we're using to mimic trial behavior doesn't work on unittest.TestCase classes. But seeing how we're already using assertFailure in other tests, and how we probably won't be changing that inheritance now I agree there isn't any problem using it.

@jdemaeyer

This comment has been minimized.

Show comment
Hide comment
@jdemaeyer

jdemaeyer Aug 1, 2015

Contributor

A few small updates/thoughts:

I opted for a new ExecutionEngine.close() method instead of updating the stop() method. The reason is that I didn't want to remove the assert statement made in ExecutionEngine.stop() (after all the whole idea was to clean up even if the engine wasn't running yet).
Code: close() method and usage in Crawler.crawl()
I've also cleaned up the graceful shutdown test.

I've enhanced base Addon class: It is now much easier to configure single components through the component, component_type, and component_key attributes and the export_component method. Thank you for the idea @curita, does that look somewhat like what you had in mind? I've also updated the built-in add-ons to make use of that new feature, and they look much tidier now.

@curita makes a very valid point with how the ADDONS dictionary is a usability stepback. I like the Django-like INSTALLED_ADDONS approach and have added a corresponding method and test alongside the old methods. A few thoughts:

  • A stumbling block with this approach is that it's not possible to append, not overwrite, to the INSTALLED_ADDONS setting from the command line; not quite sure how to overcome that...
  • Similarly, we somehow need to communicate to users that they should use append in their Spider's update_settings
  • A way out of this could be to allow using both ADDONS and INSTALLED_ADDONS, however that would result in three entry points to the add-on framework. Not very streamlined :/

@kmike mentioned in one of his first posts if it would be possible that Spiders also implement the add-on interface and have their methods called through the same code path. I very much like that idea but so far couldn't wrap my head around a couple of questions: Should we support config for spiders and what would it be? Should the Spider be added to the add-on manager? Or should the add-on manager update_settings() etc. method have an optional keyword argument where the Spider is passed to call its callbacks?

Now that I've coded around the add-on manager quite a bit, I think @kmike was indeed right in that it's not very useful to save the used_keys, I think I'll drop them soon.

I have refactored the check_dependency_clashes() method so that it now uses pkg_resources. But I wonder if we can make even more use of that package?

Last, I realise that this PR is getting very large (+2,000 lines, phew) and therefore hard to review. However I don't see how it could be split up into smaller chunks. Any ideas how I could make it more clearly laid out?

Contributor

jdemaeyer commented Aug 1, 2015

A few small updates/thoughts:

I opted for a new ExecutionEngine.close() method instead of updating the stop() method. The reason is that I didn't want to remove the assert statement made in ExecutionEngine.stop() (after all the whole idea was to clean up even if the engine wasn't running yet).
Code: close() method and usage in Crawler.crawl()
I've also cleaned up the graceful shutdown test.

I've enhanced base Addon class: It is now much easier to configure single components through the component, component_type, and component_key attributes and the export_component method. Thank you for the idea @curita, does that look somewhat like what you had in mind? I've also updated the built-in add-ons to make use of that new feature, and they look much tidier now.

@curita makes a very valid point with how the ADDONS dictionary is a usability stepback. I like the Django-like INSTALLED_ADDONS approach and have added a corresponding method and test alongside the old methods. A few thoughts:

  • A stumbling block with this approach is that it's not possible to append, not overwrite, to the INSTALLED_ADDONS setting from the command line; not quite sure how to overcome that...
  • Similarly, we somehow need to communicate to users that they should use append in their Spider's update_settings
  • A way out of this could be to allow using both ADDONS and INSTALLED_ADDONS, however that would result in three entry points to the add-on framework. Not very streamlined :/

@kmike mentioned in one of his first posts if it would be possible that Spiders also implement the add-on interface and have their methods called through the same code path. I very much like that idea but so far couldn't wrap my head around a couple of questions: Should we support config for spiders and what would it be? Should the Spider be added to the add-on manager? Or should the add-on manager update_settings() etc. method have an optional keyword argument where the Spider is passed to call its callbacks?

Now that I've coded around the add-on manager quite a bit, I think @kmike was indeed right in that it's not very useful to save the used_keys, I think I'll drop them soon.

I have refactored the check_dependency_clashes() method so that it now uses pkg_resources. But I wonder if we can make even more use of that package?

Last, I realise that this PR is getting very large (+2,000 lines, phew) and therefore hard to review. However I don't see how it could be split up into smaller chunks. Any ideas how I could make it more clearly laid out?

@curita

This comment has been minimized.

Show comment
Hide comment
@curita

curita Aug 10, 2015

Member

I opted for a new ExecutionEngine.close() method instead of updating the stop() method. [...]

Love how clean that solution resulted, if you want you can open a separate pull request with it and its test and we can merge it right away.

I've enhanced base Addon class: It is now much easier to configure single components through the component, component_type, and component_key attributes and the export_component method. Thank you for the idea @curita, does that look somewhat like what you had in mind?

That was pretty much what I had in mind :) Not sure if I'd have added the abbreviated versions of the component_types though, they seem rather cryptic for the sake of saving keystrokes, I'd prefer users to stick with the fullnames. Maybe I'd have added the component order as a class attr to match both component and component_key placements (in addition of being a config option), but this isn't necessary, and it could be taken the wrong way by thinking the order can't be changed.

A stumbling block with this approach [INSTALLED_ADDONS] is that it's not possible to append, not overwrite, to the INSTALLED_ADDONS setting from the command line; not quite sure how to overcome that...

That's a prevailing problem with any list valued setting :( I think as long as we can enable or disable add-ons using their configs (not sure if this is true for all add-ons though, maybe we could support a general enabled config in AddonManager?) it isn't a big drawback to have a fixed INSTALLED_ADDONS in settings.py listing all add-ons used across the spiders in a project. This is far from ideal (dynamically adding/removing add-ons to the list will be hard) but I think it's a reasonable compromise.

@kmike mentioned in one of his first posts if it would be possible that Spiders also implement the add-on interface and have their methods called through the same code path.

I don't recall right now any constrain on spiders being add-ons (hm, wait, calling the spider update_settings() at the same time as the other add-ons update_settings() couldn't conflict with the add-ons config? could a spider still disable another add-on this way?), but if there aren't any issues (f8cdf1a?) let's implement it 👍

Should we support config for spiders and what would it be?

If possible, I'd like to see spider arguments mapped into the config. A neat side effect of this would be that spider arguments could be configured like any other regular setting.

Should the Spider be added to the add-on manager? Or should the add-on manager update_settings() etc. method have an optional keyword argument where the Spider is passed to call its callbacks?

The second option is the one you implemented in f8cdf1a right? I like it, I think this distinction should be helpful at some point.

I have refactored the check_dependency_clashes() method so that it now uses pkg_resources. But I wonder if we can make even more use of that package?

Not sure myself, I thought we could have a custom Addon class for add-ons defined inside packages and get version and requires from that package using pkg_resources but I'm not sure if that's even something that should be done with it.

Last, I realise that this PR is getting very large (+2,000 lines, phew) and therefore hard to review. However I don't see how it could be split up into smaller chunks. Any ideas how I could make it more clearly laid out?

Cleaning up the commit history should be a must to get the pr fully reviewed once it's done, but I don't see how its readability could be improved any further. Your code is really well documented so don't mind its extension too much :)

Member

curita commented Aug 10, 2015

I opted for a new ExecutionEngine.close() method instead of updating the stop() method. [...]

Love how clean that solution resulted, if you want you can open a separate pull request with it and its test and we can merge it right away.

I've enhanced base Addon class: It is now much easier to configure single components through the component, component_type, and component_key attributes and the export_component method. Thank you for the idea @curita, does that look somewhat like what you had in mind?

That was pretty much what I had in mind :) Not sure if I'd have added the abbreviated versions of the component_types though, they seem rather cryptic for the sake of saving keystrokes, I'd prefer users to stick with the fullnames. Maybe I'd have added the component order as a class attr to match both component and component_key placements (in addition of being a config option), but this isn't necessary, and it could be taken the wrong way by thinking the order can't be changed.

A stumbling block with this approach [INSTALLED_ADDONS] is that it's not possible to append, not overwrite, to the INSTALLED_ADDONS setting from the command line; not quite sure how to overcome that...

That's a prevailing problem with any list valued setting :( I think as long as we can enable or disable add-ons using their configs (not sure if this is true for all add-ons though, maybe we could support a general enabled config in AddonManager?) it isn't a big drawback to have a fixed INSTALLED_ADDONS in settings.py listing all add-ons used across the spiders in a project. This is far from ideal (dynamically adding/removing add-ons to the list will be hard) but I think it's a reasonable compromise.

@kmike mentioned in one of his first posts if it would be possible that Spiders also implement the add-on interface and have their methods called through the same code path.

I don't recall right now any constrain on spiders being add-ons (hm, wait, calling the spider update_settings() at the same time as the other add-ons update_settings() couldn't conflict with the add-ons config? could a spider still disable another add-on this way?), but if there aren't any issues (f8cdf1a?) let's implement it 👍

Should we support config for spiders and what would it be?

If possible, I'd like to see spider arguments mapped into the config. A neat side effect of this would be that spider arguments could be configured like any other regular setting.

Should the Spider be added to the add-on manager? Or should the add-on manager update_settings() etc. method have an optional keyword argument where the Spider is passed to call its callbacks?

The second option is the one you implemented in f8cdf1a right? I like it, I think this distinction should be helpful at some point.

I have refactored the check_dependency_clashes() method so that it now uses pkg_resources. But I wonder if we can make even more use of that package?

Not sure myself, I thought we could have a custom Addon class for add-ons defined inside packages and get version and requires from that package using pkg_resources but I'm not sure if that's even something that should be done with it.

Last, I realise that this PR is getting very large (+2,000 lines, phew) and therefore hard to review. However I don't see how it could be split up into smaller chunks. Any ideas how I could make it more clearly laid out?

Cleaning up the commit history should be a must to get the pr fully reviewed once it's done, but I don't see how its readability could be improved any further. Your code is really well documented so don't mind its extension too much :)

@jdemaeyer

This comment has been minimized.

Show comment
Hide comment
@jdemaeyer

jdemaeyer Aug 13, 2015

Contributor

If possible, I'd like to see spider arguments mapped into the config. A neat side effect of this would be that spider arguments could be configured like any other regular setting.

(Comment moved to #1442)

Contributor

jdemaeyer commented Aug 13, 2015

If possible, I'd like to see spider arguments mapped into the config. A neat side effect of this would be that spider arguments could be configured like any other regular setting.

(Comment moved to #1442)

@jdemaeyer

This comment has been minimized.

Show comment
Hide comment
@jdemaeyer

jdemaeyer Aug 13, 2015

Contributor

I think as long as we can enable or disable add-ons using their configs (not sure if this is true for all add-ons though, maybe we could support a general enabled config in AddonManager?)

calling the spider update_settings() at the same time as the other add-ons update_settings() couldn't conflict with the add-ons config? could a spider still disable another add-on this way?

I added the possibility to disable any add-on by setting its config _enabled setting to False, that should allow disabling add-ons both from the Spider's update_addons() method as well as from the command line.

Once again, since there is currently no guarantee in which order add-on callbacks are called, setting an add-ons config['_enabled'] = False from a second add-on may not prevent the first add-ons update_addons() method from being called. However, it just struck me that, should we settle for the INSTALLED_ADDONS list setting, the user now has the possibility to order add-ons, so maybe I could replace the AddonManager._addons dictionary with an ordered dictionary (also, INSTALLED_ADDONS could be a dictionary-like {'addonpath': order} setting instead of a list, just like the component settings. That would also resolve the "not easy to append instead of overwrite" problem).

Note to self so I don't forget to remove some code later: 2039892 made the _source config superfluos

Contributor

jdemaeyer commented Aug 13, 2015

I think as long as we can enable or disable add-ons using their configs (not sure if this is true for all add-ons though, maybe we could support a general enabled config in AddonManager?)

calling the spider update_settings() at the same time as the other add-ons update_settings() couldn't conflict with the add-ons config? could a spider still disable another add-on this way?

I added the possibility to disable any add-on by setting its config _enabled setting to False, that should allow disabling add-ons both from the Spider's update_addons() method as well as from the command line.

Once again, since there is currently no guarantee in which order add-on callbacks are called, setting an add-ons config['_enabled'] = False from a second add-on may not prevent the first add-ons update_addons() method from being called. However, it just struck me that, should we settle for the INSTALLED_ADDONS list setting, the user now has the possibility to order add-ons, so maybe I could replace the AddonManager._addons dictionary with an ordered dictionary (also, INSTALLED_ADDONS could be a dictionary-like {'addonpath': order} setting instead of a list, just like the component settings. That would also resolve the "not easy to append instead of overwrite" problem).

Note to self so I don't forget to remove some code later: 2039892 made the _source config superfluos

@codecov-io

This comment has been minimized.

Show comment
Hide comment
@codecov-io

codecov-io Aug 16, 2015

Current coverage is 83.42%

Merging #1272 into master will increase coverage by +0.69% as of 48b077b

Powered by Codecov. Updated on successful CI builds.

codecov-io commented Aug 16, 2015

Current coverage is 83.42%

Merging #1272 into master will increase coverage by +0.69% as of 48b077b

Powered by Codecov. Updated on successful CI builds.

@jdemaeyer

This comment has been minimized.

Show comment
Hide comment
@jdemaeyer

jdemaeyer Aug 16, 2015

Contributor

hm, wait, calling the spider update_settings() at the same time as the other add-ons update_settings() couldn't conflict with the add-ons config? could a spider still disable another add-on this way?)

(Comment moved to #1442)

Contributor

jdemaeyer commented Aug 16, 2015

hm, wait, calling the spider update_settings() at the same time as the other add-ons update_settings() couldn't conflict with the add-ons config? could a spider still disable another add-on this way?)

(Comment moved to #1442)

@jdemaeyer

This comment has been minimized.

Show comment
Hide comment
@jdemaeyer

jdemaeyer Nov 11, 2015

Contributor

Rebased onto master, this PR now once again contains no commits from other PRs. (Finally the green checkmark is back :))

Contributor

jdemaeyer commented Nov 11, 2015

Rebased onto master, this PR now once again contains no commits from other PRs. (Finally the green checkmark is back :))

Show outdated Hide outdated scrapy/utils/misc.py Outdated
INSTALLED_ADDONS = (
'httpcache',
'path.to.some.addon',
'path/to/other/addon.py',

This comment has been minimized.

@kmike

kmike Feb 1, 2017

Member

path support is removed, right?

@kmike

kmike Feb 1, 2017

Member

path support is removed, right?

[addon:path.to.some.addon]
some_config = true
[addon:path/to/other/addon.py]

This comment has been minimized.

@kmike

kmike Feb 1, 2017

Member

does it still work?

@kmike

kmike Feb 1, 2017

Member

does it still work?

Default: ``()``
A tuple containing paths to the add-ons enabled in your project. For more

This comment has been minimized.

@kmike

kmike Feb 1, 2017

Member

"paths" can be misleading

@kmike

kmike Feb 1, 2017

Member

"paths" can be misleading

addons = AddonManager()
# ... load some add-ons here
print addons.enabled # prints names of all enabled add-ons

This comment has been minimized.

@kmike

kmike Feb 1, 2017

Member

it is better to use Python3-compatible syntax in examples

@kmike

kmike Feb 1, 2017

Member

it is better to use Python3-compatible syntax in examples

INSTALLED_ADDONS = (
'httpcache',
'mymodule.filters.myfilter',
'mymodule/filters/otherfilter.py',

This comment has been minimized.

@kmike

kmike Feb 1, 2017

Member

does it still work?

@kmike

kmike Feb 1, 2017

Member

does it still work?

Add-ons are Python modules that implement the following callbacks.
[addon:mymodule/filters/otherfilter.py]

This comment has been minimized.

@kmike

kmike Feb 1, 2017

Member

does it still work?

@kmike

kmike Feb 1, 2017

Member

does it still work?

@pawelmhm

This comment has been minimized.

Show comment
Hide comment
@pawelmhm

pawelmhm Feb 21, 2017

Member

this is so huge, is there some way to save pieces of this work by splitting this into more smaller PR-s? e.g. I'm looking forward to changes in load_object, is it ok if I steal it from here (cherry-pick) and create new PR with that? https://github.com/scrapy/scrapy/pull/1272/files#diff-2536f98aac2c474ec1f424c8a943b9d2

Member

pawelmhm commented Feb 21, 2017

this is so huge, is there some way to save pieces of this work by splitting this into more smaller PR-s? e.g. I'm looking forward to changes in load_object, is it ok if I steal it from here (cherry-pick) and create new PR with that? https://github.com/scrapy/scrapy/pull/1272/files#diff-2536f98aac2c474ec1f424c8a943b9d2

@coveralls

This comment has been minimized.

Show comment
Hide comment
@coveralls

coveralls Mar 9, 2017

Coverage Status

Changes Unknown when pulling a65fc0d on jdemaeyer:enhancement/addons into ** on scrapy:master**.

coveralls commented Mar 9, 2017

Coverage Status

Changes Unknown when pulling a65fc0d on jdemaeyer:enhancement/addons into ** on scrapy:master**.

@kmike kmike referenced this pull request Aug 20, 2018

Open

Adding domain #3160

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment