Add-ons (SEP-021) Discussion #591

Closed
nyov opened this Issue Feb 12, 2014 · 12 comments

Comments

Projects
None yet
7 participants
@nyov
Contributor

nyov commented Feb 12, 2014

I totally dig this idea, but I also think this needs to be well thought-out or it could spell more trouble than it solves and be hard to deprecate and fix later.

My thoughts on the current proposal:

[addons] in scrapy.cfg would not provide any way to pass configuration settings. So we still need to modify settings.py for, say, the database connection parameters.
That doesn't look very portable.

Also, if two different addons expose or use the same settings, it gets ugly.
Say we have different addons, one with a MongoPipeline and one with a MongoQueue, which by chance both use the MONGO_COLLECTION setting name.

This is obviously already a problem, but I think with add-ons, people would more expect a "plug-and-play" system which "just works", and be less careful with checking individual addon-components for name clashes or dependency issues.

Unless addons should be expected to be configured prior to use, or only as a bundle for hardcoded configuration settings, I think we need another layer deep to have addons and their settings in the same place.
Maybe we can use a [addon_(name)] (or simply [(addon_name)]) ini setting which can then take multiple parameters.
These settings should be namespaced to the addon.
Example:

[httpcache] # (looking in python path, no settings)
#enabled = True # (does ini-style require a parameter to recognize the section?)
[mongodb_pipeline]
path = /path/to/mongodb_pipeline.py # (explicit path)
host = 'localhost'
port = 27017

These settings would then expand to (addon_name)_(setting), e.g. mongodb_pipeline_port or MONGODB_PIPELINE_PORT to prevent namespace clashes between addons.

(Would this copy settings on scrapyd deployment?)
Alternatively, maybe all the addon defining and configuration could be in settings.py, but I think it should be in a single place.

@nyov

This comment has been minimized.

Show comment
Hide comment
@nyov

nyov Feb 12, 2014

Contributor

A rather more complex system.

For the uninitiated scrapy-user, per-addon post-initialization checks may not be safe enough,
if they happy-go-lucky mash up whatever addons they find somewhere.

Since add-ons would encapsulate the whole spectrum of scrapy hooks and settings,
without regards to existing settings or other addons, strange cross-dependencies could occur.

For a truly robust addon system we may need something like a dependency tree in scrapy components?
Maybe I'm overthinking this, but one add-on may need to replace an existing component which would break another add-on.

Mostly this won't be an issue for simple pipeline and middleware addons,
but if one needs to replace itself with the enabled default, or is an exclusive component like SCHEDULER, this might break other, unsuspecting, addon bundles in unpredictable ways.

Now the addon could be paranoid, and maybe check for all the settings or extensions actually being what it would expect (e.g. "I need httpcache_storage to be FilesystemCacheStorage"),
but even then this would be error prone since it can't possibly predict the whole target system setup (oops, my httpcache might not implement httpcache_storage).

But I wonder if it makes sense to address this issue.
I can't really envision this, implementation-wise, but if we could say
"ExtensionManager implements extensions",
"FeedExporter requires extensions"
or
"Scheduler implements scheduler, has DUPEFILTER_CLASS",
"RFPDupeFilter requires DUPEFILTER_CLASS"
and then I am using a CustomFrontierScheduler which exposes no DUPEFILTER_CLASS,
my add-on, bundling RFPDupeFilter and setting DUPEFILTER_CLASS, may fail in a defined manner.

But this could also work instead of fixed priority-based ordering for middlewares/pipelines/extensions, which would seem a more robust base for complex add-on trees:
"HttpCacheMiddleware implements downloadermw, provides httpcache, has httpcache_storage",
"HttpCacheMiddleware runs-after $most-dmws",
"HttpCacheMiddleware runs-before (nothing)",
where $most-dmws is a meta keyword and defines something like
"$most-dmws runs-after httpauth, redirect, cookies" (rather, more $keyword "levels" here)
"httpauth runs-after robotstxt, runs-before httpproxy"
and so on.

With these hints, scrapy could then auto-sort middlewares, extensions,... and find unresolvable or circular dependencies in other components.
And my component could say "I depend on (any) httpcache" or "I implement a mongo-backed stats system, I depend on mongocache, fail/deactivate if it's not in use" (EXTENSION depending on a MIDDLEWARE)
In such a manner, there should be no limit to cascading dependencies and throwing all manner of well-defined modules together in a project and switch on/off whole dependency-trees with only some settings.

oh, and /cc @pablohoffman

Contributor

nyov commented Feb 12, 2014

A rather more complex system.

For the uninitiated scrapy-user, per-addon post-initialization checks may not be safe enough,
if they happy-go-lucky mash up whatever addons they find somewhere.

Since add-ons would encapsulate the whole spectrum of scrapy hooks and settings,
without regards to existing settings or other addons, strange cross-dependencies could occur.

For a truly robust addon system we may need something like a dependency tree in scrapy components?
Maybe I'm overthinking this, but one add-on may need to replace an existing component which would break another add-on.

Mostly this won't be an issue for simple pipeline and middleware addons,
but if one needs to replace itself with the enabled default, or is an exclusive component like SCHEDULER, this might break other, unsuspecting, addon bundles in unpredictable ways.

Now the addon could be paranoid, and maybe check for all the settings or extensions actually being what it would expect (e.g. "I need httpcache_storage to be FilesystemCacheStorage"),
but even then this would be error prone since it can't possibly predict the whole target system setup (oops, my httpcache might not implement httpcache_storage).

But I wonder if it makes sense to address this issue.
I can't really envision this, implementation-wise, but if we could say
"ExtensionManager implements extensions",
"FeedExporter requires extensions"
or
"Scheduler implements scheduler, has DUPEFILTER_CLASS",
"RFPDupeFilter requires DUPEFILTER_CLASS"
and then I am using a CustomFrontierScheduler which exposes no DUPEFILTER_CLASS,
my add-on, bundling RFPDupeFilter and setting DUPEFILTER_CLASS, may fail in a defined manner.

But this could also work instead of fixed priority-based ordering for middlewares/pipelines/extensions, which would seem a more robust base for complex add-on trees:
"HttpCacheMiddleware implements downloadermw, provides httpcache, has httpcache_storage",
"HttpCacheMiddleware runs-after $most-dmws",
"HttpCacheMiddleware runs-before (nothing)",
where $most-dmws is a meta keyword and defines something like
"$most-dmws runs-after httpauth, redirect, cookies" (rather, more $keyword "levels" here)
"httpauth runs-after robotstxt, runs-before httpproxy"
and so on.

With these hints, scrapy could then auto-sort middlewares, extensions,... and find unresolvable or circular dependencies in other components.
And my component could say "I depend on (any) httpcache" or "I implement a mongo-backed stats system, I depend on mongocache, fail/deactivate if it's not in use" (EXTENSION depending on a MIDDLEWARE)
In such a manner, there should be no limit to cascading dependencies and throwing all manner of well-defined modules together in a project and switch on/off whole dependency-trees with only some settings.

oh, and /cc @pablohoffman

@Mnw2212

This comment has been minimized.

Show comment
Hide comment
@Mnw2212

Mnw2212 Feb 19, 2014

Hey,
i am new here and i hope its not an issue if i comment here.

What i have understood from the last two posts by Nyov is that if we make it simple enough to add extensions through the configuration file, the following issues may occur.

  1. People will not be careful about extension names and other settings which may lead to clash of names or an error prone extension behavior.
  2. The add-on's may be enabled irrespective of other add-on's or settings and may eventually lead to dependency issues.

Nyov has provided possible solutions but i could not picturise its implementation as such.
I feel the solution to this maybe complex but a starting point can be taken by creating something like a dependency tree( as said by nyov ) to let the user know of the dependencies that the extension is affecting and later give him a choice to either continue or change the extension .

Another thing that may be done is to assign priorities to the dependency that will be affected. What i mean is that whenever the extension is in use, the settings maybe changed according to it, but the moment it seizes ,change the settings back to normal( or Default ).

I have just expressed my preliminary views on this issue and i am not sure if this will be of much importance, however please let me know if i have understood something the wrong way.

Thank You,
Mayuresh

Mnw2212 commented Feb 19, 2014

Hey,
i am new here and i hope its not an issue if i comment here.

What i have understood from the last two posts by Nyov is that if we make it simple enough to add extensions through the configuration file, the following issues may occur.

  1. People will not be careful about extension names and other settings which may lead to clash of names or an error prone extension behavior.
  2. The add-on's may be enabled irrespective of other add-on's or settings and may eventually lead to dependency issues.

Nyov has provided possible solutions but i could not picturise its implementation as such.
I feel the solution to this maybe complex but a starting point can be taken by creating something like a dependency tree( as said by nyov ) to let the user know of the dependencies that the extension is affecting and later give him a choice to either continue or change the extension .

Another thing that may be done is to assign priorities to the dependency that will be affected. What i mean is that whenever the extension is in use, the settings maybe changed according to it, but the moment it seizes ,change the settings back to normal( or Default ).

I have just expressed my preliminary views on this issue and i am not sure if this will be of much importance, however please let me know if i have understood something the wrong way.

Thank You,
Mayuresh

@SudShekhar

This comment has been minimized.

Show comment
Hide comment
@SudShekhar

SudShekhar Mar 24, 2015

Contributor

I have been trying to come up with a feasible solution for this problem. Would appreciate any feedback on the same:

  1. We can define a single class AddOnManager which will tackle the issue of importing modules for every such extension/middleware/pipeline etc.
  2. Users will have to define the name of the extension and any relevant settings (db username, password etc) in the settings.py. We can have all extensions clubbed under ADD_ONS dict.
  3. To the _get_mwlist_from_settings method present in all the MiddlewareManagers, we add a call to AddOnManager
 return build_components_list(...) + AddOnManager.get_add_ons(cls.name, settings, crawler)

The get_add_ons function will call the addon_configure function for each add-on and after finishing all the add-ons, check the crawler_ready function. The cls.name field will be used to match add-ons built for that component.

  1. We mandate each add-on to be an instance of an abstract AddOn class. This class defines functions such as update_settings (to change the value of a setting and mark it as unchangeable by future add-ons, useful incase of conflict; also to prefix addon name to setting name) , report_errors etc.

  2. Each add-on must add its name, version and requirements to the global settings. This way all add-ons which are instantiated after this, can access the settings to satisfy their dependencies. Any add-on requiring a module not present in crawler raises an error in the crawler_ready function. Any conflict in settings (between two add-ons, will be caught either in addon_configure or crawler_ready)

I believe that this mechanism will handle add-ons for extensions, middlewares, pipelines and downloaders but would love to hear from the maintainers and mentors on this topic.

Contributor

SudShekhar commented Mar 24, 2015

I have been trying to come up with a feasible solution for this problem. Would appreciate any feedback on the same:

  1. We can define a single class AddOnManager which will tackle the issue of importing modules for every such extension/middleware/pipeline etc.
  2. Users will have to define the name of the extension and any relevant settings (db username, password etc) in the settings.py. We can have all extensions clubbed under ADD_ONS dict.
  3. To the _get_mwlist_from_settings method present in all the MiddlewareManagers, we add a call to AddOnManager
 return build_components_list(...) + AddOnManager.get_add_ons(cls.name, settings, crawler)

The get_add_ons function will call the addon_configure function for each add-on and after finishing all the add-ons, check the crawler_ready function. The cls.name field will be used to match add-ons built for that component.

  1. We mandate each add-on to be an instance of an abstract AddOn class. This class defines functions such as update_settings (to change the value of a setting and mark it as unchangeable by future add-ons, useful incase of conflict; also to prefix addon name to setting name) , report_errors etc.

  2. Each add-on must add its name, version and requirements to the global settings. This way all add-ons which are instantiated after this, can access the settings to satisfy their dependencies. Any add-on requiring a module not present in crawler raises an error in the crawler_ready function. Any conflict in settings (between two add-ons, will be caught either in addon_configure or crawler_ready)

I believe that this mechanism will handle add-ons for extensions, middlewares, pipelines and downloaders but would love to hear from the maintainers and mentors on this topic.

@curita

This comment has been minimized.

Show comment
Hide comment
@curita

curita Mar 25, 2015

Member
  1. We can define a single class AddOnManager which will tackle the issue of importing modules for every such extension/middleware/pipeline etc.

+1 seems reasonable

  1. Users will have to define the name of the extension and any relevant settings (db username, password etc) in the settings.py. We can have all extensions clubbed under ADD_ONS dict.

I understand that adding an ADD_ONS setting in settings.py is more flexible (we won't have to implement anything new to handle it and we can use python directly), but I don't think we need this kind of flexibility, just a section to declare each addon and its settings is enough. scrapy.cfg is the first configuration file read (that's where we look for the settings.py path actually), so configuring the addons there could let us set extensions before settings.py is read. For example, a new extension could set a different Settings class. Not sure if this is needed, we can evaluate it, but I think I prefer sticking with scrapy.cfg if there are no apparent drawbacks.

  1. To the _get_mwlist_from_settings method present in all the MiddlewareManagers, we add a call to AddOnManager

return build_components_list(...) + AddOnManager.get_add_ons(cls.name, settings, crawler)

The get_add_ons function will call the addon_configure function for each add-on and after finishing all the add-ons, check the crawler_ready function. The cls.name field will be used to match add-ons built for that component.

I haven't made a in-depth review yet, but I think this has a few implementation possibilities. For example, we could call in MiddlewareManager._get_mwlist_from_settings some method in AddOnManager that just calls addon_configure in all addons (though it doesn't even need to be called here, we can call it when we're updating the settings outside MiddlewareManager), and then call crawler_ready in MiddlewareManager.from_settings, when it's checking what addons are configured (maybe this could be outside the MidlewareManager too).

I'd personally like to keep the MiddlewareManagers as they are, say configuring the addons before or after running the managers (some managers can be configured by settings, I think that's my main reason for wanting to keep them untouched, that way we can maintain backward compatibility with user defined managers), but I'm not sure this is possible.

  1. We mandate each add-on to be an instance of an abstract AddOn class. This class defines functions such as update_settings (to change the value of a setting and mark it as unchangeable by future add-ons, useful incase of conflict; also to prefix addon name to setting name) , report_errors etc.

Makes sense.

  1. Each add-on must add its name, version and requirements to the global settings. This way all add-ons which are instantiated after this, can access the settings to satisfy their dependencies. Any add-on requiring a module not present in crawler raises an error in the crawler_ready function. Any conflict in settings (between two add-ons, will be caught either in addon_configure or crawler_ready)

Makes sense, the AddOn class could have those "name", "version" and "requirements" as class attributes, and some method in AddOnManager could takes those and construct a dependency tree (I'd use an attribute in the crawler for this instead of the global settings, since those are user defined).

As @nyov said, it's kind of hard to get a robust dependency manager since there are a lot of components that could interact in different ways, maybe we could continue with the approach described in the SEP (setting the order of the middlewares in their '<component_MIDDLEWARE>' setting and checking dependencies manually in crawler_ready) for this GSoC unless we have spare time to implement it.

On a really side note, I hate the name "crawler_ready" :P if it's going to be used just for checking dependencies, let's change it to "check_dependencies" or "check_configuration" or something along those lines. I like changing "addon_configure" to "update_settings" as well, that's how it's called in the Spider class.

/cc @nramirezuy I recall you wanted to let different components update the settings, this idea is going to do that.

Member

curita commented Mar 25, 2015

  1. We can define a single class AddOnManager which will tackle the issue of importing modules for every such extension/middleware/pipeline etc.

+1 seems reasonable

  1. Users will have to define the name of the extension and any relevant settings (db username, password etc) in the settings.py. We can have all extensions clubbed under ADD_ONS dict.

I understand that adding an ADD_ONS setting in settings.py is more flexible (we won't have to implement anything new to handle it and we can use python directly), but I don't think we need this kind of flexibility, just a section to declare each addon and its settings is enough. scrapy.cfg is the first configuration file read (that's where we look for the settings.py path actually), so configuring the addons there could let us set extensions before settings.py is read. For example, a new extension could set a different Settings class. Not sure if this is needed, we can evaluate it, but I think I prefer sticking with scrapy.cfg if there are no apparent drawbacks.

  1. To the _get_mwlist_from_settings method present in all the MiddlewareManagers, we add a call to AddOnManager

return build_components_list(...) + AddOnManager.get_add_ons(cls.name, settings, crawler)

The get_add_ons function will call the addon_configure function for each add-on and after finishing all the add-ons, check the crawler_ready function. The cls.name field will be used to match add-ons built for that component.

I haven't made a in-depth review yet, but I think this has a few implementation possibilities. For example, we could call in MiddlewareManager._get_mwlist_from_settings some method in AddOnManager that just calls addon_configure in all addons (though it doesn't even need to be called here, we can call it when we're updating the settings outside MiddlewareManager), and then call crawler_ready in MiddlewareManager.from_settings, when it's checking what addons are configured (maybe this could be outside the MidlewareManager too).

I'd personally like to keep the MiddlewareManagers as they are, say configuring the addons before or after running the managers (some managers can be configured by settings, I think that's my main reason for wanting to keep them untouched, that way we can maintain backward compatibility with user defined managers), but I'm not sure this is possible.

  1. We mandate each add-on to be an instance of an abstract AddOn class. This class defines functions such as update_settings (to change the value of a setting and mark it as unchangeable by future add-ons, useful incase of conflict; also to prefix addon name to setting name) , report_errors etc.

Makes sense.

  1. Each add-on must add its name, version and requirements to the global settings. This way all add-ons which are instantiated after this, can access the settings to satisfy their dependencies. Any add-on requiring a module not present in crawler raises an error in the crawler_ready function. Any conflict in settings (between two add-ons, will be caught either in addon_configure or crawler_ready)

Makes sense, the AddOn class could have those "name", "version" and "requirements" as class attributes, and some method in AddOnManager could takes those and construct a dependency tree (I'd use an attribute in the crawler for this instead of the global settings, since those are user defined).

As @nyov said, it's kind of hard to get a robust dependency manager since there are a lot of components that could interact in different ways, maybe we could continue with the approach described in the SEP (setting the order of the middlewares in their '<component_MIDDLEWARE>' setting and checking dependencies manually in crawler_ready) for this GSoC unless we have spare time to implement it.

On a really side note, I hate the name "crawler_ready" :P if it's going to be used just for checking dependencies, let's change it to "check_dependencies" or "check_configuration" or something along those lines. I like changing "addon_configure" to "update_settings" as well, that's how it's called in the Spider class.

/cc @nramirezuy I recall you wanted to let different components update the settings, this idea is going to do that.

@SudShekhar

This comment has been minimized.

Show comment
Hide comment
@SudShekhar

SudShekhar Mar 26, 2015

Contributor

we could call in MiddlewareManager._get_mwlist_from_settings some method in AddOnManager that just calls addon_configure in all addons (though it doesn't even need to be called here, we can call it when we're updating the settings outside MiddlewareManager), and then call crawler_ready in MiddlewareManager.from_settings, when it's checking what addons are configured (maybe this could be outside the MidlewareManager too).

Actually my original idea was to place addons in a single folder, further classifying them into sub-folders (eg: extensions, spidermiddlewares etc). This way theAddOnManager would just have to load the addons defined in a particular sub folder (depending on cls.name ) and instantiate them. This would automatically impose an instantiation order between the sub folders (DownloaderAddOns first, ExtensionAddOns last etc).

But on second thoughts, your method lets a particular add-on be loaded across multiple MiddlewareManagers (depending on check_configuration function) without having to keep several copies of the code. so +1 for this one.

we could continue with the approach described in the SEP (setting the order of the middlewares in their '<component_MIDDLEWARE>' setting and checking dependencies manually in crawler_ready)

Indeed this approach seems to be the best. Regarding the dependency manager, one another issue I can think of is the case of spurious/unneeded settings. Eg:

We are running SpiderMiddlewareManager, it runs through all the addons and calls their update_settings. Now, our crawler.addon_settings dict (assume this is where we keep the settings), is configured. But when we run the check_configuration function for all the addons, a couple of them fail. Now, we should remove the failed addons settings from crawler.addon_settings (just to be safe).

I guess we can handle this either by assuming extension developers will implement a rollback_update_settings function which will do the opposite of update_settings OR by having the AddOnManager perform a dry run, find the list of enabled addons and then call their update_settings on the crawler.addon_settings (I personally prefer the second one).

Two more things:

  1. If an addon doesn't have dependencies satisfied, do we stop or continue crawling? I believe this should be defined by the user. I suggest we define a criticial = [yes/no] inside the section for each addon in scrapy.cfg. This can help us decide what to do.

  2. I am guessing this approach will work for all the MiddlewareManager components. But I am not too sure how to simplify the procedure for loading COMMANDS_MODULE, DUPEFILTER_CLASS etc. Are we assuming that these will be automatically overridden by the enabled extensions (and thus, don't need to be handled separately)?

Contributor

SudShekhar commented Mar 26, 2015

we could call in MiddlewareManager._get_mwlist_from_settings some method in AddOnManager that just calls addon_configure in all addons (though it doesn't even need to be called here, we can call it when we're updating the settings outside MiddlewareManager), and then call crawler_ready in MiddlewareManager.from_settings, when it's checking what addons are configured (maybe this could be outside the MidlewareManager too).

Actually my original idea was to place addons in a single folder, further classifying them into sub-folders (eg: extensions, spidermiddlewares etc). This way theAddOnManager would just have to load the addons defined in a particular sub folder (depending on cls.name ) and instantiate them. This would automatically impose an instantiation order between the sub folders (DownloaderAddOns first, ExtensionAddOns last etc).

But on second thoughts, your method lets a particular add-on be loaded across multiple MiddlewareManagers (depending on check_configuration function) without having to keep several copies of the code. so +1 for this one.

we could continue with the approach described in the SEP (setting the order of the middlewares in their '<component_MIDDLEWARE>' setting and checking dependencies manually in crawler_ready)

Indeed this approach seems to be the best. Regarding the dependency manager, one another issue I can think of is the case of spurious/unneeded settings. Eg:

We are running SpiderMiddlewareManager, it runs through all the addons and calls their update_settings. Now, our crawler.addon_settings dict (assume this is where we keep the settings), is configured. But when we run the check_configuration function for all the addons, a couple of them fail. Now, we should remove the failed addons settings from crawler.addon_settings (just to be safe).

I guess we can handle this either by assuming extension developers will implement a rollback_update_settings function which will do the opposite of update_settings OR by having the AddOnManager perform a dry run, find the list of enabled addons and then call their update_settings on the crawler.addon_settings (I personally prefer the second one).

Two more things:

  1. If an addon doesn't have dependencies satisfied, do we stop or continue crawling? I believe this should be defined by the user. I suggest we define a criticial = [yes/no] inside the section for each addon in scrapy.cfg. This can help us decide what to do.

  2. I am guessing this approach will work for all the MiddlewareManager components. But I am not too sure how to simplify the procedure for loading COMMANDS_MODULE, DUPEFILTER_CLASS etc. Are we assuming that these will be automatically overridden by the enabled extensions (and thus, don't need to be handled separately)?

@nramirezuy

This comment has been minimized.

Show comment
Hide comment
@nramirezuy

nramirezuy Mar 26, 2015

Member

@curita
2). When you deploy a project just settings.py is added to the egg; I don't understand what's the problem with configuring the extensions before reading settings.py when we have settings scopes.

5). If an addon is enabled and the dependencies aren't satisfied, let it explode and close the crawler.

Member

nramirezuy commented Mar 26, 2015

@curita
2). When you deploy a project just settings.py is added to the egg; I don't understand what's the problem with configuring the extensions before reading settings.py when we have settings scopes.

5). If an addon is enabled and the dependencies aren't satisfied, let it explode and close the crawler.

@jdemaeyer

This comment has been minimized.

Show comment
Hide comment
@jdemaeyer

jdemaeyer Apr 4, 2015

Contributor

I'm new, so here's a quick intro: Hi! I'm Jakob :) I've also put some thoughts into add-on management for my GSoC proposal.

I don't think Scrapy should continue when there is an error while loading one of the add-ons. Depending on the add-on it probably means a substantial change in functionality (e.g. results are lost when a pipeline add-on fails). The user should be made to explicitly acknowledge that by disabling the add-on herself, not just be informed through a warning message or similar. Instead of marking an add-on as non-critical they might as well start the spider for a second and see if there are any problems.

I'm not sure if there is an advantage in an AddOn class: If we force existing extensions (e.g. a pipeline) to subclass it, this would forbid add-ons from using multiple hooks (an add-on could no longer consist of a pipeline and a Spider middleware, etc.). If we make developers define a new class for their addon (subclassing AddOn), we might as well make them define a module/.py file, thereby enforcing clearer structure (one file/module per addon) and freeing us from having to instantiate the add-on class.

For the dependency trees, maybe as a first step towards a full-fledged solution the add-ons could be advised to supply some additional variables (besides their name and version), e.g. REQUIRES, MODIFIES, PROVIDES. A rather simple function in the add-on manager could then check whether the requirement of one add-on is in the MODIFIES and PROVIDES lists other add-ons.

Contributor

jdemaeyer commented Apr 4, 2015

I'm new, so here's a quick intro: Hi! I'm Jakob :) I've also put some thoughts into add-on management for my GSoC proposal.

I don't think Scrapy should continue when there is an error while loading one of the add-ons. Depending on the add-on it probably means a substantial change in functionality (e.g. results are lost when a pipeline add-on fails). The user should be made to explicitly acknowledge that by disabling the add-on herself, not just be informed through a warning message or similar. Instead of marking an add-on as non-critical they might as well start the spider for a second and see if there are any problems.

I'm not sure if there is an advantage in an AddOn class: If we force existing extensions (e.g. a pipeline) to subclass it, this would forbid add-ons from using multiple hooks (an add-on could no longer consist of a pipeline and a Spider middleware, etc.). If we make developers define a new class for their addon (subclassing AddOn), we might as well make them define a module/.py file, thereby enforcing clearer structure (one file/module per addon) and freeing us from having to instantiate the add-on class.

For the dependency trees, maybe as a first step towards a full-fledged solution the add-ons could be advised to supply some additional variables (besides their name and version), e.g. REQUIRES, MODIFIES, PROVIDES. A rather simple function in the add-on manager could then check whether the requirement of one add-on is in the MODIFIES and PROVIDES lists other add-ons.

@curita

This comment has been minimized.

Show comment
Hide comment
@curita

curita Apr 5, 2015

Member

@SudShekhar:
About the rollback for updated settings when some add on fails, we should evaluate if this is necessary (we could just stop the execution if an add on fails as it was suggested). In case it’s necessary, this should be done entirely on our side to avoid add ons developers to duplicate the logic of updating the settings. Making a dry run could make sense then, but I think we should study different use-cases for settings updated by add ons, porting some extensions to this new add ons system should help. Take for example the spider middlewares, those you want to use are defined in SPIDER_MIDDLEWARES, but if something goes wrong when initializing them they just throw a NotConfigured error and they are disabled inside SpiderMiddlewareManager, no need to rollback settings. We could do something similar for add ons maybe, but we should review what settings can be updated and how to make sure.

About the critical=[yes/no] part, this is tied with closing the crawler immediately if dependencies aren't satisfied. Probably it’s easier to consider that any explicitly enabled add on is critical.

I think update_settings should work for setting things besides middlewares too (like the suggested COMMANDS_MODULE and DUPEFILTER_CLASS).

Member

curita commented Apr 5, 2015

@SudShekhar:
About the rollback for updated settings when some add on fails, we should evaluate if this is necessary (we could just stop the execution if an add on fails as it was suggested). In case it’s necessary, this should be done entirely on our side to avoid add ons developers to duplicate the logic of updating the settings. Making a dry run could make sense then, but I think we should study different use-cases for settings updated by add ons, porting some extensions to this new add ons system should help. Take for example the spider middlewares, those you want to use are defined in SPIDER_MIDDLEWARES, but if something goes wrong when initializing them they just throw a NotConfigured error and they are disabled inside SpiderMiddlewareManager, no need to rollback settings. We could do something similar for add ons maybe, but we should review what settings can be updated and how to make sure.

About the critical=[yes/no] part, this is tied with closing the crawler immediately if dependencies aren't satisfied. Probably it’s easier to consider that any explicitly enabled add on is critical.

I think update_settings should work for setting things besides middlewares too (like the suggested COMMANDS_MODULE and DUPEFILTER_CLASS).

@curita

This comment has been minimized.

Show comment
Hide comment
@curita

curita Apr 5, 2015

Member

@jdemaeyer
I’m thinking that an Addon class is good for enforcing having the methods we’re expecting them to have (update_settings and check_configuration) and maybe having a common parent for all add-on classes, but I don’t mind checking that they satisfy its interface using other way.

Providing additional variables to manage dependencies seems like a reasonable implementation. Those variables, if implemented, should be carefully considered to avoid backward incompatible changes or too many rewrites in existing addons after we promote their usage.

Member

curita commented Apr 5, 2015

@jdemaeyer
I’m thinking that an Addon class is good for enforcing having the methods we’re expecting them to have (update_settings and check_configuration) and maybe having a common parent for all add-on classes, but I don’t mind checking that they satisfy its interface using other way.

Providing additional variables to manage dependencies seems like a reasonable implementation. Those variables, if implemented, should be carefully considered to avoid backward incompatible changes or too many rewrites in existing addons after we promote their usage.

@nyov

This comment has been minimized.

Show comment
Hide comment
@nyov

nyov Apr 5, 2015

Contributor

I'm agreeing on an AddonManager class as proposed to handle addons, but I would see an interface implementation (using zope.interface as explained here) as more versatile alternative to requiring the subclassing of an Addon class.
@jdemaeyer, it would also be nice (IMO) if addons wouldn't be required to be thrown together in an addons directory, but rather keeping the distinctions between spidermw, downloadermw, pipelines etc. and the code separated. I hope that might be possible if we only need to check for interface implementers. (And for the proposal: contrib is going away, see #1063 .)

On the other hand, if all addons are required to be split into their own .py file, and be their own module (in python's sense), maybe it's possible to use __name__, __version__, __requires__, __depends__ (etc.) module globals as Addon boilerplate instead. But not sure about that, not sure at all.

Contributor

nyov commented Apr 5, 2015

I'm agreeing on an AddonManager class as proposed to handle addons, but I would see an interface implementation (using zope.interface as explained here) as more versatile alternative to requiring the subclassing of an Addon class.
@jdemaeyer, it would also be nice (IMO) if addons wouldn't be required to be thrown together in an addons directory, but rather keeping the distinctions between spidermw, downloadermw, pipelines etc. and the code separated. I hope that might be possible if we only need to check for interface implementers. (And for the proposal: contrib is going away, see #1063 .)

On the other hand, if all addons are required to be split into their own .py file, and be their own module (in python's sense), maybe it's possible to use __name__, __version__, __requires__, __depends__ (etc.) module globals as Addon boilerplate instead. But not sure about that, not sure at all.

@SudShekhar

This comment has been minimized.

Show comment
Hide comment
@SudShekhar

SudShekhar Apr 8, 2015

Contributor

Hi,
First of all, sorry for the late response. I had my university exams going on.

@curita : So my basic idea goes like this: We just pass a variable functionality (say with value SPIDER_MIDDLEWARES) to the add_ons update_settings function. At this point, all add ons which don't implement this functionality (aren't SPIDER_MIDDLEWARES), raise the NotConfigured error. Others carry on. This way we can have add-ons which are enabled at multiple points (and the add-ons can decide what to do at which component). Any other error (apart from NotConfigured) at this stage stops crawling.

My reasons for proposing the AddOn class:

  1. Lets us define an API like access to crawler settings and other parts of the code. We can use the existing priority count for settings (default, command, ..) or define a completely new framework independent of old code. We can define uniform error reporting interface (to server/console/file).
  2. In general, we get more control over the add-ons.
  3. Iff we allow some add-ons to fail and still continue the crawling (the critical=[yes/no idea), this might make it easier to implement that. One example of the top of my head, An Ipython type add-on which visualizes crawling and logs to a local file and remote server. We might want to allow one or several of these functionalities to fail and not hinder crawling.

@nyov : Indeed zope.interfaces is much more versatile in this regard. I guess the question comes down to whether we want to control the way add-ons update internal settings or not. I would love to have yours and others views on this issue.
And regarding the idea of splitting add-ons into different folders, it will make instantiating add-ons much easier (the functionality variable isn't needed) but then how do we handle add-ons with multiple hooks?

Contributor

SudShekhar commented Apr 8, 2015

Hi,
First of all, sorry for the late response. I had my university exams going on.

@curita : So my basic idea goes like this: We just pass a variable functionality (say with value SPIDER_MIDDLEWARES) to the add_ons update_settings function. At this point, all add ons which don't implement this functionality (aren't SPIDER_MIDDLEWARES), raise the NotConfigured error. Others carry on. This way we can have add-ons which are enabled at multiple points (and the add-ons can decide what to do at which component). Any other error (apart from NotConfigured) at this stage stops crawling.

My reasons for proposing the AddOn class:

  1. Lets us define an API like access to crawler settings and other parts of the code. We can use the existing priority count for settings (default, command, ..) or define a completely new framework independent of old code. We can define uniform error reporting interface (to server/console/file).
  2. In general, we get more control over the add-ons.
  3. Iff we allow some add-ons to fail and still continue the crawling (the critical=[yes/no idea), this might make it easier to implement that. One example of the top of my head, An Ipython type add-on which visualizes crawling and logs to a local file and remote server. We might want to allow one or several of these functionalities to fail and not hinder crawling.

@nyov : Indeed zope.interfaces is much more versatile in this regard. I guess the question comes down to whether we want to control the way add-ons update internal settings or not. I would love to have yours and others views on this issue.
And regarding the idea of splitting add-ons into different folders, it will make instantiating add-ons much easier (the functionality variable isn't needed) but then how do we handle add-ons with multiple hooks?

@jdemaeyer

This comment has been minimized.

Show comment
Hide comment
@jdemaeyer

jdemaeyer May 25, 2015

Contributor

I will be working on an implementation of add-on management starting now!

While I finish up per-key settings priorities (#1149) as prerequisite (almost all add-ons will want to touch the dictionary settings), I will draft an updated version of the SEP this week. I think these are some of the major open issues (sorry, lots of text, feel free to skip sections):

1. How is the add-on interface enforced?

I think our options here are:

  1. Module: Promotes clear structure (one module per add-on), but does not allow us to provide any defaults (not all add-ons will want to do post-init tests, it might good to have a dummy check_configuration() callback)
  2. Subclass of AddOn: Allows us to provide default callbacks, and, if we want that, to control how add-ons interact with the settings
  3. zope.interface: Leaves the choice of object type (module, class) up to the developer

I tend to zope.interface, since it is the most versatile. Especially since we could still provide a base AddOn class with dummy callbacks and leave it up to the developer if they want to use it. Scrapy will have to deal with determining whether an addon path points to a module or a class, but that should only be a small update to scrapy.utils.misc.load_object().

2. Do we want to control how add-ons interact with settings?

@SudShekhar raised the question whether we want to provide some kind of API with which the add-ons interact with Scrapy's settings, rather than directly handing them the Settings object. I don't think this is necessary. After all, the people writing add-ons are developers; they should know what they're doing, I don't see a reason to restrict settings access. I'm open to being convinced otherwise though.

3. Are add-ons configured in scrapy.cfg or settings.py?

I am not yet completely familiar with what file is intended for what. It seems reasonable to me to load add-ons as early as possible, since they might want to mess with / replace core components (including the Settings class). But if scrapy.cfg is not shipped by scrapyd-deploy, how do we make the add-on configuration portable?

4. Can we keep (some) configuration local to the add-on?

It isn't 100% clear to me from the previous discussion whether people had this in mind already or not. Instead of exposing every single setting into the global settings space, we could hand over the configuration section (for this add-on) from scrapy.cfg to an add-on callback. The developer can then decide if she wants to insert them into the Settings instance or not.

For example, a pipeline add-on might define the add-on callbacks and the pipeline class in the same module. This way, database settings (etc.) could be stored in module variables rather than in the global Scrapy settings, further reducing the chance of name clashes.

This example gives rise to a follow-up question: If we allowed setting not only paths, but also objects as values in the (priority, value) middleware settings, the add-on could take care of instantiating a pipeline/middleware/extension. This could again be useful to avoid cluttering up the global settings space. (Plus, in the example above, the add-on callbacks and the pipeline callbacks could be defined in the same class, storing local settings in instance attributes. The same pipeline class could then be reused to write to multiple destinations, for example.)

5. Where does the add-on manager live?

This is the question where I feel the most clueless as to what could be the best solution, and I've written down some thoughts on this in my GSoC proposal.

Right now I see two possible solutions, both of which I dislike:

  1. Setting changes made by the add-ons are not very different from setting changes made by the user in settings.py, and should therefore be included in what is returned by scrapy.utils.project.get_project_settings(). However, the add-on manager cannot live inside of that function since it is left long before we want to call the check_configuration callbacks. If we don't want to start passing around the add-on manager, we will have to introduce a new singleton, e.g. scrapy.addons.addonmanager.
  2. Spider-specific settings are loaded in Crawler.__init__(). Add-on management could follow this approach: instantiate the add-on manager and do the first round of callbacks here, then do the second round of callbacks in Crawler.crawl() or on the engine_started signal. This avoids the extra singleton but further rips apart compiling the complete configuration settings.
6. Can add-ons interact, and how?

I have not yet thought this through, but I like the idea that add-ons can interact with each other. For example, instead of forcing every extension/middleware/pipeline to provide their own mongodb interface and the corresponding settings variables, there could be an extension (as in set in EXTENSIONS) providing a mongodb API (promoting UNIX "do one thing and do it well" philosophy). Other add-ons can then access this extension and its methods through Crawler.addons['mongodb'] or similar.

Actually this can probably already be done through Crawler.extensions. Is there a way we can promote better modularity?

7. Dependency management

So far I feel full-fledged, package-manager style dependency management might be a bit overkill for the first version of an add-on manager. In my proposal, I suggested that each add-on must provide the two attributes NAME and VERSION, and should provide REQUIRES, MODIFIES, and PROVIDES to allow some basic dependency checks. I agree with @curita that the variables which usages we promote should be carefully selected to avoid future rewrites and would love some feedback on these suggested here.

@nyov I agree that add-ons should not be required to be put into an addons directory, rather developers should be allowed to decide whether they want to provide the add-on interface alongside their pipeline/middleware/extension or provide it in an extra module/class. I had the addons directory in mind for the latter case.

Contributor

jdemaeyer commented May 25, 2015

I will be working on an implementation of add-on management starting now!

While I finish up per-key settings priorities (#1149) as prerequisite (almost all add-ons will want to touch the dictionary settings), I will draft an updated version of the SEP this week. I think these are some of the major open issues (sorry, lots of text, feel free to skip sections):

1. How is the add-on interface enforced?

I think our options here are:

  1. Module: Promotes clear structure (one module per add-on), but does not allow us to provide any defaults (not all add-ons will want to do post-init tests, it might good to have a dummy check_configuration() callback)
  2. Subclass of AddOn: Allows us to provide default callbacks, and, if we want that, to control how add-ons interact with the settings
  3. zope.interface: Leaves the choice of object type (module, class) up to the developer

I tend to zope.interface, since it is the most versatile. Especially since we could still provide a base AddOn class with dummy callbacks and leave it up to the developer if they want to use it. Scrapy will have to deal with determining whether an addon path points to a module or a class, but that should only be a small update to scrapy.utils.misc.load_object().

2. Do we want to control how add-ons interact with settings?

@SudShekhar raised the question whether we want to provide some kind of API with which the add-ons interact with Scrapy's settings, rather than directly handing them the Settings object. I don't think this is necessary. After all, the people writing add-ons are developers; they should know what they're doing, I don't see a reason to restrict settings access. I'm open to being convinced otherwise though.

3. Are add-ons configured in scrapy.cfg or settings.py?

I am not yet completely familiar with what file is intended for what. It seems reasonable to me to load add-ons as early as possible, since they might want to mess with / replace core components (including the Settings class). But if scrapy.cfg is not shipped by scrapyd-deploy, how do we make the add-on configuration portable?

4. Can we keep (some) configuration local to the add-on?

It isn't 100% clear to me from the previous discussion whether people had this in mind already or not. Instead of exposing every single setting into the global settings space, we could hand over the configuration section (for this add-on) from scrapy.cfg to an add-on callback. The developer can then decide if she wants to insert them into the Settings instance or not.

For example, a pipeline add-on might define the add-on callbacks and the pipeline class in the same module. This way, database settings (etc.) could be stored in module variables rather than in the global Scrapy settings, further reducing the chance of name clashes.

This example gives rise to a follow-up question: If we allowed setting not only paths, but also objects as values in the (priority, value) middleware settings, the add-on could take care of instantiating a pipeline/middleware/extension. This could again be useful to avoid cluttering up the global settings space. (Plus, in the example above, the add-on callbacks and the pipeline callbacks could be defined in the same class, storing local settings in instance attributes. The same pipeline class could then be reused to write to multiple destinations, for example.)

5. Where does the add-on manager live?

This is the question where I feel the most clueless as to what could be the best solution, and I've written down some thoughts on this in my GSoC proposal.

Right now I see two possible solutions, both of which I dislike:

  1. Setting changes made by the add-ons are not very different from setting changes made by the user in settings.py, and should therefore be included in what is returned by scrapy.utils.project.get_project_settings(). However, the add-on manager cannot live inside of that function since it is left long before we want to call the check_configuration callbacks. If we don't want to start passing around the add-on manager, we will have to introduce a new singleton, e.g. scrapy.addons.addonmanager.
  2. Spider-specific settings are loaded in Crawler.__init__(). Add-on management could follow this approach: instantiate the add-on manager and do the first round of callbacks here, then do the second round of callbacks in Crawler.crawl() or on the engine_started signal. This avoids the extra singleton but further rips apart compiling the complete configuration settings.
6. Can add-ons interact, and how?

I have not yet thought this through, but I like the idea that add-ons can interact with each other. For example, instead of forcing every extension/middleware/pipeline to provide their own mongodb interface and the corresponding settings variables, there could be an extension (as in set in EXTENSIONS) providing a mongodb API (promoting UNIX "do one thing and do it well" philosophy). Other add-ons can then access this extension and its methods through Crawler.addons['mongodb'] or similar.

Actually this can probably already be done through Crawler.extensions. Is there a way we can promote better modularity?

7. Dependency management

So far I feel full-fledged, package-manager style dependency management might be a bit overkill for the first version of an add-on manager. In my proposal, I suggested that each add-on must provide the two attributes NAME and VERSION, and should provide REQUIRES, MODIFIES, and PROVIDES to allow some basic dependency checks. I agree with @curita that the variables which usages we promote should be carefully selected to avoid future rewrites and would love some feedback on these suggested here.

@nyov I agree that add-ons should not be required to be put into an addons directory, rather developers should be allowed to decide whether they want to provide the add-on interface alongside their pipeline/middleware/extension or provide it in an extra module/class. I had the addons directory in mind for the latter case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment