-
Notifications
You must be signed in to change notification settings - Fork 10.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feeds Enhancement: Post-Processing #5168
Comments
As |
This comment is exclusively about the plugin interface, which I think is the most important one (the manager will be an internal component, but users will write their own plugins). I don’t think the processing method should be static. Plugins should be Scrapy components (i.e. it should be possible to add a And I think we should make the plugin interface similar to, for example,
So that plugin chaining could look like this:
Of course, with an arbitrary number of plugins, the manager code would have you actually call the open and close methods of those plugins, instead of using |
I think chained # plugin1,2,3 subclasses a parent plugin class
p1 = plugin1(original_file)
p2 = plugin2(p1)
p3 = plugin3(p2)
p3.write(data) # will be called by manager class So a single write call processes the data through all the plugin and finally writes to the target storage.
This is a valid method as well to achieve the pipeline. Should we keep both methods of pipelining and give user the option to choose one? |
Could you elaborate? Which 2 methods are we talking about? |
I mean the way pipelining flow will work. 1) write data into next plugin whenever |
I am thinking that we want to allow and encourage 1. But an API designed for 1 still allows for a plugin to use 2 if needed. Scrapy could write multiple times into plugin1, but instead of calling write on plugin2 each time, plugin1 could store the input data into a temporary file, and once its close method is called, from that method it could perform a single plugin2.write() call followed by plugin.close(). |
Ok yeah the implementation of plugins will depend ultimately on the user(obviously 😅). |
@Gallaecio, you mentioned to make the compression plugins flexible by letting them have different parameter values available to them. I'm thinking of letting users pass those parameters through a dict in For eg:
So parameters are passed only to those plugins which are declared in the |
Also we're passing |
While namespacing plugin settings into a dictionary sounds reasonable, in either of the approaches you suggest, I think the simplest approach, and the one most in line with existing code, is to let plugin-specific options be defined directly among {
'items.json': {
'format': 'json',
'postprocessing': ["scrapy.extensions.feedexport.GZipProcessor"],
'gzip_compression_level': 1,
},
} This goes in line with the approach of Scrapy settings, where namespaces are built into the setting names. It also makes it easy for one setting to be interpreted by more than one plugin, which is not something I would generally recommend but I believe use-cases for that could exists. |
@Gallaecio I noticed that #5168 has been merged into master - but I was wondering when this feature will be available in the Scrapy Python package (I'm using the latest version 2.5.1) |
We are aiming to merge #4978 before we release 2.6. |
Summary
A Feed post-processing enhancement will enable plugins such as compression, minifying, beautifier, etc. which can then be added to the Feed exporting workflow.
Motivation/Proposal
A post-processing feature can help add more extensions dedicated to before-export-processing to scrapy. To help achieve extensibility, a
PostProcessingManager
can be used which will use "plugin" like scrapy components to process the data before writing it to target files.The
PostProcessingManager
can act like a wrapper to the slot's storage so whenever a write event takes place, the data is run through the plugins in a pipeline-ish way to be processed and then written to the target file.A number of plugins can be created but there will be a need to specify the order in which these plugins are used as some won't be able to process the data after it has been processed by another(eg: minifying won't work on a compression processed file). These plugins will be required to have a certain Interface so that the
PostProcessingManager
can use it without breaking down from unidentified components.Few built-in plugins can be made such as for compressions: gzip, lzma, bz2.
PostProcessingManager class prototype:
PostProcessorPlugin class inteface:
GzipPlugin example:
settings.py example:
Describe alternatives you've considered
This feature idea is actually an expansion on compression support(see #2174). Item Pipelines can be used as well for compression. But implementing this feature instead can give more options to user for post-processing while making it easier for the user to activate those post-processing components for specific feeds.
Additional context
This feature proposal is part of a GSoC project (see #4963). This issue has been created to get inputs from the Scrapy community to refine the proposed feature.
The text was updated successfully, but these errors were encountered: