Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feeds Enhancement: Post-Processing #5168

Closed
drs-11 opened this issue May 29, 2021 · 12 comments · Fixed by #5190
Closed

Feeds Enhancement: Post-Processing #5168

drs-11 opened this issue May 29, 2021 · 12 comments · Fixed by #5190

Comments

@drs-11
Copy link
Contributor

drs-11 commented May 29, 2021

Summary

A Feed post-processing enhancement will enable plugins such as compression, minifying, beautifier, etc. which can then be added to the Feed exporting workflow.

Motivation/Proposal

A post-processing feature can help add more extensions dedicated to before-export-processing to scrapy. To help achieve extensibility, a PostProcessingManager can be used which will use "plugin" like scrapy components to process the data before writing it to target files.

The PostProcessingManager can act like a wrapper to the slot's storage so whenever a write event takes place, the data is run through the plugins in a pipeline-ish way to be processed and then written to the target file.

A number of plugins can be created but there will be a need to specify the order in which these plugins are used as some won't be able to process the data after it has been processed by another(eg: minifying won't work on a compression processed file). These plugins will be required to have a certain Interface so that the PostProcessingManager can use it without breaking down from unidentified components.

Few built-in plugins can be made such as for compressions: gzip, lzma, bz2.

PostProcessingManager class prototype:

class PostProcessingManager:
    """
    This will manage and use declared plugins to process data in a
    pipeline.
    :param plugins: all the declared plugins for the uri
    :type plugins: list
    :param file: target file whose data will be processed before write
    :type file: file like object
    """

    def __init__(self, plugins, file):
        # 1) load the plugins here
        # 2) save file as an attribute

    def write(self, data):
        """
        Uses all the declared plugins to process data first, then writes
        the processed data to target file. 
        :param data: data passed to be written to target file
        :type data: bytes
        :return: returns number of bytes written
        :rtype: int
        """

    def close(self):
        """
        Close the target file along with all the plugins.
        """

PostProcessorPlugin class inteface:

class PostProcessorPlugin(Interface):
    """
    Interface for plugins that will be used by PostProcessingManager. This will
    provide necessary processing method.
    """

    def __init__(self, file, feed_options):
        """
        Initialize plugin with target file to which post-processed
        data will be written and the feed-specific options.
        """

    def write(self, data):
        """
        Exposed method which will take data passed, process it and then
        write it to target file.
        :param data: data passed to be written to target file
        :type data: bytes
        :return: returns number of bytes written
        :rtype: int
        """

    def close(self):
        """
        Closes this plugin wrapper.
        """

    @staticmethod
    def process(data):
        """
        This will process the data and return it.
        :param data: input data
        :type data: bytes
        :return: processed data
        :rtype: bytes
        """

GzipPlugin example:

@implementer(PostProcessorPlugin)
class GzipPlugin:
    COMPRESS_LEVEL = 9

    def __init__(self. file):
        # initialise various parameters for gzipping
        self.file = gzip.GzipFile(fileobj=file, mode=file.mode,
                                  compresslevel=self.COMPRESS_LEVEL)

    def write(self, data):
        return self.file.write(data)

    def close(self):
        self.file.close()

    @staticmethod
    def process(data):
        return gzip.compress(data, compresslevel=self.COMPRESS_LEVEL)

settings.py example:

from myproject.pluginfile import MyPlugin

FEEDS = {
    'item1.json' : {
        'format': 'json',
        'post-processing': ['gzip'],
    },
    'item2.xml' : {
        'post-processing': [MyPlugin,'xz'],    # order is important
    },
}


POST_PROC_PLUGINS_BASE = {
    'gzip': 'scrapy.utils.postprocessors.GzipPlugin',
    'xz' : 'scrapy.utils.postprocessors.LZMAPlugin',
    'bz2': 'scrapy.utils.postprocessors.Bz2Plugin',
}

Describe alternatives you've considered

This feature idea is actually an expansion on compression support(see #2174). Item Pipelines can be used as well for compression. But implementing this feature instead can give more options to user for post-processing while making it easier for the user to activate those post-processing components for specific feeds.

Additional context

This feature proposal is part of a GSoC project (see #4963). This issue has been created to get inputs from the Scrapy community to refine the proposed feature.

@drs-11
Copy link
Contributor Author

drs-11 commented May 30, 2021

As PostProcessingManager will be the wrapper around the feed's storage, I think write and close won't be needed in PostProcessorPlugin interface.

@Gallaecio
Copy link
Member

This comment is exclusively about the plugin interface, which I think is the most important one (the manager will be an internal component, but users will write their own plugins).

I don’t think the processing method should be static. Plugins should be Scrapy components (i.e. it should be possible to add a from_crawler method to them to initialize them with access to settings), and hence they could read custom settings that condition their behavior. I think we need a regular method that can access object variables.

And I think we should make the plugin interface similar to, for example, zipfile.ZipFile:

with Plugin(output_file_object) as plugin:
    plugin.write(data)

So that plugin chaining could look like this:

with Plugin2(output_file_object) as plugin2:
    with Plugin1(plugin2) as plugin1:
        plugin1.write(data)

Of course, with an arbitrary number of plugins, the manager code would have you actually call the open and close methods of those plugins, instead of using with, but I hope this example helps visualize the main idea: plugins should get write calls, and in those calls they should be able to write into the next plugin (or the final file, in the case of the last plugin), but not required to (e.g. some plugins may store the input data internally, and write it all at once into the next plugin when the input data stops [close method gets called]).

@drs-11
Copy link
Contributor Author

drs-11 commented Jun 1, 2021

I think chained write and close methods looks simplest to me right now.
So something like:

# plugin1,2,3 subclasses a parent plugin class
p1 = plugin1(original_file)
p2 = plugin2(p1)
p3 = plugin3(p2)

p3.write(data) # will be called by manager class

So a single write call processes the data through all the plugin and finally writes to the target storage.

some plugins may store the input data internally, and write it all at once into the next plugin when the input data stops [close method gets called]

This is a valid method as well to achieve the pipeline. Should we keep both methods of pipelining and give user the option to choose one?

@Gallaecio
Copy link
Member

This is a valid method as well to achieve the pipeline. Should we keep both methods of pipelining and give user the option to choose one?

Could you elaborate? Which 2 methods are we talking about?

@drs-11
Copy link
Contributor Author

drs-11 commented Jun 2, 2021

I mean the way pipelining flow will work. 1) write data into next plugin whenever write method is called or 2) writes all at once into next plugin when close method is called as you have said.

@Gallaecio
Copy link
Member

I am thinking that we want to allow and encourage 1.

But an API designed for 1 still allows for a plugin to use 2 if needed. Scrapy could write multiple times into plugin1, but instead of calling write on plugin2 each time, plugin1 could store the input data into a temporary file, and once its close method is called, from that method it could perform a single plugin2.write() call followed by plugin.close().

@drs-11
Copy link
Contributor Author

drs-11 commented Jun 3, 2021

Ok yeah the implementation of plugins will depend ultimately on the user(obviously 😅).

@drs-11
Copy link
Contributor Author

drs-11 commented Jun 24, 2021

@Gallaecio, you mentioned to make the compression plugins flexible by letting them have different parameter values available to them. I'm thinking of letting users pass those parameters through a dict in FEEDS settings, where key will be the plugin declared and value will be another dict for parameters as kwargs.

For eg:

{
    'items.json': {
        'format': 'json',
        'postprocessing': ["myproject.plugins.plugin1"],
        'postproc-parameters': { "myproject.plugins.plugin1": {"para1": "val1", "para2": "val2"} },
    },
}

So parameters are passed only to those plugins which are declared in the postproc-parameters field. Does this seem okay or could there be a better alternative?

@drs-11
Copy link
Contributor Author

drs-11 commented Jun 24, 2021

Also we're passing feed_options to the plugins. So we could also declare a plugin1-parameters option and then pass those parameters as a dict to whichever plugin will accept plugin1-parameters option.

@Gallaecio
Copy link
Member

While namespacing plugin settings into a dictionary sounds reasonable, in either of the approaches you suggest, I think the simplest approach, and the one most in line with existing code, is to let plugin-specific options be defined directly among feed_options, with the namespace built into each setting name. For example:

{
    'items.json': {
        'format': 'json',
        'postprocessing': ["scrapy.extensions.feedexport.GZipProcessor"],
        'gzip_compression_level': 1,
    },
}

This goes in line with the approach of Scrapy settings, where namespaces are built into the setting names. It also makes it easy for one setting to be interpreted by more than one plugin, which is not something I would generally recommend but I believe use-cases for that could exists.

@SashiDareddy
Copy link

@Gallaecio I noticed that #5168 has been merged into master - but I was wondering when this feature will be available in the Scrapy Python package (I'm using the latest version 2.5.1)

@Gallaecio
Copy link
Member

We are aiming to merge #4978 before we release 2.6.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants