Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

acccessor extending approach limits functional programming approach, make direct monkey-patching also possible #1080

Closed
smartass101 opened this issue Nov 4, 2016 · 9 comments

Comments

@smartass101
Copy link

Hi, thatnks for creating and continuing development of xarray. I'm in the process of converting my own functions and classes to it which did something very similar (label indexing, plotting, etc.) but was inferior in many ways.

Right now I'm designing a set of functions for digital signal processing (I need them the most, though inteprolation is also important), mostly lowpass/highpass filters and spectrograms based on scipy.signal. Initially I started writing a dsp accessor with such methods, but later I realized, that this accessor approach makes it quite hard to do something like dataset.apply(lowpass, 5). Instead, one has to do something like dataset.apply(lambda d: d.dsp.lowpass(0.5)) which is less convenient than the clear functional programming apply approach.

I agree that making sure that adding a method to the class does not overwrite something else is a good idea, but that can be done for single methods as well. It would be even possible to save replaced method somewhere and replace them later if requested. The great advantage is that the added methods can still be first-class functions as well.

Such methods cannot save state as easily as accessor methods, but in many cases that is not necessary.

I actually implemented something similar for my DataArray-like class (before xarray existed, now I'm trying to convert to xarray) with such plugin handling (below with slight modifications for DataArray). Let me know what you think.

'''Module for handling various DataArray method plugins'''
from xarray import DataArray
from types import FunctionType

# map: name of patched method -> stack of previous methods
_REPLACED_METHODS = {}


def patch_dataarray(method_func):
    '''Sets method_func as a method of the DataArray class

    The method name is inferred from method_func.__name__
    
    Can be used as decorator for functions that should be added to the
    DataArray class as methods, for example::

        @patch_dataarray
        def square(self, arg):
            return self**2
        
    The decorated function then becomes a method of the class, so
    these two are equivalent::

        foo(sig) == sig.foo()

    '''
    method_name = method_func.__name__
    method_stack = _REPLACED_METHODS.setdefault(method_name, [])
    method_stack.append(getattr(DataArray, method_name, None))
    setattr(DataArray, method_name, method_func)
    return method_func


def restore_method(method_func):
    '''Restore a previous version of a method of the DataArray class'''
    method_name = method_func.__name__
    try:
        method_stack = _REPLACED_METHODS[method_name]
    except KeyError:
        return                  # no previous method to restore
    previous_method = method_stack.pop(-1)
    if previous_method is None:
        delattr(DataArray, method_name)
    else:
        setattr(DataArray, method_name, previous_method)


def unload_module_patches(module):
    '''Restore previous versions of methods found in the given module'''
    for name in dir(module):
        obj = getattr(module, name)
        if isinstance(obj, FunctionType):
            restore_method(obj)


def patch_dataarray_wraps(func, func_name=None):
    '''Return a decorator that patches DataArray with the decorated function

    and copies the name of the func and adds a line to the docstring
    about wrapping the function
    '''
    if func_name is None:
        func_name = func.__name__
    def updater(new_func):
        '''copy the function name and add a docline'''
        new_func.__name__ = func_name
        new_func.__doc__ = (('Wrapper around function %s\n\n' % func_name)
                            + new_func.__doc__)
        return patch_dataarray(new_func)
    return updater
@shoyer
Copy link
Member

shoyer commented Nov 5, 2016

I don't see a conflict between encouraging accessors and making duplicate methods/functions. If you want duplicate method/functions, you can totally do that on an accessor class:

import functools
import xarray

@xarray.register_dataarray_accessor('custom')
class CustomAccessor(object):
    def __init__(self, data_array):
        self._data_array = data_array

def patch_custom(func):
    @functools.wraps(func)
    def method(accessor, *args, **kwargs):
        return func(accessor._data_array, *args, **kwargs)
    setattr(CustomAccessor, func.__name__, method)
    return func

@patch_custom
def square(data_array):
    return data_array ** 2

xarray.DataArray([1, 2, 3], dims='x').custom.square()
# <xarray.DataArray (x: 3)>
# array([1, 4, 9])
# Coordinates:
#  * x        (x) int64 0 1 2

If you really desire, you can even make method-like accessors by adding a __call__ method, e.g.,

@xarray.register_dataarray_accessor('square')
class Square(object):
    def __init__(self, data_array):
        self._data_array = data_array
    def __call__(self):
        return self._data_array ** 2

or even simpler

@xarray.register_dataarray_accessor('square')
def square_accessor(data_array):
    def square():
        return data_array ** 2
    return square

I would definitely discourage writing too many of such methods, though.

Finally, it seems like the simplest solution to your concern about needing methods for Dataset.apply would be to register an accessor for Dataset as well, with register_dataset_accessor.

@smartass101
Copy link
Author

Thank you for your response.

I still don't understand why you are pushing accessors in place of methods to such an extent. Is it because of namespace growth/conflicts? There are already many methods like diff, any which don't seem particularly more important than others. For instance, ndarray has no diff method yet you implement it.

While the solutions you presented are usable, they seem like workarounds and somewhat redundant or add extra like overhead (in terms of writing code). Registering extra dataset accessors where DataArray method application would do seems again redundant.

I would definitely discourage writing too many of such methods, though.

Could you please give some clear arguments why you discourage the use of normal methods? The two arguments listed in the docs don't really make a compelling case against method monkey-patching, because

  1. name clashes can be easily checked for either approach (in either case you just check the existence of a class attribute)
  2. caching on the dataset sometimes makes no sense and just adds redundancy and complicates the design and registering of extra functionality

I'm not trying to say that the accessor approach is wrong, I'm sure it makes sense for certain plugins. I'm just trying to share my experience with a very similar case where the simpler method approach turned out to be satisfactory and I think enabling it would increase the chances of more xarray plugins (which may not need accessor logic) coming to life.

Btw, perhaps it might be better to (perhaps optionally) issue a warning when overriding an existing class attribute during registering instead of completely refusing to do so.

@shoyer
Copy link
Member

shoyer commented Nov 6, 2016

Is it because of namespace growth/conflicts? There are already many methods like diff, any which don't seem particularly more important than others. For instance, ndarray has no diff method yet you implement it.

Indeed. My thinking was the xarray.Dataset and xarray.DataArray are in the "xarray" namespace. We allow you to register an extension namespace, but want to keep it well contained and under one attribute, so it's clear(er) to users and developers what is going on, and where the code comes from.

A stricter approach would have been to put everything under an attribute just for extensions, e.g., Dataset.x.namespace instead of Dataset.namespace, but this gets even more cumbersome -- and also conflicts with variables named x!

Could you please give some clear arguments why you discourage the use of normal methods? The two arguments listed in the docs don't really make a compelling case against method monkey-patching, because

  1. name clashes can be easily checked for either approach (in either case you just check the existence of a class attribute)

I'll add a note about the value of namespaces to the doc.

  1. caching on the dataset sometimes makes no sense and just adds redundancy and complicates the design and registering of extra functionality

We could certainly turn this off (optionally) if there are cases where it does the wrong thing. Could you go into this in a little more detail, perhaps with a concrete example? My expectation was that this should have minimal design or performance downsides.

@smartass101
Copy link
Author

The namespace argument doesn't seem very convincing since you already implement many methods which may shadow variables (mean, diff). By limiting control of the namespace you make some uses somewhat inconvenient. If you want users to use DataArray as a general and universal and also extensible container, limiting its namespace goes against that. If they shadow vars by their methods, that's their decision to make.

While it may seem cleaner to have a stricter API, in real use cases users care more about convenient code access than where it came from. And when they look at the method object it will clearly tell them where it was defined. Python's introspection capabilities are powerful enough that users can find out such information.

What I meant by the 2. point was that in many cases one just needs a simple method and with the accessor approach one has to write extra lines of code like the ones you suggested earlier that may later seem cryptic. Caching of the accessor can be indeed useful, just not always. If you want people to develop plugins, make it as simple as possible and yet also advanced for those who require it. And then there"s also the problem of accessors not being usable in functional programming paradigms.

Tl;dr: accessors have benefits (namespace containment, caching) but also limitations (not functional paradigm, overkill sometimes). Give users more control over methods and you'll get more plugins.

On November 6, 2016 2:22:44 PM GMT+01:00, Stephan Hoyer notifications@github.com wrote:

Is it because of namespace growth/conflicts? There are already many
methods like diff, any which don't seem particularly more important
than others. For instance, ndarray has no diff method yet you implement
it.

Indeed. My thinking was the xarray.Dataset and xarray.DataArray are
in the "xarray" namespace. We allow you to register an extension
namespace, but want to keep it well contained and under one attribute,
so it's clear(er) to users and developers what is going on, and where
the code comes from.

A stricter approach would have been to put everything under an
attribute just for extensions, e.g., Dataset.x.namespace instead of
Dataset.namespace, but this gets even more cumbersome -- and also
conflicts with variables named x!

Could you please give some clear arguments why you discourage the use
of normal methods? The two arguments listed in the docs don't really
make a compelling case against method monkey-patching, because

  1. name clashes can be easily checked for either approach (in either
    case you just check the existence of a class attribute)

I'll add a wrote about the value of namespaces to the doc.

  1. caching on the dataset sometimes makes no sense and just adds
    redundancy and complicates the design and registering of extra
    functionality

We could certainly turn this off (optionally) if there are cases where
it does the wrong thing. Could you go into this in a little more
detail, perhaps with a concrete example? My expectation was that this
should have minimal design or performance downsides.

You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub:
#1080 (comment)

Sent from my Android device with K-9 Mail. Please excuse my brevity.

@shoyer
Copy link
Member

shoyer commented Nov 8, 2016

Python's introspection capabilities are powerful enough that users can find out such information.

I don't really agree here. Code is read in text form more often than it is interactively explored.

At Google, our Python style guide actually prohibits writing import like from xarray import Dataset. You have to write import xarray or import xarray as xr and always use the namespace.

And then there"s also the problem of accessors not being usable in functional programming paradigms.

xarray objects are already non-threadsafe, in the same way that the built-in list and dict are not threadsafe. I don't see how caching attributes changes this. You can choose whether or not to save state on the accessor (and of course, generally it would be better not to).

Finally, I'll note that we also have the .pipe method (e.g., array.pipe(square)), so if you just want functions that you can call with method chaining syntax, you don't even need to write an accessor at all.

You are certainly welcome to monkey patch -- that's the Python philosophy, after all -- but I'm not going to recommend it or make it easy. But I would even subclassing before considering monkey patching -- at least then your methods are contained to your own code, instead of contaminating a global namespace.

@smartass101
Copy link
Author

Code is read in text form more often than it is interactively explored.

Good point, in that case explicit namespacing indeed helps.

At Google, our Python style guide actually prohibits writing import like from xarray import Dataset. You have to write import xarray or import xarray as xr and always use the namespace.

A module-level namespace has nothing to do with the class namespace, but I see you try to tie them, which makes sense in relationship with the argument about reading code in text form. However, that may not be clear for Python programmers as those namespaces are not tied in reality, better mention it in the docs. BTW, if you are enforcing some specific style guide, please note that in the docs. And I hope you strike the right balance between style complacency and universality.

xarray objects are already non-threadsafe, in the same way that the built-in list and dict are not threadsafe. I don't see how caching attributes changes this. You can choose whether or not to save state on the accessor (and of course, generally it would be better not to).

My problem with non-functional paradigms lies more in the apply, map... paradigms which accessors don't fit into than thread safety.

Finally, I'll note that we also have the .pipe method (e.g., array.pipe(square)), so if you just want functions that you can call with method chaining syntax, you don't even need to write an accessor at all.

That is indeed a good alternative, just not sure my colleagues will like the transition from sig.lowpass(0.2).multiply(3) to sig.pipe(xdsp.lowpass, 0.2).pipe(np.multiply, 3). A benefit of pipe is that methods can be tab-completed from namespaces (useful for interactive usage) and that any compatible function can be used, not just registered methods. Perhaps I will suggest DataArray.__call__ = DataArray.pipe (maybe that could be added in xarray ? should I make an issue for that?) which would make it quite convenient to write only sig(xdsp.lowpass, 0.2)(np.multiply, 3) which is almost the same in terms of chars written and has quite clear syntax (calling a signal with a function argument applies the function to it).

@shoyer
Copy link
Member

shoyer commented Nov 17, 2016

I think we probably need to agree to disagree here. I will update the docs in response to feedback (which is greatly appreciated!) and when I do so I will close out this issue.

Perhaps I will suggest DataArray.__call__ = DataArray.pipe (maybe that could be added in xarray ?

This seems too magical to me, but you are welcome to make another issue to see what others think. __call__ is not searchable in the way the .pipe is.

@fmaussion
Copy link
Member

Sorry for chiming in here. Using the example above, if we have the choice between:

  1. sig.lowpass(0.2).multiply(3)
  2. sig.accessorlib.lowpass(0.2).multiply(3)
  3. sig.pipe(xdsp.lowpass, 0.2).pipe(np.multiply, 3)
  4. sig(xdsp.lowpass, 0.2)(np.multiply, 3)

Of course, (1) is attractive because straight forward (as the dev of a small xarray accessor, I also forget very often that I have to add an attribute between the dataset and my function call). But (2) has the huge advantage that it clearly says where the code of the function is found, and where to ask questions when things do not work as expected.

(3) is OK for me since it is very explicit, but I find that (4) is quite ugly.

@smartass101
Copy link
Author

Thank you for continuing this discussion even though you didn't agree with the initial proposal.
I have accepted and embraced option 3) as it is indeed about the cleanest and most readable option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants