Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add trapz to DataArray for mathematical integration #1288

Closed
jeffbparker opened this issue Feb 28, 2017 · 26 comments
Closed

Add trapz to DataArray for mathematical integration #1288

jeffbparker opened this issue Feb 28, 2017 · 26 comments

Comments

@jeffbparker
Copy link

jeffbparker commented Feb 28, 2017

Since scientific data is often an approximation to a continuous function, when we write mean() or sum(), our underlying intention is often to approximate an integral. For example, if we have temperature of a rod T(t, x) as a function of time and space, the average value Tavg(x) is the integral of T(t,x) with respect to x, divided by the length.

I would guess that in practice, many uses of mean() and sum() are intending to approximate integrals of continuous functions. That is typically my use, at least. But simply adding up all values is a Riemann sum approximation to an integral which is not very accurate.

For approximating an integral, it seems to me that the trapezoidal rule (trapz() in numpy) should be preferred to sum() or mean() in essentially all cases, as the trapezoidal rule is more accurate while still being efficient.

It would be very useful to have trapz() as a method of DataArrays, so one could write, e.g., for an average value, Tavg = T.trapz(dim='time') / totalTime. Currently, I would have to use numpy's method and then rebuild the reduced-dimensional array myself:

TavgVal= np.trapz(T, T['time'], axis=0) / totalTime
Tavg= xr.DataArray(TavgVal, coords=T['space'], dims='space')

It could even be useful to have a function like mean_trapz() that calculates the mean value based on trapz. More generally, one could imagine having other integration methods too. E.g., data.integrate(dim='x', method='simpson'). But trapz is probably good enough for many cases and a big improvement over mean, and trapz is very simple even for unequally spaced data. And trapz shouldn't be much less efficient in principle, although in practice I find np.trapz() to be several times slower than np.mean().

Quick examples demonstrating sum/mean vs. trapz to convince you of the superiority of trapz:

x = np.linspace(0, 2, 200)
y = 1/3 * x**3
dx = x[1] - x[0]
integralRiemann =  dx * np.sum(y)  # 1.3467673375251465
integralTrapz = np.trapz(y, x)  # 1.3333670025167712
integralExact = 4/3  # 1.3333333333333333

This second example demonstrates the special advantages of trapz() for periodic functions because the trapezoidal rule happens to be extremely accurate for periodic functions integrated over their period.

x = np.linspace(0, 2*np.pi, 200)
y = cos(x)**2
meanRiemann = np.mean(y)  #  0.50249999999999995
meanTrapz = np.trapz(y, x) / (2*np.pi)  # 0.5
meanExact = 1/2  # 0.5
@jeffbparker
Copy link
Author

jeffbparker commented Feb 28, 2017

I don't at the moment see a reason to use a different API than DataArray.mean or DataArray.sum. DataArrays assume a default spacing of 1 if coordinates are not given, which is exactly what np.trapz does. So the API for trapz might look like:

DataArray.trapz(dim=None, axis=None, skipna=None, keep_attrs=False, **kwargs)

@shoyer
Copy link
Member

shoyer commented Feb 28, 2017

I agree that the API should mostly copy the mean/sum reduce methods (and in fact the implementation could probably share much of the logic). But there's still a question of whether the API should expose multiple methods like DataArray.trapz/DataArray.simps or a single method like DataArray.integrate (with method='simps'/method='trapz').

As long as there isn't something else we'd want to reserve the name for, I like the sound of integrate a little better, because it's more self-descriptive. trapz is only obvious if you know the name of the NumPy method. In contrast, integrate is the obvious way to approximate an integral. I would only hold off on using integrate if there is different functionality that comes to mind with the same.

It looks like SciPy implements Simpson's rule with the same API (see scipy.integrate.simps), so that would be easy to support, too. Given how prevalent SciPy is these days, I would have no compunctions about making scipy required for this method and defaulting to method='simps' for DataArray.integrate.

It would be useful to have dask.array versions of these functions, too, but that's not essential for a first pass. The implementation of trapz is very simple, so this would be quite easy to add to dask.

CC @spencerahill @rabernat @lesommer in case any of you have opinions about this

@fmaussion
Copy link
Member

+1 for integrate

The cumulative integral is of very frequent use in atmospheric sciences, too : https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.integrate.cumtrapz.html

@jeffbparker
Copy link
Author

jeffbparker commented Feb 28, 2017

An integrate method is probably better for the reason you describe---it's more obvious. I believe the name trapz came from Matlab originally.

With a general integrate, it's probably also useful to allow optional input arguments for lower_bound and upper_bound as a convenience for integrating over a subset of the data instead of the user doing that in a slice. If those arguments aren't given, they would default to all of the data.

@rabernat
Copy link
Contributor

Having an xarray wrapper on trapz or cumtrapz would definitely be useful for many users. I weakly prefer not to use the name integrate and instead keep the standard scipy names because they make clear the numerical algorithm that is being applied. The issue is that certain types of gridded data (such as output from numerical models) should actually not be integrated with the trapezoidal rule but rather should use the native finite volume discretization for their computational grid. The goal of our hypothetical pangeo vector calculus package is to implement integrals and derivatives in such a context. A built-in xarray integration function would apply in cases where the data is assumed to be continuous, and where no auxiliary information about the grid (beyond the coordinates) is available.

I will also make the same comment I always make when such feature requests are raised: yes, it always seems desirable to add new features to xarray on a function-by-function basis. But where does it end? Why not implement the rest of the scipy.ode module? And why stop there? As a community we need to develop a roadmap that clearly defines the scope of xarray. Once apply is stable, it might not be that hard to wrap a large fraction of the scipy library. But maybe that should live in a separate package.

@shoyer
Copy link
Member

shoyer commented Feb 28, 2017

As usual @rabernat raises some excellent points!

I weakly prefer not to use the name integrate and instead keep the standard scipy names because they make clear the numerical algorithm that is being applied.

Yes, this is a totally valid concern, if a user might expect integrate to be calculating something different.

One point in favor of calling this integrate is that the name is highly searchable, which provides an excellent place to include documentation about how to integrate in general (including links to other packages, like pangeo's vector calculus package). But we know that nobody reads documentation ;).

But where does it end? Why not implement the rest of the scipy.ode module?

Looking at the rest of scipy.integrate, in some ways the functionality of trapz/cumtrapz/simps is uniquely well suited for xarray: they are focused on data ("given fixed samples") rather than solving a system of equations ("given a function").

numpy.gradient feels very complementary as well, so I could see that as also in scope, but there are similar concerns for the name. There might be some value in complementary names for integrals/gradients.

As a community we need to develop a roadmap that clearly defines the scope of xarray.

I doubt we'll be able to come up with hard and fast rules, but maybe we can enumerate some principles, e.g.,

  • Features should be useful to users in multiple fields.
  • Features should be primarily about working with labeled data.
  • We are aiming for the 20% of functionality that covers 80% of use cases, not the long tail.
  • We don't want implementations of any complex numerical methods in xarray (like NumPy rather than SciPy).
  • Sometimes it's OK to include a feature in xarray because it makes logical sense with the rest of the package even if it's slightly domain specific (e.g., CF-conventions for netCDF files).

@rabernat
Copy link
Contributor

And I'm fine with integrate if that is the consensus here.

@fmaussion
Copy link
Member

I weakly prefer not to use the name integrate and instead keep the standard scipy names because they make clear the numerical algorithm that is being applied.

Yes, this is a totally valid concern, if a user might expect integrate to be calculating something different.
One point in favor of calling this integrate is that the name is highly searchable, which provides an excellent place to include documentation about how to integrate in general (including links to other packages, like pangeo's vector calculus package). But we know that nobody reads documentation ;).

integrate would allow to do things like:

da.integrate(how='rectangle')

da.integrate(how='trapezoidal')

@spencerahill
Copy link
Contributor

I like the integrate idea. Nothing further to add not already covered nicely via the above concerns by @rabernat and responses by @shoyer.

@dopplershift
Copy link
Contributor

👍 for the functionality (both intergrate and gradient) that work with DataArray. My concern is that this doesn't feel like functionality that inherently belongs as a method on a DataArray--if doesn't need to be a method, it shouldn't be. In numpy and scipy, these are separate functions and I think they work fine that way.

Another way to look at it is that methods are there to encapsulate some kind of manipulation of internal state or to ensure that some kind of invariant is maintained. I don't see how integrate is doing any of this for DataArray--seems like everything integrate would do would be doing can be accomplished using the public API. So really what you're buying is doing this:

da.integrate(dim='x', method='trapezoidal')

instead of

integrate(da, dim='x', method='trapezoidal`)

If you want to see what the pathological case of putting everything as a method for convenience looks like, go look at all the plot methods on matplotlib's Axes class. Pay special attention to the tangled web of stuff that comes from having ready access to the class's internals.

My real preference would just to have this work:

ds = xr.tutorial.load_dataset('rasm')
np.trapz(ds['Tair'], axis='x')

but I have no idea what that would take, so I'm perfectly fine with xarray gaining its own implementation.

@jeffbparker
Copy link
Author

The issue is that certain types of gridded data (such as output from numerical models) should actually not be integrated with the trapezoidal rule but rather should use the native finite volume discretization for their computational grid.

  • We are aiming for the 20% of functionality that covers 80% of use cases, not the long tail.
  • We don't want implementations of any complex numerical methods in xarray (like NumPy rather than SciPy).

I can see the problems down the road that @rabernat brings up. Say you have a high-order finite volume discretization and some numerical implementation of high-order integration for that gridding. What would your interface be? You could write it as new_integrate(da, dim, domain) but then it may be confusing to have da.integrate be different (and less accurate).

That might bring us back to the algorithmically descriptive name trapz, but then what about @shoyer's point that given a fixed gridding, da.integrate is the most readable choice of name? Perhaps allow generic extension of da.integrate by letting the method keyword of da.integrate accept a function as an argument that performs the actual integration?

@nbren12
Copy link
Contributor

nbren12 commented Mar 20, 2017

I would also like to see an integrate function. I have had one monkey patched in my own xarray routines for a while now. Also wanted: cumtrapz and friends. Maybe this could be implemented by adding an optional cumulative flag. This shouldn't be too hard to do. For example, in the following cumtrapz implementation all that would need to be changed is the final cumsum call.

def cumtrapz(A, dim):
    """Cumulative Simpson's rule (aka Tai's method)

    Notes
    -----
    Simpson rule is given by
        int f (x) = sum (f_i+f_i+1) dx / 2
    """
    x = A[dim]
    dx = x - x.shift(**{dim:1})
    dx = dx.fillna(0.0)
    return ((A.shift(**{dim:1}) + A)*dx/2.0)\
          .fillna(0.0)\
          .cumsum(dim)

@shoyer
Copy link
Member

shoyer commented Mar 20, 2017

Sorry for letting this lapse.

Yes, we absolutely want this functionality in some form.

My concern is that this doesn't feel like functionality that inherently belongs as a method on a DataArray--if doesn't need to be a method, it shouldn't be. In numpy and scipy, these are separate functions and I think they work fine that way.

This is a fair point, and I agree with you from a purist OO-programming/software-engineering perspective (TensorFlow, for example, takes this approach). But with xarray, we have been taking a different path, putting methods on objects for the convenience of method chaining (like pandas). So from a consistency perspective, I think it's fine to keep these as methods. This is somewhat similar even to NumPy, where a number of the most commonly used functions are also methods.

Perhaps allow generic extension of da.integrate by letting the method keyword of da.integrate accept a function as an argument that performs the actual integration?

I don't see a big advantage to adding such an extension point. Almost assuredly it's less text and more clear to simply write ds.pipe(my_integrate, 'x') or my_integrate(ds, 'x') rather than ds.integrate('x', my_integrate).

Maybe this could be implemented by adding an optional cumulative flag.

I normally don't like adding flags for switching functionality entirely but maybe that would make sense here if there's enough shared code (e.g., simply substituting cumsum for sum). The alternative is something like cum_integrate which sounds kind of awkward and is one more additional method.

One thing that can be useful to do before writing code is to write out a docstring with all the bells and whistles we might eventually add. So let's give that a shot here and see if integrate still makes sense:

integrate(dim, method='trapz', cumulative=False)

Arguments
---------
dim : str or DataArray
    DataArray or reference to an existing coordinate, labeling
    what to integrate over.
cumulative : bool, optional
    Whether to do a non-cumulative (default) or cumulative integral.
method : 'trapz' or 'simps', optional
    Whether to use the trapezoidal rule or Simpson's rule.

I could also imagine possibly adding a bounds or limits argument that specifies multiple limits for controlling multiple integrals at once (e.g., dim='x' and bounds=[0, 10, 20, 30, 40, 50] would result in an x dimension of length 5). This would certainly be useful for some of my current work. But maybe we should save this sort of add for later...

@fmaussion
Copy link
Member

An argument against a single function is that the shape of the returned array is different in each case. Also, cumtrapz has an inital keyword which changes the shape of the returned array. It is currently set to None per default, but should be set to 0 per default IMO.

I this is not a problem, I also like to have one single function for integration (simpler from a user perspective).

@nbren12
Copy link
Contributor

nbren12 commented Mar 20, 2017 via email

@shoyer
Copy link
Member

shoyer commented Mar 20, 2017

By the way, the cumtrapz implementation I pasted above matches the scipy
version when initial=0, which I also think would be a more sane default for
integration.

Yes, I agree with both of you that we should fix initial=0. (I don't know if I would even bother with adding the option.)

As far as implementation is concerned. Is there any performance downside to
using xarrays shift operators versus delving deeper into dask with
map_blocks, etc? I looked into using dasks cumreduction function, but am
not sure it is possible to implement the trapezoid method in that way
without changing dask.

From a performance perspective, it would be totally fine to implement this either in terms of high level xarray operations like shift/sum/cumsum (manipulating full xarray objects) or in terms of high level dask.array operations like dask.array.cumsum (manipulating dask arrays). I would whatever is easiest. I'm pretty sure there is no reason why you need to get into dask's low-level API like map_blocks and cumreduction.

@lamorton
Copy link

If you give a mouse a cookie, he'll ask for a glass of milk. There are a whole slew of Numpy/Scipy functions that would really benefit from using xarray to organize input/out. I've written wrappers for svd, fft, psd, gradient, and specgram, for starts. Perhaps a new package would be in order?

@shoyer
Copy link
Member

shoyer commented Apr 13, 2017

Perhaps a new package would be in order?

I would also be very happy to include many of these in a submodule inside xarray, e.g., xarray.scipy for wrappers of the scipy API. This would make it easier to use internal methods like apply_ufunc (though hopefully that will be public API soon).

@marberi
Copy link

marberi commented May 19, 2017

+1 for integrate. I found this thread when having the same problem.

@scollis
Copy link

scollis commented Jun 27, 2017

Adding my +1 without offering to do the work. :) This would be very welcome!

@yvikhlya
Copy link

yvikhlya commented Dec 1, 2017

Hello. I discovered xarray a few days ago, and find it very useful for my job. Integral along a coordinate is one of few things which I found missing so far.

@gajomi
Copy link
Contributor

gajomi commented Jan 22, 2018

I've written wrappers for svd, fft, psd, gradient, and specgram, for starts

@lamorton I really like the suggestion from @shoyer about submodules for throwing wrappers from other libraries, but in the meantime I think I might like very much to check out your implementation of fft and gradient in particular if these are somewhere public. I have been hacking at at least the latter and other functions in the numpy/scipy scope.

@lamorton
Copy link

lamorton commented Jan 22, 2018

@gajomi I can find a place to upload what I have. I foresee some difficulty making a general wrapper due to the issue of naming conventions, but I like the idea too.

Edit: Here's what I have so far ... YMMV, it's still kinda rough. https://github.com/lamorton/SciPyXarray

@nbren12
Copy link
Contributor

nbren12 commented Jan 22, 2018

I would also be very interested in seeing your codes @lamorton. Overall, I think the xarray community could really benefit from some kind of centralized contrib package which has a low barrier to entry for these kinds of functions. So far, I suspect there has been a large amount of code duplication for routine tasks like the fft, since I have also written a function for that.

@roxyboy
Copy link

roxyboy commented Jan 22, 2018

I've also contributed to developing a python package (xrft) for fft keeping the awareness of the metadata of multidimensional xarray datasets.

@shoyer
Copy link
Member

shoyer commented Jan 22, 2018

I opened #1850 to discuss xarray-contrib.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests