add quantile method to DataArray #1187

jhamman · 2016-12-28T01:06:51Z

This PR adds the quantile method to the DataArray. There may be a way to better fit this in with .apply or reduce it wasn't immediately clear to me how to do that. It uses np.nanpercentile under the hood so this shouldn't be expected to work well with dask. The main advantage to this method, over using apply(np.nanpercentile) is the handling of the quantile coordinate.

example usage:

In [12]: x = np.random.random(size=(2, 3, 800))

In [13]: da = xr.DataArray(x, dims=('x', 'y', 'time'))

In [14]: da.quantile([1, 5, 10, 25, 50, 75, 90, 95, 99], dim='time', interpolation='lower')
Out[14]: 
<xarray.DataArray (quantile: 9, x: 2, y: 3)>
array([[[ 0.00835474,  0.01126747,  0.00778847],
        [ 0.00803924,  0.00919259,  0.01150164]],

       [[ 0.04902814,  0.05346976,  0.04493341],
        [ 0.04236611,  0.05273082,  0.05858802]],

       [[ 0.09370776,  0.09976448,  0.09256707],
        [ 0.08943787,  0.09331907,  0.08832309]],

       [[ 0.25416402,  0.22577298,  0.24407393],
        [ 0.25386087,  0.23052584,  0.24621966]],

       [[ 0.534169  ,  0.46017551,  0.49817391],
        [ 0.50968059,  0.49427688,  0.51573855]],

       [[ 0.76633412,  0.73405412,  0.77210754],
        [ 0.76759837,  0.74243892,  0.76703357]],

       [[ 0.90832116,  0.89495854,  0.91818434],
        [ 0.91492771,  0.88063773,  0.91416636]],

       [[ 0.95260527,  0.95132871,  0.95979701],
        [ 0.95988286,  0.93137055,  0.95941658]],

       [[ 0.98597133,  0.98883232,  0.99013424],
        [ 0.98951238,  0.97550784,  0.99224201]]])
Coordinates:
  * quantile  (quantile) float64 1.0 5.0 10.0 25.0 50.0 75.0 90.0 95.0 99.0
  o x         (x) -
  o y         (y) -

closes #303
fixes #561

shoyer

This will be super handy!

I agree the logic is different enough (inserting a new axis) that this doesn't fit into reduce or apply.

shoyer · 2016-12-28T17:46:19Z

xarray/core/dataarray.py

+            inclusive.
+        dim : str or sequence of str, optional
+            Dimension(s) over which to apply quantile.
+        axis : int or sequence of int, optional


I would not bother with the axis argument. It makes the logic below quite convoluted, and in my experience with xarray it's rarely useful.

shoyer · 2016-12-28T17:47:03Z

xarray/core/dataarray.py

@@ -1736,5 +1736,93 @@ def dot(self, other):

        return type(self)(new_data, new_coords, new_dims)

+    def quantile(self, q, dim=None, axis=None, interpolation='linear'):
+        """


If possible, start with a one line description (without the extra line break)

shoyer · 2016-12-28T17:47:57Z

xarray/core/dataarray.py

+
+        Parameters
+        ----------
+        q : float in range of [0,100] (or sequence of floats)


Pandas uses quantiles between 0 and 1. Since we're using the name quantile (rather than percentile), I would stick with that convention (just divide by 100 to use nanpercentile under the hood).

shoyer · 2016-12-28T17:51:37Z

xarray/core/dataarray.py

+            remain after the reduction of the array. If the input
+            contains integers or floats smaller than ``float64``, the output
+            data-type is ``float64``. Otherwise, the output data-type is the
+            same as that of the input.


When is "same as the input" not float64? (I guess this is an issue with numpy docstring, but I would probably just leave out the extra detail here rather than copying it)

shoyer · 2016-12-28T17:52:37Z

xarray/core/dataarray.py

+
+        # Construct the return DataArray
+        ps = DataArray(ps, dims=new_dims, name=self.name)
+        if not isscalar:


Maybe add quantile as a coordinate even if it's a scalar?

shoyer · 2016-12-28T17:53:58Z

xarray/core/dataarray.py

+        if not isscalar:
+            new_dims = ['quantile'] + new_dims
+
+        ps = np.nanpercentile(self.data, q, axis=axis,


Let's make sure this fails loudly on dask arrays rather than implicitly converting to numpy.

shoyer · 2016-12-28T17:56:15Z

xarray/core/dataarray.py

@@ -1736,5 +1736,93 @@ def dot(self, other):

        return type(self)(new_data, new_coords, new_dims)

+    def quantile(self, q, dim=None, axis=None, interpolation='linear'):


It would be nice to also have this method for Dataset (or maybe even for Variable, too). Take a look at Dataset.reduce for an example.

To avoid duplicate implementation, many DataArray methods have the core of their logic implemented for Dataset and then use _to_temp_dataset() and _to_from_dataset() to convert back and forth.

Sure. Which do you prefer? It fits quite easily in the Variable since we don't need coords to calculate the percentiles. The DataArray and Dataset methods can just be wrappers.

Probably the most consistent with the current approach is to implement the guts (calling np.nanpercentile) on Variable, and then make a wrapper for handling coordinates on Dataset (and call that from the DataArray method).

shoyer · 2016-12-28T17:58:52Z

xarray/core/dataarray.py

+        if isscalar:
+            q = float(q)
+        else:
+            q = np.asarray(q, dtype=np.float64)


It might be slightly cleaner to unilaterally coerce q to a float64 array, and then use q.ndim != 0 to check for nonscalar q.

shoyer · 2016-12-28T18:00:31Z

xarray/core/dataarray.py

+        # Construct the return DataArray
+        ps = DataArray(ps, dims=new_dims, name=self.name)
+        if not isscalar:
+            ps['quantile'] = DataArray(q, dims=('quantile', ),


Maybe slightly cleaner: ps.coords['quantile'] = ('quantile', q) or ps.coords['quantile'] = Variable('quantile', q). There's no need to construct a full DataArray here and the `name='quantile' bit in particular is redundant with the name on the left-hand-side of the assignment operation.

shoyer · 2016-12-28T18:01:32Z

xarray/test/test_dataarray.py

@@ -1328,6 +1328,22 @@ def test_reduce(self):
        expected = DataArray(5, {'c': -999})
        self.assertDataArrayIdentical(expected, actual)

+    def test_quantile(self):
+        for method in ['linear', 'lower', 'higher', 'nearest', 'midpoint']:


No need to test all these methods here, since we have no logic specific to them. It's enough to verify (once) that the method argument gets passed on to NumPy.

jhamman · 2016-12-28T23:15:17Z

@shoyer - I moved the majority of the logic over to Variable and wrote wrapper methods for the Dataset and DataArray objects. I am getting an error on the py27-min build on travis that may or may not be an since fixed bug in numpy. Unpinning the version numbers in the build fixes the problem. Compare the builds for ddd5211 and aa81a17. I'm assuming we want to keep those versions pinned as they were so thoughts on a possible fix would be great...

shoyer

Looks much better now. This also needs basic tests for the Dataset and DataArray methods.

shoyer · 2017-01-07T03:06:57Z

xarray/core/dataset.py

+
+        variables = OrderedDict()
+        for name, var in iteritems(self.variables):
+            variables[name] = var.quantile(q, dim=dim,


I think we need a condition like the one in Dataset.reduce to skip variables that don't use dim.

Also, coordinate variable should be skipped. See Dataset.reduce for the right logic.

shoyer · 2017-01-07T03:07:42Z

xarray/core/variable.py

+
+        if isinstance(self.data, dask_array_type):
+            TypeError("quantile does not work for arrays stored as dask "
+                      "arrays. Load the data via .load() prior to calling "


.load() should be .compute() or .load()

shoyer · 2017-01-07T03:09:09Z

xarray/core/variable.py

+                axis = self.get_axis_num(dim)
+                new_dims.remove(dim)
+            else:
+                axis = [self.get_axis_num(d) for d in dim]


minor note: get_axis_num actually works on either individual or lists of dimensions

shoyer · 2017-01-07T03:13:27Z

doc/whats-new.rst

@@ -165,6 +165,9 @@ Enhancements
  and attributes. The method prints to a buffer (e.g. ``stdout``) with output
  similar to what the command line utility ``ncdump -h`` produces (:issue:`1150`).
  By `Joe Hamman <https://github.com/jhamman>`_.
+- New :py:meth:`~DataArray.quantile` method to calculate quantiles from
+  DataArray objects (:issue:`xxxx`).


update issue number to this PR

shoyer · 2017-01-07T03:14:26Z

ci/requirements-py27-min.yml

@@ -2,8 +2,8 @@ name: test_env
 dependencies:
  - python=2.7
  - pytest
-  - numpy==1.9.3


This does look a possible NumPy bug. Instead of updating the required version of numpy for everything, I would just skip it explicitly for the failing test.

shoyer · 2017-01-07T03:19:01Z

.travis.yml

@@ -61,9 +61,9 @@ install:
  - source activate test_env
  # scipy should not have been installed, but it's included in older versions of
  # the conda pandas package
-  - if [[ "$CONDA_ENV" == "py27-min" ]]; then


better not to remove this unless necessary

…y/variable

jhamman · 2017-01-21T18:47:12Z

@shoyer - all comments addressed and tests are passing.

shoyer

Needs a fix to the Dataset.quantile docstring, otherwise looks good to go!

shoyer · 2017-01-23T01:31:15Z

xarray/core/dataset.py

@@ -2547,6 +2546,87 @@ def roll(self, **shifts):

        return self._replace_vars_and_dims(variables)

+    def quantile(self, q, dim=None, numeric_only=False, keep_attrs=False,
+                 interpolation='linear'):


numeric_only and keep_attrs are missing from the docstring

shoyer · 2017-01-23T02:01:27Z

doc/api.rst

@@ -270,6 +270,7 @@ Computation
   DataArray.get_axis_num
   DataArray.diff
   DataArray.dot
+   DataArray.quantile


add this for Dataset, too

add quantile method to DataArray

c825334

shoyer reviewed Dec 28, 2016

View reviewed changes

Joe Hamman added 4 commits December 28, 2016 12:49

pep utils.py

3d87534

initial fixes after @shoyer's review

52fd4d6

move quantile to Variable, add wrapper methods to Dataset and DataArray

ddd5211

unpin numpy/pandas for quick test

aa81a17

shoyer reviewed Jan 7, 2017

View reviewed changes

Joe Hamman added 4 commits January 20, 2017 19:41

further refinement of quantile methods and tests for dataset/dataarra…

39394ec

…y/variable

skip quantile tests when numpy version is less than 1.10

fa80291

require numpy version 1.10 or later for quantile

c2d31fd

use LooseVersion, skip on Variable

1f3a990

Merge branch 'master' of github.com:pydata/xarray into feature/quantile

c07b711

shoyer approved these changes Jan 23, 2017

View reviewed changes

update doc strings and pass keep_attrs from dataarray in quantile method

f6507a1

shoyer reviewed Jan 23, 2017

View reviewed changes

add Dataset.quantile to docstring

d8ba569

shoyer merged commit d5f4af5 into pydata:master Jan 23, 2017

jhamman deleted the feature/quantile branch January 23, 2017 18:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add quantile method to DataArray #1187

add quantile method to DataArray #1187

jhamman commented Dec 28, 2016

shoyer left a comment

shoyer Dec 28, 2016

shoyer Dec 28, 2016

shoyer Dec 28, 2016

shoyer Dec 28, 2016

shoyer Dec 28, 2016

shoyer Dec 28, 2016

shoyer Dec 28, 2016

jhamman Dec 28, 2016

shoyer Dec 28, 2016

shoyer Dec 28, 2016

shoyer Dec 28, 2016

shoyer Dec 28, 2016

jhamman commented Dec 28, 2016

shoyer left a comment

shoyer Jan 7, 2017

shoyer Jan 7, 2017

shoyer Jan 7, 2017

shoyer Jan 7, 2017

shoyer Jan 7, 2017

shoyer Jan 7, 2017

shoyer Jan 7, 2017

jhamman commented Jan 21, 2017

shoyer left a comment

shoyer Jan 23, 2017

shoyer Jan 23, 2017

jhamman Jan 23, 2017

		@@ -1736,5 +1736,93 @@ def dot(self, other):

		return type(self)(new_data, new_coords, new_dims)

		def quantile(self, q, dim=None, axis=None, interpolation='linear'):

add quantile method to DataArray #1187

add quantile method to DataArray #1187

Conversation

jhamman commented Dec 28, 2016

shoyer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jhamman commented Dec 28, 2016

shoyer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jhamman commented Jan 21, 2017

shoyer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment