Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add quantile method to DataArray #1187

Merged
merged 12 commits into from
Jan 23, 2017
Merged

add quantile method to DataArray #1187

merged 12 commits into from
Jan 23, 2017

Conversation

jhamman
Copy link
Member

@jhamman jhamman commented Dec 28, 2016

This PR adds the quantile method to the DataArray. There may be a way to better fit this in with .apply or reduce it wasn't immediately clear to me how to do that. It uses np.nanpercentile under the hood so this shouldn't be expected to work well with dask. The main advantage to this method, over using apply(np.nanpercentile) is the handling of the quantile coordinate.

example usage:

In [12]: x = np.random.random(size=(2, 3, 800))

In [13]: da = xr.DataArray(x, dims=('x', 'y', 'time'))

In [14]: da.quantile([1, 5, 10, 25, 50, 75, 90, 95, 99], dim='time', interpolation='lower')
Out[14]: 
<xarray.DataArray (quantile: 9, x: 2, y: 3)>
array([[[ 0.00835474,  0.01126747,  0.00778847],
        [ 0.00803924,  0.00919259,  0.01150164]],

       [[ 0.04902814,  0.05346976,  0.04493341],
        [ 0.04236611,  0.05273082,  0.05858802]],

       [[ 0.09370776,  0.09976448,  0.09256707],
        [ 0.08943787,  0.09331907,  0.08832309]],

       [[ 0.25416402,  0.22577298,  0.24407393],
        [ 0.25386087,  0.23052584,  0.24621966]],

       [[ 0.534169  ,  0.46017551,  0.49817391],
        [ 0.50968059,  0.49427688,  0.51573855]],

       [[ 0.76633412,  0.73405412,  0.77210754],
        [ 0.76759837,  0.74243892,  0.76703357]],

       [[ 0.90832116,  0.89495854,  0.91818434],
        [ 0.91492771,  0.88063773,  0.91416636]],

       [[ 0.95260527,  0.95132871,  0.95979701],
        [ 0.95988286,  0.93137055,  0.95941658]],

       [[ 0.98597133,  0.98883232,  0.99013424],
        [ 0.98951238,  0.97550784,  0.99224201]]])
Coordinates:
  * quantile  (quantile) float64 1.0 5.0 10.0 25.0 50.0 75.0 90.0 95.0 99.0
  o x         (x) -
  o y         (y) -

closes #303
fixes #561

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be super handy!

I agree the logic is different enough (inserting a new axis) that this doesn't fit into reduce or apply.

inclusive.
dim : str or sequence of str, optional
Dimension(s) over which to apply quantile.
axis : int or sequence of int, optional
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not bother with the axis argument. It makes the logic below quite convoluted, and in my experience with xarray it's rarely useful.

@@ -1736,5 +1736,93 @@ def dot(self, other):

return type(self)(new_data, new_coords, new_dims)

def quantile(self, q, dim=None, axis=None, interpolation='linear'):
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible, start with a one line description (without the extra line break)


Parameters
----------
q : float in range of [0,100] (or sequence of floats)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pandas uses quantiles between 0 and 1. Since we're using the name quantile (rather than percentile), I would stick with that convention (just divide by 100 to use nanpercentile under the hood).

remain after the reduction of the array. If the input
contains integers or floats smaller than ``float64``, the output
data-type is ``float64``. Otherwise, the output data-type is the
same as that of the input.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When is "same as the input" not float64? (I guess this is an issue with numpy docstring, but I would probably just leave out the extra detail here rather than copying it)


# Construct the return DataArray
ps = DataArray(ps, dims=new_dims, name=self.name)
if not isscalar:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add quantile as a coordinate even if it's a scalar?

if not isscalar:
new_dims = ['quantile'] + new_dims

ps = np.nanpercentile(self.data, q, axis=axis,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make sure this fails loudly on dask arrays rather than implicitly converting to numpy.

@@ -1736,5 +1736,93 @@ def dot(self, other):

return type(self)(new_data, new_coords, new_dims)

def quantile(self, q, dim=None, axis=None, interpolation='linear'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to also have this method for Dataset (or maybe even for Variable, too). Take a look at Dataset.reduce for an example.

To avoid duplicate implementation, many DataArray methods have the core of their logic implemented for Dataset and then use _to_temp_dataset() and _to_from_dataset() to convert back and forth.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Which do you prefer? It fits quite easily in the Variable since we don't need coords to calculate the percentiles. The DataArray and Dataset methods can just be wrappers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably the most consistent with the current approach is to implement the guts (calling np.nanpercentile) on Variable, and then make a wrapper for handling coordinates on Dataset (and call that from the DataArray method).

if isscalar:
q = float(q)
else:
q = np.asarray(q, dtype=np.float64)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be slightly cleaner to unilaterally coerce q to a float64 array, and then use q.ndim != 0 to check for nonscalar q.

# Construct the return DataArray
ps = DataArray(ps, dims=new_dims, name=self.name)
if not isscalar:
ps['quantile'] = DataArray(q, dims=('quantile', ),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe slightly cleaner: ps.coords['quantile'] = ('quantile', q) or ps.coords['quantile'] = Variable('quantile', q). There's no need to construct a full DataArray here and the `name='quantile' bit in particular is redundant with the name on the left-hand-side of the assignment operation.

@@ -1328,6 +1328,22 @@ def test_reduce(self):
expected = DataArray(5, {'c': -999})
self.assertDataArrayIdentical(expected, actual)

def test_quantile(self):
for method in ['linear', 'lower', 'higher', 'nearest', 'midpoint']:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to test all these methods here, since we have no logic specific to them. It's enough to verify (once) that the method argument gets passed on to NumPy.

@jhamman
Copy link
Member Author

jhamman commented Dec 28, 2016

@shoyer - I moved the majority of the logic over to Variable and wrote wrapper methods for the Dataset and DataArray objects. I am getting an error on the py27-min build on travis that may or may not be an since fixed bug in numpy. Unpinning the version numbers in the build fixes the problem. Compare the builds for ddd5211 and aa81a17. I'm assuming we want to keep those versions pinned as they were so thoughts on a possible fix would be great...

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks much better now. This also needs basic tests for the Dataset and DataArray methods.


variables = OrderedDict()
for name, var in iteritems(self.variables):
variables[name] = var.quantile(q, dim=dim,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a condition like the one in Dataset.reduce to skip variables that don't use dim.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, coordinate variable should be skipped. See Dataset.reduce for the right logic.


if isinstance(self.data, dask_array_type):
TypeError("quantile does not work for arrays stored as dask "
"arrays. Load the data via .load() prior to calling "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.load() should be .compute() or .load()

axis = self.get_axis_num(dim)
new_dims.remove(dim)
else:
axis = [self.get_axis_num(d) for d in dim]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor note: get_axis_num actually works on either individual or lists of dimensions

@@ -165,6 +165,9 @@ Enhancements
and attributes. The method prints to a buffer (e.g. ``stdout``) with output
similar to what the command line utility ``ncdump -h`` produces (:issue:`1150`).
By `Joe Hamman <https://github.com/jhamman>`_.
- New :py:meth:`~DataArray.quantile` method to calculate quantiles from
DataArray objects (:issue:`xxxx`).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update issue number to this PR

@@ -2,8 +2,8 @@ name: test_env
dependencies:
- python=2.7
- pytest
- numpy==1.9.3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does look a possible NumPy bug. Instead of updating the required version of numpy for everything, I would just skip it explicitly for the failing test.

@@ -61,9 +61,9 @@ install:
- source activate test_env
# scipy should not have been installed, but it's included in older versions of
# the conda pandas package
- if [[ "$CONDA_ENV" == "py27-min" ]]; then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better not to remove this unless necessary

@jhamman
Copy link
Member Author

jhamman commented Jan 21, 2017

@shoyer - all comments addressed and tests are passing.

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a fix to the Dataset.quantile docstring, otherwise looks good to go!

@@ -2547,6 +2546,87 @@ def roll(self, **shifts):

return self._replace_vars_and_dims(variables)

def quantile(self, q, dim=None, numeric_only=False, keep_attrs=False,
interpolation='linear'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

numeric_only and keep_attrs are missing from the docstring

@@ -270,6 +270,7 @@ Computation
DataArray.get_axis_num
DataArray.diff
DataArray.dot
DataArray.quantile
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add this for Dataset, too

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@shoyer shoyer merged commit d5f4af5 into pydata:master Jan 23, 2017
@jhamman jhamman deleted the feature/quantile branch January 23, 2017 18:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

calculate percentiles Add a percentile/quantile method
2 participants