-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add quantile method to DataArray #1187
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be super handy!
I agree the logic is different enough (inserting a new axis) that this doesn't fit into reduce
or apply
.
inclusive. | ||
dim : str or sequence of str, optional | ||
Dimension(s) over which to apply quantile. | ||
axis : int or sequence of int, optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not bother with the axis
argument. It makes the logic below quite convoluted, and in my experience with xarray it's rarely useful.
@@ -1736,5 +1736,93 @@ def dot(self, other): | |||
|
|||
return type(self)(new_data, new_coords, new_dims) | |||
|
|||
def quantile(self, q, dim=None, axis=None, interpolation='linear'): | |||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If possible, start with a one line description (without the extra line break)
|
||
Parameters | ||
---------- | ||
q : float in range of [0,100] (or sequence of floats) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pandas uses quantiles between 0 and 1. Since we're using the name quantile
(rather than percentile
), I would stick with that convention (just divide by 100 to use nanpercentile
under the hood).
remain after the reduction of the array. If the input | ||
contains integers or floats smaller than ``float64``, the output | ||
data-type is ``float64``. Otherwise, the output data-type is the | ||
same as that of the input. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When is "same as the input" not float64? (I guess this is an issue with numpy docstring, but I would probably just leave out the extra detail here rather than copying it)
|
||
# Construct the return DataArray | ||
ps = DataArray(ps, dims=new_dims, name=self.name) | ||
if not isscalar: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add quantile
as a coordinate even if it's a scalar?
if not isscalar: | ||
new_dims = ['quantile'] + new_dims | ||
|
||
ps = np.nanpercentile(self.data, q, axis=axis, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make sure this fails loudly on dask arrays rather than implicitly converting to numpy.
@@ -1736,5 +1736,93 @@ def dot(self, other): | |||
|
|||
return type(self)(new_data, new_coords, new_dims) | |||
|
|||
def quantile(self, q, dim=None, axis=None, interpolation='linear'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to also have this method for Dataset
(or maybe even for Variable
, too). Take a look at Dataset.reduce
for an example.
To avoid duplicate implementation, many DataArray
methods have the core of their logic implemented for Dataset and then use _to_temp_dataset()
and _to_from_dataset()
to convert back and forth.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Which do you prefer? It fits quite easily in the Variable
since we don't need coords to calculate the percentiles. The DataArray
and Dataset
methods can just be wrappers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably the most consistent with the current approach is to implement the guts (calling np.nanpercentile
) on Variable
, and then make a wrapper for handling coordinates on Dataset
(and call that from the DataArray
method).
if isscalar: | ||
q = float(q) | ||
else: | ||
q = np.asarray(q, dtype=np.float64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be slightly cleaner to unilaterally coerce q
to a float64 array, and then use q.ndim != 0
to check for nonscalar q.
# Construct the return DataArray | ||
ps = DataArray(ps, dims=new_dims, name=self.name) | ||
if not isscalar: | ||
ps['quantile'] = DataArray(q, dims=('quantile', ), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe slightly cleaner: ps.coords['quantile'] = ('quantile', q)
or ps.coords['quantile'] = Variable('quantile', q)
. There's no need to construct a full DataArray here and the `name='quantile' bit in particular is redundant with the name on the left-hand-side of the assignment operation.
@@ -1328,6 +1328,22 @@ def test_reduce(self): | |||
expected = DataArray(5, {'c': -999}) | |||
self.assertDataArrayIdentical(expected, actual) | |||
|
|||
def test_quantile(self): | |||
for method in ['linear', 'lower', 'higher', 'nearest', 'midpoint']: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to test all these methods here, since we have no logic specific to them. It's enough to verify (once) that the method
argument gets passed on to NumPy.
@shoyer - I moved the majority of the logic over to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks much better now. This also needs basic tests for the Dataset and DataArray methods.
|
||
variables = OrderedDict() | ||
for name, var in iteritems(self.variables): | ||
variables[name] = var.quantile(q, dim=dim, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need a condition like the one in Dataset.reduce
to skip variables that don't use dim
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, coordinate variable should be skipped. See Dataset.reduce
for the right logic.
|
||
if isinstance(self.data, dask_array_type): | ||
TypeError("quantile does not work for arrays stored as dask " | ||
"arrays. Load the data via .load() prior to calling " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.load()
should be .compute() or .load()
axis = self.get_axis_num(dim) | ||
new_dims.remove(dim) | ||
else: | ||
axis = [self.get_axis_num(d) for d in dim] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor note: get_axis_num
actually works on either individual or lists of dimensions
@@ -165,6 +165,9 @@ Enhancements | |||
and attributes. The method prints to a buffer (e.g. ``stdout``) with output | |||
similar to what the command line utility ``ncdump -h`` produces (:issue:`1150`). | |||
By `Joe Hamman <https://github.com/jhamman>`_. | |||
- New :py:meth:`~DataArray.quantile` method to calculate quantiles from | |||
DataArray objects (:issue:`xxxx`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update issue number to this PR
@@ -2,8 +2,8 @@ name: test_env | |||
dependencies: | |||
- python=2.7 | |||
- pytest | |||
- numpy==1.9.3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does look a possible NumPy bug. Instead of updating the required version of numpy for everything, I would just skip it explicitly for the failing test.
@@ -61,9 +61,9 @@ install: | |||
- source activate test_env | |||
# scipy should not have been installed, but it's included in older versions of | |||
# the conda pandas package | |||
- if [[ "$CONDA_ENV" == "py27-min" ]]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better not to remove this unless necessary
@shoyer - all comments addressed and tests are passing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs a fix to the Dataset.quantile
docstring, otherwise looks good to go!
@@ -2547,6 +2546,87 @@ def roll(self, **shifts): | |||
|
|||
return self._replace_vars_and_dims(variables) | |||
|
|||
def quantile(self, q, dim=None, numeric_only=False, keep_attrs=False, | |||
interpolation='linear'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
numeric_only
and keep_attrs
are missing from the docstring
@@ -270,6 +270,7 @@ Computation | |||
DataArray.get_axis_num | |||
DataArray.diff | |||
DataArray.dot | |||
DataArray.quantile |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add this for Dataset
, too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
This PR adds the
quantile
method to the DataArray. There may be a way to better fit this in with.apply
orreduce
it wasn't immediately clear to me how to do that. It usesnp.nanpercentile
under the hood so this shouldn't be expected to work well with dask. The main advantage to this method, over usingapply(np.nanpercentile)
is the handling of thequantile
coordinate.example usage:
closes #303
fixes #561