Feature/average #650

jhamman · 2015-11-09T18:11:35Z

closes #422

max-sixty · 2015-11-09T18:51:51Z

jhamman · 2015-11-09T19:16:29Z

Thanks @MaximilianR. There has been an open issue here on this for a while (#422).

@shoyer - I'm actually not sure I love how I implemented this but I'm teaching a session on open source contributions and code review today so I threw this up here as an example.

shoyer · 2015-11-10T08:29:44Z

Any thoughts on the tradeoff between adding average vs adding a weights argument to mean? I guess it's nice that this mirrors NumPy.

shoyer · 2015-11-10T08:33:13Z

xray/core/dataarray.py

+
+        # if NaNs are present, we need individual weights
+        valid = self.notnull()
+        if valid.any():


I think you have this logic backwards? I think should be if not valid.all().

That said, these sort of conditionals (that look at the data) are usually best avoided when using dask because they will bring constructing the computation to a screeching halt while it does the evaluation.

I think you are right about the logic being backwards. I'm not sure how that happened. There was some discussion about why we needed this conditional in this comment: #422 (comment).

If we can think of a way to side step this, I'm fine with removing it.

I think we can just use sum_of_weights = weights.where(valid).sum(dim=dim, axis=axis). That will work regardless of whether there are any nulls in the array.

This line is here because we don't want to count weights corresponding to NaN values in the sum.

Ah yes I must have been thinking in terms of isnan

jhamman · 2015-11-10T16:34:41Z

Any thoughts on the tradeoff between adding average vs adding a weights argument to mean? I guess it's nice that this mirrors NumPy.

That would be the main motivation.

If Pandas is going the way of pandas-dev/pandas#10030 via mean, I think we could do that as well. I actually like that approach more since we tend to call it a "weighted mean" (see title of pandas issue).

shoyer · 2015-11-12T06:58:40Z

If you think the ability to return sum_of_weights is important, then this probably makes sense as a separate method -- that would be pretty confusing to add to mean. Otherwise, I would be inclined to simply add weights=None to mean. That would require a bit of refactoring but shouldn't be too bad.

jhamman · 2015-11-12T15:32:07Z

Okay, let's go with the mean refactor. We'll drop the returned arg and just add weights to the method.

@mathause - any comment?

mathause · 2015-11-12T16:24:02Z

Didn't realize you were working on this. Pulling it into mean is fine for me (if you need the weights it is a one-liner). average in numpy seems comparatively complicated - maybe that's why it got it's own function...

average with no valid elements (or 0 weight) seems to return NaN which is fine
maybe you need to add tests when the data contains NaN

@jhamman you showed this in a lecture? cool :)

jhamman · 2016-02-18T05:00:58Z

I'm doing some cleanup on my outstanding issues/PRs. After thinking about this again, I'm not all than keen on pushing this into the mean method. I actually think it will end up being a bit of an ordeal to make happen. mean is currently injected as one of the NAN_REDUCE_METHODS. Its not entirely clear to me if it will be "cleaner" to refactor mean to support weights. Thoughts?

mathause · 2016-02-19T13:34:33Z

I am fine having it as extra method. I think it is an important feature to have - I use this function every day.

shoyer · 2016-02-19T18:36:22Z

I would still lean toward trying to put this into mean. You already have most of what you need -- it would just be a matter of dropping mean from the list of injected methods.

#770 might help with some of the redundant code (if/when we get around to it).

mathause · 2016-05-10T21:48:02Z

It seems incorporating this to mean may not be very practical and average not the cleanest solution. Do you know if a weighted mean is planned in pandas?

Anyway, I have tried to put together some corner cases whre there are NaN in the data or the weights. Unfortunately there is no np.nanaverage, so I also compared it to np.ma.average. I put together a gist with a lot of examples:

https://gist.github.com/mathause/720cbca2d97597a99534581b8ca296a5

The above implementation works fine, however there are currently two cases where I expect another answer:

data = [1, np.nan]; weights =  [0, 1.]
>>> 0.

I think this should return NaN.

data = [1, 1.]; weights = [np.nan, np.nan]
>>> 0
data = [1, 1.]; weights = [np.nan, 0]
>>> 0

I think these should also return NaN.

shoyer · 2016-05-10T21:49:53Z

Do you know if a weighted mean is planned in pandas?

Like most new features for pandas (or xarray, for that matter), there isn't anyone who has committed to working on it -- it will depend on the interest of a contributor.

mathause · 2016-05-10T22:00:35Z

I could imagine to continue working on this - however, there are some open design questions:

Do we include skipna? (I would say yes)
Do we allow the weights to contain NaN? (I would say yes, although disallowing it would make it easier.)
Does skipna also apply to the weights or are NaNs always skipped in the weights? (I would suggest the latter.)
Do we need a skipna_weights for a fine grained control of this? (This sounds unnecessary)
Do you agree with the above given examples?

max-sixty · 2016-05-10T22:54:51Z

How about designing this as a groupby-like interface? In the same way as .rolling (or .expanding & .ewm in pandas)?

So for example ds.weighted(weights=ds.dim).mean().

And then this is extensible, clean, pandan-tic.

shoyer · 2016-05-10T23:28:52Z

@MaximilianR great idea! A groupby like interface is much cleaner than adding more orthogonal code paths to .mean and the like.

jhamman · 2016-05-11T04:25:04Z

@MaximilianR - I really like this idea. I'm going to close this PR and we can continue to discuss this feature in the original issue (#422 (comment)).

shoyer reviewed Nov 10, 2015
View reviewed changes

add average method to dataarray and dataset

2f57875

max-sixty mentioned this pull request May 11, 2016

weighted mean pandas-dev/pandas#10030

Open

jhamman closed this May 11, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/average #650

Feature/average #650

jhamman commented Nov 9, 2015

max-sixty commented Nov 9, 2015

jhamman commented Nov 9, 2015

shoyer commented Nov 10, 2015

shoyer Nov 10, 2015

jhamman Nov 10, 2015

shoyer Nov 12, 2015

mathause Nov 12, 2015

jhamman commented Nov 10, 2015

shoyer commented Nov 12, 2015

jhamman commented Nov 12, 2015

mathause commented Nov 12, 2015

jhamman commented Feb 18, 2016

mathause commented Feb 19, 2016

shoyer commented Feb 19, 2016

mathause commented May 10, 2016

shoyer commented May 10, 2016

mathause commented May 10, 2016

max-sixty commented May 10, 2016

shoyer commented May 10, 2016

jhamman commented May 11, 2016

Feature/average #650

Feature/average #650

Conversation

jhamman commented Nov 9, 2015

max-sixty commented Nov 9, 2015

jhamman commented Nov 9, 2015

shoyer commented Nov 10, 2015

shoyer Nov 10, 2015

Choose a reason for hiding this comment

jhamman Nov 10, 2015

Choose a reason for hiding this comment

shoyer Nov 12, 2015

Choose a reason for hiding this comment

mathause Nov 12, 2015

Choose a reason for hiding this comment

jhamman commented Nov 10, 2015

shoyer commented Nov 12, 2015

jhamman commented Nov 12, 2015

mathause commented Nov 12, 2015

jhamman commented Feb 18, 2016

mathause commented Feb 19, 2016

shoyer commented Feb 19, 2016

mathause commented May 10, 2016

shoyer commented May 10, 2016

mathause commented May 10, 2016

max-sixty commented May 10, 2016

shoyer commented May 10, 2016

jhamman commented May 11, 2016