Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/average #650

Closed
wants to merge 1 commit into from
Closed

Feature/average #650

wants to merge 1 commit into from

Conversation

jhamman
Copy link
Member

@jhamman jhamman commented Nov 9, 2015

closes #422

@max-sixty
Copy link
Collaborator

xref pandas-dev/pandas#10030

@jhamman
Copy link
Member Author

jhamman commented Nov 9, 2015

Thanks @MaximilianR. There has been an open issue here on this for a while (#422).

@shoyer - I'm actually not sure I love how I implemented this but I'm teaching a session on open source contributions and code review today so I threw this up here as an example.

@shoyer
Copy link
Member

shoyer commented Nov 10, 2015

Any thoughts on the tradeoff between adding average vs adding a weights argument to mean? I guess it's nice that this mirrors NumPy.


# if NaNs are present, we need individual weights
valid = self.notnull()
if valid.any():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you have this logic backwards? I think should be if not valid.all().

That said, these sort of conditionals (that look at the data) are usually best avoided when using dask because they will bring constructing the computation to a screeching halt while it does the evaluation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are right about the logic being backwards. I'm not sure how that happened. There was some discussion about why we needed this conditional in this comment: #422 (comment).

If we can think of a way to side step this, I'm fine with removing it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can just use sum_of_weights = weights.where(valid).sum(dim=dim, axis=axis). That will work regardless of whether there are any nulls in the array.

This line is here because we don't want to count weights corresponding to NaN values in the sum.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes I must have been thinking in terms of isnan

@jhamman
Copy link
Member Author

jhamman commented Nov 10, 2015

Any thoughts on the tradeoff between adding average vs adding a weights argument to mean? I guess it's nice that this mirrors NumPy.

That would be the main motivation.

If Pandas is going the way of pandas-dev/pandas#10030 via mean, I think we could do that as well. I actually like that approach more since we tend to call it a "weighted mean" (see title of pandas issue).

@shoyer
Copy link
Member

shoyer commented Nov 12, 2015

If you think the ability to return sum_of_weights is important, then this probably makes sense as a separate method -- that would be pretty confusing to add to mean. Otherwise, I would be inclined to simply add weights=None to mean. That would require a bit of refactoring but shouldn't be too bad.

@jhamman
Copy link
Member Author

jhamman commented Nov 12, 2015

Okay, let's go with the mean refactor. We'll drop the returned arg and just add weights to the method.

@mathause - any comment?

@mathause
Copy link
Collaborator

Didn't realize you were working on this. Pulling it into mean is fine for me (if you need the weights it is a one-liner). average in numpy seems comparatively complicated - maybe that's why it got it's own function...

  • average with no valid elements (or 0 weight) seems to return NaN which is fine
  • maybe you need to add tests when the data contains NaN

@jhamman you showed this in a lecture? cool :)

@jhamman
Copy link
Member Author

jhamman commented Feb 18, 2016

I'm doing some cleanup on my outstanding issues/PRs. After thinking about this again, I'm not all than keen on pushing this into the mean method. I actually think it will end up being a bit of an ordeal to make happen. mean is currently injected as one of the NAN_REDUCE_METHODS. Its not entirely clear to me if it will be "cleaner" to refactor mean to support weights. Thoughts?

@mathause
Copy link
Collaborator

I am fine having it as extra method. I think it is an important feature to have - I use this function every day.

@shoyer
Copy link
Member

shoyer commented Feb 19, 2016

I would still lean toward trying to put this into mean. You already have most of what you need -- it would just be a matter of dropping mean from the list of injected methods.

#770 might help with some of the redundant code (if/when we get around to it).

@mathause
Copy link
Collaborator

It seems incorporating this to mean may not be very practical and average not the cleanest solution. Do you know if a weighted mean is planned in pandas?

Anyway, I have tried to put together some corner cases whre there are NaN in the data or the weights. Unfortunately there is no np.nanaverage, so I also compared it to np.ma.average. I put together a gist with a lot of examples:

https://gist.github.com/mathause/720cbca2d97597a99534581b8ca296a5

The above implementation works fine, however there are currently two cases where I expect another answer:

data = [1, np.nan]; weights =  [0, 1.]
>>> 0.

I think this should return NaN.

data = [1, 1.]; weights = [np.nan, np.nan]
>>> 0
data = [1, 1.]; weights = [np.nan, 0]
>>> 0

I think these should also return NaN.

@shoyer
Copy link
Member

shoyer commented May 10, 2016

Do you know if a weighted mean is planned in pandas?

Like most new features for pandas (or xarray, for that matter), there isn't anyone who has committed to working on it -- it will depend on the interest of a contributor.

@mathause
Copy link
Collaborator

I could imagine to continue working on this - however, there are some open design questions:

  • Do we include skipna? (I would say yes)
  • Do we allow the weights to contain NaN? (I would say yes, although disallowing it would make it easier.)
  • Does skipna also apply to the weights or are NaNs always skipped in the weights? (I would suggest the latter.)
  • Do we need a skipna_weights for a fine grained control of this? (This sounds unnecessary)
  • Do you agree with the above given examples?

@max-sixty
Copy link
Collaborator

How about designing this as a groupby-like interface? In the same way as .rolling (or .expanding & .ewm in pandas)?

So for example ds.weighted(weights=ds.dim).mean().

And then this is extensible, clean, pandan-tic.

@shoyer
Copy link
Member

shoyer commented May 10, 2016

@MaximilianR great idea! A groupby like interface is much cleaner than adding more orthogonal code paths to .mean and the like.

@jhamman
Copy link
Member Author

jhamman commented May 11, 2016

@MaximilianR - I really like this idea. I'm going to close this PR and we can continue to discuss this feature in the original issue (#422 (comment)).

@jhamman jhamman closed this May 11, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add average function
4 participants