Skip to content

Weighted mean #51

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Sep 25, 2019
Merged

Weighted mean #51

merged 12 commits into from
Sep 25, 2019

Conversation

nilgoyette
Copy link
Contributor

Here's a first version of weighted_mean and weighted_mean_axis.

Disclaimers:

  1. I don't really know where weighted_mean and friends should go. Is summary_statistics ok?
  2. I had to move return_err_if_empty and return_err_unless_same_shape because ther are useful elsewhere.
  3. There's a little code-copy in weighted_mean_axis, to avoid (2 conditions + 1 unwrap) x nb_lanes. Maybe I could create an inner function called inner_weighted_mean or something, then call it in both functions?

Questions:

  1. Why are the summary_statistics tests (and others) not in /tests/*? I thought that the public API was supposed to be tested outside the crate. Is this not a "standard"?

@fmorency
Copy link

I think we should mimic the numpy behavior here and not assume the weights are pre-normalized. I think we should divide by the weight sum inside the weighted_mean method.

@LukeMathWalker
Copy link
Member

LukeMathWalker commented Sep 18, 2019

  1. I don't really know where weighted_mean and friends should go. Is summary_statistics ok?

I'd say it makes sense for them to go there, given that it's where mean is.

  1. I had to move return_err_if_empty and return_err_unless_same_shape because ther are useful elsewhere.

It makes perfect sense.

  1. There's a little code-copy in weighted_mean_axis, to avoid (2 conditions + 1 unwrap) x nb_lanes. Maybe I could create an inner function called inner_weighted_mean or something, then call it in both functions?

I don't feel too strongly in either direction - if you feel like doing it, all the better 👍

Questions:

  1. Why are the summary_statistics tests (and others) not in /tests/*? I thought that the public API was supposed to be tested outside the crate. Is this not a "standard"?

I'd say that it's standard for integration tests to go outside the crate, yes. On the other side, it's sometimes nice to have tests next to the code they refer to - we haven't been very consistent across the crate. It would probably makes sense to migrate them to the tests folder.

Overall, this looks good to me, thanks for working on it! I suggested one area of improvement around testing that would increase the robustness of our current check - let me know what you think about it @nilgoyette.

@LukeMathWalker
Copy link
Member

I think we should mimic the numpy behavior here and not assume the weights are pre-normalized. I think we should divide by the weight sum inside the weighted_mean method.

I am not sure - I am concerned about introducing rounding errors doing a normalization step that might or might not be necessary.
I would lean on the side of being explicit and letting the user taking care of normalisation, if they need it.

@fmorency
Copy link

I think we should mimic the numpy behavior here and not assume the weights are pre-normalized. I think we should divide by the weight sum inside the weighted_mean method.

I am not sure - I am concerned about introducing rounding errors doing a normalization step that might or might not be necessary.
I would lean on the side of being explicit and letting the user taking care of normalisation, if they need it.

I see your point and I tend to agree.

However, my concern is that the weighted_mean function doesn't return an actual weighted mean, except then the weights are normalized. From the definition in the documentation, I would tend to

  • Rename this function to weighted_sum and note in the documentation that if the weights are pre-normalized, the result is a weighted mean
  • Write an additional function weighted_mean that actually computes the weighted mean given any input - ie. divide by the sum of the weights.

Thoughts?

@LukeMathWalker
Copy link
Member

I think we should mimic the numpy behavior here and not assume the weights are pre-normalized. I think we should divide by the weight sum inside the weighted_mean method.

I am not sure - I am concerned about introducing rounding errors doing a normalization step that might or might not be necessary.
I would lean on the side of being explicit and letting the user taking care of normalisation, if they need it.

I see your point and I tend to agree.

However, my concern is that the weighted_mean function doesn't return an actual weighted mean, except then the weights are normalized. From the definition in the documentation, I would tend to

  • Rename this function to weighted_sum and note in the documentation that if the weights are pre-normalized, the result is a weighted mean
  • Write an additional function weighted_mean that actually computes the weighted mean given any input - ie. divide by the sum of the weights.

Thoughts?

I think that makes perfect sense 👍


/// Returns the [`arithmetic weighted mean`] x̅ along `axis`. Assumes that the weights are
/// already normalized.
/// Like `weighted_mean`, but assumes that the `weights` are normalized. In that case, this
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just describe what a weighted sum is and then point to the equivalence with weighted_mean if the weights are normalised.

@@ -28,13 +28,17 @@ where
where
A: Clone + FromPrimitive + Add<Output = A> + Div<Output = A> + Zero;

/// Returns the [`arithmetic weighted mean`] x̅ of all elements in the array. Assumes that the
/// weights are already normalized.
/// Returns the [`arithmetic weighted mean`] x̅ of all elements in the array. Use `weighted_sum`
Copy link
Member

@LukeMathWalker LukeMathWalker Sep 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the weights sum to 0? I assume the function panics:

  • we should add it to the docs;
  • we should add a test for it.

Same story for the _axis version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I forgot about this. Thank you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, it panics on integer and it returns a NaN on floating point, which is quite normal. Do we want to panic on floating point numbers or we let the users learn the hard facts of IEEE754?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would let them know the hard facts, unfortunately 😂
We could use something along the same lines of what we say for mean_axis in ndarray:

Panics if axis is out of bounds or if the length of the axis is zero and division by zero panics for type A.

Which makes me think that we need to add a test and a note about zero-length axis on both weighted_mean_axis and weighted_sum_axis.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't understand this part. How can "the length of the axis" be zero? Are we talking about a Array0 (single number)? Surely the len of an Array0 is 1 and not 0? Or it's considered a data point and it has no length?

Also, reading ndarray's version, it seems like it's saying "Panics if x OR (y AND z)" Should it be all "or"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can have an Array2 with shape [0, n], where n is strictly greater than 0.

"Panics if x OR (y AND z)"

That is indeed correct - if the axis lenght is 0, but division by zero doesn't panic for A (e.g. floats), then the method does not panic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to disturb you again, but I'm lost here.

  1. Those zero-length dimensions don't make sense to me. It looks like they are uselessly complex empty arrays. Do you have more information on them? Because I can't code something for a "feature" that I don't understand.
  2. I removed the * MultiInputError::EmptyInput if self is empty error on weighted_mean_axis and weighted_sum_axis because I don't understand why they should return an error when the array is empty. It does seem to make sense for weighted_mean_axis because we divide all lanes by weights_sum, but there's no lane to divide anyway.
  3. I can't respect Panics if axis is out of bounds or if the length of the axis is zero and division by zero panics for type A. because my functions use map_axis, which panics with an error not defined in map_axis documentation.
let a = Array2::<f32>::zeros((0, 20));
a.mean_axis(Axis(0)); // pass
a.weighted_mean_axis(Axis(0), &Array1::zeros(0)).unwrap(); // fail
    panicked at 'collapse_axis: Index 0 must be less than axis length 0 for array with shape [0, 20]

Of course I could try to use something else (there are other ways to code weighted_sum_axis), but I don't see why I should, mostly because of 1).

Copy link
Member

@LukeMathWalker LukeMathWalker Sep 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issue at all, happy to reach consensus on these changes 😄

  1. Those zero-length dimensions don't make sense to me. It looks like they are uselessly complex empty arrays. Do you have more information on them? Because I can't code something for a "feature" that I don't understand.

I wouldn't indeed consider them a feature - they are edge cases (empty arrays with associated non-trivial dimension-related information), but given that we are writing a library we need to take care of them (or at least document what happens if you pass them in).
Would it possible to forbid them? It's a possibility, but as long as ndarray's codebase allows them, we need to deal with them.

  1. I removed the * MultiInputError::EmptyInput if self is empty error on weighted_mean_axis and weighted_sum_axis because I don't understand why they should return an error when the array is empty. It does seem to make sense for weighted_mean_axis because we divide all lanes by weights_sum, but there's no lane to divide anyway.

I agree on weighted_sum_axis / weighted_sum - there is a very good default value to return (A::zero()) and the behaviour is consistent with sum in ndarray.
For the mean though, it gets trickier. What is the mean of an empty array of integers? 0? If would argue that there is no sound answer to that question, hence it makes sense to return None. Given that those functions can fail in multiple ways (mismatched shapes), it's more ergonomic to return it as an error variant than a Result<Option<A>, ErrType> (see mean in ndarray).

  1. I can't respect Panics if axis is out of bounds or if the length of the axis is zero and division by zero panics for type A. because my functions use map_axis, which panics with an error not defined in map_axis documentation.
let a = Array2::<f32>::zeros((0, 20));
a.mean_axis(Axis(0)); // pass
a.weighted_mean_axis(Axis(0), &Array1::zeros(0)).unwrap(); // fail
    panicked at 'collapse_axis: Index 0 must be less than axis length 0 for array with shape [0, 20]

Of course I could try to use something else (there are other ways to code weighted_sum_axis), but I don't see why I should, mostly because of 1).

This was a bug in ndarray:0.12.1 (indeed when dealing with 0-length axes 😛). It has been fixed in 0.13 - as soon as the CI is finished in #52 I will merge it into master, so that you can get the fix in this branch as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#52 has been merged - I'd say that we could this PR merged and then cut a release 👍

@nilgoyette
Copy link
Contributor Author

I think it's better now. I had to remove or if the length of the axis is zero and division by zero panics for type A. on weighted_mean_axis because the MultiInputError::EmptyInput error happens before any panic. Imo it's better that way.

Can you please check the two first assert_eq! of weighted_sum_dimension_zero? As I wrote, [0, N] arrays don't make sense to me so I'm not sure what's the right output.

Copy link
Member

@LukeMathWalker LukeMathWalker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything looks good 👍
Thanks for working on this @nilgoyette 🙏

@LukeMathWalker LukeMathWalker merged commit 496b8ac into rust-ndarray:master Sep 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants