Weighted mean #51

nilgoyette · 2019-09-16T18:05:14Z

Here's a first version of weighted_mean and weighted_mean_axis.

Disclaimers:

I don't really know where weighted_mean and friends should go. Is summary_statistics ok?
I had to move return_err_if_empty and return_err_unless_same_shape because ther are useful elsewhere.
There's a little code-copy in weighted_mean_axis, to avoid (2 conditions + 1 unwrap) x nb_lanes. Maybe I could create an inner function called inner_weighted_mean or something, then call it in both functions?

Questions:

Why are the summary_statistics tests (and others) not in /tests/*? I thought that the public API was supposed to be tested outside the crate. Is this not a "standard"?

fmorency · 2019-09-18T14:43:44Z

I think we should mimic the numpy behavior here and not assume the weights are pre-normalized. I think we should divide by the weight sum inside the weighted_mean method.

src/summary_statistics/means.rs

LukeMathWalker · 2019-09-18T21:29:46Z

I don't really know where weighted_mean and friends should go. Is summary_statistics ok?

I'd say it makes sense for them to go there, given that it's where mean is.

I had to move return_err_if_empty and return_err_unless_same_shape because ther are useful elsewhere.

It makes perfect sense.

There's a little code-copy in weighted_mean_axis, to avoid (2 conditions + 1 unwrap) x nb_lanes. Maybe I could create an inner function called inner_weighted_mean or something, then call it in both functions?

I don't feel too strongly in either direction - if you feel like doing it, all the better 👍

Questions:

Why are the summary_statistics tests (and others) not in /tests/*? I thought that the public API was supposed to be tested outside the crate. Is this not a "standard"?

I'd say that it's standard for integration tests to go outside the crate, yes. On the other side, it's sometimes nice to have tests next to the code they refer to - we haven't been very consistent across the crate. It would probably makes sense to migrate them to the tests folder.

Overall, this looks good to me, thanks for working on it! I suggested one area of improvement around testing that would increase the robustness of our current check - let me know what you think about it @nilgoyette.

LukeMathWalker · 2019-09-18T21:31:43Z

I think we should mimic the numpy behavior here and not assume the weights are pre-normalized. I think we should divide by the weight sum inside the weighted_mean method.

I am not sure - I am concerned about introducing rounding errors doing a normalization step that might or might not be necessary.
I would lean on the side of being explicit and letting the user taking care of normalisation, if they need it.

fmorency · 2019-09-19T12:47:14Z

I think we should mimic the numpy behavior here and not assume the weights are pre-normalized. I think we should divide by the weight sum inside the weighted_mean method.

I am not sure - I am concerned about introducing rounding errors doing a normalization step that might or might not be necessary.
I would lean on the side of being explicit and letting the user taking care of normalisation, if they need it.

I see your point and I tend to agree.

However, my concern is that the weighted_mean function doesn't return an actual weighted mean, except then the weights are normalized. From the definition in the documentation, I would tend to

Rename this function to weighted_sum and note in the documentation that if the weights are pre-normalized, the result is a weighted mean
Write an additional function weighted_mean that actually computes the weighted mean given any input - ie. divide by the sum of the weights.

Thoughts?

LukeMathWalker · 2019-09-19T13:45:18Z

I think we should mimic the numpy behavior here and not assume the weights are pre-normalized. I think we should divide by the weight sum inside the weighted_mean method.

I am not sure - I am concerned about introducing rounding errors doing a normalization step that might or might not be necessary.
I would lean on the side of being explicit and letting the user taking care of normalisation, if they need it.

I see your point and I tend to agree.

However, my concern is that the weighted_mean function doesn't return an actual weighted mean, except then the weights are normalized. From the definition in the documentation, I would tend to

Rename this function to weighted_sum and note in the documentation that if the weights are pre-normalized, the result is a weighted mean

Write an additional function weighted_mean that actually computes the weighted mean given any input - ie. divide by the sum of the weights.

Thoughts?

I think that makes perfect sense 👍

LukeMathWalker · 2019-09-20T09:25:40Z

src/summary_statistics/mod.rs


-    /// Returns the [`arithmetic weighted mean`] x̅ along `axis`. Assumes that the weights are
-    /// already normalized.
+    /// Like `weighted_mean`, but assumes that the `weights` are normalized. In that case, this


I would just describe what a weighted sum is and then point to the equivalence with weighted_mean if the weights are normalised.

LukeMathWalker · 2019-09-20T09:26:46Z

src/summary_statistics/mod.rs

@@ -28,13 +28,17 @@ where
    where
        A: Clone + FromPrimitive + Add<Output = A> + Div<Output = A> + Zero;

-    /// Returns the [`arithmetic weighted mean`] x̅ of all elements in the array. Assumes that the
-    /// weights are already normalized.
+    /// Returns the [`arithmetic weighted mean`] x̅ of all elements in the array. Use `weighted_sum`


What happens if the weights sum to 0? I assume the function panics:

we should add it to the docs;

we should add a test for it.

Same story for the _axis version.

Oh, I forgot about this. Thank you.

In fact, it panics on integer and it returns a NaN on floating point, which is quite normal. Do we want to panic on floating point numbers or we let the users learn the hard facts of IEEE754?

I would let them know the hard facts, unfortunately 😂
We could use something along the same lines of what we say for mean_axis in ndarray:

Panics if axis is out of bounds or if the length of the axis is zero and division by zero panics for type A.

Which makes me think that we need to add a test and a note about zero-length axis on both weighted_mean_axis and weighted_sum_axis.

Sorry, I don't understand this part. How can "the length of the axis" be zero? Are we talking about a Array0 (single number)? Surely the len of an Array0 is 1 and not 0? Or it's considered a data point and it has no length?

Also, reading ndarray's version, it seems like it's saying "Panics if x OR (y AND z)" Should it be all "or"?

You can have an Array2 with shape [0, n], where n is strictly greater than 0.

"Panics if x OR (y AND z)"

That is indeed correct - if the axis lenght is 0, but division by zero doesn't panic for A (e.g. floats), then the method does not panic.

Sorry to disturb you again, but I'm lost here.

Those zero-length dimensions don't make sense to me. It looks like they are uselessly complex empty arrays. Do you have more information on them? Because I can't code something for a "feature" that I don't understand.

I removed the * MultiInputError::EmptyInput if self is empty error on weighted_mean_axis and weighted_sum_axis because I don't understand why they should return an error when the array is empty. It does seem to make sense for weighted_mean_axis because we divide all lanes by weights_sum, but there's no lane to divide anyway.

I can't respect Panics if axis is out of bounds or if the length of the axis is zero and division by zero panics for type A. because my functions use map_axis, which panics with an error not defined in map_axis documentation.

let a = Array2::<f32>::zeros((0, 20)); a.mean_axis(Axis(0)); // pass a.weighted_mean_axis(Axis(0), &Array1::zeros(0)).unwrap(); // fail panicked at 'collapse_axis: Index 0 must be less than axis length 0 for array with shape [0, 20]

Of course I could try to use something else (there are other ways to code weighted_sum_axis), but I don't see why I should, mostly because of 1).

No issue at all, happy to reach consensus on these changes 😄

Those zero-length dimensions don't make sense to me. It looks like they are uselessly complex empty arrays. Do you have more information on them? Because I can't code something for a "feature" that I don't understand.

I wouldn't indeed consider them a feature - they are edge cases (empty arrays with associated non-trivial dimension-related information), but given that we are writing a library we need to take care of them (or at least document what happens if you pass them in).
Would it possible to forbid them? It's a possibility, but as long as ndarray's codebase allows them, we need to deal with them.

I removed the * MultiInputError::EmptyInput if self is empty error on weighted_mean_axis and weighted_sum_axis because I don't understand why they should return an error when the array is empty. It does seem to make sense for weighted_mean_axis because we divide all lanes by weights_sum, but there's no lane to divide anyway.

I agree on weighted_sum_axis / weighted_sum - there is a very good default value to return (A::zero()) and the behaviour is consistent with sum in ndarray.
For the mean though, it gets trickier. What is the mean of an empty array of integers? 0? If would argue that there is no sound answer to that question, hence it makes sense to return None. Given that those functions can fail in multiple ways (mismatched shapes), it's more ergonomic to return it as an error variant than a Result<Option<A>, ErrType> (see mean in ndarray).

I can't respect Panics if axis is out of bounds or if the length of the axis is zero and division by zero panics for type A. because my functions use map_axis, which panics with an error not defined in map_axis documentation.

let a = Array2::<f32>::zeros((0, 20)); a.mean_axis(Axis(0)); // pass a.weighted_mean_axis(Axis(0), &Array1::zeros(0)).unwrap(); // fail panicked at 'collapse_axis: Index 0 must be less than axis length 0 for array with shape [0, 20]

Of course I could try to use something else (there are other ways to code weighted_sum_axis), but I don't see why I should, mostly because of 1).

This was a bug in ndarray:0.12.1 (indeed when dealing with 0-length axes 😛). It has been fixed in 0.13 - as soon as the CI is finished in #52 I will merge it into master, so that you can get the fix in this branch as well.

#52 has been merged - I'd say that we could this PR merged and then cut a release 👍

…eighted_mean

nilgoyette · 2019-09-25T14:31:13Z

I think it's better now. I had to remove or if the length of the axis is zero and division by zero panics for type A. on weighted_mean_axis because the MultiInputError::EmptyInput error happens before any panic. Imo it's better that way.

Can you please check the two first assert_eq! of weighted_sum_dimension_zero? As I wrote, [0, N] arrays don't make sense to me so I'm not sure what's the right output.

LukeMathWalker

Everything looks good 👍
Thanks for working on this @nilgoyette 🙏

nilgoyette added 5 commits September 16, 2019 09:32

Move MultiInputError in lib

1a38a86

Add weighted_mean

3f53548

Add weighted_mean tests

db3e12a

Add weighted_mean_axis

c55eef9

Add weighted_mean_axis tests

8f0bc06

LukeMathWalker reviewed Sep 18, 2019

View reviewed changes

src/summary_statistics/means.rs Outdated Show resolved Hide resolved

nilgoyette added 3 commits September 19, 2019 11:36

Use quickcheck in mean tests

c39b926

Divide into mean_axis and sum_axis

5d07299

Add tests

6b52bbc

LukeMathWalker reviewed Sep 20, 2019

View reviewed changes

nilgoyette added 4 commits September 23, 2019 14:01

Precalculate weights_sum

f1d8c50

Merge branch 'master' of github.com:rust-ndarray/ndarray-stats into w…

ae0288e

…eighted_mean

Update doc and rules

a733652

Add tests

5baadf0

LukeMathWalker approved these changes Sep 25, 2019

View reviewed changes

LukeMathWalker merged commit 496b8ac into rust-ndarray:master Sep 25, 2019

Weighted mean #51

Weighted mean #51

Uh oh!

Conversation

nilgoyette commented Sep 16, 2019

Uh oh!

fmorency commented Sep 18, 2019

Uh oh!

Uh oh!

LukeMathWalker commented Sep 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LukeMathWalker commented Sep 18, 2019

Uh oh!

fmorency commented Sep 19, 2019

Uh oh!

LukeMathWalker commented Sep 19, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LukeMathWalker Sep 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LukeMathWalker Sep 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nilgoyette commented Sep 25, 2019

Uh oh!

LukeMathWalker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

LukeMathWalker commented Sep 18, 2019 •

edited

Loading

LukeMathWalker Sep 20, 2019 •

edited

Loading

LukeMathWalker Sep 24, 2019 •

edited

Loading