Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: stats.resampling: automatically detect whether statistic is vectorized #16651

Merged
merged 3 commits into from
Jul 20, 2022

Conversation

mdhaber
Copy link
Contributor

@mdhaber mdhaber commented Jul 20, 2022

Reference issue

raphaelvallat/pingouin#189

What does this implement/fix?

@raphaelvallat commented:

I do love the vectorized argument in scipy.stats.bootstrap, but I think for the (many) Python users that are not familiar with vectorization, this can be a little difficult to understand and use ( it took me a few trials to understand how to configure the axes, etc). Do you think that there might be a way to automatically detect if a function is vectorizable? Like if there is an axis keyword-argumnt, like in most numpy/scipy functions?

The answer is yes, and this PR implements it.

Additional information

Please review when you have a moment, @raphaelallat!

@mdhaber mdhaber added scipy.stats enhancement A new feature or improvement labels Jul 20, 2022
Copy link
Member

@tupui tupui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Matt, I am +1 here. Seems sensible as it's standard to use axis for that and I am not aware of other use/conventions.

@@ -638,6 +665,35 @@ def statistic(x, axis):
assert_allclose(res.statistic, expected.statistic)
assert_allclose(res.pvalue, expected.pvalue, atol=self.atol)

def test_vectorized(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All tests are very similar. Did you try to parametrize or even have a if for options?

Copy link
Contributor Author

@mdhaber mdhaber Jul 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to mention it - yes, I did try to write one test for all three functions (you can see how I used fun=...), but it got unwieldy. It would have been fewer lines, but more complicated than I thought it was worth.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok because I would have thought that just having fun, options, rvs as param should be enough.

If you still prefer to keep 3 tests, fine. Just maybe only define once the 2 functions statistic. Also options does not need to be a function.

Copy link
Contributor Author

@mdhaber mdhaber Jul 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would not, if you'd permit me to use the global random state or a simple seed. In fact, if I could do that, I'd be happy to try to combine all the tests again! iIRC the main difficulty in combining them was because I have to worry about passing identical Generators to multiple calls of the function - but the functions have different needs.

But fat chance of that! : P

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, I can combine the tests if we don't check that the results match - only that the assertions don't fail.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about the generator part. But fine with me not to check the results as we have other tests for all that.

@tupui tupui added this to the 1.10.0 milestone Jul 20, 2022
@ilayn
Copy link
Member

ilayn commented Jul 20, 2022

I would suggest we don't use vectorization as a feature name in the future because it has a different meaning. This also happened in the recent QR related pull request and batch processing is not vectorization per se. I am not opposed to it but just saying.

@mdhaber
Copy link
Contributor Author

mdhaber commented Jul 20, 2022

batch processing is not vectorization per se

Do you mean that you would prefer to make a distinction between how the processor performs the computation (e.g. SIMD is "vectorized") and how the user writes the code (operating on a whole array at once is not "vectorized")?

Copy link

@raphaelvallat raphaelvallat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me @mdhaber!

@@ -265,10 +269,14 @@ def bootstrap(data, statistic, *, n_resamples=9999, batch=None,
`statistic`. Memory usage is O(`batch`*``n``), where ``n`` is the
sample size. Default is ``None``, in which case ``batch = n_resamples``
(or ``batch = max(n_resamples, n)`` for ``method='BCa'``).
vectorized : bool, default: ``True``
vectorized : bool, optional
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I didn't actually change any defaults to None. Will do.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not sure about this. But it makes sense and should be more robust.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can manage this without a deprecation warning:

  • If the user was passing either vectorized=True or vectorized=False explicitly, nothing changes.
  • If the user was relying on the bootstrap default vectorized=True, then (if their code is working) their statistic must already have axis and be vectorized.
  • If the user was relying on the monte_carlo_test or permutation_test default vectorized=False, then nothing changes unless their statistic 1) has a keyword argument called axis and 2) is not actually vectorized. I wouldn't expect many cases of this, and rather than silently producing the wrong result, they'll get an error because the results of the statistic will not be the right shape (unless they've done something really wonky that I'm having trouble predicting). We'll also put this in the release notes so that users can watch out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes agreed it should be transparent for users because we expected axis.

Copy link
Member

@tupui tupui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright LGTM, let's move forward. Thanks Matt. CI failure is unrelated.

@tupui tupui merged commit 7ab63a5 into scipy:main Jul 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A new feature or improvement scipy.stats
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants