-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: enable stats.shapiro() to take n-dimension input, handle nan and assign axis #12916
Conversation
… assign axis Very often scipy is used with Pandas and users might want to verify the normality of multiple columns. A function that accepts only 1d input and not able to omit np.nan is cumbersome. Such interface does not match other API like stats.ttest_ind() either. An improvement is proposed here to handle np.nan according to the given `nan_policy`. Operation on a multi-dimension array is possible and an `axis` can be assigned. The default behavior remains the same as the old implementation. Only assigning `axis` and `nan_policy` options cause different behavior when handling input array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just did a quick/superficial read-through and added small comments. Good sign that there seem to be some detailed tests added and CI is green, but will have to wait for the stats
regulars to look it over.
THANKS.txt
Outdated
@@ -243,6 +243,7 @@ Shashaank N for contributions to scipy.signal. | |||
Frank Torres for fixing a bug with solve_bvp for large problems. | |||
Ben West for updating the Gamma distribution documentation. | |||
Terry Davis for documentation improvements in scipy.ndimage.morphology | |||
Po-Wen Kao for improving stats.shapiro() API. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the way we are managing the "THANKS.txt" file is that if/when this PR is accepted/merged, an entry would be added to the wiki page:
https://github.com/scipy/scipy/wiki/*THANKS.txt-additions-modifications-for-1.6.0*
I believe the reason is mostly to avoid merge conflicts, but @rlucas7 will know--the contents of the file may eventually live on a website instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I will remove it in PR and leave it there on wiki page
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the way we are managing the "THANKS.txt" file is that if/when this PR is accepted/merged, an entry would be added to the wiki page:
https://github.com/scipy/scipy/wiki/*THANKS.txt-additions-modifications-for-1.6.0*
I believe the reason is mostly to avoid merge conflicts, but @rlucas7 will know--the contents of the file may eventually live on a website instead?
Yeah as far as I understand the plan is to move thanks to the scipy.org website. The reason is to avoid merge conflicts. Merge conflicts are a pain on maintainers and the release manager-for us though it is more of an inconvenience. For new contributors I think it's an even bigger hassle because of how often the THANKS file creates merge conflicts, then the contributor needs to master git rebase
or open a new PR if the rebase fails (a common occurrence).
All this creates a lot more effort that undoubtedly detracts potential new contributors.
The change to the authors tool was already merged:
#12793
but the PR to remove the THANKS is still open:
#12792
I opened the wiki page so that we can move things there as we transition to adding the THANKS on the scipy.org site.
@tylerjereddy does that clarify?
scipy/stats/morestats.py
Outdated
@@ -1617,6 +1617,17 @@ def shapiro(x): | |||
---------- | |||
x : array_like | |||
Array of sample data. | |||
axis: int or None, optional | |||
Axis along which to compute test. | |||
If None, input is flatterned into 1d array. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo "flatterned"
The nightly build fails doesn't seem to have anything to do with my code.
Thank you |
Usually to retrigger the CI we close a PR, wait about 30-60 seconds and then reopen. Doing so should retrigger CI.
Sincerely,
…-Lucas Roberts
On Oct 11, 2020, at 4:16 AM, Po-Wen Kao ***@***.***> wrote:
The nightly build fails doesn't seem to have anything to do with my code.
I only change comments and test_matrix_io.py is not modified since the last successful build.
Could you trigger again the build process? @tylerjereddy
scipy/sparse/tests/test_csc.py ...... [ 77%]
scipy/sparse/tests/test_csr.py ........ [ 77%]
scipy/sparse/tests/test_extract.py .. [ 77%]
scipy/sparse/tests/test_matrix_io.py ...... [ 77%]
Fatal Python error: Segmentation fault
Current thread 0x00007feccf69d740 (most recent call first):
File "/home/runner/.local/lib/python3.9/site-packages/numpy/core/numeric.py", line 2276 in within_tol
File "/home/runner/.local/lib/python3.9/site-packages/numpy/core/numeric.py", line 2290 in disclose
.....
Thank you
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Did scipy.stats switch to In the old times, the statistics/ |
You are right. So two options for default value:
|
I didn't look carefully enough to see that there was a ravel. |
@josef-pkt @rlucas7 Do you have a preference for either options or better ideas? :D
|
I leave backwards compatibility decisions to current maintainers. I strongly prefer the consistent |
Hmm, preference here is to go through a round of deprecation warnings even though it is inconsistent behavior. Reasoning is that it's been around in the codebase for 5 years, so someone might have code that relies on the existing behavior. If there had only been 1 or 2 releases since the behavior was added the perspective might be different. Even though it is a non-consistent default behavior that we want to change we'll need to notify consumers of the package in advance that we'll no longer support he current behavior. I'll add some comments inline of the PR to indicate the changes to accommodate what I'm proposing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left a couple inline changes
scipy/stats/morestats.py
Outdated
return ShapiroResult(w, pw) | ||
|
||
|
||
def _apply_sharipo_1d(x: np.ma.MaskedArray): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def _apply_sharipo_1d(x: np.ma.MaskedArray): | |
def _apply_shapiro_1d(x: np.ma.MaskedArray): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changing my recommendation on the deprecation part. other changes still needed
scipy/stats/tests/test_morestats.py
Outdated
|
||
def test_nan_policies(self): | ||
n1 = self.nd.copy() | ||
n1[0, 0, 0] = np.NaN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style comment, we usually write the np.NaN
as np.nan
- use np.nan instead of np.NaN - refactor typo from _apply_sharipo_1d() to _apply_shapiro_1d()
The behavior of function is different from old implementation - axis=0 by default, 2d input will not be flattened - new test cases for default behavior (axis=0) is added
@hp5588 Thanks for submitting this. I wonder what you think of a different approach. Since this ends up calling from functools import wraps # so that wrapper preserves docstring
def _vectorize_1s_hypotest_factory(result_creator):
def vectorize_hypotest_decorator(hypotest_fun_in):
@wraps(hypotest_fun_in)
def vectorize_hypotest_wrapper(x, *args, axis=0,
nan_policy='propagate', **kwds):
x = np.atleast_1d(x)
# Addresses nan_policy="raise"
_, nan_policy = scipy.stats.stats._contains_nan(x, nan_policy)
# Addresses nan_policy="omit"
if nan_policy == 'omit':
def hypotest_fun(x, *args, **kwds):
x = x[~np.isnan(x)]
return hypotest_fun_in(x, *args, **kwds)
else:
hypotest_fun = hypotest_fun_in
x = np.moveaxis(x, axis, -1)
res = np.apply_along_axis(hypotest_fun, axis=-1, arr=x)
return result_creator(res)
return vectorize_hypotest_wrapper
return vectorize_hypotest_decorator Then the decorator could be applied to the def _shapiro_result_creator(res):
return ShapiroResult(res[..., 0], res[..., 1])
@_vectorize_1s_hypotest_factory(_shapiro_result_creator)
def shapiro(x):
... One advantage is that because we are not modifying the What do you think? I have a PR that proposes a strategy like this for n-sample tests (well, just two right now, but it generalizes) in gh-13312, but I think it would make sense to have something special for 1d. Update: actually I added this decorator to gh-13312 and applied it to |
Hi @mdhaber, Thanks for your comments. I am sorry for the late reply since I have been a bit busy recently. |
No, I didn't plan to do that, not unless there are problems with their current behavior. We can separate the two parts - vectorization from |
Ok, then I will first try to wrap my PR using the decorator and then we can open another PR to update other functions. |
Since I've already written the decorator and applied it to Shapiro in gh-13312, I'm not sure that is needed. It would be very helpful if you would review that PR instead.
This is exactly the sort of thing that would be great to check. I used |
We'll take care of this as part of gh-14651. Thanks @hp5558! If you're still interested, you're welcome to open a PR adding the decorator introduced in gh-13312 to |
Reference issue
What does this implement/fix?
Very often scipy is used with Pandas and users might want to verify the normality of multiple columns.
A function that accepts only 1d input and not able to omit np.nan is cumbersome.
Such interface is not consistent with other API like
stats.ttest_ind()
either.An improvement is proposed here to handle np.nan according to the given
nan_policy
.Operation on a multi-dimension array is possible and can compute along a given
axis
.The default behavior remains the same as the old implementation.
Only assigning
axis
andnan_policy
options cause different behavior when handling input array.Any comment is welcome :D
Additional information