ENH: enable stats.shapiro() to take n-dimension input, handle nan and assign axis #12916

powen-kao · 2020-10-04T13:15:28Z

Reference issue

What does this implement/fix?

Very often scipy is used with Pandas and users might want to verify the normality of multiple columns.
A function that accepts only 1d input and not able to omit np.nan is cumbersome.
Such interface is not consistent with other API like stats.ttest_ind() either.

An improvement is proposed here to handle np.nan according to the given nan_policy.
Operation on a multi-dimension array is possible and can compute along a given axis.
The default behavior remains the same as the old implementation.
Only assigning axis and nan_policy options cause different behavior when handling input array.

Any comment is welcome :D

Additional information

… assign axis Very often scipy is used with Pandas and users might want to verify the normality of multiple columns. A function that accepts only 1d input and not able to omit np.nan is cumbersome. Such interface does not match other API like stats.ttest_ind() either. An improvement is proposed here to handle np.nan according to the given `nan_policy`. Operation on a multi-dimension array is possible and an `axis` can be assigned. The default behavior remains the same as the old implementation. Only assigning `axis` and `nan_policy` options cause different behavior when handling input array.

tylerjereddy

I just did a quick/superficial read-through and added small comments. Good sign that there seem to be some detailed tests added and CI is green, but will have to wait for the stats regulars to look it over.

tylerjereddy · 2020-10-10T23:47:07Z

THANKS.txt

@@ -243,6 +243,7 @@ Shashaank N for contributions to scipy.signal.
 Frank Torres for fixing a bug with solve_bvp for large problems.
 Ben West for updating the Gamma distribution documentation.
 Terry Davis for documentation improvements in scipy.ndimage.morphology
+Po-Wen Kao for improving stats.shapiro() API.


I think the way we are managing the "THANKS.txt" file is that if/when this PR is accepted/merged, an entry would be added to the wiki page:

https://github.com/scipy/scipy/wiki/*THANKS.txt-additions-modifications-for-1.6.0*

I believe the reason is mostly to avoid merge conflicts, but @rlucas7 will know--the contents of the file may eventually live on a website instead?

Thanks. I will remove it in PR and leave it there on wiki page

I think the way we are managing the "THANKS.txt" file is that if/when this PR is accepted/merged, an entry would be added to the wiki page:

https://github.com/scipy/scipy/wiki/*THANKS.txt-additions-modifications-for-1.6.0*

I believe the reason is mostly to avoid merge conflicts, but @rlucas7 will know--the contents of the file may eventually live on a website instead?

Yeah as far as I understand the plan is to move thanks to the scipy.org website. The reason is to avoid merge conflicts. Merge conflicts are a pain on maintainers and the release manager-for us though it is more of an inconvenience. For new contributors I think it's an even bigger hassle because of how often the THANKS file creates merge conflicts, then the contributor needs to master git rebase or open a new PR if the rebase fails (a common occurrence).
All this creates a lot more effort that undoubtedly detracts potential new contributors.

The change to the authors tool was already merged:
#12793

but the PR to remove the THANKS is still open:
#12792

I opened the wiki page so that we can move things there as we transition to adding the THANKS on the scipy.org site.

@tylerjereddy does that clarify?

tylerjereddy · 2020-10-10T23:48:42Z

scipy/stats/morestats.py

@@ -1617,6 +1617,17 @@ def shapiro(x):
    ----------
    x : array_like
        Array of sample data.
+    axis: int or None, optional
+        Axis along which to compute test.
+        If None, input is flatterned into 1d array.


typo "flatterned"

powen-kao · 2020-10-11T08:15:55Z

The nightly build fails doesn't seem to have anything to do with my code.
I only change comments and test_sparsetools.py is not modified since the last successful build.
Could you trigger again the build process? @tylerjereddy

scipy/sparse/tests/test_csc.py ...... [ 77%]
scipy/sparse/tests/test_csr.py ........ [ 77%]
scipy/sparse/tests/test_extract.py .. [ 77%]
scipy/sparse/tests/test_matrix_io.py ...... [ 77%]
Fatal Python error: Segmentation fault

Current thread 0x00007feccf69d740 (most recent call first):
File "/home/runner/.local/lib/python3.9/site-packages/numpy/core/numeric.py", line 2276 in within_tol
File "/home/runner/.local/lib/python3.9/site-packages/numpy/core/numeric.py", line 2290 in isclose
File "<array_function internals>", line 5 in isclose
File "/home/runner/.local/lib/python3.9/site-packages/numpy/testing/_private/utils.py", line 1522 in compare
File "/home/runner/.local/lib/python3.9/site-packages/numpy/testing/_private/utils.py", line 788 in assert_array_compare
File "/home/runner/.local/lib/python3.9/site-packages/numpy/testing/_private/utils.py", line 1527 in assert_allclose
File "/home/runner/work/scipy/scipy/build/testenv/lib/python3.9/site-packages/scipy/sparse/tests/test_sparsetools.py", line 315 in test_upcast

.....

Thank you

rlucas7 · 2020-10-11T13:49:41Z

Usually to retrigger the CI we close a PR, wait about 30-60 seconds and then reopen. Doing so should retrigger CI. Sincerely,

…

-Lucas Roberts

On Oct 11, 2020, at 4:16 AM, Po-Wen Kao ***@***.***> wrote: The nightly build fails doesn't seem to have anything to do with my code. I only change comments and test_matrix_io.py is not modified since the last successful build. Could you trigger again the build process? @tylerjereddy scipy/sparse/tests/test_csc.py ...... [ 77%] scipy/sparse/tests/test_csr.py ........ [ 77%] scipy/sparse/tests/test_extract.py .. [ 77%] scipy/sparse/tests/test_matrix_io.py ...... [ 77%] Fatal Python error: Segmentation fault Current thread 0x00007feccf69d740 (most recent call first): File "/home/runner/.local/lib/python3.9/site-packages/numpy/core/numeric.py", line 2276 in within_tol File "/home/runner/.local/lib/python3.9/site-packages/numpy/core/numeric.py", line 2290 in disclose ..... Thank you — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

josef-pkt · 2020-10-11T13:54:39Z

Did scipy.stats switch to axis=None as the default?

In the old times, the statistics/stats default was always axis=0 (for csv/dataframe like data). I haven't paid much attention in a while.

powen-kao · 2020-10-11T20:07:56Z

Did scipy.stats switch to axis=None as the default?

In the old times, the statistics/stats default was always axis=0 (for csv/dataframe like data). I haven't paid much attention in a while.

You are right.
But I notice that if I set the default as axis=0, the old code which input 2D-array expects the array to be flattened. However, this is not the case in the new API. Only when axis=None, the input array will be flattened.

So two options for default value:

If axis=0, then will cause the compatibility problem
If axis=None, the API is not consistent

josef-pkt · 2020-10-11T21:07:25Z

I didn't look carefully enough to see that there was a ravel.
It sneaked in 5 years ago "Improve implementation details"

powen-kao · 2020-10-17T20:30:07Z

@josef-pkt @rlucas7 Do you have a preference for either options or better ideas? :D

Did scipy.stats switch to axis=None as the default?
In the old times, the statistics/stats default was always axis=0 (for csv/dataframe like data). I haven't paid much attention in a while.

You are right.
But I notice that if I set the default as axis=0, the old code which input 2D-array expects the array to be flattened. However, this is not the case in the new API. Only when axis=None, the input array will be flattened.

So two options for default value:

If axis=0, then will cause the compatibility problem

If axis=None, the API is not consistent

josef-pkt · 2020-10-18T21:26:24Z

I leave backwards compatibility decisions to current maintainers.

I strongly prefer the consistent axis=0 as default, which is also the main application in the original motivation in first comment.

rlucas7 · 2020-11-01T17:56:33Z

I didn't look carefully enough to see that there was a ravel.
It sneaked in 5 years ago "Improve implementation details"

Hmm, preference here is to go through a round of deprecation warnings even though it is inconsistent behavior. Reasoning is that it's been around in the codebase for 5 years, so someone might have code that relies on the existing behavior. If there had only been 1 or 2 releases since the behavior was added the perspective might be different. Even though it is a non-consistent default behavior that we want to change we'll need to notify consumers of the package in advance that we'll no longer support he current behavior. I'll add some comments inline of the PR to indicate the changes to accommodate what I'm proposing.

rlucas7

left a couple inline changes

rlucas7 · 2020-11-01T17:59:30Z

scipy/stats/morestats.py

+    return ShapiroResult(w, pw)
+
+
+def _apply_sharipo_1d(x: np.ma.MaskedArray):


Suggested change

def _apply_sharipo_1d(x: np.ma.MaskedArray):

def _apply_shapiro_1d(x: np.ma.MaskedArray):

scipy/stats/morestats.py

rlucas7

changing my recommendation on the deprecation part. other changes still needed

rlucas7 · 2020-11-01T18:43:30Z

scipy/stats/tests/test_morestats.py

+
+    def test_nan_policies(self):
+        n1 = self.nd.copy()
+        n1[0, 0, 0] = np.NaN


style comment, we usually write the np.NaN as np.nan

scipy/stats/morestats.py

- use np.nan instead of np.NaN - refactor typo from _apply_sharipo_1d() to _apply_shapiro_1d()

The behavior of function is different from old implementation - axis=0 by default, 2d input will not be flattened - new test cases for default behavior (axis=0) is added

mdhaber · 2021-01-11T00:27:07Z

@hp5588 Thanks for submitting this. I wonder what you think of a different approach.

Since this ends up calling np.apply_along_axis rather than vectorizing the computation at a lower level, I think we could achieve the same thing by taking the core of what you have and creating a decorator out of it:

from functools import wraps  # so that wrapper preserves docstring

def _vectorize_1s_hypotest_factory(result_creator):
    def vectorize_hypotest_decorator(hypotest_fun_in):
        @wraps(hypotest_fun_in)
        def vectorize_hypotest_wrapper(x, *args, axis=0,
                                       nan_policy='propagate', **kwds):

            x = np.atleast_1d(x)

            # Addresses nan_policy="raise"
            _, nan_policy = scipy.stats.stats._contains_nan(x, nan_policy)

            # Addresses nan_policy="omit"
            if nan_policy == 'omit':
                def hypotest_fun(x, *args, **kwds):
                    x = x[~np.isnan(x)]
                    return hypotest_fun_in(x, *args, **kwds)
            else:
                hypotest_fun = hypotest_fun_in

            x = np.moveaxis(x, axis, -1)
            res = np.apply_along_axis(hypotest_fun, axis=-1, arr=x)
            return result_creator(res)

        return vectorize_hypotest_wrapper
    return vectorize_hypotest_decorator

Then the decorator could be applied to the shapiro function, which doesn't have to be modified at all:

def _shapiro_result_creator(res):
    return ShapiroResult(res[..., 0], res[..., 1])


@_vectorize_1s_hypotest_factory(_shapiro_result_creator)
def shapiro(x):
   ...

One advantage is that because we are not modifying the shapiro function, it is less likely for us to make a mistake that would change the way shapiro works on a 1d input. That makes it easier to review, IMO. Another advantage is that the same decorator can be applied to other 1 sample tests, such as jacque_bera, or some other one sample statistics functions without modification. Of course, once we've reviewed the decorator code once, it makes reviewing the application of the decorator to other functions very easy (compared to reviewing the modification of each function separately).

What do you think? I have a PR that proposes a strategy like this for n-sample tests (well, just two right now, but it generalizes) in gh-13312, but I think it would make sense to have something special for 1d.

Update: actually I added this decorator to gh-13312 and applied it to shapiro and jacque_bera. Looks like I need to read the comments about backwards compatibility carefully, though. It's unfortunate that the behavior in the past has been to ravel nd-arrays.

powen-kao · 2021-01-28T04:03:29Z

Hi @mdhaber,

Thanks for your comments. I am sorry for the late reply since I have been a bit busy recently.
The decorator approach sounds like a good idea wrapping all the existing functions. The only concern is that are we going to modify the functions (e.g remove policy parameter stats.ttest_ind() ) that already has part of the proposed features and apply the decorator instead to achieve code consistency?

mdhaber · 2021-01-28T04:14:11Z

are we going to modify the functions (e.g remove policy parameter stats.ttest_ind() ) that already has part of the proposed features

No, I didn't plan to do that, not unless there are problems with their current behavior.

We can separate the two parts - vectorization from nan_policy if desired in the future, but initially I'd apply both in one wrapper to the functions that have neither. Any other changes can wait.

powen-kao · 2021-01-28T10:20:22Z

Ok, then I will first try to wrap my PR using the decorator and then we can open another PR to update other functions.
By the way, do you think the IDE (e.g. Pycharm) can recognize the decorator and thus activate the autocomplete?

mdhaber · 2021-01-28T15:38:51Z

Ok, then I will first try to wrap my PR using the decorator and then we can open another PR to update other functions.

Since I've already written the decorator and applied it to Shapiro in gh-13312, I'm not sure that is needed. It would be very helpful if you would review that PR instead.

By the way, do you think the IDE (e.g. Pycharm) can recognize the decorator and thus activate the autocomplete?

This is exactly the sort of thing that would be great to check. I used functools.wraps so i hope so? Thank you!

mdhaber · 2022-02-20T19:03:00Z

We'll take care of this as part of gh-14651. Thanks @hp5558! If you're still interested, you're welcome to open a PR adding the decorator introduced in gh-13312 to shapiro. (I had applied the decorator to shapiro during testing, but removed before merge. It worked fine, but I think each function we apply the decorator to deserves some attention in its own PR.)

powen-kao added 2 commits October 4, 2020 15:04

STY: line length cause CI build fail

4fe4d7e

tylerjereddy added enhancement A new feature or improvement scipy.stats labels Oct 10, 2020

tylerjereddy reviewed Oct 10, 2020

View reviewed changes

MAINT: remove THANKS.txt commit and correct typo

bc64a54

powen-kao closed this Oct 11, 2020

powen-kao reopened this Oct 11, 2020

powen-kao closed this Oct 12, 2020

powen-kao reopened this Oct 12, 2020

josef-pkt mentioned this pull request Oct 18, 2020

Correcting that p should be in [0,1] for a variety of discrete distns. #12962

Merged

rlucas7 requested changes Nov 1, 2020

View reviewed changes

powen-kao added 2 commits November 13, 2020 19:28

STY, MAINT: use np.nan and refactor function name

cdba244

- use np.nan instead of np.NaN - refactor typo from _apply_sharipo_1d() to _apply_shapiro_1d()

API: change default axis to 0 instead of None

2341a96

The behavior of function is different from old implementation - axis=0 by default, 2d input will not be flattened - new test cases for default behavior (axis=0) is added

powen-kao requested a review from rlucas7 November 13, 2020 23:43

mdhaber mentioned this pull request Jan 11, 2021

ENH: stats: add nan_policy' argument for stats.wilcoxon` #13223

Closed

mdhaber mentioned this pull request Aug 19, 2021

ENH: stats: add axis and nan_policy parameters to functions with decorator #13312

Merged

mdhaber mentioned this pull request Aug 27, 2021

ENH: stats: consistent nan_policy, axis, masked array, and dtype support #14651

Open

13 tasks

mdhaber closed this Feb 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: enable stats.shapiro() to take n-dimension input, handle nan and assign axis #12916

ENH: enable stats.shapiro() to take n-dimension input, handle nan and assign axis #12916

powen-kao commented Oct 4, 2020 •

edited

Loading

tylerjereddy left a comment

tylerjereddy Oct 10, 2020

powen-kao Oct 11, 2020

rlucas7 Oct 11, 2020

tylerjereddy Oct 10, 2020

powen-kao commented Oct 11, 2020 •

edited

Loading

rlucas7 commented Oct 11, 2020 via email

josef-pkt commented Oct 11, 2020

powen-kao commented Oct 11, 2020 •

edited

Loading

josef-pkt commented Oct 11, 2020

powen-kao commented Oct 17, 2020

josef-pkt commented Oct 18, 2020

rlucas7 commented Nov 1, 2020

rlucas7 left a comment

rlucas7 Nov 1, 2020

rlucas7 left a comment

rlucas7 Nov 1, 2020

mdhaber commented Jan 11, 2021 •

edited

Loading

powen-kao commented Jan 28, 2021

mdhaber commented Jan 28, 2021

powen-kao commented Jan 28, 2021

mdhaber commented Jan 28, 2021 •

edited

Loading

mdhaber commented Feb 20, 2022 •

edited

Loading

		return ShapiroResult(w, pw)


		def _apply_sharipo_1d(x: np.ma.MaskedArray):

	def _apply_sharipo_1d(x: np.ma.MaskedArray):
	def _apply_shapiro_1d(x: np.ma.MaskedArray):

ENH: enable stats.shapiro() to take n-dimension input, handle nan and assign axis #12916

ENH: enable stats.shapiro() to take n-dimension input, handle nan and assign axis #12916

Conversation

powen-kao commented Oct 4, 2020 • edited Loading

Reference issue

What does this implement/fix?

Additional information

tylerjereddy left a comment

Choose a reason for hiding this comment

tylerjereddy Oct 10, 2020

Choose a reason for hiding this comment

powen-kao Oct 11, 2020

Choose a reason for hiding this comment

rlucas7 Oct 11, 2020

Choose a reason for hiding this comment

tylerjereddy Oct 10, 2020

Choose a reason for hiding this comment

powen-kao commented Oct 11, 2020 • edited Loading

rlucas7 commented Oct 11, 2020 via email

josef-pkt commented Oct 11, 2020

powen-kao commented Oct 11, 2020 • edited Loading

josef-pkt commented Oct 11, 2020

powen-kao commented Oct 17, 2020

josef-pkt commented Oct 18, 2020

rlucas7 commented Nov 1, 2020

rlucas7 left a comment

Choose a reason for hiding this comment

rlucas7 Nov 1, 2020

Choose a reason for hiding this comment

rlucas7 left a comment

Choose a reason for hiding this comment

rlucas7 Nov 1, 2020

Choose a reason for hiding this comment

mdhaber commented Jan 11, 2021 • edited Loading

powen-kao commented Jan 28, 2021

mdhaber commented Jan 28, 2021

powen-kao commented Jan 28, 2021

mdhaber commented Jan 28, 2021 • edited Loading

mdhaber commented Feb 20, 2022 • edited Loading

powen-kao commented Oct 4, 2020 •

edited

Loading

powen-kao commented Oct 11, 2020 •

edited

Loading

powen-kao commented Oct 11, 2020 •

edited

Loading

mdhaber commented Jan 11, 2021 •

edited

Loading

mdhaber commented Jan 28, 2021 •

edited

Loading

mdhaber commented Feb 20, 2022 •

edited

Loading