Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: enable stats.shapiro() to take n-dimension input, handle nan and assign axis #12916

Closed
wants to merge 5 commits into from

Conversation

powen-kao
Copy link

@powen-kao powen-kao commented Oct 4, 2020

Reference issue

What does this implement/fix?

Very often scipy is used with Pandas and users might want to verify the normality of multiple columns.
A function that accepts only 1d input and not able to omit np.nan is cumbersome.
Such interface is not consistent with other API like stats.ttest_ind() either.

An improvement is proposed here to handle np.nan according to the given nan_policy.
Operation on a multi-dimension array is possible and can compute along a given axis.
The default behavior remains the same as the old implementation.
Only assigning axis and nan_policy options cause different behavior when handling input array.

Any comment is welcome :D

Additional information

… assign axis

Very often scipy is used with Pandas and users might want to verify the normality of multiple columns.
A function that accepts only 1d input and not able to omit np.nan is cumbersome.
Such interface does not match other API like stats.ttest_ind() either.

An improvement is proposed here to handle np.nan according to the given `nan_policy`.
Operation on a multi-dimension array is possible and an `axis` can be assigned.
The default behavior remains the same as the old implementation.
Only assigning `axis` and `nan_policy` options cause different behavior when handling input array.
@tylerjereddy tylerjereddy added enhancement A new feature or improvement scipy.stats labels Oct 10, 2020
Copy link
Contributor

@tylerjereddy tylerjereddy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just did a quick/superficial read-through and added small comments. Good sign that there seem to be some detailed tests added and CI is green, but will have to wait for the stats regulars to look it over.

THANKS.txt Outdated
@@ -243,6 +243,7 @@ Shashaank N for contributions to scipy.signal.
Frank Torres for fixing a bug with solve_bvp for large problems.
Ben West for updating the Gamma distribution documentation.
Terry Davis for documentation improvements in scipy.ndimage.morphology
Po-Wen Kao for improving stats.shapiro() API.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the way we are managing the "THANKS.txt" file is that if/when this PR is accepted/merged, an entry would be added to the wiki page:

https://github.com/scipy/scipy/wiki/*THANKS.txt-additions-modifications-for-1.6.0*

I believe the reason is mostly to avoid merge conflicts, but @rlucas7 will know--the contents of the file may eventually live on a website instead?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I will remove it in PR and leave it there on wiki page

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the way we are managing the "THANKS.txt" file is that if/when this PR is accepted/merged, an entry would be added to the wiki page:

https://github.com/scipy/scipy/wiki/*THANKS.txt-additions-modifications-for-1.6.0*

I believe the reason is mostly to avoid merge conflicts, but @rlucas7 will know--the contents of the file may eventually live on a website instead?

Yeah as far as I understand the plan is to move thanks to the scipy.org website. The reason is to avoid merge conflicts. Merge conflicts are a pain on maintainers and the release manager-for us though it is more of an inconvenience. For new contributors I think it's an even bigger hassle because of how often the THANKS file creates merge conflicts, then the contributor needs to master git rebase or open a new PR if the rebase fails (a common occurrence).
All this creates a lot more effort that undoubtedly detracts potential new contributors.

The change to the authors tool was already merged:
#12793

but the PR to remove the THANKS is still open:
#12792

I opened the wiki page so that we can move things there as we transition to adding the THANKS on the scipy.org site.

@tylerjereddy does that clarify?

@@ -1617,6 +1617,17 @@ def shapiro(x):
----------
x : array_like
Array of sample data.
axis: int or None, optional
Axis along which to compute test.
If None, input is flatterned into 1d array.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo "flatterned"

@powen-kao
Copy link
Author

powen-kao commented Oct 11, 2020

The nightly build fails doesn't seem to have anything to do with my code.
I only change comments and test_sparsetools.py is not modified since the last successful build.
Could you trigger again the build process? @tylerjereddy

scipy/sparse/tests/test_csc.py ...... [ 77%]
scipy/sparse/tests/test_csr.py ........ [ 77%]
scipy/sparse/tests/test_extract.py .. [ 77%]
scipy/sparse/tests/test_matrix_io.py ...... [ 77%]
Fatal Python error: Segmentation fault

Current thread 0x00007feccf69d740 (most recent call first):
File "/home/runner/.local/lib/python3.9/site-packages/numpy/core/numeric.py", line 2276 in within_tol
File "/home/runner/.local/lib/python3.9/site-packages/numpy/core/numeric.py", line 2290 in isclose
File "<array_function internals>", line 5 in isclose
File "/home/runner/.local/lib/python3.9/site-packages/numpy/testing/_private/utils.py", line 1522 in compare
File "/home/runner/.local/lib/python3.9/site-packages/numpy/testing/_private/utils.py", line 788 in assert_array_compare
File "/home/runner/.local/lib/python3.9/site-packages/numpy/testing/_private/utils.py", line 1527 in assert_allclose
File "/home/runner/work/scipy/scipy/build/testenv/lib/python3.9/site-packages/scipy/sparse/tests/test_sparsetools.py", line 315 in test_upcast

.....

Thank you

@rlucas7
Copy link
Member

rlucas7 commented Oct 11, 2020 via email

@josef-pkt
Copy link
Member

Did scipy.stats switch to axis=None as the default?

In the old times, the statistics/stats default was always axis=0 (for csv/dataframe like data). I haven't paid much attention in a while.

@powen-kao
Copy link
Author

powen-kao commented Oct 11, 2020

Did scipy.stats switch to axis=None as the default?

In the old times, the statistics/stats default was always axis=0 (for csv/dataframe like data). I haven't paid much attention in a while.

You are right.
But I notice that if I set the default as axis=0, the old code which input 2D-array expects the array to be flattened. However, this is not the case in the new API. Only when axis=None, the input array will be flattened.

So two options for default value:

  1. If axis=0, then will cause the compatibility problem
  2. If axis=None, the API is not consistent

@powen-kao powen-kao closed this Oct 11, 2020
@powen-kao powen-kao reopened this Oct 11, 2020
@josef-pkt
Copy link
Member

I didn't look carefully enough to see that there was a ravel.
It sneaked in 5 years ago "Improve implementation details"

@powen-kao powen-kao closed this Oct 12, 2020
@powen-kao powen-kao reopened this Oct 12, 2020
@powen-kao
Copy link
Author

@josef-pkt @rlucas7 Do you have a preference for either options or better ideas? :D

Did scipy.stats switch to axis=None as the default?
In the old times, the statistics/stats default was always axis=0 (for csv/dataframe like data). I haven't paid much attention in a while.

You are right.
But I notice that if I set the default as axis=0, the old code which input 2D-array expects the array to be flattened. However, this is not the case in the new API. Only when axis=None, the input array will be flattened.

So two options for default value:

  1. If axis=0, then will cause the compatibility problem
  2. If axis=None, the API is not consistent

@josef-pkt
Copy link
Member

I leave backwards compatibility decisions to current maintainers.

I strongly prefer the consistent axis=0 as default, which is also the main application in the original motivation in first comment.

@rlucas7
Copy link
Member

rlucas7 commented Nov 1, 2020

I didn't look carefully enough to see that there was a ravel.
It sneaked in 5 years ago "Improve implementation details"

Hmm, preference here is to go through a round of deprecation warnings even though it is inconsistent behavior. Reasoning is that it's been around in the codebase for 5 years, so someone might have code that relies on the existing behavior. If there had only been 1 or 2 releases since the behavior was added the perspective might be different. Even though it is a non-consistent default behavior that we want to change we'll need to notify consumers of the package in advance that we'll no longer support he current behavior. I'll add some comments inline of the PR to indicate the changes to accommodate what I'm proposing.

Copy link
Member

@rlucas7 rlucas7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left a couple inline changes

return ShapiroResult(w, pw)


def _apply_sharipo_1d(x: np.ma.MaskedArray):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _apply_sharipo_1d(x: np.ma.MaskedArray):
def _apply_shapiro_1d(x: np.ma.MaskedArray):

scipy/stats/morestats.py Show resolved Hide resolved
Copy link
Member

@rlucas7 rlucas7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changing my recommendation on the deprecation part. other changes still needed


def test_nan_policies(self):
n1 = self.nd.copy()
n1[0, 0, 0] = np.NaN
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style comment, we usually write the np.NaN as np.nan

scipy/stats/morestats.py Show resolved Hide resolved
- use np.nan instead of np.NaN
- refactor typo from _apply_sharipo_1d() to _apply_shapiro_1d()
The behavior of function is different from old implementation
- axis=0 by default, 2d input will not be flattened
- new test cases for default behavior (axis=0) is added
@mdhaber
Copy link
Contributor

mdhaber commented Jan 11, 2021

@hp5588 Thanks for submitting this. I wonder what you think of a different approach.

Since this ends up calling np.apply_along_axis rather than vectorizing the computation at a lower level, I think we could achieve the same thing by taking the core of what you have and creating a decorator out of it:

from functools import wraps  # so that wrapper preserves docstring

def _vectorize_1s_hypotest_factory(result_creator):
    def vectorize_hypotest_decorator(hypotest_fun_in):
        @wraps(hypotest_fun_in)
        def vectorize_hypotest_wrapper(x, *args, axis=0,
                                       nan_policy='propagate', **kwds):

            x = np.atleast_1d(x)

            # Addresses nan_policy="raise"
            _, nan_policy = scipy.stats.stats._contains_nan(x, nan_policy)

            # Addresses nan_policy="omit"
            if nan_policy == 'omit':
                def hypotest_fun(x, *args, **kwds):
                    x = x[~np.isnan(x)]
                    return hypotest_fun_in(x, *args, **kwds)
            else:
                hypotest_fun = hypotest_fun_in

            x = np.moveaxis(x, axis, -1)
            res = np.apply_along_axis(hypotest_fun, axis=-1, arr=x)
            return result_creator(res)

        return vectorize_hypotest_wrapper
    return vectorize_hypotest_decorator

Then the decorator could be applied to the shapiro function, which doesn't have to be modified at all:

def _shapiro_result_creator(res):
    return ShapiroResult(res[..., 0], res[..., 1])


@_vectorize_1s_hypotest_factory(_shapiro_result_creator)
def shapiro(x):
   ...

One advantage is that because we are not modifying the shapiro function, it is less likely for us to make a mistake that would change the way shapiro works on a 1d input. That makes it easier to review, IMO. Another advantage is that the same decorator can be applied to other 1 sample tests, such as jacque_bera, or some other one sample statistics functions without modification. Of course, once we've reviewed the decorator code once, it makes reviewing the application of the decorator to other functions very easy (compared to reviewing the modification of each function separately).

What do you think? I have a PR that proposes a strategy like this for n-sample tests (well, just two right now, but it generalizes) in gh-13312, but I think it would make sense to have something special for 1d.

Update: actually I added this decorator to gh-13312 and applied it to shapiro and jacque_bera. Looks like I need to read the comments about backwards compatibility carefully, though. It's unfortunate that the behavior in the past has been to ravel nd-arrays.

@powen-kao
Copy link
Author

Hi @mdhaber,

Thanks for your comments. I am sorry for the late reply since I have been a bit busy recently.
The decorator approach sounds like a good idea wrapping all the existing functions. The only concern is that are we going to modify the functions (e.g remove policy parameter stats.ttest_ind() ) that already has part of the proposed features and apply the decorator instead to achieve code consistency?

@mdhaber
Copy link
Contributor

mdhaber commented Jan 28, 2021

are we going to modify the functions (e.g remove policy parameter stats.ttest_ind() ) that already has part of the proposed features

No, I didn't plan to do that, not unless there are problems with their current behavior.

We can separate the two parts - vectorization from nan_policy if desired in the future, but initially I'd apply both in one wrapper to the functions that have neither. Any other changes can wait.

@powen-kao
Copy link
Author

Ok, then I will first try to wrap my PR using the decorator and then we can open another PR to update other functions.
By the way, do you think the IDE (e.g. Pycharm) can recognize the decorator and thus activate the autocomplete?

@mdhaber
Copy link
Contributor

mdhaber commented Jan 28, 2021

Ok, then I will first try to wrap my PR using the decorator and then we can open another PR to update other functions.

Since I've already written the decorator and applied it to Shapiro in gh-13312, I'm not sure that is needed. It would be very helpful if you would review that PR instead.

By the way, do you think the IDE (e.g. Pycharm) can recognize the decorator and thus activate the autocomplete?

This is exactly the sort of thing that would be great to check. I used functools.wraps so i hope so? Thank you!

@mdhaber
Copy link
Contributor

mdhaber commented Feb 20, 2022

We'll take care of this as part of gh-14651. Thanks @hp5558! If you're still interested, you're welcome to open a PR adding the decorator introduced in gh-13312 to shapiro. (I had applied the decorator to shapiro during testing, but removed before merge. It worked fine, but I think each function we apply the decorator to deserves some attention in its own PR.)

@mdhaber mdhaber closed this Feb 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A new feature or improvement scipy.stats
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants