-
-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: stats: add Page's L test #12531
Conversation
scipy/stats/__init__.py
Outdated
@@ -265,6 +265,7 @@ | |||
brunnermunzel | |||
combine_pvalues | |||
jarque_bera | |||
pagel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Immediate first thought: pagel
like "bagel"? This doesn't seem like a good function name. Maybe page_l
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking a look. page_l
was the original name, but I changed it for consistency with kendalltau
, spearmanr
, pearsonr
, johnsonsb
, johnsonsu
, mannwhitneyu
, and friedmanchisquare
: surname adjoined with variable name. There are some functions that have a _
after the surname, but they don't have the name of a variable after them; they're more of a description (e.g. fisher_exact
, yeojohnson_normmax
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either is fine with me, but @mdhaber's argument for following the common pattern pushes me towards pagel
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rgommers, I'd like to merge this soon. Are you OK with pagel
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. It's a terrible name, but at least it's consistent with the other terrible names - and page_l
also won't tell anyone what the function does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've found two versions in R, one called page.trend.test
(in the crank
package) and another called page.test
(in the cultevo
package). In this blog post, the author of the cultevo
package argues that the test isn't actually a trend test. If that argument is convincing, then including trend
in the name isn't quite accurate.
Perhaps something with the word ordered
, e.g. page_ordered_test
(with or without the underscores, with or without the word test
). The name Page is clearly associated with this test, so I think we want to keep page
in there.
Naming things is hard. Suggestions for names that are not terrible would be appreciated!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
page_test
or page_l_test
is what I'd choose. We have ttest
, binomtest
etc. as well - makes it a bit more descriptive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only thing preventing me from merging this is @rgommers objection to the name. I know @mdhaber prefers pagel
, and (for better or worse), that style is consistent with many other names in scipy.stats
. I don't have a strong preference, and when that happens I try to go with the original author's preference. Does anyone else have input? Is there a set of objective criteria we can apply to evaluate the quality of a name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just ping on the mailing list? If no one else objects or has a clear preference, go with pagel
scipy/stats/_pagel.py
Outdated
in the following order: tutorial, lecture, seminar. | ||
|
||
>>> table = [[3, 4, 3], | ||
[2, 2, 4], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: this requires ...
at the start of each line to be valid doctest syntax.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, thanks!
scipy/stats/_pagel.py
Outdated
* there are :math:`n \geq 3` treatments, | ||
* :math:`m \geq 2` subjects are observed for each treatment, and | ||
* the observations are hypothesized to have a particular order. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Wikipedia page has useful context here, like it's related to spearmanr
and has more statistical power than friedmanchisquare
. I think when we add more of these lesser-known tests, such context is becoming more and more important to add.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, adding.
It works, but is suboptimal. If it gets its own file then it should be imported directly from |
There is a lot more code to be written. If, in the end, it's too small for its own file, I'll move it. In the meantime, easier to develop in its own file.
I was following the example of
and included in
Should I fix both of them? |
I created _hypotests.py for new statistical tests / hypothesis testing to avoid adding more and more code to stats.py. could we add Page L to that file (and potentially revise the way the functions are imported) ? |
scipy/stats/_pagel.py
Outdated
import scipy.stats | ||
|
||
|
||
Page_L_Result = namedtuple('Page_L_Result', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @WarrenWeckesser suggested, I'm going to change this to some other sort of object that requires items to be accessed by name.
TLDR: I added
Help? I try to load it with:
How can I fix this?
and uses it like:
I changed my code to use |
Thanks @WarrenWeckesser! |
CircleCI was successful in 51c88a9 but fails in ed3b9d0. How did those changes break the doc build (without any error messages, of course)? Update: sounds like
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a bunch of suggested changes inline. There are also many lines in the test class TestPageL
that are longer than 79 characters. Those should be fixed before we merge the PR.
I have one API change to consider. In almost every example that I find (web pages, text books, R function documentation), the given data is the raw observations, not the ranked observations. (Off the top of my head, the only case where I recall the given data being the ranked observations is the example from the original paper.) I suspect this is, in fact, the most common use-case. This means the users will almost always have to give the argument ranked=False
when using the function. I think we should make that the default.
scipy/stats/_pagel.py
Outdated
``'asymptotic'``` *p*-values, however, tend to be smaller (i.e. less | ||
conservative) than the ``'exact'`` *p*-values. | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove a blank line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops; I'll include this in the next commit.
I remember I chose this default for speed. Ranking the data is the most expensive part of the (asymptotic) tests, I think, so I thought we should skip ranking by default. But you're right; we should probably make it convenient first, and if the user cares about speed, they can pay attention to the available parameters. |
OK, please send me the formatted lines and I will include them, or you're welcome to push directly. Update: done. |
…review Co-authored-by: Warren Weckesser <warren.weckesser@gmail.com>
Done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Matt! I noticed three more tiny whitespace issues, otherwise this looks ready.
Co-authored-by: Warren Weckesser <warren.weckesser@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is ready! I'll hold off merging this until Monday, to give time for previous commenters (or anyone else) to take a look at the updated version.
scipy/stats/_pagel.py
Outdated
We use the example from [3]_: 10 students are asked to rate three | ||
teaching methods - tutorial, lecture, and seminar - on a scale of 1-5, | ||
with 1 being the lowest and 5 being the highest. We have decided that | ||
a confidence level of 99% is required to reject the null hypothsis in favor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a confidence level of 99% is required to reject the null hypothsis in favor | |
a confidence level of 99% is required to reject the null hypothesis in favor |
@WarrenWeckesser Done, I think. If doctests fail due to line break, please resolve as you see fit. |
Well that's nice that doctests passed. |
Thanks @mdhaber, merged. I added a brief note about it to the 1.7.0 release notes. |
In the CZI Proposal, we indicated that we would add Page's L test.
This is part of our effort to address the top level Statistics Enhancements roadmap item "Expand the set of hypothesis tests."
Original post and updates (most have been addressed)
The initial commit is just proposed documentation so we can discuss the signature and functionality. Update: it's all here now.
I may be getting carried away with the three different methods. Some other
stats
methods implement only asymptotic approximations, so I assume I could do the same here, but I think that naively-implemented exact (permutation) and Monte Carlo methods would be useful without adding too much difficulty. Update: Naive exact was too slow, but I added a more efficient exact method.Questions:
scipy.stats
correctly? Do you agree thatpagel
is the appropriate name, considering other tests are named likekendalltau
,personr
, andspearmanr
? Update: I followedepps_singleton_2samp
as an example, which imports intostats.py
and adds it into__all__
there. Should I move these both directly to__init__.py
?ranks
. Update: I think we should keep it. R does.'auto'
would simply select between'exact'
and'asymptotic'
based on Table 2 of the original paper. As Wikipedia states it, "The approximation is reliable for more than 20 subjects with any number of conditions, for more than 12 subjects when there are 4 or more conditions, and for any number of subjects when there are 9 or more conditions.".method
argument instead ofmethod='mc'
plus a separaten_s
argument?auto
, resort to Monte Carlo when exact will be too slow and asymptotic will be too inaccurate? Update: I'm implementing the exact algorithm from this paper'exact'
and'mc'
calculations make sense? Are they correct?I'll add an example problem at the end of the documentation later. Fingers crossed Sphinx doesn't give me too much of a headache.... Update: surprisingly, no issues!
@WarrenWeckesser