ENH: stats: add Page's L test #12531

mdhaber · 2020-07-12T21:20:45Z

In the CZI Proposal, we indicated that we would add Page's L test.

This is part of our effort to address the top level Statistics Enhancements roadmap item "Expand the set of hypothesis tests."

Original post and updates (most have been addressed)
The initial commit is just proposed documentation so we can discuss the signature and functionality. Update: it's all here now.

I may be getting carried away with the three different methods. Some other stats methods implement only asymptotic approximations, so I assume I could do the same here, but I think that naively-implemented exact (permutation) and Monte Carlo methods would be useful without adding too much difficulty. Update: Naive exact was too slow, but I added a more efficient exact method.

Questions:

Is it OK to have the function in its own file? Have I integrated it into scipy.stats correctly? Do you agree that pagel is the appropriate name, considering other tests are named like kendalltau, personr, and spearmanr? Update: I followed epps_singleton_2samp as an example, which imports into stats.py and adds it into __all__ there. Should I move these both directly to __init__.py?
Can you suggest a better name for any argument?
Should we allow the user to pass in raw data instead of ranks? If not, then we don't need the second argument, and perhaps a better name for the first argument would be ranks. Update: I think we should keep it. R does.
Should we allow the user to pass in data with the columns out of the hypothesized order? If not, then we don't need the third argument.
Do we want all the proposed methods? When there are no ties, 'auto' would simply select between 'exact' and 'asymptotic' based on Table 2 of the original paper. As Wikipedia states it, "The approximation is reliable for more than 20 subjects with any number of conditions, for more than 12 subjects when there are 4 or more conditions, and for any number of subjects when there are 9 or more conditions.".
If we're going to have a Monte Carlo method, how should the user specify the number of samples? Can they just put an integer into the method argument instead of method='mc' plus a separate n_s argument?
Reference [2] says "If there are ties within blocks, mean ranks can be used. Such ties reduce the variance of L... As a consequence, the p-value based on the uncorrected variance would be too large, hence waiving the correction would not violate the significance level." In other words, the asymptotic method estimates the p-value conservatively in the presence of ties. Is that OK, or is it important to adjust for the possibility of ties? [2] suggests Van de Wiel and Di Bucchianico for that. Update: That reference is not very explicit about how to perform the computation. I'd like to leave this out of the PR.
I was originally planning on implementing the exact method naively - actually generating all the rank permutations and computing the L statistic for each. That is going to be too slow for some of the larger combinations of k and n. Is it ok to eliminate the exact method, or, if the user selects auto, resort to Monte Carlo when exact will be too slow and asymptotic will be too inaccurate? Update: I'm implementing the exact algorithm from this paper
Does the documentation's explanation for the 'exact' and 'mc' calculations make sense? Are they correct?
Is the original article quoted too much in the documentation?

I'll add an example problem at the end of the documentation later. Fingers crossed Sphinx doesn't give me too much of a headache.... Update: surprisingly, no issues!

@WarrenWeckesser

rgommers · 2020-07-13T10:22:15Z

scipy/stats/__init__.py

@@ -265,6 +265,7 @@
   brunnermunzel
   combine_pvalues
   jarque_bera
+   pagel


Immediate first thought: pagel like "bagel"? This doesn't seem like a good function name. Maybe page_l?

Thanks for taking a look. page_l was the original name, but I changed it for consistency with kendalltau, spearmanr, pearsonr, johnsonsb, johnsonsu, mannwhitneyu, and friedmanchisquare: surname adjoined with variable name. There are some functions that have a _ after the surname, but they don't have the name of a variable after them; they're more of a description (e.g. fisher_exact, yeojohnson_normmax.

Either is fine with me, but @mdhaber's argument for following the common pattern pushes me towards pagel.

@rgommers, I'd like to merge this soon. Are you OK with pagel?

Sure. It's a terrible name, but at least it's consistent with the other terrible names - and page_l also won't tell anyone what the function does.

I've found two versions in R, one called page.trend.test (in the crank package) and another called page.test (in the cultevo package). In this blog post, the author of the cultevo package argues that the test isn't actually a trend test. If that argument is convincing, then including trend in the name isn't quite accurate.

Perhaps something with the word ordered, e.g. page_ordered_test (with or without the underscores, with or without the word test). The name Page is clearly associated with this test, so I think we want to keep page in there.

Naming things is hard. Suggestions for names that are not terrible would be appreciated!

page_test or page_l_test is what I'd choose. We have ttest, binomtest etc. as well - makes it a bit more descriptive.

The only thing preventing me from merging this is @rgommers objection to the name. I know @mdhaber prefers pagel, and (for better or worse), that style is consistent with many other names in scipy.stats. I don't have a strong preference, and when that happens I try to go with the original author's preference. Does anyone else have input? Is there a set of objective criteria we can apply to evaluate the quality of a name?

Maybe just ping on the mailing list? If no one else objects or has a clear preference, go with pagel

rgommers · 2020-07-13T10:23:27Z

scipy/stats/_pagel.py

+    in the following order: tutorial, lecture, seminar.
+
+    >>> table = [[3, 4, 3],
+                 [2, 2, 4],


minor: this requires ... at the start of each line to be valid doctest syntax.

Yup, thanks!

scipy/stats/_pagel.py

rgommers · 2020-07-13T10:30:08Z

scipy/stats/_pagel.py

+        * there are :math:`n \geq 3` treatments,
+        * :math:`m \geq 2` subjects are observed for each treatment, and
+        * the observations are hypothesized to have a particular order.
+


The Wikipedia page has useful context here, like it's related to spearmanr and has more statistical power than friedmanchisquare. I think when we add more of these lesser-known tests, such context is becoming more and more important to add.

OK, adding.

rgommers · 2020-07-13T10:32:51Z

Have I integrated it into scipy.stats correctly?

It works, but is suboptimal. If it gets its own file then it should be imported directly from stats/__init__.py. However, I don't think this should be in its own file. If stats.py gets too large, we should just break it up in logical sections. One file named after one function this small is a bit odd.

mdhaber · 2020-07-13T14:56:43Z

One file named after one function this small is a bit odd.

There is a lot more code to be written. If, in the end, it's too small for its own file, I'll move it. In the meantime, easier to develop in its own file.

If it gets its own file then it should be imported directly from stats/__init__.py.

I was following the example of epps_singleton_2samp, which gets imported into stats.py:

from ._hypotests import epps_singleton_2samp
from ._pagel import pagel

and included in __all___ there,

           'brunnermunzel', 'epps_singleton_2samp', 'pagel']

Should I fix both of them?

chrisb83 · 2020-07-14T20:45:08Z

If stats.py gets too large, we should just break it up in logical sections. One file named after one function this small is a bit odd.

I created _hypotests.py for new statistical tests / hypothesis testing to avoid adding more and more code to stats.py. could we add Page L to that file (and potentially revise the way the functions are imported) ?

mdhaber · 2020-07-17T00:36:23Z

scipy/stats/_pagel.py

+import scipy.stats
+
+
+Page_L_Result = namedtuple('Page_L_Result',


As @WarrenWeckesser suggested, I'm going to change this to some other sort of object that requires items to be accessed by name.

mdhaber · 2020-07-17T00:46:00Z

TLDR: I added scipy/stats/pagel_exact.npy, which the exact method uses to look up pre-calculated PMF values, but CI can't find the file.

E   FileNotFoundError: [Errno 2] No such file or directory: 'C:\\hostedtoolcache\\windows\\Python\\3.7.8\\x64\\lib\\site-packages\\scipy\\stats\\pagel_exact.npy'

Help?

I try to load it with:

    dir_path = os.path.dirname(os.path.realpath(__file__))
    datafile = os.path.join(dir_path, "pagel_exact.npy")
    all_pmfs = np.load(datafile, allow_pickle=True).item()

How can I fix this?
update: can't check now, but maybe .gitignore needs an exception.
update: nope, it's not that.
update:
scipy\interpolate\tests\test_interpnd.py has a function:

def data_file(basename):
    return os.path.join(os.path.abspath(os.path.dirname(__file__)),
                        'data', basename)

and uses it like:

points = np.load(data_file('estimate_gradients_hang.npy'))

I changed my code to use abspath just like this function, but it's still not working. I don't see anything in .gitignore that would prevent the .npy files from getting copied. Do the CI definitions prevent it?

mdhaber · 2020-07-19T04:20:08Z

Thanks @WarrenWeckesser!

mdhaber · 2020-07-19T04:34:09Z

CircleCI was successful in 51c88a9 but fails in ed3b9d0. How did those changes break the doc build (without any error messages, of course)?

Update: sounds like data was a bad name.

/home/circleci/repo/build/testenv/lib/python3.7/site-packages/scipy/stats/morestats.py:docstring of scipy.stats.boxcox_llf:11: WARNING: py:obj reference target not found: data
/home/circleci/repo/build/testenv/lib/python3.7/site-packages/scipy/stats/morestats.py:docstring of scipy.stats.boxcox_llf:18: WARNING: py:obj reference target not found: data
/home/circleci/repo/build/testenv/lib/python3.7/site-packages/scipy/stats/morestats.py:docstring of scipy.stats.boxcox_llf:18: WARNING: py:obj reference target not found: data
writing output... [ 89%] generated/scipy.stats.kstwobign .. generated/scipy.stats.rv_discrete.support
writing output... [ 94%] generated/scipy.stats.rv_discrete.var .. sparse.csgraph
/home/circleci/repo/build/testenv/lib/python3.7/site-packages/scipy/stats/morestats.py:docstring of scipy.stats.yeojohnson_llf:12: WARNING: py:obj reference target not found: data
/home/circleci/repo/build/testenv/lib/python3.7/site-packages/scipy/stats/morestats.py:docstring of scipy.stats.yeojohnson_llf:19: WARNING: py:obj reference target not found: data

WarrenWeckesser

I made a bunch of suggested changes inline. There are also many lines in the test class TestPageL that are longer than 79 characters. Those should be fixed before we merge the PR.

I have one API change to consider. In almost every example that I find (web pages, text books, R function documentation), the given data is the raw observations, not the ranked observations. (Off the top of my head, the only case where I recall the given data being the ranked observations is the example from the original paper.) I suspect this is, in fact, the most common use-case. This means the users will almost always have to give the argument ranked=False when using the function. I think we should make that the default.

WarrenWeckesser · 2020-12-10T16:34:45Z

scipy/stats/_pagel.py

+    ``'asymptotic'``` *p*-values, however, tend to be smaller (i.e. less
+    conservative) than the ``'exact'`` *p*-values.
+
+


Remove a blank line.

Suggested change

Oops; I'll include this in the next commit.

scipy/stats/_pagel.py

scipy/stats/tests/test_stats.py

mdhaber · 2020-12-10T17:54:43Z

This means the users will almost always have to give the argument ranked=False when using the function. I think we should make that the default.

I remember I chose this default for speed. Ranking the data is the most expensive part of the (asymptotic) tests, I think, so I thought we should skip ranking by default.

But you're right; we should probably make it convenient first, and if the user cares about speed, they can pay attention to the available parameters.

mdhaber · 2020-12-10T17:56:26Z

There are also many lines in the test class TestPageL that are longer than 79 characters. Those should be fixed before we merge the PR.

OK, please send me the formatted lines and I will include them, or you're welcome to push directly.

Update: done.

…review Co-authored-by: Warren Weckesser <warren.weckesser@gmail.com>

mdhaber · 2020-12-10T20:39:21Z

This means the users will almost always have to give the argument ranked=False when using the function. I think we should make that the default.

Done.

WarrenWeckesser

Thanks Matt! I noticed three more tiny whitespace issues, otherwise this looks ready.

scipy/stats/tests/test_stats.py

Co-authored-by: Warren Weckesser <warren.weckesser@gmail.com>

WarrenWeckesser

I think it is ready! I'll hold off merging this until Monday, to give time for previous commenters (or anyone else) to take a look at the updated version.

mdhaber · 2021-01-20T01:56:53Z

scipy/stats/_pagel.py

+    We use the example from [3]_: 10 students are asked to rate three
+    teaching methods - tutorial, lecture, and seminar - on a scale of 1-5,
+    with 1 being the lowest and 5 being the highest. We have decided that
+    a confidence level of 99% is required to reject the null hypothsis in favor


Suggested change

a confidence level of 99% is required to reject the null hypothsis in favor

a confidence level of 99% is required to reject the null hypothesis in favor

scipy/stats/_pagel.py

mdhaber · 2021-01-25T23:13:22Z

@WarrenWeckesser Done, I think. If doctests fail due to line break, please resolve as you see fit.

scipy/stats/_page_trend_test.py

mdhaber · 2021-01-26T02:11:14Z

Well that's nice that doctests passed.

WarrenWeckesser · 2021-01-27T18:55:46Z

Thanks @mdhaber, merged. I added a brief note about it to the 1.7.0 release notes.

mdhaber added scipy.stats enhancement A new feature or improvement labels Jul 12, 2020

mdhaber requested a review from WarrenWeckesser July 12, 2020 21:21

mdhaber added this to the 1.6.0 milestone Jul 12, 2020

mdhaber force-pushed the pagel branch from 0aff2e4 to 705a030 Compare July 13, 2020 05:09

rgommers reviewed Jul 13, 2020

View reviewed changes

scipy/stats/_pagel.py Outdated Show resolved Hide resolved

rgommers reviewed Jul 13, 2020

View reviewed changes

mdhaber commented Jul 17, 2020

View reviewed changes

mdhaber mentioned this pull request Jul 20, 2020

A Solid Foundation for Statistics in Python with SciPy mdhaber/scipy#26

Closed

mdhaber added 13 commits July 20, 2020 20:14

DOC: stats: add Page's L test (documentation only)

9fecc91

DOC: stats: pagel documentation revisions

66b2078

ENH: stats: add Page's L test (asymptotic) and example

55e9b7b

DOC: stats: refine pagel documentation

f3b962d

DOC: stats: fixed doctest and added context of Page's L

30d7663

ENH: stats: add pagel Monte Carlo, input validation, fixed doctest

896eb51

MAINT: stats: fix pagel doctest, PEP8 issue in stats.py

6d7fb4a

TST: stats: add pagel unit tests - invalid arguments

269b093

TST: stats: add accuracy and options tests to pagel

21a2b02

TST: stats: add Ames assay test from _pagel.py [2]

8c1e89f

ENH: stats: add pagel exact method

154a7d2

DOC: stats: update pagel documentation w/ exact method

2340a38

MAINT: stats: fix path issue for data file

534d676

mdhaber added 2 commits December 9, 2020 20:52

TST: stats: test pagel to greater precision against R library 'cultevo'

30ab7df

DOC: stats: fix typo in doctest

4182bb9

WarrenWeckesser requested changes Dec 10, 2020

View reviewed changes

mdhaber and others added 3 commits December 10, 2020 11:17

MAINT: stats: apply PEP8, doctest mistake, and typo suggestions from …

d620b9f

…review Co-authored-by: Warren Weckesser <warren.weckesser@gmail.com>

STY: stats: PEP8 line breaks

ecdc669

MAINT: stats: change pagel default to assume data is unranked

de4532e

WarrenWeckesser requested changes Dec 10, 2020

View reviewed changes

scipy/stats/tests/test_stats.py Outdated Show resolved Hide resolved

scipy/stats/tests/test_stats.py Show resolved Hide resolved

scipy/stats/tests/test_stats.py Show resolved Hide resolved

STY: stats: add PEP8 whitespace

c05abea

Co-authored-by: Warren Weckesser <warren.weckesser@gmail.com>

WarrenWeckesser approved these changes Dec 10, 2020

View reviewed changes

mdhaber mentioned this pull request Jan 10, 2021

ENH: add Cramer-von Mises test for two samples #13263

Merged

mdhaber commented Jan 20, 2021

View reviewed changes

mdhaber commented Jan 25, 2021

View reviewed changes

scipy/stats/_pagel.py Outdated Show resolved Hide resolved

mdhaber commented Jan 25, 2021

View reviewed changes

scipy/stats/_pagel.py Outdated Show resolved Hide resolved

mdhaber commented Jan 25, 2021

View reviewed changes

scipy/stats/_pagel.py Outdated Show resolved Hide resolved

mdhaber added 2 commits January 25, 2021 14:56

MAINT: stats: rename pagel->page_trend_test per mailing list discussion

4db84f5

TST: stats: refactor tests to use assert_raises context manager

72d1da5

mdhaber commented Jan 25, 2021

View reviewed changes

scipy/stats/_page_trend_test.py Outdated Show resolved Hide resolved

DOC: stats: correct spelling of hypothesis

b98e206

mdhaber mentioned this pull request Jan 27, 2021

MAINT: stats: Add _fitstart method to wrapcauchy. #13446

Merged

WarrenWeckesser merged commit 8538cc9 into scipy:master Jan 27, 2021

tylerjereddy added this to the 1.7.0 milestone Jan 28, 2021

mdhaber mentioned this pull request Jan 28, 2021

robust_variation aka robust version of coeffient of variation for scipy.stats #13385

Open

mdhaber mentioned this pull request Mar 10, 2021

ENH: vectorize cramervonmises_2samp mdhaber/scipy#58

Open

tylerjereddy mentioned this pull request Apr 9, 2021

MAINT: SciPy 1.6.3 backports #13829

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: stats: add Page's L test #12531

ENH: stats: add Page's L test #12531

mdhaber commented Jul 12, 2020 •

edited

Loading

rgommers Jul 13, 2020

mdhaber Jul 13, 2020

WarrenWeckesser Dec 9, 2020

WarrenWeckesser Dec 15, 2020

rgommers Dec 15, 2020

WarrenWeckesser Dec 15, 2020

rgommers Dec 15, 2020

WarrenWeckesser Jan 5, 2021

rgommers Jan 5, 2021

rgommers Jul 13, 2020

mdhaber Jul 13, 2020

rgommers Jul 13, 2020

mdhaber Jul 13, 2020

rgommers commented Jul 13, 2020

mdhaber commented Jul 13, 2020 •

edited

Loading

chrisb83 commented Jul 14, 2020

mdhaber Jul 17, 2020 •

edited

Loading

mdhaber commented Jul 17, 2020 •

edited

Loading

mdhaber commented Jul 19, 2020

mdhaber commented Jul 19, 2020 •

edited

Loading

WarrenWeckesser left a comment

WarrenWeckesser Dec 10, 2020

mdhaber Dec 10, 2020

mdhaber commented Dec 10, 2020

mdhaber commented Dec 10, 2020 •

edited

Loading

mdhaber commented Dec 10, 2020

WarrenWeckesser left a comment

WarrenWeckesser left a comment

mdhaber Jan 20, 2021

mdhaber commented Jan 25, 2021

mdhaber commented Jan 26, 2021

WarrenWeckesser commented Jan 27, 2021

		import scipy.stats


		Page_L_Result = namedtuple('Page_L_Result',

		``'asymptotic'``` p-values, however, tend to be smaller (i.e. less
		conservative) than the ``'exact'`` p-values.

	a confidence level of 99% is required to reject the null hypothsis in favor
	a confidence level of 99% is required to reject the null hypothesis in favor

ENH: stats: add Page's L test #12531

ENH: stats: add Page's L test #12531

Conversation

mdhaber commented Jul 12, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rgommers commented Jul 13, 2020

mdhaber commented Jul 13, 2020 • edited Loading

chrisb83 commented Jul 14, 2020

mdhaber Jul 17, 2020 • edited Loading

Choose a reason for hiding this comment

mdhaber commented Jul 17, 2020 • edited Loading

mdhaber commented Jul 19, 2020

mdhaber commented Jul 19, 2020 • edited Loading

WarrenWeckesser left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mdhaber commented Dec 10, 2020

mdhaber commented Dec 10, 2020 • edited Loading

mdhaber commented Dec 10, 2020

WarrenWeckesser left a comment

Choose a reason for hiding this comment

WarrenWeckesser left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mdhaber commented Jan 25, 2021

mdhaber commented Jan 26, 2021

WarrenWeckesser commented Jan 27, 2021

mdhaber commented Jul 12, 2020 •

edited

Loading

mdhaber commented Jul 13, 2020 •

edited

Loading

mdhaber Jul 17, 2020 •

edited

Loading

mdhaber commented Jul 17, 2020 •

edited

Loading

mdhaber commented Jul 19, 2020 •

edited

Loading

mdhaber commented Dec 10, 2020 •

edited

Loading