DOC: stats: add realistic examples to variance tests #17778

tupui · 2023-01-12T18:16:00Z

What does this implement/fix?

Adds a realistic example to scipy.stats.levene. The example is from:

C.I. BLISS (1952), The Statistics of Bioassay: With Special
Reference to the Vitamins, pp 499-503,
doi: 10.1016/C2013-0-12584-6.

Additional information

When this looks good, I'll add similar examples to the other variance tests.

[skip cirrus] [skip actions]

scipy/stats/_morestats.py

Co-authored-by: Jake Bowhay <60778417+j-bowhay@users.noreply.github.com>

[skip cirrus] [skip actions]

mdhaber

I'm curious if a variance test was performed in the paper?

I'd be interested in seeing the results of the permutation version of the test. If that portion of the example belongs in the spearmanr/kendalltau, might as well follow the rest of that format here.

Then again, I'm having second thoughts about putting these in all the function documentation. The files are growing considerably, and I think they might be too lengthy (too much plotting code for the amount we use the function itself). Maybe we should move them over into a section of tutorials for hypothesis tests like we do for the distributions?

scipy/stats/_morestats.py

tupui · 2023-01-14T11:24:36Z

I'm curious if a variance test was performed in the paper?

It was not done, but I've seen some blog posts (of questionable quality) using this examples for that.

I'd be interested in seeing the results of the permutation version of the test. If that portion of the example belongs in the spearmanr/kendalltau, might as well follow the rest of that format here.

Permutation version? Not sure I understand.

Then again, I'm having second thoughts about putting these in all the function documentation. The files are growing considerably, and I think they might be too lengthy (too much plotting code for the amount we use the function itself). Maybe we should move them over into a section of tutorials for hypothesis tests like we do for the distributions?

I think there are 2 separate things here.

File length: I think we could think about splitting some of the big files into logical grouping. Similarly to your refactoring of the API doc. It's common for the API doc structure to match the file/code structure. Having huge files like more_stats or stats is not very self explanatory.
Details/plots: I also have this concern. I think we have 2 ways of proceeding: 1. continue like that and consolidate things at the end. 2. think now about the result we want to have and refactor/document functions. We started 1. which is not optimal, but still an improvement. One benefit is that for a user it's self contained. At the moment users don't really navigate a lot in our docs (from the metrics). But it's a biased assessment since our doc is constructed to favor that.

I propose we discuss that next week.

mdhaber · 2023-01-14T16:09:32Z

Permutation version? Not sure I understand.

It is possible to use permutation_test to approximate the null distribution similarly to how it is done in the correlation tests, and it is useful to do so if the sample size is small. (If that is not desired here, we should be able to answer the question: why is it desirable to show how to do that in the correlation tests but not here? I think should be consistent about the depth of each example, ideally.)

I think there are 2 separate things here.

Yes, you're right, there are really two issues.

Yes, I was planning on that. I want to to be careful to preserve git blame and the history of the code, if possible. I'm hoping that if we copy the big files in one commit so that git recognizes that they have the same history, we can then remove things from each copy to make them unique.
Yes, I think proceeding as we are and refactoring at the end is OK. That's what I had in mind.

tupui · 2023-01-15T17:26:34Z

It is possible to use permutation_test to approximate the null distribution similarly to how it is done in the correlation tests, and it is useful to do so if the sample size is small. (If that is not desired here, we should be able to answer the question: why is it desirable to show how to do that in the correlation tests but not here? I think should be consistent about the depth of each example, ideally.)

We would get the following:

def statistic(x, y, z):
    return stats.levene(x, y, z).statistic

ref = stats.permutation_test((small_dose, medium_dose, large_dose), statistic,
                             permutation_type='pairings')

# PermutationTestResult(statistic=0.6457341109631506, pvalue=1.0, null_distribution=array([0.64573411, 0.64573411, 0.64573411, ..., 0.64573411, 0.64573411, 0.64573411]))

Shall I include it then?

mdhaber · 2023-01-15T20:20:00Z

I think it would be good to include the stuff about the F distribution being an asymptotic approximation and to show how permutation_test can be used to get a more accurate p-value.

The code is not quite right.

Based on how the levene statistic works, it looks like we should only consider values of statistic larger than the observed value to be "more extreme". Therefore, we need to use alternative='greater' in the call to permutation_test.
Also, the levene statistic doesn't require the sample sizes to be equal, so we immediately know that method='pairings' and method='samples' are inappropriate. method='independent' is appropriate because we are testing the null hypothesis that the samples are independently drawn from distributions with the same variance.

The call should be:

ref = stats.permutation_test((small_dose, medium_dose, large_dose), statistic, 
                             permutation_type='independent', 
                             alternative='greater')

And we would see a pretty substanctial discrepancy between the asymptotic null distribution and the randomized approximation:

Co-authored-by: Matt Haberland <mhaberla@calpoly.edu>

[skip cirrus] [skip actions]

tupui · 2023-01-16T15:15:49Z

I did the update. (I noticed the colour of the hist does not corresponds to the legend for kendalltau etc, I can adjust here if you want.)

For the rest I am honestly getting lost with wording. Could you make suggestions? Thanks!

* DOC: stats.levene: corrections * Apply suggestions from code review Co-authored-by: Matt Haberland <mhaberla@calpoly.edu> Co-authored-by: Pamphile Roy <roy.pamphile@gmail.com>

scipy/stats/_morestats.py

mdhaber

Thanks @tupui! Yes, go ahead and add this example to the other variance tests.

scipy/stats/_morestats.py

[skip actions] [skip cirrus]

tupui · 2023-01-26T17:56:46Z

@mdhaber failure is not related. Depending on how close this is to be merged, I could put the fix here which is just a variable name change in a stats test which was introduced yesterday.

mdhaber · 2023-01-27T05:25:11Z

I could put the fix here which is just a variable name change in a stats test which was introduced yesterday.

Oops, I took care of that in gh-17865.

mdhaber

Looks close. ~~If you wouldn't mind digging a bit into the conditions under which the null distributions used in the tests were derived, I'd appreciate it.~~ (Never mind. The computational experiments don't lie.) The statement about the null distributions being an asymptotic approximation is not always true. In any case, it does not appear to be the most important thing to say for bartlett.

mdhaber · 2023-01-27T05:30:08Z

scipy/stats/_morestats.py

+    >>> flig_val = np.linspace(0, 8, 100)
+    >>> pdf = dist.pdf(flig_val)
+    >>> fig, ax = plt.subplots(figsize=(8, 5))
+    >>> def lev_plot(ax):  # we'll re-use this


The name of this test could be changed. In some PRs I removed the prefix and just made it plot so that it would be easier to copy.

Yeah I will do this 👍

scipy/stats/_morestats.py

mdhaber · 2023-01-27T05:58:12Z

scipy/stats/_morestats.py

+    Note that the chi-square distribution provides an asymptotic approximation
+    of the null distribution; it is only accurate for samples with many


I'm not sure if this is true here. It might be, but I'm not sure it's the thing that's worth mentioning.

import numpy as np from scipy import stats import matplotlib.pyplot as plt rng = np.random.default_rng(1638083107694713882823079058616272161) ps = [] ss = [] for i in range(10000): samples = rng.normal(size=(5, 5)) s, p = stats.bartlett(*samples) ss.append(s) ps.append(p) x = np.linspace(0, np.max(ss)) dist = stats.chi2(df=len(samples)-1) plt.plot(x, dist.pdf(x)) plt.hist(ss, density=True, bins=50) plt.show()

The graph matches the chi2 distribution quite well even for just a few observations per sample.

If we just change to the uniform distribution, it's a very different story:

The following might be more important:

Note that the chi-square distribution provides the null distribution when the observations are normally distributed. For samples drawn from non-normal populations, it may be more appropriate to perform a permutation test...

But it would probably be worth doing some research to confirm.

scipy/stats/_morestats.py

[skip actions] [skip cirrus]

tupui · 2023-01-27T11:40:53Z

Thanks Matt. Since these are just examples I propose to only write we know to be true. I don't think there is much value to gain, for the purpose of the examples, going deeper.

mdhaber · 2023-01-27T18:21:57Z

scipy/stats/_morestats.py

+    Note that the chi-square distribution provides an asymptotic approximation
+    of the null distribution.


Since these are just examples I propose to only write we know to be true.

But we don't know that this is true. It doesn't appear to be true.

We do know that bartlett is sensitive to non-normality. It is mentioned in the docstring.

So my suggestion was to remove the part we don't know and replace it with something we do know:

Note that the chi-square distribution provides the null distribution when the observations are normally distributed. For small samples drawn from non-normal populations, it may be more appropriate to perform a permutation test...

scipy/stats/_morestats.py

[skip ci] Co-authored-by: Pamphile Roy <roy.pamphile@gmail.com>

tupui · 2023-01-27T18:53:26Z

Thanks Matt!

tupui added scipy.stats Documentation Issues related to the SciPy documentation. Also check https://github.com/scipy/scipy.org labels Jan 12, 2023

tupui requested a review from mdhaber January 12, 2023 18:16

DOC: add realistic example for stats.levene.

9cebc04

[skip cirrus] [skip actions]

tupui force-pushed the bio_var branch from ec26a17 to 9cebc04 Compare January 12, 2023 18:27

j-bowhay requested changes Jan 12, 2023

View reviewed changes

scipy/stats/_morestats.py Show resolved Hide resolved

scipy/stats/_morestats.py Outdated Show resolved Hide resolved

Update scipy/stats/_morestats.py

23a9537

Co-authored-by: Jake Bowhay <60778417+j-bowhay@users.noreply.github.com>

tupui force-pushed the bio_var branch from bd5cba8 to c50be9b Compare January 12, 2023 20:05

DOC: fix linter and small formating.

b77769e

[skip cirrus] [skip actions]

tupui force-pushed the bio_var branch from c50be9b to b77769e Compare January 13, 2023 00:19

mdhaber reviewed Jan 13, 2023

View reviewed changes

scipy/stats/_morestats.py Outdated Show resolved Hide resolved

scipy/stats/_morestats.py Outdated Show resolved Hide resolved

scipy/stats/_morestats.py Outdated Show resolved Hide resolved

scipy/stats/_morestats.py Show resolved Hide resolved

tupui and others added 2 commits January 16, 2023 15:10

DOC: fix typos. [skip ci]

464a6eb

Co-authored-by: Matt Haberland <mhaberla@calpoly.edu>

DOC: add permutation test.

f626146

[skip cirrus] [skip actions]

mdhaber mentioned this pull request Jan 16, 2023

DOC: stats.levene: corrections [skip cirrus] [skip actions] tupui/scipy#13

Merged

DOC: stats.levene: corrections [skip cirrus] [skip actions]

20adab7

* DOC: stats.levene: corrections * Apply suggestions from code review Co-authored-by: Matt Haberland <mhaberla@calpoly.edu> Co-authored-by: Pamphile Roy <roy.pamphile@gmail.com>

tupui commented Jan 17, 2023

View reviewed changes

scipy/stats/_morestats.py Outdated Show resolved Hide resolved

scipy/stats/_morestats.py Outdated Show resolved Hide resolved

DOC: fix indentation and spacing. [skip actions] [skip cirrus]

c8649b6

mdhaber reviewed Jan 19, 2023

View reviewed changes

scipy/stats/_morestats.py Outdated Show resolved Hide resolved

tupui added 2 commits January 19, 2023 18:59

DOC: add realistic example for stats.fligner

02ce535

DOC: add realistic example for stats.bartlett.

9fdbe24

[skip actions] [skip cirrus]

mdhaber reviewed Jan 27, 2023

View reviewed changes

DOC: fix naming and statement about null distributions.

61bffa6

[skip actions] [skip cirrus]

mdhaber reviewed Jan 27, 2023

View reviewed changes

tupui commented Jan 27, 2023

View reviewed changes

scipy/stats/_morestats.py Outdated Show resolved Hide resolved

Update scipy/stats/_morestats.py

2898152

[skip ci] Co-authored-by: Pamphile Roy <roy.pamphile@gmail.com>

mdhaber merged commit d3212e4 into scipy:main Jan 27, 2023

tupui deleted the bio_var branch January 27, 2023 18:53

tupui added this to the 1.11.0 milestone Jan 27, 2023

mdhaber mentioned this pull request Jan 30, 2023

SciPy: Fundamental Tools for Biomedical Research mdhaber/scipy#90

Open

55 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOC: stats: add realistic examples to variance tests #17778

DOC: stats: add realistic examples to variance tests #17778

tupui commented Jan 12, 2023 •

edited

mdhaber left a comment •

edited

tupui commented Jan 14, 2023

mdhaber commented Jan 14, 2023

tupui commented Jan 15, 2023

mdhaber commented Jan 15, 2023 •

edited

tupui commented Jan 16, 2023 •

edited

mdhaber left a comment •

edited

tupui commented Jan 26, 2023

mdhaber commented Jan 27, 2023

mdhaber left a comment •

edited

mdhaber Jan 27, 2023

tupui Jan 27, 2023

mdhaber Jan 27, 2023

tupui commented Jan 27, 2023

mdhaber Jan 27, 2023 •

edited

tupui commented Jan 27, 2023

		Note that the chi-square distribution provides an asymptotic approximation
		of the null distribution; it is only accurate for samples with many

		Note that the chi-square distribution provides an asymptotic approximation
		of the null distribution.

DOC: stats: add realistic examples to variance tests #17778

DOC: stats: add realistic examples to variance tests #17778

Conversation

tupui commented Jan 12, 2023 • edited

What does this implement/fix?

Additional information

mdhaber left a comment • edited

Choose a reason for hiding this comment

tupui commented Jan 14, 2023

mdhaber commented Jan 14, 2023

tupui commented Jan 15, 2023

mdhaber commented Jan 15, 2023 • edited

tupui commented Jan 16, 2023 • edited

mdhaber left a comment • edited

Choose a reason for hiding this comment

tupui commented Jan 26, 2023

mdhaber commented Jan 27, 2023

mdhaber left a comment • edited

Choose a reason for hiding this comment

mdhaber Jan 27, 2023

Choose a reason for hiding this comment

tupui Jan 27, 2023

Choose a reason for hiding this comment

mdhaber Jan 27, 2023

Choose a reason for hiding this comment

tupui commented Jan 27, 2023

mdhaber Jan 27, 2023 • edited

Choose a reason for hiding this comment

tupui commented Jan 27, 2023

tupui commented Jan 12, 2023 •

edited

mdhaber left a comment •

edited

mdhaber commented Jan 15, 2023 •

edited

tupui commented Jan 16, 2023 •

edited

mdhaber left a comment •

edited

mdhaber left a comment •

edited

mdhaber Jan 27, 2023 •

edited