ENH: stats: Add the conditional odds ratio and CI to fisher_exact. #12288

WarrenWeckesser · 2020-06-01T07:07:44Z

In fisher_exact, compute the conditional odds ratio and its
confidence interval.

The return value of fisher_exact is changed to an instance
of a class created by _make_tuple_bunch(). The new return
values related to the conditional odds ratio are included as
attributes of this object.

The code for fisher_exact is now in its own file. The tests
for fisher_exact are also moved to a separate file. Some of
the test values have been given more precise "known" values
(computed using code written with mpmath or in C with GMP).

An R script is included that generates the Python file
"fisher_exact_results_from_r.py" that is used in the test suite.
The generated Python file has the input parameters and R results
written in a form that can be loaded into Python with just an
import statement.

WarrenWeckesser · 2020-06-02T04:37:48Z

Test failures are not related to this PR.

WarrenWeckesser · 2020-06-06T15:52:11Z

I just pushed an update (rebased). The PR now includes both the conditional odds ratio and the conditional confidence interval. The new computations match the values in the R function fisher.test (in the estimate and conf.int fields of its result). For example, in R:

> table = matrix(c(8, 2, 1, 5), nrow=2, ncol=2, byrow=TRUE)
> table
     [,1] [,2]
[1,]    8    2
[2,]    1    5
> result = fisher.test(table)
> result$estimate
odds ratio 
  15.46969 
> result$conf.int
[1]    1.008849 1049.791446
attr(,"conf.level")
[1] 0.95

With the updated fisher_exact in this PR:

In [3]: from scipy.stats import fisher_exact                                                                        

In [4]: table = np.array([[8, 2], [1, 5]])                                                                          

In [5]: result = fisher_exact(table)                                                                                

In [6]: result.conditional_odds_ratio                                                                               
Out[6]: 15.46581971692097

In [7]: result.conditional_odds_ratio_ci()
Out[7]: ConfidenceInterval(low=1.0088492079476903, high=1059.6394950862784)

(Note: The code was updated to add the parentheses to result.conditional_odds_ratio_ci(), as it is a method in the latest update to the PR.)

The numerical values do not match R's values exactly, and in many cases, the values agree to only a few decimal places. Each time I have checked such a discrepancy, I have found it is because R often gives only 3 to 5 digits of precision, and the result from fisher_exact is much more accurate.

WarrenWeckesser · 2020-06-29T19:12:03Z

I have switched this from "draft" to "ready for review". There is a lot here, so let me know if there is anything I can do to make it easier to review.

mdhaber · 2020-07-09T21:58:42Z

scipy/stats/tests/generate_fisher_exact_results_from_r.R

+# [1,]  100    2
+# [2,] 1000    5
+# 
+table_data = list(


Where do these values come from?

(Getting back to this PR--you might have to jog your memory a bit.)

These are arrays that were tested in

scipy/scipy/stats/tests/test_stats.py

Line 407 in d4789ba

class TestFisherExact(object):

mdhaber · 2020-07-09T22:03:44Z

scipy/stats/tests/test_stats.py

-    def test_less_greater(self):
-        tables = (
-            # Some tables to compare with R:
-            [[2, 7], [8, 2]],


It looks like these tables are no longer tested, or were they moved somewhere other than test_less_greater?

These tables are included in the file fisher_exact_results_from _r.py, which includes all three alternatives for each table.

mdhaber · 2020-07-09T22:06:24Z

scipy/stats/tests/test_stats.py

-    [1] 1.701815e-09
-    """
-
-    def test_basic(self):


Can you explain where these and test_precise went?

The test class TestFisherExact should cover all these matrices. Note that the results from R are stored in the generated file fisher_exact_results_from_r.py. This data is used in the test method TestFisherExact.test_results_from_r in test_fisher_exact.py.

scipy/stats/tests/test_fisher_exact.py

scipy/stats/_fisher_exact.py

mdhaber · 2020-07-09T22:14:29Z

scipy/stats/_fisher_exact.py

+            the noncentrality parameter of Fisher's noncentral
+            hypergeometric distribution with the same hypergeometric
+            parameters as ``table`` whose mean is ``table[0, 0]``.
+            (FIXME: I know this need work!)


Does it still?

It could use a little copy-editing. I'll tweak the docstring in my next update.

mdhaber · 2020-07-09T22:20:44Z

scipy/stats/_fisher_exact.py

+    >>> result.pvalue
+    0.03496503496503495
+
+    The probability that we would observe this or an even more imbalanced


"The probability under the null hypothesis that ..." would be helpful.

Sounds good, I'll add that.

mdhaber · 2020-07-09T22:20:48Z

scipy/stats/_fisher_exact.py

+
+    The sample and conditional odds ratios for this example are
+
+    >>> result.sample_odds_ratio


as would interpretation of these

In fisher_exact, compute the conditional odds ratio and its confidence interval. The return value of `fisher_exact` is changed to an instance of a class created by _make_tuple_bunch(). The new return values related to the conditional odds ratio are included as attributes of this object. The code for `fisher_exact` is now in its own file. The tests for `fisher_exact` are also moved to a separate file. Some of the test values have been given more precise "known" values (computed using code written with mpmath or in C with GMP). An R script is included that generates the Python file "fisher_exact_results_from_r.py" that is used in the test suite. The generated Python file has the input parameters and R results written in a form that can be loaded into Python with just an import statement.

…terval Instead of always computing the confidence interval, the object returned by fisher_exact now has a method that is called to compute the confidence interval.

…test results.

rlucas7 · 2020-11-15T21:09:42Z

@WarrenWeckesser this should probably go under the contingency namespace as well.

Is there any reason it should be here in stats?

tylerjereddy · 2020-12-02T03:47:14Z

@rlucas7 @WarrenWeckesser @mdhaber No activity in last 16 days--will there be consensus here for a merge very soon? Otherwise, maybe bump the milestone and if it does get merged before I branch just switch back the milestone.

WarrenWeckesser · 2020-12-03T12:24:08Z

@rlucas7 wrote

@WarrenWeckesser this should probably go under the contingency namespace as well.

Is there any reason it should be here in stats?

fisher_exact has been in the stats namespace for many years. I don't think we want to change that now. And even if we wanted to, that change is orthogonal to the changes in this PR.

For reference, a bit of history (also relevant to #13153 and #13048): the contingency submodule was added when I added chi2_contingency to stats. Because of the general preference for a flat namespace, it was natural to put chi2_contingency in the stats namespace. The two functions margins and expected_freq that are used by chi2_contingency seemed like they might be useful tools, so, after some discussion (which you can read in the original pull request), the submodule stats.contingency was made public, and the functions margins and expected_freq were made public in that submodule.

These days I try to be more conservative about what we make public, and if I went back in time, I'd argue against making margins and expected_freq public. They are somewhere specialized, and there was no specific demand for them other than their internal use in chi2_contingency. If we hadn't made margins and expected_freq public back then, we wouldn't have needed the public contingency submodule, and we might not be discussing now whether or not the new functions (i.e. association and relative_risk) belong in stats. Lessons learned and all that.

Having said that, I'm OK with using the stats.contingency namespace for new functions such as association and relative_risk. Since we have it, we might as well use it.

…sher_exact result.

WarrenWeckesser · 2020-12-04T17:15:09Z

After looking into more instances of the use of the odds ratio in texts and papers, I think it is important that we also include the confidence interval for the sample odds ratio. In a new commit, I have added the method sample_odds_ratio_ci() to the FisherExactResult class. This method implements a widely used formula for the confidence interval. I haven't added unit tests for this method yet.

WarrenWeckesser · 2020-12-04T17:21:25Z

For example, with the new method, we can reproduce the results at https://sphweb.bumc.bu.edu/otlt/MPH-Modules/PH717-QuantCore/PH717_ComparingFrequencies/PH717_ComparingFrequencies8.html:

In [54]: c = np.array([[992, 2260], [165, 1017]])                                         

In [55]: result = fisher_exact(c)                                                         

In [56]: result.sample_odds_ratio                                                         
Out[56]: 2.7054545454545456

In [57]: result.sample_odds_ratio_ci()                                                    
Out[57]: ConfidenceInterval(low=2.2583385178687365, high=3.2410926172522103)

mdhaber · 2020-12-05T00:00:15Z

@WarrenWeckesser If I understand correctly, a lot of this will be implementation of Fisher's NCHG distribution. Before I review all that, we should make some decisions about mdhaber#31. If FNCHG is going to be added as a distribution, shouldn't this rely on that? If so, I think we should add FNCHG first, then add this.

Also, I suppose I can use a separate diff tool to review the changes against the existing fisher_exact, but would it be crazy to make that a separate PR? One in which we move fisher_exact to its own file (otherwise unchanged) and one in which we add the odds ratio CI?

mdhaber · 2020-12-08T08:25:13Z

If this will close gh-11131, maybe add that up top so it will be closed when this merges.

WarrenWeckesser · 2021-01-04T10:54:08Z

I decided it makes more sense to create a new function for the odds ratio statistic. The new PR gh-13340 is a replacement for this PR. In the new PR, I tweak the docstring of fisher_exact, but otherwise that function is not touched.

WarrenWeckesser · 2021-02-10T21:43:04Z

Replaced by gh-13340, so closing.

WarrenWeckesser added scipy.stats enhancement A new feature or improvement labels Jun 1, 2020

WarrenWeckesser marked this pull request as draft June 1, 2020 07:08

WarrenWeckesser mentioned this pull request Jun 1, 2020

Return Bunch objects from stats functions #3665

Closed

WarrenWeckesser force-pushed the fisher-cond-oddsratio branch 2 times, most recently from ea25e94 to b051f5d Compare June 2, 2020 01:34

WarrenWeckesser closed this Jun 2, 2020

WarrenWeckesser reopened this Jun 2, 2020

WarrenWeckesser mentioned this pull request Jun 2, 2020

DOC: stats.fisher_exact does not match R functionality for oddsratio calculation #11131

Closed

WarrenWeckesser force-pushed the fisher-cond-oddsratio branch 2 times, most recently from 4037cca to 2628935 Compare June 6, 2020 15:37

WarrenWeckesser changed the title ~~ENH: stats: Add the conditional MLE odds ratio to fisher_exact.~~ ENH: stats: Add the conditional odds ratio and CI to fisher_exact. Jun 6, 2020

WarrenWeckesser force-pushed the fisher-cond-oddsratio branch from 2628935 to f5c0ad9 Compare June 6, 2020 16:37

WarrenWeckesser force-pushed the fisher-cond-oddsratio branch from f5c0ad9 to 3ac9c7d Compare June 22, 2020 21:20

WarrenWeckesser closed this Jun 25, 2020

WarrenWeckesser reopened this Jun 25, 2020

WarrenWeckesser force-pushed the fisher-cond-oddsratio branch from 3ac9c7d to 1b3bcc8 Compare June 25, 2020 23:45

WarrenWeckesser marked this pull request as ready for review June 29, 2020 18:59