Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: stats: Add the conditional odds ratio and CI to fisher_exact. #12288

Closed

Conversation

WarrenWeckesser
Copy link
Member

@WarrenWeckesser WarrenWeckesser commented Jun 1, 2020

In fisher_exact, compute the conditional odds ratio and its
confidence interval.

The return value of fisher_exact is changed to an instance
of a class created by _make_tuple_bunch(). The new return
values related to the conditional odds ratio are included as
attributes of this object.

The code for fisher_exact is now in its own file. The tests
for fisher_exact are also moved to a separate file. Some of
the test values have been given more precise "known" values
(computed using code written with mpmath or in C with GMP).

An R script is included that generates the Python file
"fisher_exact_results_from_r.py" that is used in the test suite.
The generated Python file has the input parameters and R results
written in a form that can be loaded into Python with just an
import statement.

@WarrenWeckesser WarrenWeckesser added scipy.stats enhancement A new feature or improvement labels Jun 1, 2020
@WarrenWeckesser WarrenWeckesser marked this pull request as draft June 1, 2020 07:08
@WarrenWeckesser WarrenWeckesser force-pushed the fisher-cond-oddsratio branch 2 times, most recently from ea25e94 to b051f5d Compare June 2, 2020 01:34
@WarrenWeckesser
Copy link
Member Author

Test failures are not related to this PR.

@WarrenWeckesser WarrenWeckesser force-pushed the fisher-cond-oddsratio branch 2 times, most recently from 4037cca to 2628935 Compare June 6, 2020 15:37
@WarrenWeckesser WarrenWeckesser changed the title ENH: stats: Add the conditional MLE odds ratio to fisher_exact. ENH: stats: Add the conditional odds ratio and CI to fisher_exact. Jun 6, 2020
@WarrenWeckesser
Copy link
Member Author

WarrenWeckesser commented Jun 6, 2020

I just pushed an update (rebased). The PR now includes both the conditional odds ratio and the conditional confidence interval. The new computations match the values in the R function fisher.test (in the estimate and conf.int fields of its result). For example, in R:

> table = matrix(c(8, 2, 1, 5), nrow=2, ncol=2, byrow=TRUE)
> table
     [,1] [,2]
[1,]    8    2
[2,]    1    5
> result = fisher.test(table)
> result$estimate
odds ratio 
  15.46969 
> result$conf.int
[1]    1.008849 1049.791446
attr(,"conf.level")
[1] 0.95

With the updated fisher_exact in this PR:

In [3]: from scipy.stats import fisher_exact                                                                        

In [4]: table = np.array([[8, 2], [1, 5]])                                                                          

In [5]: result = fisher_exact(table)                                                                                

In [6]: result.conditional_odds_ratio                                                                               
Out[6]: 15.46581971692097

In [7]: result.conditional_odds_ratio_ci()
Out[7]: ConfidenceInterval(low=1.0088492079476903, high=1059.6394950862784)

(Note: The code was updated to add the parentheses to result.conditional_odds_ratio_ci(), as it is a method in the latest update to the PR.)

The numerical values do not match R's values exactly, and in many cases, the values agree to only a few decimal places. Each time I have checked such a discrepancy, I have found it is because R often gives only 3 to 5 digits of precision, and the result from fisher_exact is much more accurate.

@WarrenWeckesser
Copy link
Member Author

I have switched this from "draft" to "ready for review". There is a lot here, so let me know if there is anything I can do to make it easier to review.

# [1,] 100 2
# [2,] 1000 5
#
table_data = list(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do these values come from?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Getting back to this PR--you might have to jog your memory a bit.)

These are arrays that were tested in

class TestFisherExact(object):

def test_less_greater(self):
tables = (
# Some tables to compare with R:
[[2, 7], [8, 2]],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like these tables are no longer tested, or were they moved somewhere other than test_less_greater?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tables are included in the file fisher_exact_results_from _r.py, which includes all three alternatives for each table.

[1] 1.701815e-09
"""

def test_basic(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain where these and test_precise went?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test class TestFisherExact should cover all these matrices. Note that the results from R are stored in the generated file fisher_exact_results_from_r.py. This data is used in the test method TestFisherExact.test_results_from_r in test_fisher_exact.py.

the noncentrality parameter of Fisher's noncentral
hypergeometric distribution with the same hypergeometric
parameters as ``table`` whose mean is ``table[0, 0]``.
(FIXME: I know this need work!)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it still?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could use a little copy-editing. I'll tweak the docstring in my next update.

>>> result.pvalue
0.03496503496503495

The probability that we would observe this or an even more imbalanced
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"The probability under the null hypothesis that ..." would be helpful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I'll add that.


The sample and conditional odds ratios for this example are

>>> result.sample_odds_ratio
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as would interpretation of these

In fisher_exact, compute the conditional odds ratio and its
confidence interval.

The return value of `fisher_exact` is changed to an instance
of a class created by _make_tuple_bunch().  The new return
values related to the conditional odds ratio are included as
attributes of this object.

The code for `fisher_exact` is now in its own file.  The tests
for `fisher_exact` are also moved to a separate file.  Some of
the test values have been given more precise "known" values
(computed using code written with mpmath or in C with GMP).

An R script is included that generates the Python file
"fisher_exact_results_from_r.py" that is used in the test suite.
The generated Python file has the input parameters and R results
written in a form that can be loaded into Python with just an
import statement.
…terval

Instead of always computing the confidence interval, the object returned
by fisher_exact now has a method that is called to compute the confidence
interval.
@WarrenWeckesser WarrenWeckesser marked this pull request as ready for review November 8, 2020 05:29
@WarrenWeckesser WarrenWeckesser added this to the 1.6.0 milestone Nov 10, 2020
@rlucas7
Copy link
Member

rlucas7 commented Nov 15, 2020

@WarrenWeckesser this should probably go under the contingency namespace as well.

Is there any reason it should be here in stats?

@tylerjereddy
Copy link
Contributor

tylerjereddy commented Dec 2, 2020

@rlucas7 @WarrenWeckesser @mdhaber No activity in last 16 days--will there be consensus here for a merge very soon? Otherwise, maybe bump the milestone and if it does get merged before I branch just switch back the milestone.

@WarrenWeckesser
Copy link
Member Author

@rlucas7 wrote

@WarrenWeckesser this should probably go under the contingency namespace as well.

Is there any reason it should be here in stats?

fisher_exact has been in the stats namespace for many years. I don't think we want to change that now. And even if we wanted to, that change is orthogonal to the changes in this PR.


For reference, a bit of history (also relevant to #13153 and #13048): the contingency submodule was added when I added chi2_contingency to stats. Because of the general preference for a flat namespace, it was natural to put chi2_contingency in the stats namespace. The two functions margins and expected_freq that are used by chi2_contingency seemed like they might be useful tools, so, after some discussion (which you can read in the original pull request), the submodule stats.contingency was made public, and the functions margins and expected_freq were made public in that submodule.

These days I try to be more conservative about what we make public, and if I went back in time, I'd argue against making margins and expected_freq public. They are somewhere specialized, and there was no specific demand for them other than their internal use in chi2_contingency. If we hadn't made margins and expected_freq public back then, we wouldn't have needed the public contingency submodule, and we might not be discussing now whether or not the new functions (i.e. association and relative_risk) belong in stats. Lessons learned and all that.

Having said that, I'm OK with using the stats.contingency namespace for new functions such as association and relative_risk. Since we have it, we might as well use it.

@WarrenWeckesser
Copy link
Member Author

After looking into more instances of the use of the odds ratio in texts and papers, I think it is important that we also include the confidence interval for the sample odds ratio. In a new commit, I have added the method sample_odds_ratio_ci() to the FisherExactResult class. This method implements a widely used formula for the confidence interval. I haven't added unit tests for this method yet.

@WarrenWeckesser
Copy link
Member Author

For example, with the new method, we can reproduce the results at https://sphweb.bumc.bu.edu/otlt/MPH-Modules/PH717-QuantCore/PH717_ComparingFrequencies/PH717_ComparingFrequencies8.html:

In [54]: c = np.array([[992, 2260], [165, 1017]])                                         

In [55]: result = fisher_exact(c)                                                         

In [56]: result.sample_odds_ratio                                                         
Out[56]: 2.7054545454545456

In [57]: result.sample_odds_ratio_ci()                                                    
Out[57]: ConfidenceInterval(low=2.2583385178687365, high=3.2410926172522103)

@mdhaber
Copy link
Contributor

mdhaber commented Dec 5, 2020

@WarrenWeckesser If I understand correctly, a lot of this will be implementation of Fisher's NCHG distribution. Before I review all that, we should make some decisions about mdhaber#31. If FNCHG is going to be added as a distribution, shouldn't this rely on that? If so, I think we should add FNCHG first, then add this.

Also, I suppose I can use a separate diff tool to review the changes against the existing fisher_exact, but would it be crazy to make that a separate PR? One in which we move fisher_exact to its own file (otherwise unchanged) and one in which we add the odds ratio CI?

@mdhaber
Copy link
Contributor

mdhaber commented Dec 8, 2020

If this will close gh-11131, maybe add that up top so it will be closed when this merges.

@WarrenWeckesser
Copy link
Member Author

I decided it makes more sense to create a new function for the odds ratio statistic. The new PR gh-13340 is a replacement for this PR. In the new PR, I tweak the docstring of fisher_exact, but otherwise that function is not touched.

@WarrenWeckesser
Copy link
Member Author

Replaced by gh-13340, so closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A new feature or improvement scipy.stats
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants