Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: stats: Add the function odds_ratio. #13340

Merged
merged 37 commits into from
Aug 8, 2022
Merged

Conversation

WarrenWeckesser
Copy link
Member

The new function scipy.stats.contingency.odds_ratio computes
the odds ratio and p-value for a 2x2 contingency table. An option
allows the user to select either the sample odds ratio or the
conditional maximum likelihood estimate of the odds ratio. The
returned object provides a method for computing the confidence
interval of the odds ratio.

Closes gh-11131.

The new function `scipy.stats.contingency.odds_ratio` computes
the odds ratio and p-value for a 2x2 contingency table.  An option
allows the user to select either the sample odds ratio or the
conditional maximum likelihood estimate of the odds ratio.  The
returned object provides a method for computing the confidence
interval of the odds ratio.

Closes scipygh-11131.
@WarrenWeckesser WarrenWeckesser added scipy.stats enhancement A new feature or improvement labels Jan 4, 2021
@WarrenWeckesser
Copy link
Member Author

This PR replaces gh-12288. This PR creates a new function in the scipy.stats.contingency namespace specifically for the odds ratio statistic, similar to how the proposed function relative_risk in gh-13048 creates a function for another widely used statistic for 2x2 contingency tables, the relative risk.

There is code in the file _odds_ratio.py that implements a private version of Fisher's noncentral hypergeometric distribution. Some of that code will be able to use the implementation in gh-13330 instead, but we don't necessarily have to block this PR waiting for gh-13330. If it turns out that gh-13330 is merged first, I can update this PR to use it, but if this PR goes in first, the private implementation can be updated after gh-13330 is merged.

Add odds_ratio and OddsRatioResult to the contingency module docstring.
This should make the refguide-check happy.
Copy link
Contributor

@sethtroisi sethtroisi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small comments, I hope to dive into _odds_ratio.py tomorrow.

scipy/stats/stats.py Outdated Show resolved Hide resolved
scipy/stats/tests/generate_fisher_exact_results_from_r.R Outdated Show resolved Hide resolved
scipy/stats/tests/generate_fisher_exact_results_from_r.R Outdated Show resolved Hide resolved
@mdhaber
Copy link
Contributor

mdhaber commented Jan 4, 2021

Some of that code will be able to use the implementation in gh-13330 instead, but we don't necessarily have to block this PR waiting for gh-13330.

@WarrenWeckesser you proposed the addition of the two noncentral hypergeometric distributions. I implemented them in gh-13330 as we discussed. Rather than asking others to review the private code, then re-review when you swap things out, would it be more efficient for you to review gh-13330 first?

@WarrenWeckesser
Copy link
Member Author

@mdhaber, yes, that makes sense. I'll mark this as draft for now.

* Use the new nchypergeom_fisher distribution instead of the private
  implementation.
* Don't put `OddsRatioResult` into the public `scipy.stats` namespace.
@WarrenWeckesser WarrenWeckesser marked this pull request as ready for review February 22, 2021 20:40
@WarrenWeckesser
Copy link
Member Author

I have updated this PR to use the new nchypergeom_fisher distribution.

scipy/stats/__init__.py Outdated Show resolved Hide resolved
scipy/stats/_odds_ratio.py Outdated Show resolved Hide resolved
scipy/stats/_odds_ratio.py Outdated Show resolved Hide resolved
* Tweak whitespace in the fisher_exact docstring.
* Add a comment in the R script generate_fisher_exact_results_from_r.R
  that shows how to run the script.
scipy/stats/_odds_ratio.py Outdated Show resolved Hide resolved
scipy/stats/_odds_ratio.py Outdated Show resolved Hide resolved
scipy/stats/_odds_ratio.py Outdated Show resolved Hide resolved
scipy/stats/_odds_ratio.py Outdated Show resolved Hide resolved
scipy/stats/_odds_ratio.py Outdated Show resolved Hide resolved
scipy/stats/_odds_ratio.py Outdated Show resolved Hide resolved
scipy/stats/_odds_ratio.py Outdated Show resolved Hide resolved
scipy/stats/tests/test_odds_ratio.py Outdated Show resolved Hide resolved
scipy/stats/tests/test_odds_ratio.py Show resolved Hide resolved
scipy/stats/stats.py Outdated Show resolved Hide resolved
Copy link
Contributor

@mdhaber mdhaber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far I've reviewed documentation, tests, and sample odds ratio calculations (odds ratio, p value, and CI). Calculations look good; biggest comments are that sample odds ratio calculations and input validation need tests.

I see the reference for the conditional odds ratio CI. Is there one particular reference that you'd recommend I follow for the conditional odds ratio itself and for the fact that the fisher_exact p-value should be the same as this p-value?

scipy/stats/stats.py Outdated Show resolved Hide resolved
scipy/stats/_odds_ratio.py Outdated Show resolved Hide resolved
scipy/stats/_odds_ratio.py Show resolved Hide resolved
scipy/stats/_odds_ratio.py Outdated Show resolved Hide resolved
scipy/stats/_odds_ratio.py Show resolved Hide resolved
scipy/stats/_odds_ratio.py Show resolved Hide resolved
scipy/stats/_odds_ratio.py Outdated Show resolved Hide resolved
scipy/stats/_odds_ratio.py Show resolved Hide resolved
@WarrenWeckesser
Copy link
Member Author

Here are some notes about relevant references, for anyone looking to learn more about inference with the odds ratio.

  • The following text covers the calculation in detail, in section 4.1.2. Specifically, equation (4.7) is the equation that must be solved for the CMLE.

    • H. Sahai and A. Khurshid (1996), Statistics in Epidemiology: Methods, Techniques, and Applications, CRC Press LLC, Boca Raton, Florida.
  • This publication is often cited when the CMLE is disussed:

    • Breslow NE, Day NE. Statistical methods in cancer research. Volume I - The analysis of case-control studies. IARC Sci Publ. 1980;(32):5-338. PMID: 7216345.

    The PDF is available at

    https://publications.iarc.fr/Book-And-Report-Series/Iarc-Scientific-Publications/Statistical-Methods-In-Cancer-Research-Volume-I-The-Analysis-Of-Case-Control-Studies-1980

    The odds ratio and the CMLE are discussed in section 4.2, beginning on page 124. The defining equation for the CMLE is (4.5).

  • Inference for the odds ratio is discussed in section 10.3 of the text

    • M. Hollander, D. A. Wolfe and E. Chicken (2014), Nonparametric Statistical Methods (third edition), John Wiley & Sons, Inc., Hoboken, NJ, USA.

    The CMLE is covered in comments 14, 15 and 16 in the "Comments" subsection on pages 519-520. Note that there is a typographical error in comment 16: the reference to equation (10.38) should be (10.58).

scipy/stats/_odds_ratio.py Outdated Show resolved Hide resolved
scipy/stats/_odds_ratio.py Outdated Show resolved Hide resolved
scipy/stats/_odds_ratio.py Outdated Show resolved Hide resolved
* Use nchypergeom_fisher.support(M, n, N, 1) to compute the support
  (instead of hypergeom.support(M, n, N)).
* Add more tests of parameter validation for odds_ratio() and for the
  odds_ratio_ci() method of the result.
Copy link
Contributor

@mdhaber mdhaber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was really close. Shall we finish it? I think the remaining request was just documenting these alternatives and for more specific comments to be added to the code to point readers to the appropriate reference + equation.

scipy/stats/_odds_ratio.py Show resolved Hide resolved
scipy/stats/_odds_ratio.py Outdated Show resolved Hide resolved
@mdhaber
Copy link
Contributor

mdhaber commented May 6, 2021

I'm re-running CircleCI. Looked like a temporary glitch.

@mdhaber
Copy link
Contributor

mdhaber commented May 7, 2021

Re-ran CircleCI; same failure. Not seeing it in other PRs. But I can't see how the latest two commits would have cause this.

Aside: something I thought about when we were discussion the alternatives - from a UI perspective, how should we think about the relationship between this function and fisher_exact? Does this function essentially make fisher_exact obsolete but retain it for backwards compatibility (and its commonly-used name for the test)?

Update 7/30/2022: the function now only provides a point estimate and confidence interval of the statistic rather than duplicating the p-value of fisher_exact. I made this and other changes to bring the interfaces more in line with typical stats functions, avoiding exposure of attributes unless necessary. Future PRs can build on this if desired.

scipy/stats/_odds_ratio.py Outdated Show resolved Hide resolved
scipy/stats/_odds_ratio.py Outdated Show resolved Hide resolved
@@ -427,6 +427,7 @@
contingency.margins
contingency.relative_risk
contingency.association
contingency.odds_ratio
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't object to this being in scipy.stats, but here is fine, too. Ultimately, it was probably added here because of #13048 (review).

Comment on lines +331 to +332
table : array_like of ints
A 2x2 contingency table. Elements must be non-negative integers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this stays in stats.contingency, we could consider calling this observed to match chi2_contingency, expected_freq, and association.

The documentation explains how the table is laid out, so I don't see a need for the four separate arguments of relative_risk.

If this goes in stats, I would keep table to match fisher_exact, barnard_exact, and boschloo_exact.

On the other hand, I can see leaving it as table either way and using _rename_parameter to make the stats.contingency functions consistent with stats.

----------
table : array_like of ints
A 2x2 contingency table. Elements must be non-negative integers.
kind : str, optional
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is not much precedent for kind in scipy.stats, but method is not appropriate IMO since it's more a matter of choice/definition rather than algorithm, and I'd avoid the keyword type.

@mdhaber
Copy link
Contributor

mdhaber commented Aug 8, 2022

No responses to email, but this seems fundamental, was requested by a user, and at least two maintainers have wanted to have this function, so let's merge it. For now, this has been trimmed down to the bare minimum (statistic and confidence interval), but if we want to change parameter names, add output, etc., that can all be done in a backward compatible way.

@mdhaber mdhaber merged commit f5bd9af into scipy:main Aug 8, 2022
tartopohm pushed a commit to tartopohm/scipy that referenced this pull request Aug 13, 2022
* ENH: stats: Add the function odds_ratio.

The new function `scipy.stats.contingency.odds_ratio` computes
the odds ratio for a 2x2 contingency table.  An option
allows the user to select either the sample odds ratio or the
conditional maximum likelihood estimate of the odds ratio.  The
returned object provides a method for computing the confidence
interval of the odds ratio.

Co-authored-by: Matt Haberland <mhaberla@calpoly.edu>
@mdhaber mdhaber added this to the 1.10.0 milestone Aug 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A new feature or improvement scipy.stats
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DOC: stats.fisher_exact does not match R functionality for oddsratio calculation
3 participants