Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enrichr: Local mode of GO/enrichment analysis does not provide Odds Ratios in the results #132

Closed
sreichl opened this issue Sep 26, 2021 · 17 comments

Comments

@sreichl
Copy link

sreichl commented Sep 26, 2021

Hi, thanks for this package!
This is more of a feature request than a bug report. I think the odds ratio is quite an important value to interpret enrichment analysis results.

Setup

I am reporting a problem with GSEApy version, Python version, and operating
system as follows:

import sys; print(sys.version)
import platform; print(platform.python_implementation()); print(platform.platform())
import gseapy; print(gseapy.__version__)

3.9.5 (default, Jun 4 2021, 12:28:51)
[GCC 7.5.0]
CPython
Linux-3.10.0-1062.18.1.el7.x86_64-x86_64-with-glibc2.17
0.10.5

Expected behaviour

When using the Enrichr functionality of GSEApy gseapy.enrichr() in local mode, to be able to provide a custom background gene set, the resulting data frame contains the same columns (including odds ratio) as in the vanilla mode (direct query of enrichr).

Actual behaviour

When using the Enrichr functionality of GSEApy gseapy.enrichr() in local mode, to be able to provide a custom background gene set, the resulting data frame does not contain Odds Ratios, although the vanilla mode returns Odds Ratios from enrichr.

Steps to reproduce

It is already apparent in the respective example in the docs.

@zqfang
Copy link
Owner

zqfang commented Sep 29, 2021

Do you mind share me the code you've used ? @sreichl

@sreichl
Copy link
Author

sreichl commented Oct 11, 2021

Hi @zqfang,
sorry for the late reply.

Sure, I can share the code I use now for determining odds ratio manually based on the following variables and results I get from GSEApy:

  • bg_n = number of background genes
  • gene_list_n = number of genes in the gene list (ie query genes)
  • gene_set_n = number of genes in the (corrected by background) gene set (eg pathways/GO terms)
  • overlap_n = number of genes overlapping with between the (corrected by background) gene set and the gene list

The last two parameters can be extracted directly from the GSEApy results, column "Overlap".

def odds_ratio_calc(bg_n, gene_list_n, gene_set_n, overlap_n): 
    import scipy.stats as stats

    # make contingency table
    table=np.array([[gene_set_n, bg_n-gene_set_n],[overlap_n, gene_list_n-overlap_n]])

    # perform Fisher's exact test
    oddsratio, pvalue = stats.fisher_exact(table)

    # return (inverse) oddsratio
    return (1/oddsratio)

You could also construct the contingency table differently to then directly get the odds ratio you are looking for, without the 1/x step at the end.

I hope this helps!

Cheers, S

@zqfang
Copy link
Owner

zqfang commented Oct 25, 2021

Sorry for rely late. Just back to work now. This feature could be implemented. The Odds ratio is new thing since the Enrichr Sever updated.

zqfang pushed a commit that referenced this issue Oct 25, 2021
@sreichl
Copy link
Author

sreichl commented Oct 26, 2021

thanks for implementing!

@zqfang zqfang closed this as completed Oct 26, 2021
@sreichl
Copy link
Author

sreichl commented Oct 28, 2021

Hi,
sorry to bother you again, but I compared the results of your currently implemented odds ratio calculation and my own calculations and noticed that they are not the same (close but different).

Therefore I looked into the committed code and I think the formula you are applying for determining the odds ratio is not correct.

instead of this

expect_count = k*m/bg
oddr= x / expect_count

you should use this

oddr= (x*(bg-m))/(m*(k-x))

if I am not mistaken.

I hope this helps.
Cheers, S

@zqfang
Copy link
Owner

zqfang commented Oct 28, 2021

Great. Thanks for the correction. I thought odd ratio was ( experiment_hits / expected_hits ).

zqfang pushed a commit that referenced this issue Oct 28, 2021
@sreichl
Copy link
Author

sreichl commented Oct 29, 2021

Hi, no worries and thanks.

As per wikipedia: "An odds ratio (OR) is a statistic that quantifies the strength of the association between two events, A and B."

Or in this recommended paper: "The OR represents the odds that an outcome will occur given a particular exposure, compared to the odds of the outcome occurring in the absence of that exposure."

Exposures in our case are the gene lists (eg from a differential analysis) and Outcomes are the gene sets in which we are testing enrichment (eg GO terms) or the other way around.

Cheers, S

@zqfang
Copy link
Owner

zqfang commented Oct 29, 2021

Thanks for the clarification. I'm really appreciate it.

@yossi-liron
Copy link

yossi-liron commented Dec 5, 2021

The added code causes divion-by-zero crash when x==k in the expression
oddr= (x*(bg-m))/(m*(k-x))
which occurs when the enriched list is a subset of the category

@sreichl
Copy link
Author

sreichl commented Dec 5, 2021

Hi @yossi-liron, great catch that is true!

I think this only occurs if the query gene list completely overlaps with the category gene list (eg GO Term).
Still, this exception has to be addressed, and apparently we are not the first ones to encounter this.

The pragmatic solution is to add 0.5 to every cell of the contingency table to avoid divisions by zero, called Haldane-Anscombe correction.

This would mean in the case of k==x

oddr= ((x+0.5)*(bg-m+0.5))/((m+0.5)*(k-x+0.5))

What do you think?

Cheers, S

@yossi-liron
Copy link

Two alternatives : returning an 'inf' value or the suggested correction formulat.
I'd vote for the formula , because:

  1. Exception will be prevented (main problem solved)
  2. When x increases in size toward m, the oddr does move in the right direction instead of being stuck on the less informative inf value.

..And if the caller insists on rejecting the oddr , she still has the information that x==k from the overlap value

@136s
Copy link
Contributor

136s commented Dec 8, 2023

Hi,

Sorry for replying to this closed issue. I just had a question about the implementation based on this discussion.

Why did @zqfang adopt the formula oddr = (x * (bg - m)) / (m * (k - x)) to odds ratio? This is equivalent to oddr = (a * (c + d)) / ((a + b) * c) when I refered to the docstring in gseapy.stats.calc_pvalues().
However, as mentioned within the docstring, the enrichr's definition is oddsRatio = (1.0 * a * d) / Math.max(1.0 * b * c, 1).
In this definition the odds ratio of enrichr and gseapy are different, are they not?

where:
a are the overlapping genes
b are the genes in the annotated set - overlapping genes
c are the genes in the input set - overlapping genes
d are the 20,000 genes (or total genes in the background) - genes in the annotated set - genes in the input set + overlapping genes

x = len(query.intersection(category)) # = a
bg = len(background) # = a + b + c + d
m = len(category)  # = a + b
k = len(query)  # = a + c

In the above equation, 0.5 is omitted for Haldane-Anscombe correction for simplicity.

It would be helpful if you could point out my misunderstanding.

@sreichl
Copy link
Author

sreichl commented Dec 8, 2023

Hi @136s ,
good catch. I was stumped at first, but I think they have an error in their documentation as multiplication with 1 does not make sense in their formula. I have created an issue in their GitHub repository: MaayanLab/enrichr_issues#78.

If the formula is indeed wrong and instead of 1.0*... it is always 1.0+..., then the two formulas are equivalent, right?

$$\frac{a \cdot (c + d)}{(a + b) \cdot c} = \frac{ac + ad}{ac + bc} = \frac{ac \cdot (1 + ad)}{ac \cdot (1 + bc)} = \frac{1 + ad}{1 + bc}$$

@136s
Copy link
Contributor

136s commented Dec 19, 2023

Hi, @sreichl ,

Since there seems to be no replies on enrichr_issues, I will write my thoughts.

I read "Definition in terms of joint and conditional probabilities" section on wikipedia thought the following:

Where $Y=1$ for inclusion in query and $X=1$ for inclusion in gene_set, $p_{11}=a, p_{10}=b, p_{01}=c, p_{00}=d$.
So, the odds ratio is ${\displaystyle {\frac {p_{11}p_{00}} {p_{10}p_{01}} }={\frac {ad}{bc}}}$.

@sreichl
Copy link
Author

sreichl commented Dec 20, 2023

Hi @136s ,
I think you are right and we have been missing two components in the formula. This would also explain the "weird" times 1 in the enrichr formula (maybe as a reminder of a simplification step).

I think the correct formula given our variables should

$$\frac{x\cdot(bg-m-k+x)}{(m-x)\cdot(k-x)} = \frac{ad}{bc}$$

in python w/ Haldane-Anscombe correction

oddr = ((x + 0.5) * (bg - m -k + x + 0.5)) / ((m - x + 0.5) * (k - x + 0.5)

What do you think @136s @zqfang @yossi-liron?

@136s
Copy link
Contributor

136s commented Dec 21, 2023

Hi @sreichl,

Thank you for agreeing with my opinion.
Thanks also for reconsidering the correct formula.
I can agree with the pull request #238 I submitted the other day as it is also the same formula.

@zqfang
Copy link
Owner

zqfang commented Dec 21, 2023

Thanks @sreichl , @136s . It's not easy to detect this issue in so much detail.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants