Enrichr: Local mode of GO/enrichment analysis does not provide Odds Ratios in the results #132

sreichl · 2021-09-26T13:47:29Z

Hi, thanks for this package!
This is more of a feature request than a bug report. I think the odds ratio is quite an important value to interpret enrichment analysis results.

Setup

I am reporting a problem with GSEApy version, Python version, and operating
system as follows:

import sys; print(sys.version)
import platform; print(platform.python_implementation()); print(platform.platform())
import gseapy; print(gseapy.__version__)

3.9.5 (default, Jun 4 2021, 12:28:51)
[GCC 7.5.0]
CPython
Linux-3.10.0-1062.18.1.el7.x86_64-x86_64-with-glibc2.17
0.10.5

Expected behaviour

When using the Enrichr functionality of GSEApy gseapy.enrichr() in local mode, to be able to provide a custom background gene set, the resulting data frame contains the same columns (including odds ratio) as in the vanilla mode (direct query of enrichr).

Actual behaviour

When using the Enrichr functionality of GSEApy gseapy.enrichr() in local mode, to be able to provide a custom background gene set, the resulting data frame does not contain Odds Ratios, although the vanilla mode returns Odds Ratios from enrichr.

Steps to reproduce

It is already apparent in the respective example in the docs.

The text was updated successfully, but these errors were encountered:

zqfang · 2021-09-29T14:44:29Z

Do you mind share me the code you've used ? @sreichl

sreichl · 2021-10-11T14:24:59Z

Hi @zqfang,
sorry for the late reply.

Sure, I can share the code I use now for determining odds ratio manually based on the following variables and results I get from GSEApy:

bg_n = number of background genes
gene_list_n = number of genes in the gene list (ie query genes)
gene_set_n = number of genes in the (corrected by background) gene set (eg pathways/GO terms)
overlap_n = number of genes overlapping with between the (corrected by background) gene set and the gene list

The last two parameters can be extracted directly from the GSEApy results, column "Overlap".

def odds_ratio_calc(bg_n, gene_list_n, gene_set_n, overlap_n): 
    import scipy.stats as stats

    # make contingency table
    table=np.array([[gene_set_n, bg_n-gene_set_n],[overlap_n, gene_list_n-overlap_n]])

    # perform Fisher's exact test
    oddsratio, pvalue = stats.fisher_exact(table)

    # return (inverse) oddsratio
    return (1/oddsratio)

You could also construct the contingency table differently to then directly get the odds ratio you are looking for, without the 1/x step at the end.

I hope this helps!

Cheers, S

zqfang · 2021-10-25T21:51:26Z

Sorry for rely late. Just back to work now. This feature could be implemented. The Odds ratio is new thing since the Enrichr Sever updated.

sreichl · 2021-10-26T15:04:54Z

thanks for implementing!

sreichl · 2021-10-28T12:12:58Z

Hi,
sorry to bother you again, but I compared the results of your currently implemented odds ratio calculation and my own calculations and noticed that they are not the same (close but different).

Therefore I looked into the committed code and I think the formula you are applying for determining the odds ratio is not correct.

instead of this

expect_count = k*m/bg
oddr= x / expect_count

you should use this

oddr= (x*(bg-m))/(m*(k-x))

if I am not mistaken.

I hope this helps.
Cheers, S

zqfang · 2021-10-28T17:38:29Z

Great. Thanks for the correction. I thought odd ratio was ( experiment_hits / expected_hits ).

sreichl · 2021-10-29T14:41:09Z

Hi, no worries and thanks.

As per wikipedia: "An odds ratio (OR) is a statistic that quantifies the strength of the association between two events, A and B."

Or in this recommended paper: "The OR represents the odds that an outcome will occur given a particular exposure, compared to the odds of the outcome occurring in the absence of that exposure."

Exposures in our case are the gene lists (eg from a differential analysis) and Outcomes are the gene sets in which we are testing enrichment (eg GO terms) or the other way around.

Cheers, S

zqfang · 2021-10-29T18:04:03Z

Thanks for the clarification. I'm really appreciate it.

yossi-liron · 2021-12-05T08:55:03Z

The added code causes divion-by-zero crash when x==k in the expression
oddr= (x*(bg-m))/(m*(k-x))
which occurs when the enriched list is a subset of the category

sreichl · 2021-12-05T10:11:11Z

Hi @yossi-liron, great catch that is true!

I think this only occurs if the query gene list completely overlaps with the category gene list (eg GO Term).
Still, this exception has to be addressed, and apparently we are not the first ones to encounter this.

The pragmatic solution is to add 0.5 to every cell of the contingency table to avoid divisions by zero, called Haldane-Anscombe correction.

This would mean in the case of k==x

oddr= ((x+0.5)*(bg-m+0.5))/((m+0.5)*(k-x+0.5))

What do you think?

Cheers, S

yossi-liron · 2021-12-06T11:39:51Z

Two alternatives : returning an 'inf' value or the suggested correction formulat.
I'd vote for the formula , because:

Exception will be prevented (main problem solved)
When x increases in size toward m, the oddr does move in the right direction instead of being stuck on the less informative inf value.

..And if the caller insists on rejecting the oddr , she still has the information that x==k from the overlap value

136s · 2023-12-08T05:40:20Z

Hi,

Sorry for replying to this closed issue. I just had a question about the implementation based on this discussion.

Why did @zqfang adopt the formula oddr = (x * (bg - m)) / (m * (k - x)) to odds ratio? This is equivalent to oddr = (a * (c + d)) / ((a + b) * c) when I refered to the docstring in gseapy.stats.calc_pvalues().
However, as mentioned within the docstring, the enrichr's definition is oddsRatio = (1.0 * a * d) / Math.max(1.0 * b * c, 1).
In this definition the odds ratio of enrichr and gseapy are different, are they not?

where:
a are the overlapping genes
b are the genes in the annotated set - overlapping genes
c are the genes in the input set - overlapping genes
d are the 20,000 genes (or total genes in the background) - genes in the annotated set - genes in the input set + overlapping genes

x = len(query.intersection(category)) # = a
bg = len(background) # = a + b + c + d
m = len(category)  # = a + b
k = len(query)  # = a + c

In the above equation, 0.5 is omitted for Haldane-Anscombe correction for simplicity.

It would be helpful if you could point out my misunderstanding.

sreichl · 2023-12-08T11:03:53Z

Hi @136s ,
good catch. I was stumped at first, but I think they have an error in their documentation as multiplication with 1 does not make sense in their formula. I have created an issue in their GitHub repository: MaayanLab/enrichr_issues#78.

If the formula is indeed wrong and instead of 1.0*... it is always 1.0+..., then the two formulas are equivalent, right?

$$\frac{a \cdot (c + d)}{(a + b) \cdot c} = \frac{ac + ad}{ac + bc} = \frac{ac \cdot (1 + ad)}{ac \cdot (1 + bc)} = \frac{1 + ad}{1 + bc}$$

136s · 2023-12-19T08:14:46Z

Hi, @sreichl ,

Since there seems to be no replies on enrichr_issues, I will write my thoughts.

I read "Definition in terms of joint and conditional probabilities" section on wikipedia thought the following:

Where $Y=1$ for inclusion in query and $X=1$ for inclusion in gene_set, $p_{11}=a, p_{10}=b, p_{01}=c, p_{00}=d$.
So, the odds ratio is ${\displaystyle {\frac {p_{11}p_{00}} {p_{10}p_{01}} }={\frac {ad}{bc}}}$.

sreichl · 2023-12-20T12:28:24Z

Hi @136s ,
I think you are right and we have been missing two components in the formula. This would also explain the "weird" times 1 in the enrichr formula (maybe as a reminder of a simplification step).

I think the correct formula given our variables should

$$\frac{x\cdot(bg-m-k+x)}{(m-x)\cdot(k-x)} = \frac{ad}{bc}$$

in python w/ Haldane-Anscombe correction

oddr = ((x + 0.5) * (bg - m -k + x + 0.5)) / ((m - x + 0.5) * (k - x + 0.5)

What do you think @136s @zqfang @yossi-liron?

136s · 2023-12-21T01:15:54Z

Hi @sreichl,

Thank you for agreeing with my opinion.
Thanks also for reconsidering the correct formula.
I can agree with the pull request #238 I submitted the other day as it is also the same formula.

zqfang · 2023-12-21T21:31:39Z

Thanks @sreichl , @136s . It's not easy to detect this issue in so much detail.

zqfang pushed a commit that referenced this issue Oct 25, 2021

add Odds ratio, #132

3ee1eec

zqfang closed this as completed Oct 26, 2021

zqfang pushed a commit that referenced this issue Oct 28, 2021

odds ratio correction, #132

0b350ee

zqfang pushed a commit that referenced this issue Dec 7, 2021

Haldane-Anscombe correction for odds ratio, #132

b363801

136s mentioned this issue Dec 8, 2023

Odds ratios differ between enrichr and gseapy.enrich() #237

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enrichr: Local mode of GO/enrichment analysis does not provide Odds Ratios in the results #132

Enrichr: Local mode of GO/enrichment analysis does not provide Odds Ratios in the results #132

sreichl commented Sep 26, 2021

zqfang commented Sep 29, 2021

sreichl commented Oct 11, 2021

zqfang commented Oct 25, 2021

sreichl commented Oct 26, 2021

sreichl commented Oct 28, 2021

zqfang commented Oct 28, 2021

sreichl commented Oct 29, 2021

zqfang commented Oct 29, 2021

yossi-liron commented Dec 5, 2021 •

edited

sreichl commented Dec 5, 2021

yossi-liron commented Dec 6, 2021

136s commented Dec 8, 2023

sreichl commented Dec 8, 2023

136s commented Dec 19, 2023

sreichl commented Dec 20, 2023

136s commented Dec 21, 2023

zqfang commented Dec 21, 2023

Enrichr: Local mode of GO/enrichment analysis does not provide Odds Ratios in the results #132

Enrichr: Local mode of GO/enrichment analysis does not provide Odds Ratios in the results #132

Comments

sreichl commented Sep 26, 2021

Setup

Expected behaviour

Actual behaviour

Steps to reproduce

zqfang commented Sep 29, 2021

sreichl commented Oct 11, 2021

zqfang commented Oct 25, 2021

sreichl commented Oct 26, 2021

sreichl commented Oct 28, 2021

zqfang commented Oct 28, 2021

sreichl commented Oct 29, 2021

zqfang commented Oct 29, 2021

yossi-liron commented Dec 5, 2021 • edited

sreichl commented Dec 5, 2021

yossi-liron commented Dec 6, 2021

136s commented Dec 8, 2023

sreichl commented Dec 8, 2023

136s commented Dec 19, 2023

sreichl commented Dec 20, 2023

136s commented Dec 21, 2023

zqfang commented Dec 21, 2023

yossi-liron commented Dec 5, 2021 •

edited