Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

P values of Enrichr are not uniform calculated when switching from online mode to local mode #122

Closed
hsiaoyi0504 opened this issue May 26, 2021 · 13 comments

Comments

@hsiaoyi0504
Copy link
Contributor

Setup

I am reporting a problem with GSEApy version, Python version, and operating
system as follows:

import sys; print(sys.version)
import platform; print(platform.python_implementation()); print(platform.platform())
import gseapy; print(gseapy.__version__)
3.9.4 | packaged by conda-forge | (default, May 10 2021, 22:10:52) 
[Clang 11.1.0 ]
CPython
macOS-11.2.3-arm64-arm-64bit
0.10.4

Expected behaviour

The p-values should be the same when switching the online mode to local mode.

Actual behaviour

The p-values seem to be different switching the online mode to local mode.

Steps to reproduce

On-line mode:

enr2_down = gp.enrichr(downregulated_genes.index.to_list(), gene_sets='KEGG_2019_Human',
                 organism='Human', # don't forget to set organism to the one you desired! e.g. Yeast
                 description='test_name',
                 outdir='enrichr_kegg', cutoff=0.5)
enr2_down.results.head(20)

image

Local mode (KEGG_2019_Human.gmt file is download from the Enrichr website):

enr2_down_local = gp.enrichr(downregulated_genes.index.to_list(), gene_sets='./data/KEGG_2019_Human.gmt',
                 organism='Human', # don't forget to set organism to the one you desired! e.g. Yeast
                 description='test_name',
                 outdir='enrichr_kegg', cutoff=0.5)
enr2_down_local.results.sort_values('P-value').head(20)

image

@hsiaoyi0504
Copy link
Contributor Author

One thing I notice is that the total numbers of genes in PI3K-Akt signaling pathway term (354 and 352) are different in these two modes. Does that mean the gmt files used by the Enrichr sever are different from those we can download from Enrichr sever?

@hsiaoyi0504
Copy link
Contributor Author

Another thought on this issue is that there are total 49 genes with term 'Malaria', but result of the local mode shows only 48 genes there.

@hsiaoyi0504
Copy link
Contributor Author

I suspect this is the origin of the issue:

category = category.intersection(background)

There might be some genes in KEGG_2019_HUMAN not included in the
https://github.com/zqfang/GSEApy/blob/149f3ec0fabaae1ceefc1eec4a73ec423fd41cb1/gseapy/data/hsapiens_gene_ensembl.background.genes.txt

@hsiaoyi0504
Copy link
Contributor Author

hsiaoyi0504 commented May 26, 2021

For example, KLRC4-KLRK1 gene (it's associated with the term 'Malaria') is not in gseapy/data/hsapiens_gene_ensembl.background.genes.txt. Where we can get an updated version of this file?

@zqfang
Copy link
Owner

zqfang commented May 26, 2021

the could set the background gene list by your own.

enr2_down_local = gp.enrichr(downregulated_genes.index.to_list(), gene_sets='./data/KEGG_2019_Human.gmt',
                 organism='Human', 
                 background= ['gene1', 'gene2', 'gene3', ....], 
                 outdir='enrichr_kegg', cutoff=0.5)
enr2_down_local.results.sort_values('P-value').head(20)

you could get the updated file from biomart ensembl: http://uswest.ensembl.org/biomart/martview/e666d46813690a3366756dfcbc11466e

@hsiaoyi0504
Copy link
Contributor Author

Thanks. I have downloaded it and opened a pull request #123.

@hsiaoyi0504
Copy link
Contributor Author

Even with the updated background, the p-values are still different. The number of the genes with "PI3K-Akt signaling pathway" now becomes 351. I am wondering if we should not do an intersection here:

category = category.intersection(background)

@zqfang
Copy link
Owner

zqfang commented May 27, 2021

@hsiaoyi0504 , sorry, the intersection for the background gene is necessary here. We need to makesure all genes could be found in the background. That's, it's weird(wrong) that you draw a blue ball from the urn which only contains red and white ball.

@hsiaoyi0504
Copy link
Contributor Author

hsiaoyi0504 commented May 27, 2021

Yes, I understand that. However, given the results I get here, some genes in KEGG_2019_HUMAN are not in the human gene list fetched from the Ensembl through Biomart. It's very weird. What I propose is a temporary fix like if background = 'hsapiens_gene_ensembl' or 'mmusculus_gene_ensembl', no need to do the intersection operation.

@zqfang
Copy link
Owner

zqfang commented May 28, 2021

@hsiaoyi0504 , If you go deep, you need to curated a background genes annotated just only from KEGG database (for the KEGG_2019_HUMAN.gmt). As far as I know, KEGG covers much less genes than ensembl or NCBI.

Why not just input a number for background to skip the background checking ? your gene (KLRC4-KLRK1) will be counted.

@hsiaoyi0504
Copy link
Contributor Author

hsiaoyi0504 commented May 28, 2021

That's a good workaround. I didn't think of that. Thanks for the suggestion.

@zqfang
Copy link
Owner

zqfang commented May 28, 2021

I'm wondering why you need the local mode instead of using EnrichrAPI. EnrirchR is more powerfull and has more metrics

@hsiaoyi0504
Copy link
Contributor Author

hsiaoyi0504 commented May 28, 2021

As you mentioned in #121, background genes can only be customized in the local mode.
At the same time, the machines I used are not guaranteed to have access to the internet in some scenarios.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants