Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider evidence type in GOEA #119

Closed
jeffsmith8 opened this issue Jan 30, 2019 · 17 comments
Closed

Consider evidence type in GOEA #119

jeffsmith8 opened this issue Jan 30, 2019 · 17 comments

Comments

@jeffsmith8
Copy link

Hi
Apologies if I've missed it somewhere- but does goatools also have a facility for limiting associations to specific evidence types for pval testing?

I'm also interested in filtering by qualifier as raised earlier, though I've noticed that the gene2go file is nearly wholly empty of qualifier statements (at least for human, ergo I assume for all).

@dvklopfenstein
Copy link
Collaborator

Hello,

Thank you for your interest in GOATOOLS and for taking the time to write us.

Yes. GOATOOLS does have a facility for limiting associations to specific evidence types for pval testing. Are you running from your own script or from our scripts/find_enrichment.py script?

If you are running from your own scripts, use the keyword argument, evidence_set, when calling the function read_ncbi_gene2go. The value for evidence_set should be a set containing the evidence codes that you would like.

If you are running from our find_enrichment script, we need to add a command-line-argument to pass this information to the GOEA.

Here is the plan to do to address your issue:

  1. Give the user the ability to specify the evidence codes to the find_enrichment script.
  2. Add an example in the Jupyter notebooks showing how to do it from inside your own scripts.
  3. Add evidence code tests.

Thank you again for your interest in GOATOOLS.

@dvklopfenstein
Copy link
Collaborator

Hello @jeffsmith8 , Thank you for your excellent request. It will surely be a popular feature. I am working on it now.

To better inform the tests that need to be written; How are you doing your enrichment analyses? Are you running the script, find_enrichment.py? Or are you writing your own scripts?

@jeffsmith8
Copy link
Author

Hey there

I haven't really progressed with this because I've been occupied with a lot of lab work and some more urgent analysis. Nevertheless it's still a feature I would utilise in the future because my usual practice is to only review GO Associations that fall in the experimental evidence groups as below (comment snippet from one of my own GO filtering functions):

Evidence Codes:
1: ['GO_ExpEvidence','EXP','IDA','IPI','IMP','IGI','IEP']
2: ['GO_HTPEvidence','HTP','HDA','HMP','HGI','HEP']
3: ['GO_CompEvidence','ISS','ISO','ISA','ISM','IGC','IBA','IBD','IKR','IRD','RCA']
4: ['GO_AuthEvidence,''TAS','NAS']
5: ['GO_CurEvidence','IC','ND']
6: ['GO_eEvidence','IEA']

Not sure if that helps?

@dvklopfenstein
Copy link
Collaborator

dvklopfenstein commented May 2, 2019

Thank you for taking the time to write us and for your interest in GOATOOLS. It was a great request. The information you provided in the last issue post was extremely helpful in determining the user-interface for the extended functionality.

I have standardized the way evidence codes can be used across all annotation formats (GPAD, GAF, NCBI's and gene2go).

To get a list of the Evidence codes, do:

$ python3 scripts/find_enrichment.py --ev_help_short

EVIDENCE GROUP AND CODES:
    Experimental       : EXP IDA IPI IMP IGI IEP
    Similarity         : ISS ISO ISA ISM IGC IBA IBD IKR IRD IMR
    Combinatorial      : RCA
    High_Throughput    : HTP HDA HMP HGI HEP
    Author             : TAS NAS
    Curatorial         : IC
    No biological data : ND
    Automatic          : IEA

To get a more detailed list of evidence codes containing their descriptions do:

python3 scripts/find_enrichment.py --ev_help

If you only wanted to use annotations returned from experimental evidence groups when using the find_enrichment.py script, you would use the include evidence argument, ev_inc:

--ev_inc=Experimental

This is the same as listing all experimental codes individually:

--ev_inc=EXP,IDA,IPI,IMP,IGI,IEP

You can also EXCLUDE evidence codes. If you want to use all evidence codes except the ones inferred from Electronic Annotation, use this argument:

--ev_exc=IEA

Please give it a try and let us know what you think. Thank you again for your taking the time to contact us.

@cross12tamu
Copy link

This is awesome! Great work!

@cross12tamu
Copy link

cross12tamu commented May 2, 2019

I am curious if there is plans to incorporate ECO codes for evidence type? I'm wondering if y'all would want a mapper or something, since I think the GOC is switching to ECO for gpad/gaf annotation development.

For example,

'IDA' == "ECO:0000314",
'IMP'== "ECO:0000315",

etc...etc...

@dvklopfenstein
Copy link
Collaborator

dvklopfenstein commented May 2, 2019

Thank you! @cross12tamu! It was a lot of work because we unified handling of all the different annotation formats. Each format had been added at different times from different requests. Now they are all derived off of one base class.

We added a mapper to map ECO IDs to the evidence code letters, which was needed to keep support across all annotation formats consistent.

The mapper downloads this file:

https://raw.githubusercontent.com/evidenceontology/evidenceontology/master/gaf-eco-mapping-derived.txt

And then creates this Python module in GOATOOLS:

goatools/anno/eco2group.py

ECO2GRP = {
    'ECO:0000030': 'ISA',
    'ECO:0000031': 'ISA',
    'ECO:0000032': 'ISA',
    'ECO:0000053': 'IEA',
    'ECO:0000209': 'IEA',
    'ECO:0000210': 'IEA',
...

@cross12tamu
Copy link

awesome! 👍

@dvklopfenstein
Copy link
Collaborator

dvklopfenstein commented May 2, 2019

@cross12tamu : That is a great idea to add ECO IDs too.

I will not be able to do it right now due to work on other high priority issues.

If you want to take a crack at it, go ahead. Please before submitting, write a test(s) that covers your new functionality and be sure to run:

$ make clobber
$ make pytest

Only GPAD supports the ECO IDs, so if a user specifies an ECO ID and is not using the GPAD format, you should ignore the user's ECO ID and write a message telling them that the other formats do not contain ECO IDs and so the user ECO IDs will be ignored.

@jeffsmith8
Copy link
Author

jeffsmith8 commented May 2, 2019

Wonderful! I'm flattered you found my suggestion useful, and grateful that it has even been added as a feature! I look forward to using this tool, and am particularly pleased you have added a facility for custom evidence code combinations. Great work :)

This is a bit of a segway though I'll add it in case it is already a GOATOOLS feature or deemed useful...

One other filter I like to include is the QUALIFIER so that searches can also be limited by biological process, molecular function or cellular compartment.

I generally find this useful because much of my work is in proteomics where our experimental techniques are usually selected to target these categories- i.e. affinity proteomics targets molecular function, cell fractionation targets cellular compartment. As such I often wonder whether p-val testing in GO enrichment offers more power when it is similarly matched to a chosen technique or hypothesis being tested (at present I limit myself to a descriptive approach rather than a statistical one). Also seems like a worthwhile research question if it hasn't been tackled before.

@dvklopfenstein
Copy link
Collaborator

One more final note: You can combine including evidence codes and evidence groups with excluding evidence codes.

For example,

--ev_inc=Experimental,Similarity
--ev_exc=IEP,IMR

Results in these codes being used:

    Experimental       : EXP IDA IPI IMP IGI
    Similarity         : ISS ISO ISA ISM IGC IBA IBD IKR IRD

For reference:

$ python3 scripts/find_enrichment.py --ev_help_short

EVIDENCE GROUP AND CODES:
    Experimental       : EXP IDA IPI IMP IGI IEP
    Similarity         : ISS ISO ISA ISM IGC IBA IBD IKR IRD IMR
    Combinatorial      : RCA
    High_Throughput    : HTP HDA HMP HGI HEP
    Author             : TAS NAS
    Curatorial         : IC
    No biological data : ND
    Automatic          : IEA

@dvklopfenstein
Copy link
Collaborator

@jeffsmith8 , Thank you very much for the detailed description of your usage model and how it would benefit to be able to specify which namespace to use in a GOEA run.

I just added this for #127

To specify running only biological process from the command line, use the --ns option:

Namespace examples:
--ns=BP
--ns=BP,MF
--ns=CC

Where the namespace abbreviations are BP, MF, and CC:

NS Namespace
BP Biological Process
MF Molecular Function
CC Cellular Component

So an example of a full command-line call to run a GOEA on just the molecular function (--ev=MF) branch is:

python3 scripts/find_enrichment.py ids_stu_gene2go_9606.txt ids_pop_gene2go_9606.txt gene2go --pval=0.05 --method=fdr_bh --pval_field=fdr_bh --outfile=results_gene2go_9606.xlsx --ns=MF

You can combine this with --ev_inc and --ev_exc to include and exclude evidence codes.

@dvklopfenstein
Copy link
Collaborator

dvklopfenstein commented May 8, 2019

@tanghaibao , Can you upload a new version to PyPI and Bioconda? The latest updates for the user-experience:

  • Add option to run GOEAs with user-specified evidence codes or evidence code groups
  • Add Evidence Code help to inform user of Evidence groups and codes, with descriptions
  • Add option to run a GOEA only one branch

The internal changes are:

  • Add one base class that all annotation format readers derive from
  • GOEAs are run on each branch

These changes will close
#127 and
#119

And then the next high priority issues that will be addressed are:
#126 and
#117

Thank you @BrianLohman, @jeffsmith8, and @cross12tamu for your ideas and requests, and taking the time to convey them to us. Your detailed descriptions of your usage case help us to build a better user experience for the new functionality. Thank you so much.

Thank you @dgpinheiro and @risserlin for opening the issues concerning relationships. This will be the next task to tackle.

@tanghaibao
Copy link
Owner

@dvklopfenstein

Version tag updated and uploaded to PyPI.

(hijacking this message thread ...) Somehow the recent travis CI tests has been failing often due to network timeout (see: https://travis-ci.org/tanghaibao/goatools/jobs/529067076). Not sure what is the nature of the problem here. @dvklopfenstein would you mind taking a look, and/or disable the tests that failed on the travis server?

Thanks,
Haibao

@dvklopfenstein
Copy link
Collaborator

dvklopfenstein commented May 9, 2019

@tanghaibao,
Thank you for the version in PyPI.

Regarding TravisCI: Before committing, to ensure a good build, I always run all of the tests using this:

$ make clobber
$ make pytest

Only if these tests all pass, I commit. Then it gets to TravisCI ...

You are correct that the TravisCI failures have all been timeouts lately; The Travis config, No language set always passes. The configs, Python 2.7 and Python 3.6, often fail with timeouts. I always check the failures and ensure that all are timeout failures.

The timeouts are occurring when the tests download the annotation files. In an attempt to fix, I placed the download tests first. The tests that follow should not need to download after the first tests. But TravisCI seems to continue to try downloading the files, so perhaps the tests are split into individual jobs and distributed across different machines where there are no previously downloaded annotation files.

Many people have had issues with TravisCI timeouts and there appears to be no satisfying resolution (travis-ci/travis-ci#9587).

I am taking a look at this...

@dvklopfenstein
Copy link
Collaborator

The TravisCI tests are now passing.

Here is the TravisCI resolution for test failures due to timeouts on downloading the annotation files:

TravisCI will run all tests on Python versions, 2.7, 3.6, and 3.7, using generic language, osx, and install python by hand rather than using the built virtual environments for linux builds as the built environments have the annotation file download timeouts.

@tanghaibao
Copy link
Owner

@dvklopfenstein

Good to know the tests are now passing. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants