Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why are some GO IDs (and associated genes) not included in the "main data table"? #12

Open
laurahspencer opened this issue Jul 11, 2022 · 14 comments

Comments

@laurahspencer
Copy link

laurahspencer commented Jul 11, 2022

I'm interested in particular GO terms in my input files, but they do not get included in the "main data table" (GO division)_(input filename). I understand that all original GO terms aren't actually analyzed because they are represented by either a) a more specific term, or b) a highly similar term. I would like to see which redundant/similar GO term absorbed the GO terms of interest, but can't find my GO term of interest in any of the GO_MWU output. Are all original GO IDs then supposed to be accounted for in the main data table?

Any insight would be very helpful. Attached is some GO_MWU code and results showing that the GO term of interest (and its associated genes) is missing from the GO_MWU output. You'll see that I have relaxed the filtering settings to not remove any GO categories that contain a large fraction of genes or only a few genes, in an effort to not throw out GO terms.

testing_gomwu.zip

@z0on
Copy link
Owner

z0on commented Jul 13, 2022 via email

@laurahspencer
Copy link
Author

Hi Misha,

Thanks for the info!

I've double checked that the GO term is indeed in the GO annotations file (go.obo) and is assigned to the namespace "biological_process" (see below), and that the GO term is in the background list of GO terms that I input into GO_MWU.

[Term]
id: GO:0006313
name: transposition, DNA-mediated
namespace: biological_process
alt_id: GO:0006317
alt_id: GO:0006318
def: "Any process involved in a type of transpositional recombination which occurs via a DNA intermediate." [GOC:jp, ISBN:0198506732, ISBN:1555812090]
synonym: "Class II transposition" EXACT []
synonym: "DNA transposition" EXACT [GOC:dph]
synonym: "P-element excision" NARROW []
synonym: "P-element transposition" NARROW []
synonym: "Tc1/mariner transposition" NARROW []
synonym: "Tc3 transposition" NARROW []
is_a: GO:0006310 ! DNA recombination
is_a: GO:0032196 ! transposition

I have also experimented with relaxing the smallest and largest (smallest=1, largest=.99), and set clustuerCutHeight=0, but my GO terms are still missing. To your fourth bullet I found one offspring of my GO term in the results, but it isn't associated with any of the genes that map to my GO term of interest (the genes it does map to aren't significant). Further, the genes that are associated with my GO term of interest are also missing from the output - are all significant genes supposed to be re-assigned to other GO terms, or are they supposed to be removed?

Thanks for the help!

@z0on
Copy link
Owner

z0on commented Jul 14, 2022 via email

@laurahspencer
Copy link
Author

By "background" I mean the goAnnotations list, and yes, that list is lousy with my GO term of interest (6,259 of 29,127 genes are associated with my GO term).

My genes contain 732 genes linked to my GO term of interest (it actually comprises ~28% of all "significant" genes).

For context, I'm analyzing WGNCA results, so my input includes all genes measured, and then module membership scores for genes assigned to the focal module. And the GO term relates to transposons, of which there are many in my focal species' genome.

I guess the big question now, and what I should have started this issue with, is why do so many of my genes get discarded despite me relaxing the settings that filter/merge GO terms?

@z0on
Copy link
Owner

z0on commented Jul 15, 2022 via email

@laurahspencer
Copy link
Author

laurahspencer commented Jul 15, 2022

Yes I have played with that setting quite a bit and tried various levels up to 0.99 (see code I attached in my first comment). Does that setting have a hard-coded ceiling? (E.g. nothing above 50% is analyzed)

@z0on
Copy link
Owner

z0on commented Jul 15, 2022 via email

@laurahspencer
Copy link
Author

Thanks for checking! I've tried playing with the perl code but haven't had any breakthroughs

@z0on
Copy link
Owner

z0on commented Jul 25, 2022 via email

@laurahspencer
Copy link
Author

Yes the output changes- for example here's the output when I used the following settings: largest=0.99 smallest=1 cutHeight=0 (genes of interest still get discarded).

go.obo WGCNA-genes_for-GOMWU.tab WGCNA-module_lightgreen.csv BP largest=0.99 smallest=1 cutHeight=0

Run parameters:

largest GO category as fraction of all genes (largest)  : 0.99
         smallest GO category as # of genes (smallest)  : 1
                clustering threshold (clusterCutHeight) : 0

-----------------
retrieving GO hierarchy, reformatting data...

-------------
go_reformat:
Genes with GO annotations, but not listed in measure table: 1

Terms without defined level (old ontology?..): 0
-------------
-------------
go_nrify:
1174 categories, 2585 genes; size range 1-2559.15
	1 too broad
	0 too small
	1173 remaining

removing redundancy:

calculating GO term similarities based on shared genes...
598 non-redundant GO categories of good size

@z0on
Copy link
Owner

z0on commented Jul 25, 2022 via email

@laurahspencer
Copy link
Author

yes, sorry, i definitely used option clusterCutHeight

@z0on
Copy link
Owner

z0on commented Jul 25, 2022 via email

@z0on
Copy link
Owner

z0on commented Oct 11, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants