obo file and the association file out of sync? #23

IanCodes · 2020-12-10T11:51:20Z

Hi,

I have followed the tutorial and think I have the correct outputs. Allow the tutorial showed that the *tblout file should be used, which is not created by 'hmmer2go run'. I used *_scan. When I run Ontologizer I get a message that says:

Skipping association of item "NECHADRAFT_88713_8" to GO:0050662 because term is obsolete!
(Are the obo file and the association file in sync?)

I may be interpreting this incorrectly, but I assume this is because there is a discrepancy between the latest obo fle and the one used by 'hmmer2go fetchmap -o pfam2go'. Is it possible to use an updated file for fetchmap, or do I need to use an old obo file?

Also, the output of getorf create gene names are followed by '_N'. So when creating files for -s and -p does it matter that those gene names do not have the suffixes?

I hope that makes sense. Thank you.

sestaton · 2020-12-10T21:35:21Z

The mapping file used for GO terms is always the latest. The program will fetch the term mapping file if it is not given as an option. That said, the latest file is from Jan. of 2019, so it is possible that some terms have been deprecated since that release. Unless that specific term is important to you I would just ignore this because it is just a warning.

For the second part, about the gene names, can you show an example? That would be the best way forward so I can make sure I know what you are referring to.

IanCodes · 2020-12-11T10:31:17Z

Hi,

Thanks for the reply. Looking further I can see there are only 349 obsolete terms. So I am wondering if there is a mismatch between the original gene names and names with a suffux?

Here is an example of the input files for getorf (used -c):
Input CDS seq =

>NECHADRAFT_89548
ATGACagtagaggaaggaaggcaggcaattgatcaaatggatgCTAATGCGCAGGTAGTAGCCGAATCAT
...

Output file =

>NECHADRAFT_89548_2 [1 - 309] 
MTVEEGRQAIDQMDANAQVVAESSRRSGQGRSARPGVRRCGVCGQPGHNARTCQVVIETS
...

The output of the -run' program 'scan' file includes the gene name with the suffix =

#                                                               --- full sequence ---- --- best 1 domain ---- --- domain number estimation ----
# target name        accession  query name           accession    E-value  score  bias   E-value  score  bias   exp reg clu  ov env dom rep inc description of target
------------------- ---------- -------------------- ---------- --------- ------ ----- --------- ------ -----   --- --- --- --- --- --- --- --- ---------------------
GMC_oxred_N          PF00732.20 **NECHADRAFT_94976_13**  -            1.7e-65  221.4   0.0   2.3e-65  220.9   0.0   1.2   1   0   0   1   1   1   1 GMC oxidoreductase

If I run ontologizer2 using the default output and the latest GO obo file I get error messages:

java -jar ~/sources/ontologizer2/Ontologizer.jar -s solani_q0.1_log2FC_gte1.txt -a F_solani_GCF_000151355_GOterm_mapping.gaf -g ~/sources/ontologizer2/go.obo -p solani_genes_in_population.txt
Parse obo file "/home/ian/sources/ontologizer2/go.obo"
Dec 11, 2020 10:04:10 AM ontologizer.ontology.OBOParser doParse
INFO: Got 47218 terms and 90583 relations in 346 ms
Details of parsed obo file:
  date:			null
  format:		1.2
  term definitions:	47218
Building graph
Dec 11, 2020 10:04:10 AM ontologizer.ontology.Ontology assignLevel1TermsAndFixRoot
INFO: Ontology contains multiple level-one terms: "molecular_function" ,"cellular_component" ,"biological_process". Adding artificial root term "GO:0000000".
Dec 11, 2020 10:04:10 AM ontologizer.set.StudySetFactory createFromFile
INFO: Processing studyset solani_q0.1_log2FC_gte1.txt
Dec 11, 2020 10:04:10 AM ontologizer.set.StudySetFactory createFromFile
INFO: Processing studyset solani_genes_in_population.txt
Skipping association of item "NECHADRAFT_88713_8" to GO:0050662 because term is obsolete!
(Are the obo file and the association file in sync?)
...
Dec 11, 2020 10:18:42 AM ontologizer.association.AssociationParser importAssociationFile
INFO: 39747 associations parsed, 0 of which were kept while 0 malformed lines had to be ignored.
Dec 11, 2020 10:18:42 AM ontologizer.association.AssociationParser importAssociationFile
INFO: A further 39747 associations were skipped due to various reasons whereas 0 of those where explicitly qualified with NOT, 349 referred to obsolete terms and 0 didn't match the requested evidence codes
Dec 11, 2020 10:18:42 AM ontologizer.association.AssociationParser importAssociationFile
INFO: A total of 1669 terms are directly associated to 0 items.

In case my input files for ontologizer are at fault, here they are:
-s (DE genes), -p contains all genes in RNA-seq

gene-NECHADRAFT_45473
gene-NECHADRAFT_41228
gene-NECHADRAFT_10996
...

-g is the latest obo file.
-a is the GAF output from map2gaf

I realise ontologizer isn't your problem, but I you notice anything wrong with the input that would be helpful.

Thanks.

IanCodes · 2020-12-11T12:34:55Z

UPDATE -
I managed to get ontologizer running by removing the ORF number suffix from the end of the gene names. I removed 'db.' from the second column of the GAF file. I also noticed some of the gene were prefixed with 'gene-'.

sestaton · 2020-12-19T00:40:08Z

Thanks for the detailed responses. This is very helpful. I'll try to recreate these issues over the next week or two and make a new release. It sounds like the GAF file being produced is no longer compatible with Ontologizer, which needs to be fixed.

For the sequence name format, it is information to know the coordinates and frame of the translation so I do not want to omit this by default. It could be an option to remove it if it causes problems downstream though. Perhaps this could be logged in a separate file. I'll have to experiment to understand better, so I'll leave this open until resolved.

sestaton · 2021-01-31T18:10:36Z

Concerning the issues, please try the latest release (v0.18.0) and take a look at the changes in the release. I have incorporated the comments discussed here, including: keeping the original IDs in the output, discarding obsolete GO terms, updating the GAF file format to work with Ontologizer, and other new features.

One of the new features is filling out the GAF fields more fully, such as the taxon ID based on the study, logging the GO DB and HMMER2GO versions be used, logging the run time, adding Dbxrefs to the GAF file, and other improvements.

Also, the demonstration has been updated with more context and examples.

I'm going to close this issue now, but feel free to comment if you still see issues and I'll reopen the issue. Thanks again for the feedback.

sestaton closed this as completed Jan 31, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

obo file and the association file out of sync? #23

obo file and the association file out of sync? #23

IanCodes commented Dec 10, 2020

sestaton commented Dec 10, 2020

IanCodes commented Dec 11, 2020 •

edited

Loading

IanCodes commented Dec 11, 2020

sestaton commented Dec 19, 2020

sestaton commented Jan 31, 2021

obo file and the association file out of sync? #23

obo file and the association file out of sync? #23

Comments

IanCodes commented Dec 10, 2020

sestaton commented Dec 10, 2020

IanCodes commented Dec 11, 2020 • edited Loading

IanCodes commented Dec 11, 2020

sestaton commented Dec 19, 2020

sestaton commented Jan 31, 2021

IanCodes commented Dec 11, 2020 •

edited

Loading