Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in the gff output file when using --add_genes --add_mrna (prokka version 1.12) #338

Closed
Juke34 opened this issue Sep 18, 2018 · 4 comments
Assignees
Labels

Comments

@Juke34
Copy link

Juke34 commented Sep 18, 2018

We figured out many problems in the gff file output from the Prokka version 1.12. Consequently the format is invalid and cannot be really used in other tools.

  1. The mRNA feature lacks a parent attribute.
  2. CDS have 2 parent features, and one of them is the gene feature when only the mRNA feature is expected.
  3. tRNA have mRNA as parent feature while a tRNA is expecting to have a gene as parent feature.
  4. A minor thing but it's a bit awkward, it's having the CDS feature before the mRNA and the gene in the file.

Related to those problems I'm not sure the convertion to EMBL with gff3toembl script still does the job. I guess it's the same with the EMBLmyGFF3.

@andrewjpage
Copy link

Have you considered using this other tool? It was built to convert GFF files from Prokka into EMBL format and was used to submit over 20,000 assemblies to the EMBL:

http://joss.theoj.org/papers/10.21105/joss.00080

It's unclear what you mean by 'the format is invalid'? Are you saying that your software cannot parse the file because it has hard coded a eukaryotic gene model?

@Juke34
Copy link
Author

Juke34 commented Sep 19, 2018

By invalid I mean the gff3 produced by prokka 1.12 doesn't follow the [format specification].(https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md).

I was just saying that it could be problematic for other tools to use this gff. And it would be important for this tool so widely used to create a proper gff output.

That being said, I did some tests this morning:
Artemis => it complains about some stuff but succeed to deal properly with it.
EMBLmyGFF3 => it doesn't complain but the python parser duplicate the features. So you end-up with twice much more features than expected.
I have tried to test gff3toembl but didn't succeed due to problems with Genometools. I guess it's working fine otherwise we would have seen people complaining but I would be interested to have confirmation that gff3toembl is working fine with this gff output.

@tseemann
Copy link
Owner

tseemann commented Sep 19, 2018

@Juke34 Prokka uses the GFF format that NCBI uses for bacterial genomes. This is different to the "standard" GFF gene model of gene->mRNA->exons->CDS because we don't have exons/introns. It's designed to be compatible with the standard bacterial Genbank format (.gbk.gbff) too. I do understand how the true GFF model works in eukaryotic datasets (although its usually GTF/GFF2.5 not 3.0).

Because Prokka doesn't have mRNA features those bugs must be coming from elsewhere. I only produce CDS by default, but gene can be generated by --genes option too, to conform to what NCBI prefers. But as CDS == gene for nearly every bacterial gene in Genbank (no one annotates UTRs) it is overload for most bacterial genome pipelines.

I could write a post-processing step to make a fully Ensembl compatible GFF3 file. I alreadt have a prototype as bcftools csq needs it.

@tseemann tseemann self-assigned this Sep 19, 2018
@Juke34
Copy link
Author

Juke34 commented Sep 19, 2018

Thank you for your input @tseemann.
You're right it is related to the option --addgenes and --addmrna. The colleague that sent me the file confirmed me he has used those options.
We usually don't use those options and post process the data to make it fully gff3 compliant file using our own tool.

To summarize the gff3 file produced by default is correct but becomes slightly wrong when using the --add_genes and/or --add_mrna options (only the format is touched because the structural and functional annotation itself is not affected).

P.S: Our gff3 parser deals with refseq format that doesn't contain any mRNA feature. So using the --addgenes option would haven't raised any problem. So I conclude it is the option --addmrna that have been introduced in the version 1.12 that creates unexpected gff-like file when activated.

@Juke34 Juke34 changed the title Error in the gff output file (prokka version 1.12) Error in the gff output file when using --add_genes --add_mrna (prokka version 1.12) Sep 19, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants