Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gff3 specification #127

Open
nhartwic opened this issue May 10, 2024 · 1 comment
Open

gff3 specification #127

nhartwic opened this issue May 10, 2024 · 1 comment
Milestone

Comments

@nhartwic
Copy link

nhartwic commented May 10, 2024

The gff-version listed by helixer is "3.2.1" I haven't been able to find a specification for that version. The only particularly well detailed specification of gff3 that I've ever found is from Sequence Ontology for "3.1.26"...

https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md

...According to this standard, CDS lines that are meant to be part of the same sequence should have the same ID. This marks those lines as a multi-line feature. Helixer doesn't do this. Helixer marks each CDS segment with its own ID.

Is there an actual specification for 3.2.1 somewhere? If so, does it disagree with 3.1.26 in terms of how CDS are supposed to be represented?

EDIT: For reference, current GFF from helixer looks like...

1       Helixer gene    241717  242876  .       +       .       ID=Zmays.B73.HPIv02_1_000003
1       Helixer mRNA    241717  242876  .       +       .       ID=Zmays.B73.HPIv02_1_000003.1;Parent=Zmays.B73.HPIv02_1_000003
1       Helixer exon    241717  241720  .       +       .       ID=Zmays.B73.HPIv02_1_000003.1.exon.1;Parent=Zmays.B73.HPIv02_1_000003.1
1       Helixer five_prime_UTR  241717  241717  .       +       .       ID=Zmays.B73.HPIv02_1_000003.1.five_prime_UTR.1;Parent=Zmays.B73.HPIv02_1_000003.1
1       Helixer CDS     241718  241720  .       +       0       ID=Zmays.B73.HPIv02_1_000003.1.CDS.1;Parent=Zmays.B73.HPIv02_1_000003.1
1       Helixer exon    241835  242876  .       +       .       ID=Zmays.B73.HPIv02_1_000003.1.exon.2;Parent=Zmays.B73.HPIv02_1_000003.1
1       Helixer CDS     241835  242875  .       +       0       ID=Zmays.B73.HPIv02_1_000003.1.CDS.2;Parent=Zmays.B73.HPIv02_1_000003.1
1       Helixer three_prime_UTR 242876  242876  .       +       .       ID=Zmays.B73.HPIv02_1_000003.1.three_prime_UTR.1;Parent=Zmays.B73.HPIv02_1_000003.1

...According to the sequence ontology standard, it should be...

1       Helixer gene    241717  242876  .       +       .       ID=Zmays.B73.HPIv02_1_000003
1       Helixer mRNA    241717  242876  .       +       .       ID=Zmays.B73.HPIv02_1_000003.1;Parent=Zmays.B73.HPIv02_1_000003
1       Helixer exon    241717  241720  .       +       .       ID=Zmays.B73.HPIv02_1_000003.1.exon.1;Parent=Zmays.B73.HPIv02_1_000003.1
1       Helixer five_prime_UTR  241717  241717  .       +       .       ID=Zmays.B73.HPIv02_1_000003.1.five_prime_UTR.1;Parent=Zmays.B73.HPIv02_1_000003.1
1       Helixer CDS     241718  241720  .       +       0       ID=Zmays.B73.HPIv02_1_000003.1.CDS;Parent=Zmays.B73.HPIv02_1_000003.1
1       Helixer exon    241835  242876  .       +       .       ID=Zmays.B73.HPIv02_1_000003.1.exon.2;Parent=Zmays.B73.HPIv02_1_000003.1
1       Helixer CDS     241835  242875  .       +       0       ID=Zmays.B73.HPIv02_1_000003.1.CDS;Parent=Zmays.B73.HPIv02_1_000003.1
1       Helixer three_prime_UTR 242876  242876  .       +       .       ID=Zmays.B73.HPIv02_1_000003.1.three_prime_UTR.1;Parent=Zmays.B73.HPIv02_1_000003.1

I know the difference seems small, but its the difference between a "correct" parsing of the file producing two distinct proteins and producing a single (correct) protein.

@alisandra alisandra added this to the v0.3.4 milestone Jun 2, 2024
@alisandra
Copy link
Collaborator

Ah, thanks for raising this. That will help us get it patched up the next round of changes / next version.

As a practical option until then, many parsers use the Parent feature to determine what belongs together and will parse this correctly despite the IDs. I think gffread should do it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants