bcftools csq cannot parse bacterial gff3 files #530

johnlees · 2016-12-20T15:35:51Z

I have tried the bcftools csq command with the attached fasta, and three different gff files (one from ncbi, one ncbi modified to add transcripts, and one from ensembl bacteria). In all cases the gff cannot be parsed, and I haven't found the error message informative as to why.

Files to replicate this are attached here:
bcftools_csq.zip

The commands I tried with their stderr are in bcftools_csq.stderr

Parsing Streptococcus_pneumoniae_atcc_700669.ASM2666v1.32.chromosome.Chromosome.gff3.gz ...
[csq.c:699 gff_parse_id] Could not parse the line: Chromosome	ena	gene	186	1547	.	+	.	ID=gene:SPN23F00010;Name=dnaA;biotype=protein_coding;description=chromosomal replication initiator protein;gene_id=SPN23F00010;logic_name=ena

bcftools csq -f Streptococcus_pneumoniae_ATCC_700669_v1.fa -g transcripts.gtf dhh.vcf.gz 2>> bcftools_csq.stderr
Parsing transcripts.gtf ...
[csq.c:692 gff_parse_id] Could not parse the line, "Parent=transcript:" not present: FM211187	protein_coding	exon	186	1547	.	+	0	gene_id ""FM211187.2.1""; transcript_id ""dnaA""; exon_number "1"; 

bcftools csq -f Streptococcus_pneumoniae_ATCC_700669_v1.fa -g Streptococcus_pneumoniae_ATCC_700669_v1.gff dhh.vcf.gz 2>> bcftools_csq.stderr
Parsing Streptococcus_pneumoniae_ATCC_700669_v1.gff ...
ignore: FM211187	EMBL	databank_entry	1	2221315	0.000	+	.	ID="FM211187.1";organism="Streptococcus pneumoniae ATCC 700669";strain="ATCC 700669";mol_type="genomic DNA";db_xref="taxon:561276"
[csq.c:692 gff_parse_id] Could not parse the line, "Parent=transcript:" not present: FM211187	EMBL	CDS	186	1547	0.000	+	0	ID="FM211187.2";transl_table=11;gene="dnaA";locus_tag="SPN23F00010";product="chromosomal replication initiator protein";db_xref="GOA:B8ZJH9";db_xref="InterPro:IPR001957";db_xref="InterPro:IPR003593";db_xref="InterPro:IPR010921";db_xref="InterPro:IPR013159";db_xref="InterPro:IPR013317";db_xref="InterPro:IPR018312";db_xref="InterPro:IPR020591";db_xref="UniProtKB/Swiss-Prot:B8ZJH9";protein_id="CAR67867.1";translation="MKEKQFWNRILEFAQERLTRSMYDFYAIQAELIKVEENVATIFLPRSEMEMVWEKQLKDIIVVAGFEIYDAEITPHYIFTKPQDTTSSQVEEATNLTLYDYSPKLVSIPYSDTGLKEKYTFDNFIQGDGNVWAVSAALAVSEDLALTYNPLFIYGGPGLGKTHLLNAIGNEILKNIPNARVKYIPAESFINDFLDHLRLGEMEKFKKTYRSLDLLLIDDIQSLSGKKVATQEEFFNTFNALHDKQKQIVLTSDRSPKHLEGLEERLVTRFSWGLTQTITPPDFETRIAILQSKTEHLGYNFQSDTLEYLAGQFDSNVRDLEGAINDITLIARVKKIKDITIDIAAEAIRARKQDVSQMLVIPIDKIQTEVGNFYGVSIKEMKGSRRLQNIVLARQVAMYLSRELTDNSLPKIGKEFGGKDHTTVIHAHAKIKSLIDQDDNLRLEIESIKKKIK"```

The text was updated successfully, but these errors were encountered:

pd3 · 2016-12-20T15:51:16Z

The error messages actually do indicate the problem: the "Parent=transcript:" keyword is not found in some records, see an example of a supported file format here: ftp://ftp.ensembl.org/pub/grch37/release-84/gff3/homo_sapiens/Homo_sapiens.GRCh37.82.gff3.gz

Maybe there could be a small perl script added for convertin the most common GFF variants to a format supported by bcftools/csq.

johnlees · 2016-12-20T15:57:02Z

Ah I see, thanks a lot. Such a perl script would certainly be useful, but I'd guess it'd be difficult to cover all the variants of GFF files.

I will write one for ensembl bacteria files at some point, and attach it here in case it's useful to others in the future.

pd3 · 2016-12-20T15:59:15Z

That would be great, thanks

tseemann · 2018-03-30T02:04:32Z

@pd3 @johnlees I encountered this problem today too in my attempts to find an alternative to snpEff.

Clearly "Parent=transcript:" not present means it wants that tag. There does seem to be information in the source code comments about the expected gene mode:

https://github.com/samtools/bcftools/blob/develop/csq.c

    The gff parsing logic
        We collect features such by combining gff lines A,B,C as follows:
            A .. gene line with a supported biotype
                    A.ID=~/^gene:/
            B .. transcript line referencing A
                    B.ID=~/^transcript:/ && B.Parent=~/^gene:A.ID/
            C .. corresponding CDS, exon, and UTR lines:
                    C[3] in {"CDS","exon","three_prime_UTR","five_prime_UTR"} && C.Parent=~/^transcript:B.ID/ 
        For coding biotypes ("protein_coding" or "polymorphic_pseudogene") the
        complete chain link C -> B -> A is required. For the rest, link B -> A suffices.

It's even in perl regexp form so I can read it :-P

Here's a CDS from chrY formatted for brevity

gene    2786855 2787699 .       -       .       ID=gene:ENSG00000184895;Name=SRY;biotype=protein_coding;description=sex determining region Y [Source:HGNC Symbol%3BAcc:HGNC:11311];gen>
mRNA    2786855 2787699 .       -       .       ID=transcript:ENST00000383070;Parent=gene:ENSG00000184895;Name=SRY-201;biotype=protein_coding;ccdsid=CCDS14772.1;tag=basic;transcript_>
3'_UTR  2786855 2786988 .       -       .       Parent=transcript:ENST00000383070
exon    2786855 2787699 .       -       .       Parent=transcript:ENST00000383070;Name=ENSE00001494622;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00001494622;ra>
CDS     2786989 2787603 .       -       0       ID=CDS:ENSP00000372547;Parent=transcript:ENST00000383070;protein_id=ENSP00000372547
5'_UTR  2787604 2787699 .       -       .       Parent=transcript:ENST00000383070

pd3 · 2018-03-30T13:14:50Z

The documentation excerpt pasted above states that gene, transcript and CDS lines are required, but in your example the transcript is not shown. (By the way, the program must have printed also the line where the substring "Parent=transcript:" was not found, so you should be able to check where the program got stuck.)

Here is the expected format in more detail:

# The program looks for "CDS", "exon", "three_prime_UTR" and "five_prime_UTR" lines,
# looks up their parent transcript (determined from the "Parent=transcript:" attribute),
# the gene (determined from the transcript's "Parent=gene:" attribute), and the biotype
# (the most interesting is "protein_coding").
#
# Attributes required for 
#   gene lines:
#   - ID=gene:<gene_id> 
#   - biotype=<biotype>
#   - Name=<gene_name>      [optional]
#
#   transcript lines:
#   - ID=transcript:<transcript_id>
#   - Parent=gene:<gene_id>
#   - biotype=<biotype>
#
#   other lines (CDS, exon, five_prime_UTR, three_prime_UTR):
#   - Parent=transcript:<transcript_id>
#
# Supported biotypes:
#   - see the function gff_parse_biotype() in bcftools/csq.c

1   ignored_field  gene            21  2148  .   -   .   ID=gene:GeneId;biotype=protein_coding;Name=GeneName
1   ignored_field  transcript      21  2148  .   -   .   ID=transcript:TranscriptId;Parent=gene:GeneId;biotype=protein_coding
1   ignored_field  three_prime_UTR 21  2054  .   -   .   Parent=transcript:TranscriptId
1   ignored_field  exon            21  2148  .   -   .   Parent=transcript:TranscriptId
1   ignored_field  CDS             21  2148  .   -   1   Parent=transcript:TranscriptId
1   ignored_field  five_prime_UTR  210 2148  .   -   .   Parent=transcript:TranscriptId

And many more examples can be found in bcftools/test/csq/*/*.gff.

If you are going to write a gff2gff conversion script I am happy to host it in bcftools/misc. This should be straightforward really, looking at your example you'd just need to codify the naming (note the UTR notation) and add a transcript for each CDS.

tseemann · 2018-03-30T22:28:31Z

I finally managed to get it working. What the perl rules example doesn't say is that each of the lines must also have biotype=protein_coding (or other approved type).

This is what worked:

gene  ID=gene:foo;biotype=protein_coding
mRNA  ID=transcript:foo;Parent=gene:foo;biotype=protein_coding
CDS   ID=CDS:foo;Parent=mRNA:foo;biotype=protein_coding

On my mini example:

Indexed 63 transcripts, 0 exons, 63 CDSs, 0 UTRs

BCSQ=frameshift|ILHGENPG_00058|ILHGENPG_00058|protein_coding|+|434VHASSMLESARSITRSTLVGGHAGGNGGIKND*>434VLLQC*|53660GTACAT>GT

Bacterial GFF files come in 2 flavours

Every gene has a gene + CDS/rRNA/etc feature (NCBI style)
Every gene only has the CDS/rRNA/etc feature.

However this is now changing as more predictor tools and experimental evidence is being used for the UTRs like promoters and terminators etc.

My popular annotation tool Prokka can produce either. I just need to make an option to output in bcftools csq format.

My script is currently dependent on Bio::Perl and only works fromgbk2gff, I still need to write a gff2gff one, but I am happy to contribute a pull request soon.

tseemann · 2018-03-30T22:32:15Z

@pd3 I notice the {transcript,gene,CDS}: part of the ID is not reported in the BCSQ line as part of the ID. Forgive my ignorance, but I assume this type: notation is a semi-formalised naming convention in the model organism community?

pd3 · 2018-03-31T09:02:55Z

This is just a practical decision to avoid the complexity of parsing of the many GFF flavors that exist out there. Ensembl's human GFF is one of the most frequently used, so that became the default.

Note that contrary to your statement above, the biotype attribute does not need to be present for each line, only the gene and transcript lines. The GFF description is now included in the manual page https://github.com/samtools/bcftools/blob/develop/doc/bcftools.txt#L1066-L1098

tseemann · 2018-04-07T06:35:50Z

@pd3 thanks so much for the clarification of the biotype and for the extra docs.
the htslib teams on github are so helpful and responsive. thank you!

danrlu · 2020-04-16T04:36:37Z

Thanks for the discussion above! It's very helpful!
Please allow me to do a quick summary, and point out that spelling and letter case needs to be exact match:

column 3 (type)                    column 9 (attributes)
xxx                                ID=gene:xxx;biotype=xxx
xxx                                ID=transcript:xxx;Parent=gene:xxx;biotype=xxx
exon                               Parent=transcript:xxx
CDS                                Parent=transcript:xxx
five_prime_UTR                     Parent=transcript:xxx
three_prime_UTR                    Parent=transcript:xxx

Gene line and transcript line are identified by column 9 having ID=gene or ID=transcript, so column 3 can be anything (mRNA, lncRNA, etc.).

List of supported biotype here: https://github.com/samtools/bcftools/blob/develop/csq.c
biotype=protein_coding is most useful.

Thank you for developing these wonderful tools!

flashton2003 · 2020-07-10T09:08:01Z

@pd3 @tseemann @johnlees did any of you end up with a script to convert GFF3 to bcftools csq compatible Ensembl GFF3?

pd3 · 2020-07-10T09:31:30Z

The only code made publicly available I am aware of is here #1208 (comment). I'd be very happy to host a gff2gff script in bcftools/misc but cannot commit to developing it myself unfortunately. Even if its functionality is limited, it would still be a worthwhile contribution.

flashton2003 · 2020-07-10T10:39:13Z

Thank you @pd3! If I roll my own I'll post it here at least.

flashton2003 · 2020-07-14T14:48:23Z

This is my script. Takes a gff converted from gbk with the BioPerl gbk2gff3.pl script. Outputs something which can be used by bcftools csq. It's probably only useful as a starting point for others with the same problem to work from, but it works from my use case.

https://gist.github.com/flashton2003/b246ce509300a8669d27de7a4eb5c4c9

Sorry, I don't have time to make something more general.

Expected input here, expected output here.

pd3 · 2020-09-07T16:02:10Z

This script has been now added, thank you @flashton2003

flashton2003 · 2020-09-07T17:29:33Z

You're welcome!

…

On Mon, Sep 7, 2020 at 6:02 PM Petr Danecek ***@***.***> wrote: This script has been now added, thank you @flashton2003 <https://github.com/flashton2003> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#530 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAT6FEWRFK5ALQCZKA5OABDSET7ZFANCNFSM4C2H3UEA> .

mcshane added the enhancement label Feb 27, 2017

pd3 mentioned this issue Apr 29, 2020

[FR] overlapping CDS #1208

Closed

pd3 added a commit that referenced this issue Sep 7, 2020

Add gff2gff.py script contributed by @flashton2003 in #530

d1b74cd

lindenb mentioned this issue Jul 4, 2022

Suggestion to support syntax for bcftools csq gpertea/gffread#95

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bcftools csq cannot parse bacterial gff3 files #530

bcftools csq cannot parse bacterial gff3 files #530

johnlees commented Dec 20, 2016

pd3 commented Dec 20, 2016

johnlees commented Dec 20, 2016

pd3 commented Dec 20, 2016

tseemann commented Mar 30, 2018 •

edited

pd3 commented Mar 30, 2018

tseemann commented Mar 30, 2018 •

edited

tseemann commented Mar 30, 2018

pd3 commented Mar 31, 2018

tseemann commented Apr 7, 2018

danrlu commented Apr 16, 2020 •

edited

flashton2003 commented Jul 10, 2020

pd3 commented Jul 10, 2020

flashton2003 commented Jul 10, 2020

flashton2003 commented Jul 14, 2020

pd3 commented Sep 7, 2020

flashton2003 commented Sep 7, 2020 via email

bcftools csq cannot parse bacterial gff3 files #530

bcftools csq cannot parse bacterial gff3 files #530

Comments

johnlees commented Dec 20, 2016

pd3 commented Dec 20, 2016

johnlees commented Dec 20, 2016

pd3 commented Dec 20, 2016

tseemann commented Mar 30, 2018 • edited

pd3 commented Mar 30, 2018

tseemann commented Mar 30, 2018 • edited

tseemann commented Mar 30, 2018

pd3 commented Mar 31, 2018

tseemann commented Apr 7, 2018

danrlu commented Apr 16, 2020 • edited

flashton2003 commented Jul 10, 2020

pd3 commented Jul 10, 2020

flashton2003 commented Jul 10, 2020

flashton2003 commented Jul 14, 2020

pd3 commented Sep 7, 2020

flashton2003 commented Sep 7, 2020 via email

tseemann commented Mar 30, 2018 •

edited

tseemann commented Mar 30, 2018 •

edited

danrlu commented Apr 16, 2020 •

edited