Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bcftools csq cannot parse bacterial gff3 files #530

Open
johnlees opened this issue Dec 20, 2016 · 16 comments
Open

bcftools csq cannot parse bacterial gff3 files #530

johnlees opened this issue Dec 20, 2016 · 16 comments

Comments

@johnlees
Copy link

I have tried the bcftools csq command with the attached fasta, and three different gff files (one from ncbi, one ncbi modified to add transcripts, and one from ensembl bacteria). In all cases the gff cannot be parsed, and I haven't found the error message informative as to why.

Files to replicate this are attached here:
bcftools_csq.zip

The commands I tried with their stderr are in bcftools_csq.stderr

Parsing Streptococcus_pneumoniae_atcc_700669.ASM2666v1.32.chromosome.Chromosome.gff3.gz ...
[csq.c:699 gff_parse_id] Could not parse the line: Chromosome	ena	gene	186	1547	.	+	.	ID=gene:SPN23F00010;Name=dnaA;biotype=protein_coding;description=chromosomal replication initiator protein;gene_id=SPN23F00010;logic_name=ena

bcftools csq -f Streptococcus_pneumoniae_ATCC_700669_v1.fa -g transcripts.gtf dhh.vcf.gz 2>> bcftools_csq.stderr
Parsing transcripts.gtf ...
[csq.c:692 gff_parse_id] Could not parse the line, "Parent=transcript:" not present: FM211187	protein_coding	exon	186	1547	.	+	0	gene_id ""FM211187.2.1""; transcript_id ""dnaA""; exon_number "1"; 

bcftools csq -f Streptococcus_pneumoniae_ATCC_700669_v1.fa -g Streptococcus_pneumoniae_ATCC_700669_v1.gff dhh.vcf.gz 2>> bcftools_csq.stderr
Parsing Streptococcus_pneumoniae_ATCC_700669_v1.gff ...
ignore: FM211187	EMBL	databank_entry	1	2221315	0.000	+	.	ID="FM211187.1";organism="Streptococcus pneumoniae ATCC 700669";strain="ATCC 700669";mol_type="genomic DNA";db_xref="taxon:561276"
[csq.c:692 gff_parse_id] Could not parse the line, "Parent=transcript:" not present: FM211187	EMBL	CDS	186	1547	0.000	+	0	ID="FM211187.2";transl_table=11;gene="dnaA";locus_tag="SPN23F00010";product="chromosomal replication initiator protein";db_xref="GOA:B8ZJH9";db_xref="InterPro:IPR001957";db_xref="InterPro:IPR003593";db_xref="InterPro:IPR010921";db_xref="InterPro:IPR013159";db_xref="InterPro:IPR013317";db_xref="InterPro:IPR018312";db_xref="InterPro:IPR020591";db_xref="UniProtKB/Swiss-Prot:B8ZJH9";protein_id="CAR67867.1";translation="MKEKQFWNRILEFAQERLTRSMYDFYAIQAELIKVEENVATIFLPRSEMEMVWEKQLKDIIVVAGFEIYDAEITPHYIFTKPQDTTSSQVEEATNLTLYDYSPKLVSIPYSDTGLKEKYTFDNFIQGDGNVWAVSAALAVSEDLALTYNPLFIYGGPGLGKTHLLNAIGNEILKNIPNARVKYIPAESFINDFLDHLRLGEMEKFKKTYRSLDLLLIDDIQSLSGKKVATQEEFFNTFNALHDKQKQIVLTSDRSPKHLEGLEERLVTRFSWGLTQTITPPDFETRIAILQSKTEHLGYNFQSDTLEYLAGQFDSNVRDLEGAINDITLIARVKKIKDITIDIAAEAIRARKQDVSQMLVIPIDKIQTEVGNFYGVSIKEMKGSRRLQNIVLARQVAMYLSRELTDNSLPKIGKEFGGKDHTTVIHAHAKIKSLIDQDDNLRLEIESIKKKIK"```
@pd3
Copy link
Member

pd3 commented Dec 20, 2016

The error messages actually do indicate the problem: the "Parent=transcript:" keyword is not found in some records, see an example of a supported file format here: ftp://ftp.ensembl.org/pub/grch37/release-84/gff3/homo_sapiens/Homo_sapiens.GRCh37.82.gff3.gz

Maybe there could be a small perl script added for convertin the most common GFF variants to a format supported by bcftools/csq.

@johnlees
Copy link
Author

Ah I see, thanks a lot. Such a perl script would certainly be useful, but I'd guess it'd be difficult to cover all the variants of GFF files.

I will write one for ensembl bacteria files at some point, and attach it here in case it's useful to others in the future.

@pd3
Copy link
Member

pd3 commented Dec 20, 2016

That would be great, thanks

@tseemann
Copy link

tseemann commented Mar 30, 2018

@pd3 @johnlees I encountered this problem today too in my attempts to find an alternative to snpEff.

Clearly "Parent=transcript:" not present means it wants that tag. There does seem to be information in the source code comments about the expected gene mode:

https://github.com/samtools/bcftools/blob/develop/csq.c

    The gff parsing logic
        We collect features such by combining gff lines A,B,C as follows:
            A .. gene line with a supported biotype
                    A.ID=~/^gene:/
            B .. transcript line referencing A
                    B.ID=~/^transcript:/ && B.Parent=~/^gene:A.ID/
            C .. corresponding CDS, exon, and UTR lines:
                    C[3] in {"CDS","exon","three_prime_UTR","five_prime_UTR"} && C.Parent=~/^transcript:B.ID/ 
        For coding biotypes ("protein_coding" or "polymorphic_pseudogene") the
        complete chain link C -> B -> A is required. For the rest, link B -> A suffices.

It's even in perl regexp form so I can read it :-P

Here's a CDS from chrY formatted for brevity

gene    2786855 2787699 .       -       .       ID=gene:ENSG00000184895;Name=SRY;biotype=protein_coding;description=sex determining region Y [Source:HGNC Symbol%3BAcc:HGNC:11311];gen>
mRNA    2786855 2787699 .       -       .       ID=transcript:ENST00000383070;Parent=gene:ENSG00000184895;Name=SRY-201;biotype=protein_coding;ccdsid=CCDS14772.1;tag=basic;transcript_>
3'_UTR  2786855 2786988 .       -       .       Parent=transcript:ENST00000383070
exon    2786855 2787699 .       -       .       Parent=transcript:ENST00000383070;Name=ENSE00001494622;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00001494622;ra>
CDS     2786989 2787603 .       -       0       ID=CDS:ENSP00000372547;Parent=transcript:ENST00000383070;protein_id=ENSP00000372547
5'_UTR  2787604 2787699 .       -       .       Parent=transcript:ENST00000383070

@pd3
Copy link
Member

pd3 commented Mar 30, 2018

The documentation excerpt pasted above states that gene, transcript and CDS lines are required, but in your example the transcript is not shown. (By the way, the program must have printed also the line where the substring "Parent=transcript:" was not found, so you should be able to check where the program got stuck.)

Here is the expected format in more detail:

# The program looks for "CDS", "exon", "three_prime_UTR" and "five_prime_UTR" lines,
# looks up their parent transcript (determined from the "Parent=transcript:" attribute),
# the gene (determined from the transcript's "Parent=gene:" attribute), and the biotype
# (the most interesting is "protein_coding").
#
# Attributes required for 
#   gene lines:
#   - ID=gene:<gene_id> 
#   - biotype=<biotype>
#   - Name=<gene_name>      [optional]
#
#   transcript lines:
#   - ID=transcript:<transcript_id>
#   - Parent=gene:<gene_id>
#   - biotype=<biotype>
#
#   other lines (CDS, exon, five_prime_UTR, three_prime_UTR):
#   - Parent=transcript:<transcript_id>
#
# Supported biotypes:
#   - see the function gff_parse_biotype() in bcftools/csq.c

1   ignored_field  gene            21  2148  .   -   .   ID=gene:GeneId;biotype=protein_coding;Name=GeneName
1   ignored_field  transcript      21  2148  .   -   .   ID=transcript:TranscriptId;Parent=gene:GeneId;biotype=protein_coding
1   ignored_field  three_prime_UTR 21  2054  .   -   .   Parent=transcript:TranscriptId
1   ignored_field  exon            21  2148  .   -   .   Parent=transcript:TranscriptId
1   ignored_field  CDS             21  2148  .   -   1   Parent=transcript:TranscriptId
1   ignored_field  five_prime_UTR  210 2148  .   -   .   Parent=transcript:TranscriptId

And many more examples can be found in bcftools/test/csq/*/*.gff.

If you are going to write a gff2gff conversion script I am happy to host it in bcftools/misc. This should be straightforward really, looking at your example you'd just need to codify the naming (note the UTR notation) and add a transcript for each CDS.

@tseemann
Copy link

tseemann commented Mar 30, 2018

I finally managed to get it working. What the perl rules example doesn't say is that each of the lines must also have biotype=protein_coding (or other approved type).

This is what worked:

gene  ID=gene:foo;biotype=protein_coding
mRNA  ID=transcript:foo;Parent=gene:foo;biotype=protein_coding
CDS   ID=CDS:foo;Parent=mRNA:foo;biotype=protein_coding

On my mini example:

Indexed 63 transcripts, 0 exons, 63 CDSs, 0 UTRs

BCSQ=frameshift|ILHGENPG_00058|ILHGENPG_00058|protein_coding|+|434VHASSMLESARSITRSTLVGGHAGGNGGIKND*>434VLLQC*|53660GTACAT>GT

Bacterial GFF files come in 2 flavours

  1. Every gene has a gene + CDS/rRNA/etc feature (NCBI style)
  2. Every gene only has the CDS/rRNA/etc feature.

However this is now changing as more predictor tools and experimental evidence is being used for the UTRs like promoters and terminators etc.

My popular annotation tool Prokka can produce either. I just need to make an option to output in bcftools csq format.

My script is currently dependent on Bio::Perl and only works fromgbk2gff, I still need to write a gff2gff one, but I am happy to contribute a pull request soon.

@tseemann
Copy link

@pd3 I notice the {transcript,gene,CDS}: part of the ID is not reported in the BCSQ line as part of the ID. Forgive my ignorance, but I assume this type: notation is a semi-formalised naming convention in the model organism community?

@pd3
Copy link
Member

pd3 commented Mar 31, 2018

This is just a practical decision to avoid the complexity of parsing of the many GFF flavors that exist out there. Ensembl's human GFF is one of the most frequently used, so that became the default.

Note that contrary to your statement above, the biotype attribute does not need to be present for each line, only the gene and transcript lines. The GFF description is now included in the manual page https://github.com/samtools/bcftools/blob/develop/doc/bcftools.txt#L1066-L1098

@tseemann
Copy link

tseemann commented Apr 7, 2018

@pd3 thanks so much for the clarification of the biotype and for the extra docs.
the htslib teams on github are so helpful and responsive. thank you!

@danrlu
Copy link

danrlu commented Apr 16, 2020

Thanks for the discussion above! It's very helpful!
Please allow me to do a quick summary, and point out that spelling and letter case needs to be exact match:

column 3 (type)                    column 9 (attributes)
xxx                                ID=gene:xxx;biotype=xxx
xxx                                ID=transcript:xxx;Parent=gene:xxx;biotype=xxx
exon                               Parent=transcript:xxx
CDS                                Parent=transcript:xxx
five_prime_UTR                     Parent=transcript:xxx
three_prime_UTR                    Parent=transcript:xxx

Gene line and transcript line are identified by column 9 having ID=gene or ID=transcript, so column 3 can be anything (mRNA, lncRNA, etc.).

List of supported biotype here: https://github.com/samtools/bcftools/blob/develop/csq.c
biotype=protein_coding is most useful.

Thank you for developing these wonderful tools!

@flashton2003
Copy link

@pd3 @tseemann @johnlees did any of you end up with a script to convert GFF3 to bcftools csq compatible Ensembl GFF3?

@pd3
Copy link
Member

pd3 commented Jul 10, 2020

The only code made publicly available I am aware of is here #1208 (comment). I'd be very happy to host a gff2gff script in bcftools/misc but cannot commit to developing it myself unfortunately. Even if its functionality is limited, it would still be a worthwhile contribution.

@flashton2003
Copy link

Thank you @pd3! If I roll my own I'll post it here at least.

@flashton2003
Copy link

This is my script. Takes a gff converted from gbk with the BioPerl gbk2gff3.pl script. Outputs something which can be used by bcftools csq. It's probably only useful as a starting point for others with the same problem to work from, but it works from my use case.

https://gist.github.com/flashton2003/b246ce509300a8669d27de7a4eb5c4c9

Sorry, I don't have time to make something more general.

Expected input here, expected output here.

@pd3
Copy link
Member

pd3 commented Sep 7, 2020

This script has been now added, thank you @flashton2003

@flashton2003
Copy link

flashton2003 commented Sep 7, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants