New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bcftools csq cannot parse bacterial gff3 files #530
Comments
The error messages actually do indicate the problem: the "Parent=transcript:" keyword is not found in some records, see an example of a supported file format here: ftp://ftp.ensembl.org/pub/grch37/release-84/gff3/homo_sapiens/Homo_sapiens.GRCh37.82.gff3.gz Maybe there could be a small perl script added for convertin the most common GFF variants to a format supported by bcftools/csq. |
Ah I see, thanks a lot. Such a perl script would certainly be useful, but I'd guess it'd be difficult to cover all the variants of GFF files. I will write one for ensembl bacteria files at some point, and attach it here in case it's useful to others in the future. |
That would be great, thanks |
@pd3 @johnlees I encountered this problem today too in my attempts to find an alternative to Clearly https://github.com/samtools/bcftools/blob/develop/csq.c
It's even in perl regexp form so I can read it :-P Here's a CDS from
|
The documentation excerpt pasted above states that gene, transcript and CDS lines are required, but in your example the transcript is not shown. (By the way, the program must have printed also the line where the substring Here is the expected format in more detail:
And many more examples can be found in If you are going to write a |
I finally managed to get it working. What the perl rules example doesn't say is that each of the lines must also have This is what worked:
On my mini example:
Bacterial GFF files come in 2 flavours
However this is now changing as more predictor tools and experimental evidence is being used for the UTRs like promoters and terminators etc. My popular annotation tool Prokka can produce either. I just need to make an option to output in My script is currently dependent on |
@pd3 I notice the |
This is just a practical decision to avoid the complexity of parsing of the many GFF flavors that exist out there. Ensembl's human GFF is one of the most frequently used, so that became the default. Note that contrary to your statement above, the biotype attribute does not need to be present for each line, only the gene and transcript lines. The GFF description is now included in the manual page https://github.com/samtools/bcftools/blob/develop/doc/bcftools.txt#L1066-L1098 |
@pd3 thanks so much for the clarification of the |
Thanks for the discussion above! It's very helpful!
Gene line and transcript line are identified by column 9 having List of supported biotype here: https://github.com/samtools/bcftools/blob/develop/csq.c Thank you for developing these wonderful tools! |
The only code made publicly available I am aware of is here #1208 (comment). I'd be very happy to host a |
Thank you @pd3! If I roll my own I'll post it here at least. |
This is my script. Takes a gff converted from gbk with the BioPerl gbk2gff3.pl script. Outputs something which can be used by bcftools csq. It's probably only useful as a starting point for others with the same problem to work from, but it works from my use case. https://gist.github.com/flashton2003/b246ce509300a8669d27de7a4eb5c4c9 Sorry, I don't have time to make something more general. |
This script has been now added, thank you @flashton2003 |
You're welcome!
…On Mon, Sep 7, 2020 at 6:02 PM Petr Danecek ***@***.***> wrote:
This script has been now added, thank you @flashton2003
<https://github.com/flashton2003>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#530 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAT6FEWRFK5ALQCZKA5OABDSET7ZFANCNFSM4C2H3UEA>
.
|
I have tried the bcftools csq command with the attached fasta, and three different gff files (one from ncbi, one ncbi modified to add transcripts, and one from ensembl bacteria). In all cases the gff cannot be parsed, and I haven't found the error message informative as to why.
Files to replicate this are attached here:
bcftools_csq.zip
The commands I tried with their stderr are in bcftools_csq.stderr
The text was updated successfully, but these errors were encountered: