Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for GFF files from NCBI #120

Closed
andrewjpage opened this issue May 13, 2015 · 6 comments
Closed

Add support for GFF files from NCBI #120

andrewjpage opened this issue May 13, 2015 · 6 comments

Comments

@andrewjpage
Copy link
Member

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.gff.gz

@tseemann
Copy link
Contributor

NCBI GFF files don't include the .FASTA at the end, also the seqid (column 1) is just the NC_ number and not the full gi||ref|| ID that is in the FASTA.

@andrewjpage
Copy link
Member Author

I rewrote Bio-RetrieveAssemblies (cpanm Bio::RetrieveAssemblies) so that it will download WGS and RefSeq assemblies and convert them to GFF (including the FASTA file at the end) which should make it easier to get existing data from NCBI. It filters out dodgy stuff using RefWeak. Theres a tweak or two I need to make to Roary first, but I hope in a day or so that you will be able to create a pan genome of all S. typhi by doing:

retrieve_assemblies -o my_files -f gff typhi
roary my_files/*.gff

@andrewjpage
Copy link
Member Author

So Roary now works grand with WGS and RefSeq assemblies now. Its not everything in GenBank but it provides a vast quantity of data for people to play around with.

retrieve_assemblies -a -f gff typhi

http://sanger-pathogens.github.io/Roary/index.html#genbank_files
https://github.com/sanger-pathogens/Bio-RetrieveAssemblies

@stellareichling
Copy link
Contributor

stellareichling commented Apr 28, 2020

Hello
I am also trying to get Roary running with gff3 files form NCBI (annotated by PGAP). Is it possible to do so without taking a detour over Bio-RetrieveAssemblies?
Many thanks in advance and kind regards

@andrewjpage
Copy link
Member Author

GFF3 files from NCBI usually only contain the annotation and not the nucleotide sequence. You'll need to make sure you get a file with the annotation and assembly in the same file. You can either do this by appending the assembly to the end of the GFF file (with ##FASTA as the delimiter) or you can convert from the gbff file (Genbank file with annotation and assembly) to a GFF file. Or you could use the script above.

@vappiah
Copy link

vappiah commented Jan 5, 2021

Hi @andrewjpage I tried your suggesting about appending but I still got an error message below is one such message.
2021/01/05 22:56:24 Could not extract any protein sequences from fixed_input_files/pMUM001.gff. Does the file contain the assembly as well as the annotation?
Attached is the gff
pMUM001.gff.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants