Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Building genbank/refseq databases from assembly_summary.txt #7

Closed
luizirber opened this issue Jul 17, 2020 · 1 comment
Closed

Building genbank/refseq databases from assembly_summary.txt #7

luizirber opened this issue Jul 17, 2020 · 1 comment

Comments

@luizirber
Copy link
Member

luizirber commented Jul 17, 2020

related to sourmash-bio/sourmash#970

Each subset of RefSeq and GenBank has an assembly_summary.txt file.
This is from fungi: https://ftp.ncbi.nlm.nih.gov/genomes/refseq/fungi/assembly_summary.txt
All refseq subsets: https://ftp.ncbi.nlm.nih.gov/genomes/refseq/

Benefits of using assembly_summary.txt:

  • We can generate good names for signatures from the columns, using assembly_accession, organism_name, infraspecific_name and asm_name. For example, for GCF_001477545.1, the name could be GCF_001477545.1 Pneumocystis carinii B80 strain=B80, Pneu_cari_B80_V3
  • The taxid field can be used to generate TaxInfo and save it in the Zipped SBT during indexing. Because we control both the name (instead of using --name-from-first) and how it is saved in the TaxInfo, scripts for converting results like gather_to_opal.py can be simplified.
  • We can distribute one database per refseq/genbank subset, so people don't need to download a gigantic one for everything, but if they want to use all of them it's not a problem too (just list them all in gather or search)

More info: https://ftp.ncbi.nlm.nih.gov/pub/factsheets/HowTo_Downloading_Genomic_Data.pdf

@ctb
Copy link
Contributor

ctb commented Apr 2, 2022

Much belated update that I think finally resolves this -

The ncbi-assemblies examples in https://github.com/sourmash-bio/database-examples show how to use assembly_summary.txt to generate information in the new fromfile format merged in sourmash-bio/sourmash#1885.

moreover, I'm 99.9% sure that wort uses the same information as in assembly_summary.txt to name the signatures that are automatically generated.

closing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants