Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make sourmash databases BDBag-compatible? #991

Open
luizirber opened this issue May 15, 2020 · 2 comments
Open

Make sourmash databases BDBag-compatible? #991

luizirber opened this issue May 15, 2020 · 2 comments
Labels

Comments

@luizirber
Copy link
Member

@taylorreiter mentioned the new datasets tool from NCBI on Slack, and the downloaded file is a zip file in the BDBag format. This is the content after running ./datasets download assembly GCA_003583405.1 and unzipping the file:

.
├── bag-info.txt
├── bagit.txt
├── data
│   ├── dataset_catalog.json
│   └── GCA_003583405.1
│       ├── data_report.yaml
│       └── GCA_003583405.1_CHULA_Jazt_1.1_for_version_1.1_of_the_Jishengella_sp._nov._AZ1-13_genome_from_a_lab_in_CHULA_genomic.fna
├── fetch.txt
├── manifest-md5.txt
└── tagmanifest-md5.txt

What would be needed to make sourmash databases into BDBag-compatible datasets?

@luizirber luizirber added the idea label May 15, 2020
@luizirber
Copy link
Member Author

Especially interesting: the unresolved/rehydrate use case in the examples:

Download a compact package, also known as an unresolved bag, containing data reports and file locations only for all 29 primate RefSeq genomes, then retrieve the data when needed using rehydrate:

# First download the compact package (<10 MB) (unresolved bag), containing data reports and file locations for 29 primate (Taxonomy ID: 9443) RefSeq genomes
$ ./datasets download assembly tax-id 9443 --refseq --limit ALL --unresolved --filename primates_refseq_unresolved.zip

# Then unzip the unresolved bag
$ unzip primates_refseq_unresolved.zip

# When needed, use the rehydrate command to get the genome sequences for these assemblies (about 80 GB of data)
$ ./datasets rehydrate --filename .
Found 563 files for rehydration
1h4m51s [====================================================================] 100%

which might fit neatly with the discussion in #985 (comment)

@ctb
Copy link
Contributor

ctb commented Feb 28, 2021

I want to drop in a reference to frictionless data, too!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants