We provide a number of pre-built collections and indexed databases that you can use with sourmash.
For each k-mer size, three types of databases may be available: Zipfile (.zip
), SBT (.sbt.zip
), and LCA (.lca.jzon.gz
). The Zipfile and SBT databases are built with scaled=1000, and then LCA databases are built with scaled=10,000.
We recommend using the Zipfile databases for sourmash gather
and the SBT databases for sourmash search
. You must use the LCA databases for sourmash lca
operations.
You can read more about the different database and index types here.
Note that the SBT and LCA databases can be used with sourmash v3.5 and later, while Zipfile collections can only be used with sourmash v4.1 and up.
All databases below can be downloaded via the command line with curl -L <url> -o <output>
, where <url>
is the URL below, and <output>
is the filename you want to use locally.
The databases do not need to be unpacked or prepared in any way after download.
You can verify that they've been successfully downloaded (and view database properties such as ksize
and scaled
) with sourmash sig summarize <output>
.
GTDB R07-RS207 consists of 317,542 genomes organized into 65,703 species clusters.
The lineage spreadsheet (for sourmash tax
commands) is available at the species level and at the strain level.
The GTDB genomic representatives are a low-redundancy subset of Genbank genomes, with 65,703 species-level genomes.
K-mer size | Zipfile collection | SBT | LCA |
---|---|---|---|
21 | download (1.7 GB) | download (3.5 GB) | download (181 MB) |
31 | download (1.7 GB) | download (3.5 GB) | download (181 MB) |
51 | download (1.7 GB) | download (3.5 GB) | download (181 MB) |
These are databases for the full GTDB release, each containing 317,542 genomes.
K-mer size | Zipfile collection | SBT | LCA |
---|---|---|---|
21 | download (9.4 GB) | download (19 GB) | download (351 MB) |
31 | download (9.4 GB) | download (19 GB) | download (351 MB) |
51 | download (9.4 GB) | download (19 GB) | download (351 MB) |
The below zip files contain signatures for all microbial Genbank genomes as of March 2022, based on the assembly_summary files provided here.
Since some of the files are extremely large, we only provide them in Zip format.
Taxonomic spreadsheets for each domain are provided below as well.
47,952 genomes:
genbank-2022.03-viral.lineages.csv.gz
8,750 genomes:
genbank-2022.03-archaea-k21.zip
genbank-2022.03-archaea-k31.zip
genbank-2022.03-archaea-k51.zip
genbank-2022.03-archaea.lineages.csv.gz
1193 genomes:
genbank-2022.03-protozoa-k21.zip
genbank-2022.03-protozoa-k31.zip
genbank-2022.03-protozoa-k51.zip
genbank-2022.03-protozoa.lineages.csv.gz
10,286 genomes:
genbank-2022.03-fungi.lineages.csv.gz
1,148,011 genomes:
genbank-2022.03-bacteria-k21.zip
genbank-2022.03-bacteria-k31.zip
genbank-2022.03-bacteria-k51.zip
genbank-2022.03-bacteria.lineages.csv.gz
All files below are available under https://osf.io/wxf9z/. The GTDB taxonomy spreadsheet (in a format suitable for sourmash lca index
) is available here.
The GTDB genomic representatives are a low-redundancy subset of Genbank genomes.
K-mer size | Zipfile collection | SBT | LCA |
---|---|---|---|
21 | download (1.3 GB) | download (2.6 GB) | download (114 MB) |
31 | download (1.3 GB) | download (2.6 GB) | download (131 MB) |
51 | download (1.3 GB) | download (2.6 GB) | download (137 MB) |
These databases contain the complete GTDB collection of 258,406 genomes.
K-mer size | Zipfile collection | SBT | LCA |
---|---|---|---|
21 | download (7.8 GB) | download (15 GB) | download (266 MB) |
31 | download (7.8 GB) | download (15 GB) | download (286 MB) |
51 | download (7.8 GB) | download (15 GB) | download (299 MB) |
Database release workflows are being archived at sourmash-bio/database-releases.
Some more details on database use and construction:
- Zipfile collections can be used for a linear search. The signatures were calculated with a scaled of 1000, which robustly supports searches for ~10kb or larger matches.
- SBT databases are indexed versions of the Zipfile collections that support faster search. They are also indexed with scaled=1000.
- LCA databases are indexed versions of the Zipfile collections that also contain taxonomy information and can be used with regular search as well as with the
lca
subcommands for taxonomic analysis. They are indexed with scaled=10,000, which robustly supports searches for 100kb or larger matches.
The detailed memory usage of sourmash depends on the type of search, the query, and the database you're searching, but to help guide you here is a range of numbers:
Search type | Query | Database | Max RAM | Time |
---|---|---|---|---|
gather | Bacterial genome | GTDB complete (280k) | 1 GB | 6 minutes |
gather | Simple metagenome | GTDB reps .zip (65k) | 2 GB | 6 minutes |
gather | Real metagenome | All Genbank (1.2m) | 100 GB | 3 hours |
lca summarize | Simple metagenome | GTDB reps .sql (65k) | 400 MB | 20 seconds |
lca summarize | Simple metagenome | GTDB reps .json (65k) | 6.2 GB | 1m 20 seconds |
Please see sourmash#1958 for detailed GTDB numbers and gather paper#47 for detailed Genbank numbers.
Legacy databases are available here.