All files below are available under https://osf.io/wxf9z/. The GTDB taxonomy spreadsheet (in a format suitable for sourmash lca index
) is available here.
For each k-mer size, three databases are available.
- Zipfile collections can be used for a linear search. The signatures were calculated with a scaled of 1000, which robustly supports searches for ~10kb or larger matches.
- SBT databases are indexed versions of the Zipfile collections that support faster search. They are also indexed with scaled=1000.
- LCA databases are indexed versions of the Zipfile collections that also contain taxonomy information and can be used with regular search as well as with the
lca
subcommands for taxonomic analysis. They are indexed with scaled=10,000, which robustly supports searches for 100kb or larger matches.
You can read more about the different database and index types here.
Legacy databases are available here
Note that the SBT and LCA databases can be used with sourmash v3.5 and later, while Zipfile collections can only be used with sourmash v4.1.0 and up.
The GTDB genomic representatives are a low-redundancy subset of Genbank genomes.
K-mer size | Zipfile collection | SBT | LCA |
---|---|---|---|
21 | download (1.3 GB) | download (2.6 GB) | download (114 MB) |
31 | download (1.3 GB) | download (2.6 GB) | download (131 MB) |
51 | download (1.3 GB) | download (2.6 GB) | download (137 MB) |
These databases contain the complete GTDB collection of 258,406 genomes.
K-mer size | Zipfile collection | SBT | LCA |
---|---|---|---|
21 | download (7.8 GB) | download (15 GB) | download (266 MB) |
31 | download (7.8 GB) | download (15 GB) | download (286 MB) |
51 | download (7.8 GB) | download (15 GB) | download (299 MB) |