Skip to content

Latest commit

 

History

History
37 lines (24 loc) · 2.73 KB

databases.md

File metadata and controls

37 lines (24 loc) · 2.73 KB

Prepared databases

GTDB R06-rs202 - DNA databases

All files below are available under https://osf.io/wxf9z/. The GTDB taxonomy spreadsheet (in a format suitable for sourmash lca index) is available here.

For each k-mer size, three databases are available.

  • Zipfile collections can be used for a linear search. The signatures were calculated with a scaled of 1000, which robustly supports searches for ~10kb or larger matches.
  • SBT databases are indexed versions of the Zipfile collections that support faster search. They are also indexed with scaled=1000.
  • LCA databases are indexed versions of the Zipfile collections that also contain taxonomy information and can be used with regular search as well as with the lca subcommands for taxonomic analysis. They are indexed with scaled=10,000, which robustly supports searches for 100kb or larger matches.

You can read more about the different database and index types here.

Legacy databases are available here

Note that the SBT and LCA databases can be used with sourmash v3.5 and later, while Zipfile collections can only be used with sourmash v4.1.0 and up.

GTDB genomic representatives (47.8k genomes)

The GTDB genomic representatives are a low-redundancy subset of Genbank genomes.

K-mer size Zipfile collection SBT LCA
21 download (1.3 GB) download (2.6 GB) download (114 MB)
31 download (1.3 GB) download (2.6 GB) download (131 MB)
51 download (1.3 GB) download (2.6 GB) download (137 MB)

GTDB all genomes (258k genomes)

These databases contain the complete GTDB collection of 258,406 genomes.

K-mer size Zipfile collection SBT LCA
21 download (7.8 GB) download (15 GB) download (266 MB)
31 download (7.8 GB) download (15 GB) download (286 MB)
51 download (7.8 GB) download (15 GB) download (299 MB)