updates of indexed databases

sourmash-bio · Feb 8, 2021 · 40762f3 · 40762f3
1 parent 5e1f92b
commit 40762f3
Show file tree

Hide file tree

Showing 2 changed files with 95 additions and 2 deletions.
diff --git a/doc/command-line.md b/doc/command-line.md
@@ -61,15 +61,16 @@ Matrix:
 
 To get a list of subcommands, run `sourmash` without any arguments.
 
-There are five main subcommands: `sketch`, `compare`, `plot`,
-`search`, and `gather`.  See [the tutorial](tutorials.md) for a
+There are six main subcommands: `sketch`, `compare`, `plot`,
+`search`, `gather`, and `index`.  See [the tutorial](tutorials.md) for a
 walkthrough of these commands.
 
 * `sketch` creates signatures.
 * `compare` compares signatures and builds a distance matrix.
 * `plot` plots distance matrices created by `compare`.
 * `search` finds matches to a query signature in a collection of signatures.
 * `gather` finds the best reference genomes for a metagenome, using the provided collection of signatures
+* `index` build a fast index for many (thousands) of signatures
 
 There are also a number of commands that work with taxonomic
 information; these are grouped under the `sourmash lca`
@@ -288,6 +289,37 @@ genomes with no (or incomplete) taxonomic information.  Use `sourmash
 lca summarize` to classify a metagenome using a collection of genomes
 with taxonomic information.
 
+### `sourmash index` - build an SBT index of signatures
+
+The `sourmash index` command creates a Zipped SBT database
+(`.sbt.zip`) from a collection of signatures.  This can be used to
+create databases from private collections of genomes, and can also be
+used to create databases for e.g. subsets of GenBank.
+
+These databases support fast search and gather on large collections
+of signatures in low memory.
+
+SBTs can only be created on scaled signatures, and all signatures in
+an SBT must be of compatible types (i.e. the same k-mer size and
+molecule type). You can specify the usual command line selectors
+(`-k`, `--scaled`, `--dna`, `--protein`, etc.) to pick out the types
+of signatures to include.
+
+Usage:
+```
+sourmash index database [ list of input signatures/directories/databases ]
+```
+
+This will create a `database.sbt.zip` file containing the SBT of the
+input signatures. You can create an "unpacked" version by specifying
+`database.sbt.json` and it will create the JSON file as well as a
+subdirectory of files under `.sbt.database`.
+
+Note that you can use `--from-file` to pass `index` a text file
+containing a list of files to index; you can also provide individual
+signature files, directories full of signatures, or other sourmash
+databases.
+
 ## `sourmash lca` subcommands for taxonomic classification
 
 These commands use LCA databases (created with `lca index`, below, or

diff --git a/doc/using-sourmash-a-guide.md b/doc/using-sourmash-a-guide.md
@@ -1,5 +1,9 @@
 # Using sourmash: a practical guide
 
+```{contents}
+   :depth: 2
+```
+
 So! You've installed sourmash, run a few of the tutorials and commands,
 and now you actually want to *use* it.  This guide is here to answer some
 of your questions, and explain why we can't answer others.
@@ -145,3 +149,60 @@ names them based on their FASTA headers, and places them all in a single
 `.sig` file, `file.fa.sig`.  (This behavior is triggered by the option
 `--singleton`, which tells sourmash to treat each individual sequence in
 the file as an independent sequence.)
+
+## How do I store and search collections of signatures?
+
+sourmash supports a variety of signature loading and storage options for
+flexibility.  If you have only a few hundred signatures, here are some
+options -
+
+* you can put all your signature files in a directory and search them all
+  using the path to the directory.
+* you can use `sourmash sig cat` to concatenate multiple signatures into a
+  single file.
+* you can compress any signature file using `gzip` and sourmash will
+  load them.
+
+If you have more than a few hundred genome signatures that you
+regularly search, it might be worth creating an indexed database of
+them that will support faster searches.
+
+sourmash supports two types of indexed databases: Sequence Bloom
+Trees, or SBTs; and reverse indices, or LCAs.  (You can read more
+detail about their implementation and design considerations
+[in Chapter 2 of Dr. Luiz Irber's thesis, "Efficient indexing of collections of signatures"](https://github.com/luizirber/phd/releases/download/2020.09.28/thesis.pdf).)
+
+### Sequence Bloom Tree (SBT) indexed databases
+
+Sequence Bloom Trees (SBTs) (see
+[Solomon and Kingsford, 2016](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4804353/))
+are on disk databases that support low-memory query of 10s-100s of
+thousands of signatures.  They can be created using `sourmash index`.
+
+SBTs are the lowest-memory way to run search or gather on a collection
+of signatures. The tradeoff is that they may be quite large on disk,
+because SBTs also contain intermediate nodes in the tree.  The default
+way to store SBTs is in a Zip file, named `.sbt.zip`, that can be
+built and searched directly from the command line.
+
+### Reverse indexed (LCA) databases
+
+Reverse indexed or LCA databases are *in-memory* databases that, once
+loaded from disk, support fast search and gather across 10s of thousands
+of signatures.  They can be created using `sourmash lca index` ([docs](command-line.md#sourmash-lca-index-build-an-lca-database))
+
+LCA databases are currently stored in JSON files (that can be gzipped).
+As these files get larger, the time required to load them from disk
+can be substantial.
+
+LCA databases are also currently (sourmash 2.0-4.0) the only databases
+that support the inclusion of taxonomic information in the database,
+and there is an associated collection of commands
+[under `sourmash lca`](command.md#sourmash-lca-subcommands-for-taxonomic-classification).
+However, they can also be used as regular indexed databases for search
+and gather as above.
+
+(These are called "LCA databases" because they originally were created
+to support "lowest common ancestor" taxonomic analyses, e.g. like
+Kraken; their functionality has evolved a lot since, but their name
+hasn't changed to match!)