Skip to content

Commit

Permalink
updates of indexed databases
Browse files Browse the repository at this point in the history
  • Loading branch information
ctb committed Feb 8, 2021
1 parent 5e1f92b commit 40762f3
Show file tree
Hide file tree
Showing 2 changed files with 95 additions and 2 deletions.
36 changes: 34 additions & 2 deletions doc/command-line.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,15 +61,16 @@ Matrix:

To get a list of subcommands, run `sourmash` without any arguments.

There are five main subcommands: `sketch`, `compare`, `plot`,
`search`, and `gather`. See [the tutorial](tutorials.md) for a
There are six main subcommands: `sketch`, `compare`, `plot`,
`search`, `gather`, and `index`. See [the tutorial](tutorials.md) for a
walkthrough of these commands.

* `sketch` creates signatures.
* `compare` compares signatures and builds a distance matrix.
* `plot` plots distance matrices created by `compare`.
* `search` finds matches to a query signature in a collection of signatures.
* `gather` finds the best reference genomes for a metagenome, using the provided collection of signatures
* `index` build a fast index for many (thousands) of signatures

There are also a number of commands that work with taxonomic
information; these are grouped under the `sourmash lca`
Expand Down Expand Up @@ -288,6 +289,37 @@ genomes with no (or incomplete) taxonomic information. Use `sourmash
lca summarize` to classify a metagenome using a collection of genomes
with taxonomic information.

### `sourmash index` - build an SBT index of signatures

The `sourmash index` command creates a Zipped SBT database
(`.sbt.zip`) from a collection of signatures. This can be used to
create databases from private collections of genomes, and can also be
used to create databases for e.g. subsets of GenBank.

These databases support fast search and gather on large collections
of signatures in low memory.

SBTs can only be created on scaled signatures, and all signatures in
an SBT must be of compatible types (i.e. the same k-mer size and
molecule type). You can specify the usual command line selectors
(`-k`, `--scaled`, `--dna`, `--protein`, etc.) to pick out the types
of signatures to include.

Usage:
```
sourmash index database [ list of input signatures/directories/databases ]
```

This will create a `database.sbt.zip` file containing the SBT of the
input signatures. You can create an "unpacked" version by specifying
`database.sbt.json` and it will create the JSON file as well as a
subdirectory of files under `.sbt.database`.

Note that you can use `--from-file` to pass `index` a text file
containing a list of files to index; you can also provide individual
signature files, directories full of signatures, or other sourmash
databases.

## `sourmash lca` subcommands for taxonomic classification

These commands use LCA databases (created with `lca index`, below, or
Expand Down
61 changes: 61 additions & 0 deletions doc/using-sourmash-a-guide.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,9 @@
# Using sourmash: a practical guide

```{contents}
:depth: 2
```

So! You've installed sourmash, run a few of the tutorials and commands,
and now you actually want to *use* it. This guide is here to answer some
of your questions, and explain why we can't answer others.
Expand Down Expand Up @@ -145,3 +149,60 @@ names them based on their FASTA headers, and places them all in a single
`.sig` file, `file.fa.sig`. (This behavior is triggered by the option
`--singleton`, which tells sourmash to treat each individual sequence in
the file as an independent sequence.)

## How do I store and search collections of signatures?

sourmash supports a variety of signature loading and storage options for
flexibility. If you have only a few hundred signatures, here are some
options -

* you can put all your signature files in a directory and search them all
using the path to the directory.
* you can use `sourmash sig cat` to concatenate multiple signatures into a
single file.
* you can compress any signature file using `gzip` and sourmash will
load them.

If you have more than a few hundred genome signatures that you
regularly search, it might be worth creating an indexed database of
them that will support faster searches.

sourmash supports two types of indexed databases: Sequence Bloom
Trees, or SBTs; and reverse indices, or LCAs. (You can read more
detail about their implementation and design considerations
[in Chapter 2 of Dr. Luiz Irber's thesis, "Efficient indexing of collections of signatures"](https://github.com/luizirber/phd/releases/download/2020.09.28/thesis.pdf).)

### Sequence Bloom Tree (SBT) indexed databases

Sequence Bloom Trees (SBTs) (see
[Solomon and Kingsford, 2016](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4804353/))
are on disk databases that support low-memory query of 10s-100s of
thousands of signatures. They can be created using `sourmash index`.

SBTs are the lowest-memory way to run search or gather on a collection
of signatures. The tradeoff is that they may be quite large on disk,
because SBTs also contain intermediate nodes in the tree. The default
way to store SBTs is in a Zip file, named `.sbt.zip`, that can be
built and searched directly from the command line.

### Reverse indexed (LCA) databases

Reverse indexed or LCA databases are *in-memory* databases that, once
loaded from disk, support fast search and gather across 10s of thousands
of signatures. They can be created using `sourmash lca index` ([docs](command-line.md#sourmash-lca-index-build-an-lca-database))

LCA databases are currently stored in JSON files (that can be gzipped).
As these files get larger, the time required to load them from disk
can be substantial.

LCA databases are also currently (sourmash 2.0-4.0) the only databases
that support the inclusion of taxonomic information in the database,
and there is an associated collection of commands
[under `sourmash lca`](command.md#sourmash-lca-subcommands-for-taxonomic-classification).
However, they can also be used as regular indexed databases for search
and gather as above.

(These are called "LCA databases" because they originally were created
to support "lowest common ancestor" taxonomic analyses, e.g. like
Kraken; their functionality has evolved a lot since, but their name
hasn't changed to match!)

0 comments on commit 40762f3

Please sign in to comment.