Skip to content

Commit

Permalink
MRG: update the CLI docs and help for search --containment and `pre…
Browse files Browse the repository at this point in the history
…fetch` (#2971)

Adds useful information about the order of containment searches:
* `search --containment A B` reports A contained in B;
* `prefetch A B` reports B contained in A;

Fixes #2968.
  • Loading branch information
ctb committed Feb 5, 2024
1 parent 3af9a04 commit 5e6fdb9
Show file tree
Hide file tree
Showing 2 changed files with 14 additions and 1 deletion.
12 changes: 11 additions & 1 deletion doc/command-line.md
Original file line number Diff line number Diff line change
Expand Up @@ -325,6 +325,13 @@ Match information can be saved to a CSV file with `-o/--output`; with
`-o`, all matches above the threshold will be saved, not just those
printed to stdout (which are limited to `-n/--num-results`).

The `--containment` flag calculates the containment of the query in
database matches; this is an asymmetric order-dependent measure,
unlike Jaccard. Here, `search --containment Q A B C D` will report the
containment of `Q` in each of `A`, `B`, `C`, and `D`. This is opposite
to the order used by `prefetch`, where the composite sketch (e.g. metagenomes)
is the query, and the matches are contained items (e.g. genomes).

As of sourmash 4.2.0, `search` supports `--picklist`, to
[select a subset of signatures to search, based on a CSV file](#using-picklists-to-subset-large-collections-of-signatures). This
can be used to search only a small subset of a large collection, or to
Expand Down Expand Up @@ -477,7 +484,10 @@ The `prefetch` subcommand searches a collection of scaled signatures
for matches in a large database, using containment. It is similar to
`search --containment`, while taking a `--threshold-bp` argument like
`gather` does for thresholding matches (instead of using Jaccard
similarity or containment).
similarity or containment). Note that `prefetch` uses the composite
sketch (e.g. a metagenome) as the query, and finds all matching
subjects (e.g. genomes) from the database - the arguments are in the
opposite order from `search --containment`.

`sourmash prefetch` is intended to select a subset of a large database
for further processing. As such, it can search very large collections
Expand Down
3 changes: 3 additions & 0 deletions src/sourmash/cli/search.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,9 @@
[1] https://en.wikipedia.org/wiki/Jaccard_index
When `--containment` is provided, the containment of the query in each
of the search signatures or databases is reported.
---
"""

Expand Down

0 comments on commit 5e6fdb9

Please sign in to comment.