Skip to content

Commit

Permalink
[MRG] add picklists to selector protocol and provide initial Index
Browse files Browse the repository at this point in the history
…support (#1588)

* various cleanups of sourmash_args

* cleanup flakes errors

* clean up sourmash.sig submodule

* initial picklist implementation

* integrate picklists into sourmash sig extract

* basic tests for picklist functionality

* track found etc

* add picklists to selectors

* split pickfile out a little bit

* split column_type out of SignaturePicklist a bit

* picklist tests for .signatures() methods on Index classes

* split pickfile out a little bit

* split column_type out of SignaturePicklist a bit

* test 'Index.find' on picklists for SBTs and LCAs

* factor out picklist checks to 'passes_all_picklists' fn

* update comments, constructor, etc.

* fix tests :)

* more picklist tests

* verify output

* add --picklist-require-all &c

* documentation

* test with --md5 selector

* cover untested code with tests

* trap errors and be nice to users

* remove comment

* fix tests for new SignaturePicklist

* move picklist.py from sourmash.sig into sourmash

* move picklist reporting into sourmash_args

* fix space

* add picklist args throughout, eek.

* add picklists and tests for search, gather, index

* add picklists to prefetch

* add picklists to sourmash compare

* add picklists to lca index

* block multiple picklists on SBTs and LCAs, for now

* add picklist test that checks indexing-and-then-search == index

* add a test for using prefetch CSV as picklist

* remove debugging print

* add docs

* remove order dependence from test

* further attempt to fix test

Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com>
  • Loading branch information
ctb and bluegenes committed Jun 18, 2021
1 parent a2d438d commit 74de59a
Show file tree
Hide file tree
Showing 27 changed files with 729 additions and 114 deletions.
99 changes: 61 additions & 38 deletions doc/command-line.md
Original file line number Diff line number Diff line change
Expand Up @@ -177,15 +177,14 @@ sourmash compare file1.sig [ file2.sig ... ]
```

Options:
```
--output -- save the distance matrix to this file (as a numpy binary matrix)
--ksize -- do the comparisons at this k-mer size.
--containment -- calculate containment instead of similarity.
C(i, j) = size(i intersection j) / size(i).
--from-file -- append the list of files in this text file to the input

* `--output` -- save the distance matrix to this file (as a numpy binary matrix)
* `--ksize` -- do the comparisons at this k-mer size.
* `--containment` -- calculate containment instead of similarity; `C(i, j) = size(i intersection j) / size(i)`
* `--from-file` -- append the list of files in this text file to the input
signatures.
--ignore-abundance -- ignore abundances in signatures.
```
* `--ignore-abundance` -- ignore abundances in signatures.
* `--picklist` -- select a subset of signatures with [a picklist](#using-picklists-to-subset-large-collections-of-signatures)

**Note:** compare by default produces a symmetric similarity matrix that can be used as an input to clustering. With `--containment`, however, this matrix is no longer symmetric and cannot formally be used for clustering.

Expand Down Expand Up @@ -249,6 +248,9 @@ similarity match
...
```

Note, as of sourmash 4.2.0, `search` supports `--picklist`, to
[select a subset of signatures based on a CSV file](#using-picklists-to-subset-large-collections-of-signatures).

### `sourmash gather` - find metagenome members

The `gather` subcommand selects the best reference genomes to use for
Expand Down Expand Up @@ -289,6 +291,9 @@ which matches are no longer reported; by default, this is set to
50kb. see the Appendix in
[Classifying Signatures](classifying-signatures.md) for details.

As of sourmash 4.2.0, `gather` supports `--picklist`, to
[select a subset of signatures based on a CSV file](#using-picklists-to-subset-large-collections-of-signatures).

Note:

Use `sourmash gather` to classify a metagenome against a collection of
Expand Down Expand Up @@ -350,6 +355,9 @@ containing a list of file names to index; you can also provide individual
signature files, directories full of signatures, or other sourmash
databases.

As of sourmash 4.2.0, `index` supports `--picklist`, to
[select a subset of signatures based on a CSV file](#using-picklists-to-subset-large-collections-of-signatures).

### `sourmash prefetch` - select subsets of very large databases for more processing

The `prefetch` subcommand searches a collection of scaled signatures
Expand All @@ -375,6 +383,7 @@ Other options include:
* `--threshold-bp` to require a minimum estimated bp overlap for output;
* `--scaled` for downsampling;
* `--force` to continue past survivable errors;
* `--picklist` select a subset of signatures with [a picklist](#using-picklists-to-subset-large-collections-of-signatures)

### Alternative search mode for low-memory (but slow) search: `--linear`

Expand Down Expand Up @@ -589,6 +598,9 @@ see
You can use `--from-file` to pass `lca index` a text file containing a
list of file names to index.

As of sourmash 4.2.0, `lca index` supports `--picklist`, to
[select a subset of signatures based on a CSV file](#using-picklists-to-subset-large-collections-of-signatures).

### `sourmash lca rankinfo` - examine an LCA database

The `sourmash lca rankinfo` command displays k-mer specificity
Expand Down Expand Up @@ -821,36 +833,8 @@ will extract the same signature, which has an accession number of
#### Using picklists with `sourmash sig extract`

As of sourmash 4.2.0, `extract` also supports picklists, a feature by
which you can select signatures based on values in a CSV file.

For example,
```
sourmash sig extract --picklist list.csv:md5:md5sum <signatures>
```
will extract only the signatures that have md5sums matching the
column `md5sum` in the CSV file `list.csv`.

The `--picklist` argument string must be of the format
`pickfile:colname:coltype`, where `pickfile` is the path to a CSV
file, `colname` is the name of the column to select from the CSV
file (based on the headers in the first line of the CSV file),
and `coltype` is the type of match.

The following `coltype`s are currently supported by `sourmash sig extract`:

* `name` - exact match to signature's name
* `md5` - exact match to signature's md5sum
* `md5prefix8` - match to 8-character prefix of signature's md5sum
* `md5short` - same as `md5prefix8`
* `ident` - exact match to signature's identifier
* `identprefix` - match to signature's identifier, before '.'

Identifiers are constructed by using the first space delimited word in
the signature name.

One way to build a picklist is to use `sourmash sig describe --csv
out.csv <signatures>` to construct an initial CSV file that you can
then edit further.
which you can select signatures based on values in a CSV file. See
[Using picklists to subset large collections of signatures](#using-picklists-to-subset-large-collections-of-signatures), below.

### `sourmash signature flatten` - remove abundance information from signatures

Expand Down Expand Up @@ -963,6 +947,45 @@ signatures with multiple ksizes or moltypes at the same time; you need
to pick the ksize and moltype to use for your search. Where possible,
scaled values will be made compatible.

### Using picklists to subset large collections of signatures

As of sourmash 4.2.0, many commands support *picklists*, a feature by
which you can select or "pick out" signatures based on values in a CSV
file.

For example,
```
sourmash sig extract --picklist list.csv:md5:md5sum <signatures>
```
will extract only the signatures that have md5sums matching the
column `md5sum` in the CSV file `list.csv`.

The `--picklist` argument string must be of the format
`pickfile:colname:coltype`, where `pickfile` is the path to a CSV
file, `colname` is the name of the column to select from the CSV
file (based on the headers in the first line of the CSV file),
and `coltype` is the type of match.

The following `coltype`s are currently supported by `sourmash sig extract`:

* `name` - exact match to signature's name
* `md5` - exact match to signature's md5sum
* `md5prefix8` - match to 8-character prefix of signature's md5sum
* `md5short` - same as `md5prefix8`
* `ident` - exact match to signature's identifier
* `identprefix` - match to signature's identifier, before '.'

Identifiers are constructed by using the first space delimited word in
the signature name.

One way to build a picklist is to use `sourmash sig describe --csv
out.csv <signatures>` to construct an initial CSV file that you can
then edit further.

In addition to `sig extract`, the following commands support
`--picklist` selection: `index`, `search`, `gather`, `prefetch`,
`compare`, `index`, and `lca index`.

### Storing (and searching) signatures

Backing up a little, there are many ways to store and search
Expand Down
4 changes: 3 additions & 1 deletion src/sourmash/cli/compare.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
"""compare sequence signatures made by compute"""

from sourmash.cli.utils import add_ksize_arg, add_moltype_args
from sourmash.cli.utils import (add_ksize_arg, add_moltype_args,
add_picklist_args)


def subparser(subparsers):
Expand Down Expand Up @@ -47,6 +48,7 @@ def subparser(subparsers):
subparser.add_argument(
'-p', '--processes', metavar='N', type=int, default=None,
help='Number of processes to use to calculate similarity')
add_picklist_args(subparser)


def main(args):
Expand Down
9 changes: 6 additions & 3 deletions src/sourmash/cli/gather.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
"""search a metagenome signature against dbs"""

from sourmash.cli.utils import add_ksize_arg, add_moltype_args
from sourmash.cli.utils import (add_ksize_arg, add_moltype_args,
add_picklist_args)


def subparser(subparsers):
Expand Down Expand Up @@ -60,8 +61,6 @@ def subparser(subparsers):
'--cache-size', default=0, type=int, metavar='N',
help='number of internal SBT nodes to cache in memory (default: 0, cache all nodes)'
)
add_ksize_arg(subparser, 31)
add_moltype_args(subparser)

# advanced parameters
subparser.add_argument(
Expand All @@ -80,6 +79,10 @@ def subparser(subparsers):
help="use prefetch before gather; see documentation",
)

add_ksize_arg(subparser, 31)
add_moltype_args(subparser)
add_picklist_args(subparser)


def main(args):
import sourmash
Expand Down
6 changes: 4 additions & 2 deletions src/sourmash/cli/index.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,8 @@
---
"""

from sourmash.cli.utils import add_moltype_args, add_ksize_arg
from sourmash.cli.utils import (add_ksize_arg, add_moltype_args,
add_picklist_args)


def subparser(subparsers):
Expand All @@ -44,7 +45,6 @@ def subparser(subparsers):
'-q', '--quiet', action='store_true',
help='suppress non-error output'
)
add_ksize_arg(subparser, 31)
subparser.add_argument(
'-d', '--n_children', metavar='D', type=int, default=2,
help='number of children for internal nodes; default=2'
Expand All @@ -70,7 +70,9 @@ def subparser(subparsers):
'--scaled', metavar='FLOAT', type=float, default=0,
help='downsample signatures to the specified scaled factor'
)
add_ksize_arg(subparser, 31)
add_moltype_args(subparser)
add_picklist_args(subparser)


def main(args):
Expand Down
9 changes: 6 additions & 3 deletions src/sourmash/cli/lca/index.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
"""create LCA database"""

from sourmash.cli.utils import add_ksize_arg, add_moltype_args
from sourmash.cli.utils import (add_ksize_arg, add_moltype_args,
add_picklist_args)


def subparser(subparsers):
Expand All @@ -18,8 +19,6 @@ def subparser(subparsers):
subparser.add_argument(
'--scaled', metavar='S', default=10000, type=float
)
add_ksize_arg(subparser, 31)
add_moltype_args(subparser)
subparser.add_argument(
'-q', '--quiet', action='store_true',
help='suppress non-error output'
Expand Down Expand Up @@ -53,6 +52,10 @@ def subparser(subparsers):
help='ignore signatures with no taxonomy entry'
)

add_ksize_arg(subparser, 31)
add_moltype_args(subparser)
add_picklist_args(subparser)


def main(args):
import sourmash
Expand Down
4 changes: 3 additions & 1 deletion src/sourmash/cli/prefetch.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
"""search a signature against dbs, find all overlaps"""

from sourmash.cli.utils import add_ksize_arg, add_moltype_args
from sourmash.cli.utils import (add_ksize_arg, add_moltype_args,
add_picklist_args)


def subparser(subparsers):
Expand Down Expand Up @@ -63,6 +64,7 @@ def subparser(subparsers):
)
add_ksize_arg(subparser, 31)
add_moltype_args(subparser)
add_picklist_args(subparser)


def main(args):
Expand Down
4 changes: 3 additions & 1 deletion src/sourmash/cli/search.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
"""search a signature against other signatures"""

from sourmash.cli.utils import add_ksize_arg, add_moltype_args
from sourmash.cli.utils import (add_ksize_arg, add_moltype_args,
add_picklist_args)


def subparser(subparsers):
Expand Down Expand Up @@ -59,6 +60,7 @@ def subparser(subparsers):
)
add_ksize_arg(subparser, 31)
add_moltype_args(subparser)
add_picklist_args(subparser)


def main(args):
Expand Down
12 changes: 3 additions & 9 deletions src/sourmash/cli/sig/extract.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,8 @@

import sys

from sourmash.cli.utils import add_moltype_args, add_ksize_arg
from sourmash.cli.utils import (add_moltype_args, add_ksize_arg,
add_picklist_args)


def subparser(subparsers):
Expand All @@ -25,16 +26,9 @@ def subparser(subparsers):
'--name', default=None,
help='select signatures whose name contains this substring'
)
subparser.add_argument(
'--picklist', default=None,
help="select signatures based on a picklist, i.e. 'file.csv:colname:coltype'"
)
subparser.add_argument(
'--picklist-require-all', default=False, action='store_true',
help="require that all picklist values be found or else fail"
)
add_ksize_arg(subparser, 31)
add_moltype_args(subparser)
add_picklist_args(subparser)


def main(args):
Expand Down
10 changes: 10 additions & 0 deletions src/sourmash/cli/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,16 @@ def add_ksize_arg(parser, default=31):
help='k-mer size; default={d}'.format(d=default)
)

def add_picklist_args(parser):
parser.add_argument(
'--picklist', default=None,
help="select signatures based on a picklist, i.e. 'file.csv:colname:coltype'"
)
parser.add_argument(
'--picklist-require-all', default=False, action='store_true',
help="require that all picklist values be found or else fail"
)


def opfilter(path):
return not path.startswith('__') and path not in ['utils']
Expand Down
Loading

0 comments on commit 74de59a

Please sign in to comment.