summary: selectors are good, let's maybe have more of them. #1524

ctb · 2021-05-15T13:06:54Z

This is an update of & replacement for #1072, which introduced the idea of a database.select(...) function.

This issue is being updated after the release of sourmash 4.1.

In #1406 and #1392, we significantly expanded selector functionality.

The key changes were actually in #1420, a PR into #1406. This introduced the following method on Index classes -

def select(self, ksize=None, moltype=None, scaled=None, num=None,
           abund=None, containment=None)

with this docstring:

Return Index containing only signatures that match requirements.
Current arguments can be any or all of:
* ksize
* moltype
* scaled
* num
* containment
'select' will raise ValueError if the requirements are incompatible
with the Index subclass.
'select' may return an empty object or None if no matches can be
found.

This was added into LinearIndex, LazyLinearIndex, ZipFileLinearIndex, and MultiIndex, as well as the SBT and LCA Database classes.

The ultimate idea is to cleanly support databases and collections with richer signature types, see #198.

A few design decisions were made as part of this - the most consequential one is that select just selects compatible signatures, and doesn't actually do any downsampling or anything. See #1072 (comment) for links.

Items from #1072 not tackled in #1420:

md5sum, name/accession, and taxonomic ID selectors
abundance selection and/or flattening
method chaining: db = db.select(ksize=31).select(moltype'dna')
selection of compatible signatures via passed-in signature or Index object, e.g. db = db.select(other_db) or db = db.select(some_sig)

(not all of these may be good ideas, but leaving them in here for discussion ;).

Some other TODO items:

write more comprehensive Index.select(...) tests #1427 suggests we need more select tests
write some Python API docs and/or doctests for selectors

Notes and additional thoughts:

I suspect method chaining works fine now.
abund/noabund should be easy.
not sure if md5sum, name/accession, and tax ID selectors belong to selectors or would perhaps better belong under manifests support/provide/require manifests for collections of signatures? #1352
selection of compatible signatures via passed-in signature (e.g. db.select(sig) yields all compatible signatures) could be a better, cleaner way to fix sourmash gather doesn't automatically figure out ksize from database #809 (and replace [WIP] Refactor subject database and signature loading for search, gather, and multigather. #934)

The text was updated successfully, but these errors were encountered:

ctb · 2021-10-10T17:15:20Z

#1433 discusses the distinction between finding compatible sketches and/or filtering out incompatible ones; and making the sketches compatible. Maybe we should think about an apply_select method, that returns a collection of compatible sketches?

ctb · 2022-08-03T11:11:35Z

#1433 discusses the distinction between finding compatible sketches and/or filtering out incompatible ones; and making the sketches compatible. Maybe we should think about an apply_select method, that returns a collection of compatible sketches?

We have a bunch of janky code that does this on the fly, but I'm pretty happy with the overall design philosophy. And downsampling doesn't seem to be that slow. So I'm inclined to not worry about doing downsampling on the fly, and instead I think we should stick with declarative approaches ("I want a sketch at this scaled/etc") and do clever caching or lazy evaluation underneath.

ctb mentioned this issue May 15, 2021

start thinking about a standard selector framework for signature search/compatibility #1072

Closed

ctb changed the title ~~selectors are good, let's maybe have more of them.~~ summary: selectors are good, let's maybe have more of them. May 20, 2021

ctb mentioned this issue Sep 23, 2021

Add ability to adjust --num-hashes, --scaled on-the-fly in 'compare' and 'categorize' #560

Closed

ctb mentioned this issue Aug 3, 2022

revisit LinearIndex.select behavior for scaled (and num?) #1433

Closed

ctb mentioned this issue Dec 16, 2022

think about distributing zipfile databases with different scaled values #2408

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

summary: selectors are good, let's maybe have more of them. #1524

summary: selectors are good, let's maybe have more of them. #1524

ctb commented May 15, 2021 •

edited

Loading

ctb commented Oct 10, 2021

ctb commented Aug 3, 2022

summary: selectors are good, let's maybe have more of them. #1524

summary: selectors are good, let's maybe have more of them. #1524

Comments

ctb commented May 15, 2021 • edited Loading

ctb commented Oct 10, 2021

ctb commented Aug 3, 2022

ctb commented May 15, 2021 •

edited

Loading