Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

summary: selectors are good, let's maybe have more of them. #1524

Open
2 tasks
ctb opened this issue May 15, 2021 · 2 comments
Open
2 tasks

summary: selectors are good, let's maybe have more of them. #1524

ctb opened this issue May 15, 2021 · 2 comments

Comments

@ctb
Copy link
Contributor

ctb commented May 15, 2021

This is an update of & replacement for #1072, which introduced the idea of a database.select(...) function.

This issue is being updated after the release of sourmash 4.1.


In #1406 and #1392, we significantly expanded selector functionality.

The key changes were actually in #1420, a PR into #1406. This introduced the following method on Index classes -

def select(self, ksize=None, moltype=None, scaled=None, num=None,
           abund=None, containment=None)

with this docstring:

Return Index containing only signatures that match requirements.
Current arguments can be any or all of:
* ksize
* moltype
* scaled
* num
* containment
'select' will raise ValueError if the requirements are incompatible
with the Index subclass.
'select' may return an empty object or None if no matches can be
found.

This was added into LinearIndex, LazyLinearIndex, ZipFileLinearIndex, and MultiIndex, as well as the SBT and LCA Database classes.

The ultimate idea is to cleanly support databases and collections with richer signature types, see #198.

A few design decisions were made as part of this - the most consequential one is that select just selects compatible signatures, and doesn't actually do any downsampling or anything. See #1072 (comment) for links.

Items from #1072 not tackled in #1420:

  • md5sum, name/accession, and taxonomic ID selectors
  • abundance selection and/or flattening
  • method chaining: db = db.select(ksize=31).select(moltype'dna')
  • selection of compatible signatures via passed-in signature or Index object, e.g. db = db.select(other_db) or db = db.select(some_sig)

(not all of these may be good ideas, but leaving them in here for discussion ;).

Some other TODO items:


Notes and additional thoughts:

@ctb ctb changed the title selectors are good, let's maybe have more of them. summary: selectors are good, let's maybe have more of them. May 20, 2021
@ctb
Copy link
Contributor Author

ctb commented Oct 10, 2021

#1433 discusses the distinction between finding compatible sketches and/or filtering out incompatible ones; and making the sketches compatible. Maybe we should think about an apply_select method, that returns a collection of compatible sketches?

@ctb
Copy link
Contributor Author

ctb commented Aug 3, 2022

#1433 discusses the distinction between finding compatible sketches and/or filtering out incompatible ones; and making the sketches compatible. Maybe we should think about an apply_select method, that returns a collection of compatible sketches?

We have a bunch of janky code that does this on the fly, but I'm pretty happy with the overall design philosophy. And downsampling doesn't seem to be that slow. So I'm inclined to not worry about doing downsampling on the fly, and instead I think we should stick with declarative approaches ("I want a sketch at this scaled/etc") and do clever caching or lazy evaluation underneath.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant