Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better conceptual definition on minhash/signature/query #616

Open
luizirber opened this issue Jan 9, 2019 · 12 comments
Open

Better conceptual definition on minhash/signature/query #616

luizirber opened this issue Jan 9, 2019 · 12 comments
Labels
5.0 issues to address for a 5.0 release
Milestone

Comments

@luizirber
Copy link
Member

I think we should separate query, signature and minhash better around the codebase.
It's all pretty entangled, but they are different things!

  • MinHash: the low-level minhash object. Specific k, num/max_hash.
  • Signature: a collection of minhash for one dataset. Many k and num/max_hash combinations, but they are all derived from the same source.
  • query: what we pass for find/search/gather. gather assumes query is a Signature; LCA assumes query is a MinHash.

We can also lift up some functionality from MinHash into Signature: add_sequence can be a Signature method, and will call add_kmer (or equivalent) in all the MinHash defined for that signature. At the moment we do all this parsing during compute (for example), where for each sequence we need to iterate with the appropriate k size and so on.

Another note: Signature is a collection of MinHash at the moment, but would be pretty interesting to allow it to keep HLL/BF/CMS/HistoSketch representations of the data too.

Some examples

  • the signature file we save is a list of Signature, but many methods assume there is only one available (ignoring the other)
  • in feature/bf_query I'm attaching a Nodegraph with the Sig/MH content to the query to make search faster (but it is weird, because Signature shouldn't have a Nodegraph attached to it!)

(things to consider on #556)

@ctb
Copy link
Contributor

ctb commented Jan 9, 2019 via email

@taylorreiter
Copy link
Contributor

@luizirber and I recently had a conversation via slack about naming conventions for signatures. His explanation was helpful and I think it could be helpful to add it to the documentation. At the least, now it wont be lost to the ages in disappearing slack threads:

Taylor: "what’s the correct terminology for a signature that contains multiple minhashes in sourmash? like if I create a signature for each contig in a fasta, and put them all in the same signature…what is the baby sig called?"

Luiz:
sooooo... I like calling signature a collection of sketch, and a sketch can be a MinHash (for now), or a draff sketch, or an HLL, or a bloom filter...
that's how the Rust code is organized (part of the discussion above)

so, how I like to think about it (with examples):

  • Sketch: a MinHash(k=21, dna) for one genomic dataset
  • Signature : a collection of Sketches: [MinHash(k=31), MinHash(k=21)], but all related to the same original genomic dataset
  • Index: a collection of Signatures. Not all indices support all signatures at the same time, so they might be constrained to one parameter (like k=21) and Sketch type that all Signature must have

it's a bit silly to have so many hierarchies when we only have one implementation for Sketch, and Index all operate over MinHash es... but part of that is also being able to quickly add new sketches or indices, and still support the core operations (search and gather, or more generally similarity and containment). That's part of the reason for #556, for example

@luizirber luizirber added the 4.0 issues to address for a 4.0 release label Jun 8, 2020
@ctb
Copy link
Contributor

ctb commented Aug 9, 2020

We also have a big conceptual dichotomy, where the code computing signatures behaves as if each SourmashSignature may contain multiple MinHash objects, but the code operating on signatures behaves as if each SourmashSignature has a single MinHash. Since the signature loading code enforces that, it all works ... as long as we serialize them first 😂. I think I introduced this silliness Back When and @luizirber faithfully copied it over into Rust, probably while shaking his head in dismay.

@luizirber
Copy link
Member Author

luizirber commented Aug 10, 2020

We also have a big conceptual dichotomy, where the code computing signatures behaves as if each SourmashSignature may contain multiple MinHash objects, but the code operating on signatures behaves as if each SourmashSignature has a single MinHash. Since the signature loading code enforces that, it all works ... as long as we serialize them first joy. I think I introduced this silliness Back When and @luizirber faithfully copied it over into Rust, probably while shaking his head in dismay.

I'm well aware... But the idea of a Signature as a collection of Sketches came later anyway (during the refactoring), so don't blame yourself too much =]

On the bright side, I tried to keep the Rust codebase doing operations over Signatures instead of over Sketches. For now the first thing these operations do is extract the matching MinHash in each sig and do the regular MinHash operations, but I think it is quite doable to "hide" most of the operations (especially Containment and Similarity) inside the Signature impl in the same way.

This can be lifted to Python too, with the awesome benefit of avoiding a new copy of the MinHash in sig.minhash every single time it's called (because that's what happens nowadays 😨)

@ctb
Copy link
Contributor

ctb commented Aug 10, 2020

I like how you avoided the word "just" in that comment in favor of "quite doable."

@ctb
Copy link
Contributor

ctb commented Aug 20, 2020

as I dug in to SourmashSignature a bit more in #1179, I noticed that md5sum is calculated on the sig.minhash not on the SourmashSignature itself. So this is another place where we have a split between how we operate on signatures vs how we generate them.

@ctb ctb added 5.0 issues to address for a 5.0 release and removed 4.0 issues to address for a 4.0 release labels Jan 10, 2021
@ctb
Copy link
Contributor

ctb commented Jan 10, 2021

I think this should be punted to 5.0, or later.

@ctb
Copy link
Contributor

ctb commented Apr 10, 2021

We also have a big conceptual dichotomy, where the code computing signatures behaves as if each SourmashSignature may contain multiple MinHash objects, but the code operating on signatures behaves as if each SourmashSignature has a single MinHash. Since the signature loading code enforces that, it all works ... as long as we serialize them first joy. I think I introduced this silliness Back When and @luizirber faithfully copied it over into Rust, probably while shaking his head in dismay.

I'm well aware... But the idea of a Signature as a collection of Sketches came later anyway (during the refactoring), so don't blame yourself too much =]

:)

Over in #1392, I'm starting to make the clear distinction that search and gather return the original SourmashSignature inserted into the Index, without any modifications done during the search. This opens the door to doing searches on more complex SourmashSignature objects (e.g. multiple sketches, multiple types of sketches) where the search uses the "best" sketch type shared between the query signature and the contents of the database, but returns the entire signature. There are conceptual challenges to work out (e.g. what does threshold mean, in the case where you might not know exactly what kind of sketches are being compared!?) but I think this is a desirable kind of flexibility.

This also makes #198 even more interesting, because then you could build an SBT that lets you find matches based on the kind of sketch that was indexed, but returns a richer set of sketches (as part of the returned signature object).

@ctb
Copy link
Contributor

ctb commented May 7, 2022

in re #2039,

@luizirber reminded me:

Another one might be HLL, it already is implemented/exposed as a sketch in Rust. Need to figure out the Python side (more than one sketch per sig? A select method for choosing sketches, like select in indices chooses sigs?)

and now I'm curious, what operations does the HLL (as implemented) currently support? definitely cardinality counting, and merge. does it support containment as a byproduct of merge?

@luizirber
Copy link
Member Author

and now I'm curious, what operations does the HLL (as implemented) currently support? definitely cardinality counting, and merge. does it support containment as a byproduct of merge?

since we are using the estimators from https://arxiv.org/abs/1706.07290, we can do similarity and containment

Caveat: estimators don't work well with wildly different cardinalities, so genome containment in metagenome (or vice-versa) don't work well =(

@ctb
Copy link
Contributor

ctb commented May 9, 2022

since we are using the estimators from https://arxiv.org/abs/1706.07290, we can do similarity and containment

Caveat: estimators don't work well with wildly different cardinalities, so genome containment in metagenome (or vice-versa) don't work well =(

ok, thx - this makes me less enthusiastic about putting a lot of UX time into adding it, but I think it is an excellent "nth" method to add to expand internal support for more sketches, because a lot of sourmash power comes from the containment stuff.

@ctb
Copy link
Contributor

ctb commented Aug 3, 2022

ref #1616

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5.0 issues to address for a 5.0 release
Projects
None yet
Development

No branches or pull requests

3 participants