New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
some thoughts on modulo hash & modimizer, and connections with sourmash 'scaled' signatures #1
Comments
nice! looking forward to seeing where you go with this project ;) |
hi Richard, @luizirber (now Dr. @luizirber :) wrote this up, together with many associated benchmarks, for his thesis - if you're interested in seeing what we've done with it, the link is here: https://github.com/luizirber/phd/releases We're also working on a paper. Please let us know how to best link to your work - I was thinking of just citing this repository, but happy to do something else! |
you might be interested in this conversation - https://twitter.com/krsahlin/status/1463169988689285125 - where a few related references are mentioned, and we converge on the name FracMinHash. |
hi @richarddurbin, @luizirber pointed me at this repo and I wanted to drop you a note --
we have been using an analogous concept in sourmash, that is identical in concept to modulo hash but uses a slightly different technical approach. (modulo hash was coined by Broder in his 1997 paper on MinHash; see also the
mash screen
blog post from Adam Phillippy.)briefly, in sourmash we choose a max_hash value below which all hash values are kept; this max_hash is the inverse of the density, which we refer to as the scaled value. (
max_hash = 2**64/scaled
) This gives us slightly more options for resolution than modulo hash, and also has the advantage of being interconvertible with mash-style MinHash approaches.we have this set of notes on the advantage of modulo hash, as well as some API documentation that goes further into depth on interconvertibility.
our code is not so pretty but here is the max_hash implementation.
we have a draft f1000research software paper nearly written that I'd be happy to send you if you want to see more polished prose :)
happy to discuss more!
other than saying hi, the main purpose of this issue is to recommend that you investigate (or at least support) mash compatibility by using murmurhash hashing with a seed of 42 and reverse complement handling. see our example Python code here. murmurhash is less efficient than rolling hash but on the other hand it's nice to be able to interconvert between these efforts!
The text was updated successfully, but these errors were encountered: