Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

some thoughts on modulo hash & modimizer, and connections with sourmash 'scaled' signatures #1

Open
ctb opened this issue Feb 3, 2019 · 4 comments

Comments

@ctb
Copy link

ctb commented Feb 3, 2019

hi @richarddurbin, @luizirber pointed me at this repo and I wanted to drop you a note --

we have been using an analogous concept in sourmash, that is identical in concept to modulo hash but uses a slightly different technical approach. (modulo hash was coined by Broder in his 1997 paper on MinHash; see also the mash screen blog post from Adam Phillippy.)

briefly, in sourmash we choose a max_hash value below which all hash values are kept; this max_hash is the inverse of the density, which we refer to as the scaled value. (max_hash = 2**64/scaled) This gives us slightly more options for resolution than modulo hash, and also has the advantage of being interconvertible with mash-style MinHash approaches.

we have this set of notes on the advantage of modulo hash, as well as some API documentation that goes further into depth on interconvertibility.

our code is not so pretty but here is the max_hash implementation.

we have a draft f1000research software paper nearly written that I'd be happy to send you if you want to see more polished prose :)

happy to discuss more!

other than saying hi, the main purpose of this issue is to recommend that you investigate (or at least support) mash compatibility by using murmurhash hashing with a seed of 42 and reverse complement handling. see our example Python code here. murmurhash is less efficient than rolling hash but on the other hand it's nice to be able to interconvert between these efforts!

@richarddurbin
Copy link
Owner

richarddurbin commented Feb 4, 2019 via email

@ctb
Copy link
Author

ctb commented Feb 4, 2019

nice! looking forward to seeing where you go with this project ;)

@ctb
Copy link
Author

ctb commented Oct 5, 2020

hi Richard, @luizirber (now Dr. @luizirber :) wrote this up, together with many associated benchmarks, for his thesis - if you're interested in seeing what we've done with it, the link is here:

https://github.com/luizirber/phd/releases

We're also working on a paper. Please let us know how to best link to your work - I was thinking of just citing this repository, but happy to do something else!

@ctb
Copy link
Author

ctb commented Nov 28, 2021

you might be interested in this conversation - https://twitter.com/krsahlin/status/1463169988689285125 - where a few related references are mentioned, and we converge on the name FracMinHash.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants