Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add docs on FracMinHash downsampling #1799

Open
ctb opened this issue Jan 18, 2022 · 2 comments
Open

add docs on FracMinHash downsampling #1799

ctb opened this issue Jan 18, 2022 · 2 comments
Labels
doc documentation content or issues faq things to add to an FAQ or docs

Comments

@ctb
Copy link
Contributor

ctb commented Jan 18, 2022

@drtamermansour asked some questions on slack about how FracMinHash signatures with different scaled values are handled in practice, and I took a look in the docs and couldn't find anything that was clearly written. We should add that somewhere.

(On the plus side, it's pretty well tested, I think?)

Off the top of my head,

  • for most purposes, when there is a difference between the query and a subject signature, the query and signature are downsampled to the same scaled, i.e. scaled is increased to the same value. This results in a loss of resolution in situations where the signature gets modified for further searching.
  • this loss of resolution can be ...problematic when searching multiple databases with gather, in particular; if there's a match to a low rez signature, then the query will be appropriately downsamples and will forevermore be low rez.
  • also, there are some databases that cannot be downsampled properly, like SBTs in particular.

This was all actually written up internally in the code base - see #407 and PR #1420 - but the details didn't make it into the docs. Oops!

@ctb ctb added doc documentation content or issues faq things to add to an FAQ or docs labels Jan 18, 2022
@drtamermansour
Copy link

In the current implementation, when there is a difference between the query and a subject signature, sourmash rescale the DB but not the sample.

I tried:

  • sample scale=500 & DB scale=1000 ==> runtime error (ValueError: new scaled 500 is lower than current sample scaled 1000)
  • sample scale=2000 & DB scale=1000 ==> works fine

@ctb
Copy link
Contributor Author

ctb commented Jan 20, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc documentation content or issues faq things to add to an FAQ or docs
Projects
None yet
Development

No branches or pull requests

2 participants