-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
debiasing FracMinHash - plans and progress #1798
Comments
@luizirber I recall you mentioning trying to estimate cardinality from the sketches themselves. However, in PR #1270 I see a bunch of HLL references. IMHO, the most straight-forward implementation would be to calculate the cardinality to high precision when forming the sketches, but I see how that would break backward compatibility. Did you come up with a mathematically sound way to calculate cardinality from the sketches, or would you like @mahmudhera and I to take a swing at it? I.e. sketch_size/scale_factor is an intuitive estimate of true cardinality, but it's not immediately obvious if the variance is better/worse than HLL |
This would be really helpful for #1788 -- I've currently implemented using |
@bluegenes what would be helpful in this regard? Since this basically boils down to a binomial random variable, whatever you would like can be explicitly calculated. Eg. do you want confidence intervals around the number of unique kmers? Specification of scale factor given a guess about number of unique kmers? Bound on how well you can estimate the number of unique k-mers given a scale factor? etc. Just let me know and I can send you a python class implementing what it is you'd like |
Thanks @dkoslicki! I guess my ideal situation would be implementing HLL cardinality estimation going forward so we could use it for ANI estimation, at least for new sketches. To avoid backward incompatibility, we could fall back to sketch estimation when necessary. But there are design, compatibility concerns at play, so I'll leave that up to @ctb! I mostly wanted to drop my branch here as motivation for any decisions/progress there. If that's not an option, perhaps a bound on how much this k-mer estimation affects ANI estimation would be helpful? |
@mahmudhera @dkoslicki since we don't have HLL yet, I think we need to alert the user when the scaling factor is likely to be insufficient for the dataset. So in the example above, if the set has fewer than ~4.4x10^6 elements, we could raise a warning during sketching (and later prevent the user from using this sketch for ANI estimation, ref #2003). Would you be able to help me with a function for this? |
@bluegenes Without HLL, is there some way to get an estimate of the number of k-mers/elements? That would be required (eg. knowing there are fewer than 4.4x10^6) for the above formula to work. If you do have that on hand, I can get you a function that implements this formula. Is that what you are looking for? |
I've been estimating via |
I see: there is a bit of circular logic if I use |
tagging in @ctb! |
I have to dig into how to best add such information into the signature format. So it's definitely not a next-week kind of change, but it is on my list. After that, it's mostly a versioning challenge, and should be annoying and detail oriented, but not that hard ;). |
Gotcha! Then I'll throw together a function that has the desired behavior, and you/we can plop in the HLL estimate when it eventually gets implemented |
@bluegenes see PR #2031 for the function. Once you place it where it needs to go in the main text, I can add the de-bias term, or you can if you like. The idea will be to:
|
Oh yeah, and once @mahmudhera is available again, we will want to discuss where to put the probability of corner cases (eg. high ANI and high scale factor leading to sketches not being able to distinguish between inputs) |
@bluegenes Should we leave this open since the bias part hasn't been implemented yet? ala this comment |
I imagine that we would just need to change this line to have the bias term in it:
with the logic around that to make sure the estimate of |A| is ok. Do you want me to take a crack at that, or would you like to @bluegenes ? |
ah, right, so something like the following --
@ctb should we not return containment if size is inaccurate? Or return containment but also notify users with a warning? This is a pretty big change in terms of handling of very small sketches -- might affect the majority of our tests... |
no idea! warning for sure. |
@bluegenes yup, that looks correct! And I think a warning makes the most sense too: if |
#2032 adds functionality to warn users about small / potentially inaccurate sketches and prevent ANI estimation from these. However, adding the de-biasing term to our containment function would change our output for small sketches. For semantic versioning reasons it may be best to warn users for now, and make this change (/ HLL cardinality estimation) for the next major release. cc @dkoslicki @ctb |
In re this paper,
Debiasing FracMinHash and deriving confidence intervals for mutation rates across a wide range of evolutionary distances
the question arose, is this being implemented in sourmash? see Twitter thread and @luizirber draft PR #1270.
quoth @dkoslicki -
and @luizirber -
also for grins, check out this paper:
The minimizer Jaccard estimator is biased and inconsistent
The text was updated successfully, but these errors were encountered: