Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

debiasing FracMinHash - plans and progress #1798

Open
ctb opened this issue Jan 18, 2022 · 20 comments · Fixed by #2031
Open

debiasing FracMinHash - plans and progress #1798

ctb opened this issue Jan 18, 2022 · 20 comments · Fixed by #2031

Comments

@ctb
Copy link
Contributor

ctb commented Jan 18, 2022

In re this paper,

Debiasing FracMinHash and deriving confidence intervals for mutation rates across a wide range of evolutionary distances

the question arose, is this being implemented in sourmash? see Twitter thread and @luizirber draft PR #1270.

quoth @dkoslicki -

Planned. Should be really straightforward to implement, as sourmash already had some cardinality estimators built in, iirc

and @luizirber -

Plan was to avoid changing the sig format because it could be calculated from the MH data, instead of building a HLL during sketching and saving it too (or just the # of unique k-mers from the HLL)


also for grins, check out this paper:

The minimizer Jaccard estimator is biased and inconsistent

@dkoslicki
Copy link
Collaborator

@luizirber I recall you mentioning trying to estimate cardinality from the sketches themselves. However, in PR #1270 I see a bunch of HLL references. IMHO, the most straight-forward implementation would be to calculate the cardinality to high precision when forming the sketches, but I see how that would break backward compatibility.

Did you come up with a mathematically sound way to calculate cardinality from the sketches, or would you like @mahmudhera and I to take a swing at it? I.e. sketch_size/scale_factor is an intuitive estimate of true cardinality, but it's not immediately obvious if the variance is better/worse than HLL

@dkoslicki
Copy link
Collaborator

Ok, turns out easier than I expected: the sketch_size/scale_factor obeys a nice two-sided Chernoff bound:
chernoff
where |X|/s is the sketch_size/scale_factor estimator, \delta is the relative error, and |A| the true set cardinality.
So you don't really run into issues unless the underlying cardinality |A| is small.

For example, using a scale factor of s=1/1000, if you want to be at least 95% sure that the FracMinHash cardinality is off by less than 5% relative error, you want to be sure that your set has more than ~4.4x10^6 elements.

Interestingly, this approach will always be less space efficient than HLL. HLL experiences a relative error of something like 1.04/Sqrt(m) for a sketch of size m. For FracMinHash, you end up with a relative error of (at absolute minimum, often much worse) 2.07944/Sqrt(m) for a sketch of size m.

The interesting wrinkle is that in practice, we don't know |A| a priori. Besides the aforementioned backward compatibility issues, I would then lean towards HLL unless you would want to pass over the data twice: once to get a good estimate of |A| and use this to inform an intelligent choice of scale_factor given a desired relative error

@bluegenes
Copy link
Contributor

This would be really helpful for #1788 -- I've currently implemented using n_unique_kmers = scaled * len(minhash).

@dkoslicki
Copy link
Collaborator

@bluegenes what would be helpful in this regard? Since this basically boils down to a binomial random variable, whatever you would like can be explicitly calculated. Eg. do you want confidence intervals around the number of unique kmers? Specification of scale factor given a guess about number of unique kmers? Bound on how well you can estimate the number of unique k-mers given a scale factor? etc.

Just let me know and I can send you a python class implementing what it is you'd like

@bluegenes
Copy link
Contributor

bluegenes commented Jan 25, 2022

Thanks @dkoslicki!

I guess my ideal situation would be implementing HLL cardinality estimation going forward so we could use it for ANI estimation, at least for new sketches. To avoid backward incompatibility, we could fall back to sketch estimation when necessary. But there are design, compatibility concerns at play, so I'll leave that up to @ctb! I mostly wanted to drop my branch here as motivation for any decisions/progress there.

If that's not an option, perhaps a bound on how much this k-mer estimation affects ANI estimation would be helpful?

@bluegenes
Copy link
Contributor

bluegenes commented Apr 29, 2022

For example, using a scale factor of s=1/1000, if you want to be at least 95% sure that the FracMinHash cardinality is off by less than 5% relative error, you want to be sure that your set has more than ~4.4x10^6 elements.

@mahmudhera @dkoslicki since we don't have HLL yet, I think we need to alert the user when the scaling factor is likely to be insufficient for the dataset. So in the example above, if the set has fewer than ~4.4x10^6 elements, we could raise a warning during sketching (and later prevent the user from using this sketch for ANI estimation, ref #2003).

Would you be able to help me with a function for this?

@dkoslicki
Copy link
Collaborator

@bluegenes Without HLL, is there some way to get an estimate of the number of k-mers/elements? That would be required (eg. knowing there are fewer than 4.4x10^6) for the above formula to work. If you do have that on hand, I can get you a function that implements this formula. Is that what you are looking for?

@bluegenes
Copy link
Contributor

bluegenes commented May 3, 2022

@bluegenes Without HLL, is there some way to get an estimate of the number of k-mers/elements? That would be required (eg. knowing there are fewer than 4.4x10^6) for the above formula to work. If you do have that on hand, I can get you a function that implements this formula. Is that what you are looking for?

I've been estimating via num_hashes * scaled for all equations -- will that work? If yes, then that function would be great!

@dkoslicki
Copy link
Collaborator

I see: there is a bit of circular logic if I use num_hashes * scaled as the estimate of |A|, since that's the quantity we are trying to measure the variance of with the chernoff bounds. It should be ok in many cases, but I would want to make a note that this should be fixed after the HLL is implemented. Is HLL on the roadmap?

@bluegenes
Copy link
Contributor

I see: there is a bit of circular logic if I use num_hashes * scaled as the estimate of |A|, since that's the quantity we are trying to measure the variance of with the chernoff bounds. It should be ok in many cases, but I would want to make a note that this should be fixed after the HLL is implemented. Is HLL on the roadmap?

tagging in @ctb!

@ctb
Copy link
Contributor Author

ctb commented May 3, 2022

I have to dig into how to best add such information into the signature format. So it's definitely not a next-week kind of change, but it is on my list.

After that, it's mostly a versioning challenge, and should be annoying and detail oriented, but not that hard ;).

@dkoslicki
Copy link
Collaborator

Gotcha! Then I'll throw together a function that has the desired behavior, and you/we can plop in the HLL estimate when it eventually gets implemented

@dkoslicki
Copy link
Collaborator

@bluegenes see PR #2031 for the function. Once you place it where it needs to go in the main text, I can add the de-bias term, or you can if you like. The idea will be to:

  1. Check if we can trust num_hashes * scale as an estimate of |A|
  2. Check if 1-(1-s)^|A| is sufficiently close to 1 and/or just divide by this value since it will likely be extremely close to 1 if |A| is large enough

@dkoslicki
Copy link
Collaborator

Oh yeah, and once @mahmudhera is available again, we will want to discuss where to put the probability of corner cases (eg. high ANI and high scale factor leading to sketches not being able to distinguish between inputs)

@dkoslicki
Copy link
Collaborator

@bluegenes Should we leave this open since the bias part hasn't been implemented yet? ala this comment

@dkoslicki
Copy link
Collaborator

I imagine that we would just need to change this line to have the bias term in it:

return self.count_common(other, downsample) / (len(self) * (1- (1-1/s)^|A|)

with the logic around that to make sure the estimate of |A| is ok. Do you want me to take a crack at that, or would you like to @bluegenes ?

@bluegenes bluegenes reopened this May 5, 2022
@bluegenes
Copy link
Contributor

bluegenes commented May 5, 2022

ah, right, so something like the following --

if self.size_is_accurate(): 
    return self.count_common(other, downsample) / (len(self) * (1- (1-1/self.scaled)^(len(self)*self.scaled)))

@ctb should we not return containment if size is inaccurate? Or return containment but also notify users with a warning?

This is a pretty big change in terms of handling of very small sketches -- might affect the majority of our tests...

@ctb
Copy link
Contributor Author

ctb commented May 5, 2022

no idea! warning for sure.

@dkoslicki
Copy link
Collaborator

@bluegenes yup, that looks correct! And I think a warning makes the most sense too: if self.size_is_accurate() is false, then it just means the estimate has high variance, so could be inaccurate, though not guaranteed to be inaccurate.

@bluegenes
Copy link
Contributor

#2032 adds functionality to warn users about small / potentially inaccurate sketches and prevent ANI estimation from these.

However, adding the de-biasing term to our containment function would change our output for small sketches. For semantic versioning reasons it may be best to warn users for now, and make this change (/ HLL cardinality estimation) for the next major release. cc @dkoslicki @ctb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants