debiasing FracMinHash - plans and progress #1798

ctb · 2022-01-18T14:45:11Z

In re this paper,

Debiasing FracMinHash and deriving confidence intervals for mutation rates across a wide range of evolutionary distances

the question arose, is this being implemented in sourmash? see Twitter thread and @luizirber draft PR #1270.

quoth @dkoslicki -

Planned. Should be really straightforward to implement, as sourmash already had some cardinality estimators built in, iirc

and @luizirber -

Plan was to avoid changing the sig format because it could be calculated from the MH data, instead of building a HLL during sketching and saving it too (or just the # of unique k-mers from the HLL)

also for grins, check out this paper:

The minimizer Jaccard estimator is biased and inconsistent

dkoslicki · 2022-01-18T17:25:56Z

@luizirber I recall you mentioning trying to estimate cardinality from the sketches themselves. However, in PR #1270 I see a bunch of HLL references. IMHO, the most straight-forward implementation would be to calculate the cardinality to high precision when forming the sketches, but I see how that would break backward compatibility.

Did you come up with a mathematically sound way to calculate cardinality from the sketches, or would you like @mahmudhera and I to take a swing at it? I.e. sketch_size/scale_factor is an intuitive estimate of true cardinality, but it's not immediately obvious if the variance is better/worse than HLL

dkoslicki · 2022-01-18T19:29:46Z

Ok, turns out easier than I expected: the sketch_size/scale_factor obeys a nice two-sided Chernoff bound:

where |X|/s is the sketch_size/scale_factor estimator, \delta is the relative error, and |A| the true set cardinality.
So you don't really run into issues unless the underlying cardinality |A| is small.

For example, using a scale factor of s=1/1000, if you want to be at least 95% sure that the FracMinHash cardinality is off by less than 5% relative error, you want to be sure that your set has more than ~4.4x10^6 elements.

Interestingly, this approach will always be less space efficient than HLL. HLL experiences a relative error of something like 1.04/Sqrt(m) for a sketch of size m. For FracMinHash, you end up with a relative error of (at absolute minimum, often much worse) 2.07944/Sqrt(m) for a sketch of size m.

The interesting wrinkle is that in practice, we don't know |A| a priori. Besides the aforementioned backward compatibility issues, I would then lean towards HLL unless you would want to pass over the data twice: once to get a good estimate of |A| and use this to inform an intelligent choice of scale_factor given a desired relative error

bluegenes · 2022-01-20T20:29:08Z

This would be really helpful for #1788 -- I've currently implemented using n_unique_kmers = scaled * len(minhash).

dkoslicki · 2022-01-20T22:10:24Z

@bluegenes what would be helpful in this regard? Since this basically boils down to a binomial random variable, whatever you would like can be explicitly calculated. Eg. do you want confidence intervals around the number of unique kmers? Specification of scale factor given a guess about number of unique kmers? Bound on how well you can estimate the number of unique k-mers given a scale factor? etc.

Just let me know and I can send you a python class implementing what it is you'd like

bluegenes · 2022-01-25T21:16:07Z

Thanks @dkoslicki!

I guess my ideal situation would be implementing HLL cardinality estimation going forward so we could use it for ANI estimation, at least for new sketches. To avoid backward incompatibility, we could fall back to sketch estimation when necessary. But there are design, compatibility concerns at play, so I'll leave that up to @ctb! I mostly wanted to drop my branch here as motivation for any decisions/progress there.

If that's not an option, perhaps a bound on how much this k-mer estimation affects ANI estimation would be helpful?

bluegenes · 2022-04-29T19:50:53Z

For example, using a scale factor of s=1/1000, if you want to be at least 95% sure that the FracMinHash cardinality is off by less than 5% relative error, you want to be sure that your set has more than ~4.4x10^6 elements.

@mahmudhera @dkoslicki since we don't have HLL yet, I think we need to alert the user when the scaling factor is likely to be insufficient for the dataset. So in the example above, if the set has fewer than ~4.4x10^6 elements, we could raise a warning during sketching (and later prevent the user from using this sketch for ANI estimation, ref #2003).

Would you be able to help me with a function for this?

dkoslicki · 2022-05-03T16:45:30Z

@bluegenes Without HLL, is there some way to get an estimate of the number of k-mers/elements? That would be required (eg. knowing there are fewer than 4.4x10^6) for the above formula to work. If you do have that on hand, I can get you a function that implements this formula. Is that what you are looking for?

bluegenes · 2022-05-03T16:49:24Z

@bluegenes Without HLL, is there some way to get an estimate of the number of k-mers/elements? That would be required (eg. knowing there are fewer than 4.4x10^6) for the above formula to work. If you do have that on hand, I can get you a function that implements this formula. Is that what you are looking for?

I've been estimating via num_hashes * scaled for all equations -- will that work? If yes, then that function would be great!

dkoslicki · 2022-05-03T18:16:38Z

I see: there is a bit of circular logic if I use num_hashes * scaled as the estimate of |A|, since that's the quantity we are trying to measure the variance of with the chernoff bounds. It should be ok in many cases, but I would want to make a note that this should be fixed after the HLL is implemented. Is HLL on the roadmap?

bluegenes · 2022-05-03T18:26:07Z

I see: there is a bit of circular logic if I use num_hashes * scaled as the estimate of |A|, since that's the quantity we are trying to measure the variance of with the chernoff bounds. It should be ok in many cases, but I would want to make a note that this should be fixed after the HLL is implemented. Is HLL on the roadmap?

tagging in @ctb!

ctb · 2022-05-03T18:29:30Z

I have to dig into how to best add such information into the signature format. So it's definitely not a next-week kind of change, but it is on my list.

After that, it's mostly a versioning challenge, and should be annoying and detail oriented, but not that hard ;).

dkoslicki · 2022-05-03T18:32:23Z

Gotcha! Then I'll throw together a function that has the desired behavior, and you/we can plop in the HLL estimate when it eventually gets implemented

dkoslicki · 2022-05-04T18:27:26Z

@bluegenes see PR #2031 for the function. Once you place it where it needs to go in the main text, I can add the de-bias term, or you can if you like. The idea will be to:

Check if we can trust num_hashes * scale as an estimate of |A|
Check if 1-(1-s)^|A| is sufficiently close to 1 and/or just divide by this value since it will likely be extremely close to 1 if |A| is large enough

dkoslicki · 2022-05-04T18:30:58Z

Oh yeah, and once @mahmudhera is available again, we will want to discuss where to put the probability of corner cases (eg. high ANI and high scale factor leading to sketches not being able to distinguish between inputs)

dkoslicki · 2022-05-05T18:23:06Z

@bluegenes Should we leave this open since the bias part hasn't been implemented yet? ala this comment

dkoslicki · 2022-05-05T18:37:55Z

I imagine that we would just need to change this line to have the bias term in it:

return self.count_common(other, downsample) / (len(self) * (1- (1-1/s)^|A|)

with the logic around that to make sure the estimate of |A| is ok. Do you want me to take a crack at that, or would you like to @bluegenes ?

bluegenes · 2022-05-05T18:57:43Z

ah, right, so something like the following --

if self.size_is_accurate(): 
    return self.count_common(other, downsample) / (len(self) * (1- (1-1/self.scaled)^(len(self)*self.scaled)))

@ctb should we not return containment if size is inaccurate? Or return containment but also notify users with a warning?

This is a pretty big change in terms of handling of very small sketches -- might affect the majority of our tests...

ctb · 2022-05-05T19:00:13Z

no idea! warning for sure.

dkoslicki · 2022-05-06T15:10:38Z

@bluegenes yup, that looks correct! And I think a warning makes the most sense too: if self.size_is_accurate() is false, then it just means the estimate has high variance, so could be inaccurate, though not guaranteed to be inaccurate.

bluegenes · 2022-05-12T22:44:52Z

#2032 adds functionality to warn users about small / potentially inaccurate sketches and prevent ANI estimation from these.

However, adding the de-biasing term to our containment function would change our output for small sketches. For semantic versioning reasons it may be best to warn users for now, and make this change (/ HLL cardinality estimation) for the next major release. cc @dkoslicki @ctb

ctb mentioned this issue Feb 9, 2022

lca summarize abundance with outfile #1833

Closed

This was referenced Apr 27, 2022

[MRG] output estimated ANI from sourmash compare, search, prefetch, and gather #1967

Merged

what to do about ANI estimate for two very small scaled sketches? #2003

Open

ctb mentioned this issue May 4, 2022

add input number of k-mers (before sketching) to signature format #2030

Open

dkoslicki mentioned this issue May 4, 2022

1798: implementation of function to test if cardinality estimate is accurate wrt scale size #2031

Merged

bluegenes closed this as completed in #2031 May 5, 2022

bluegenes reopened this May 5, 2022

bluegenes mentioned this issue May 16, 2022

add script that computes and prints the three different ways to make … mahmudhera/phylogenetic-tree-using-fracminhash#2

Merged

bluegenes mentioned this issue Jul 7, 2022

[MRG] less stringent size accuracy parameters #2074

Merged

bluegenes mentioned this issue Aug 29, 2022

[MRG] Use the bias factor for containment when estimating ANI #2057

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

debiasing FracMinHash - plans and progress #1798

debiasing FracMinHash - plans and progress #1798

ctb commented Jan 18, 2022

dkoslicki commented Jan 18, 2022

dkoslicki commented Jan 18, 2022

bluegenes commented Jan 20, 2022

dkoslicki commented Jan 20, 2022

bluegenes commented Jan 25, 2022 •

edited

Loading

bluegenes commented Apr 29, 2022 •

edited

Loading

dkoslicki commented May 3, 2022

bluegenes commented May 3, 2022 •

edited

Loading

dkoslicki commented May 3, 2022

bluegenes commented May 3, 2022

ctb commented May 3, 2022

dkoslicki commented May 3, 2022

dkoslicki commented May 4, 2022

dkoslicki commented May 4, 2022

dkoslicki commented May 5, 2022

dkoslicki commented May 5, 2022

bluegenes commented May 5, 2022 •

edited

Loading

ctb commented May 5, 2022

dkoslicki commented May 6, 2022

bluegenes commented May 12, 2022

debiasing FracMinHash - plans and progress #1798

debiasing FracMinHash - plans and progress #1798

Comments

ctb commented Jan 18, 2022

dkoslicki commented Jan 18, 2022

dkoslicki commented Jan 18, 2022

bluegenes commented Jan 20, 2022

dkoslicki commented Jan 20, 2022

bluegenes commented Jan 25, 2022 • edited Loading

bluegenes commented Apr 29, 2022 • edited Loading

dkoslicki commented May 3, 2022

bluegenes commented May 3, 2022 • edited Loading

dkoslicki commented May 3, 2022

bluegenes commented May 3, 2022

ctb commented May 3, 2022

dkoslicki commented May 3, 2022

dkoslicki commented May 4, 2022

dkoslicki commented May 4, 2022

dkoslicki commented May 5, 2022

dkoslicki commented May 5, 2022

bluegenes commented May 5, 2022 • edited Loading

ctb commented May 5, 2022

dkoslicki commented May 6, 2022

bluegenes commented May 12, 2022

bluegenes commented Jan 25, 2022 •

edited

Loading

bluegenes commented Apr 29, 2022 •

edited

Loading

bluegenes commented May 3, 2022 •

edited

Loading

bluegenes commented May 5, 2022 •

edited

Loading