Checking how full a CQF is #4

betatim · 2017-06-19T13:29:24Z

Hi, we are adding CQF support to khmer and one thing we've run into is that after inserting some number of unique kmers we hit this assertion: https://github.com/splatlab/cqf/blob/master/gqf.c#L487

For example when you create a QF with qf_init(cf, (1ULL << size), size+8, 0) and size=7. I then attempt to insert 400 random 20-mers. The assert gets triggered when I try to insert kmer 242.

I think this is because the QF is "full". Is that right? I'm asking because coming from a world of bloom filters (BF) you can keep adding and adding to your BF and while eventually all bits will be set to 1, you can always add more (even if it makes little sense).

For khmer this is nice. We can guarantee to the user that the program will not run out of memory (all allocated up front) and we can warn them if we think that they made their BF too small. But you can leave your program running for hours and get "some kind of result" instead of "ah we used up all the memory and crashed please start from scratch".

So finally my question: is there a way for me to check how full the CQF is and if adding one more kmer will fail or not? Or is there a way to use the CQF so that it will let me keep adding kmers until I go blue in the face regardless of how full it is/overflow the counter?

(I think you can reproduce this with squeakr: ./main 0 10 1 test.fastq though for me that segfaults and doesn't print the assert it hit so I am not 100% sure.)

The text was updated successfully, but these errors were encountered:

prashantpandey · 2017-06-22T00:48:23Z

Hi Tim,

You are right, this should be because the CQF is full. Right now, there is no explicit check in the code to abort insertion after the CQF is certain percent full.
However, we have metadata in the CQF that gets updated every time you insert or delete an element. You can use the metadata to know how full the CQF is at any instant.
For example,

typedef struct quotient_filter {
	uint64_t nslots;  // total number of slots in the CQF
	uint64_t xnslots;  // total number of slots+some extra slots to handle the overflow in the last block
	uint64_t key_bits; 
	uint64_t value_bits;
	uint64_t key_remainder_bits;
	uint64_t bits_per_slot;
	__uint128_t range;
	uint64_t nblocks;
	uint64_t nelts;  // total number of elements inserted 
	uint64_t ndistinct_elts;  // total number of distinct elements inserted
	uint64_t noccupied_slots;  // total number of occupied slots
	qfblock *blocks;
} quotient_filter;

So, at any instant the load factor of the CQF is (noccupied_slots/nslots). You can stop after the CQF is 95% full.
In the Bloom filter, you can continue inserting elements and it will never crib. However, since the false-positive rate depends on the number of inserted elements, if you insert more elements then the false-positive rate will go down.
The CQF is a hash table like data structure. So, once you are very full (e.g., 80% or 90%) then you have to resize the CQF. You can resize the CQF by creating a bigger CQF (twice the size of the original one) and iterating over the hashes in the original one and inserting them in the bigger.
There is an iterator interface that you can use.

Thanks
Prashant

rtjohnso · 2017-06-22T01:00:02Z

Here are some ideas for dealing with this problem: First, use a sampling-based method to estimate the number D of distinct k-mers in the dataset. For example, we used Mohamadi's "ntcard: A streaming algorithm for cardinality estimation in genomics data" Once you know D, you can size your hash function output to achieve the desired false positve rate, i.e. set the number p of output bits of your hash function so that p >= log D / epsilon, where epsilon is the fp rate you want. The fp rate is entirely controlled by D and epsilon, so you can now choose the size of your CQF, or resize your CQF, handle duplicates, etc., without having to worry about exceeding your fp rate. At this point I recommend that you select some initial estimate of the number of slots you need, and then resize the CQF if that guess turns out to be too low. For example, you could guess that you'll only need D slots (this is what Squeakr does, but more on this later). You then set q = ceiling(log D) and r = p - q. When adding items, you can detect that the CQF is getting full by checking whether qf->metadata.noccupied_slots >= 0.95 * qf->metadata.nslots If this condition holds, then you can create a new empty CQF twice as big as your old one, i.e. with q' = q + 1 and r' = r - 1. Then iterate over the hashes in the old CQF and insert them into the new CQF. Then discard the old CQF and proceed adding more data to the new CQF. Repeat resizing as many times as necessary. The annoyance with resizing is that it wastes time and requires allocating memory in the middle of counting. Therefore it is worth spending a little time up front estimating the number of slots you need. Here are some guidelines: - You always need at least D slots. - You never need more than N slots, where N is the number of k-mer instances in your input. - Singletons take a singe slot, doubletons take 2 slots. - For tripletons up, it's probably ok to just treat them as if they take 3 slots. (Some could take more, but those will have high counts, and hence by definition there can't be too many of those.) The upshot is that, in order to avoid resizing, you can just estimate the number of slots needed as something like 2 * D. Then round up to a power of 2. The justification is that the errors introduced by the read process will guarantee that a large fraction (e.g. 1/2) of the k-mers will be singletons, which will all take a single slot. The remaining k-mers will take, typically, 3 slots. So 1/2 * 1 + 1/2 * 3 = 2 slots per k-mer. I hope this gives some insight into how to think about sizing the CQF. Best, Rob

…

On Mon, Jun 19, 2017 at 6:29 AM, Tim Head ***@***.***> wrote: Hi, we are adding CQF support to khmer <//github.com/dib-lab/khmer> and one thing we've run into is that after inserting some number of unique kmers we hit this assertion: https://github.com/splatlab/ cqf/blob/master/gqf.c#L487 For example when you create a QF with qf_init(cf, (1ULL << size), size+8, 0) and size=7. I then attempt to insert 400 random 20-mers. The assert gets triggered when I try to insert kmer 242. I think this is because the QF is "full". Is that right? I'm asking because coming from a world of bloom filters (BF) you can keep adding and adding to your BF and while eventually all bits will be set to 1, you can always add more (even if it makes little sense). For khmer this is nice. We can guarantee to the user that the program will not run out of memory (all allocated up front) and we can warn them if we think that they made their BF too small. But you can leave your program running for hours and get "some kind of result" instead of "ah we used up all the memory and crashed please start from scratch". So finally my question: is there a way for me to check how full the CQF is and if adding one more kmer will fail or not? Or is there a way to use the CQF so that it will let me keep adding kmers until I go blue in the face regardless of how full it is/overflow the counter? ------------------------------ (I think you can reproduce this with squeakr: ./main 0 10 1 test.fastq though for me that segfaults and doesn't print the assert it hit so I am not 100% sure.) — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#4>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADnSxQFGMUn2-CGwqwAWLr7Yv0JiKZyWks5sFne2gaJpZM4N-Ot2> .

-- Calendar info: https://www.google.com/calendar/embed?src=rob%40cs.stonybrook.edu

ctb · 2017-06-22T13:13:55Z

Hi @rtjohnso @prashantpandey thank you for your excellent replies!

We've been through a bunch of stuff with memory size choices in khmer over the years - see dib-lab/khmer#1117 for what we finally settled on.

@betatim is correct that one thing we really value in general is the ability to guarantee no out of memory errors. Over the years this has become more nuanced because in some (most) real data processing cases, we don't run the risk of running out of memory - but we still encounter some such cases regularly.

We do have a very fast (HLL-based) approach to closely estimating the number of k-mers and feeding them into the scripts (using -U).

Also, interestingly, there are several situations in which sampling estimation does not work - RNAseq and metagenome analyses are two of them. Because of the variable coverage situation in both (equiv. unknown abundance spectrum), you cannot easily guesstimate the total number of k-mers in a data set.

Another interesting use case where sampling estimation doesn't work well is in streaming single-pass downsampling/error correction of data (normalize-by-median.py/digital normalization and trim-low-abund.py/streaming error trimming) where we are removing errors as we walk across the data. There is no simple way to estimate what is left in the remaining data (although for genomes you can start to place probabilistic upper bounds on what you're likely to see).

This will certainly not prevent us from adopting the CQF which is far more efficient (it seems :) than our Bloom filter approach, but I just wanted to share! With khmer we can implement both and let the user select, although then we run into UX challenges :)

betatim · 2017-06-22T13:29:41Z

So, at any instant the load factor of the CQF is (noccupied_slots/nslots). You can stop after the CQF is 95% full.

That makes sense. Just tried it out and am confused (again) :-)

  QF cf;
  uint64_t qbits = 3;
  uint64_t nhashbits = qbits + 8;
  uint64_t nslots = (1ULL << qbits);

  /* Initialise the CQF */
  qf_init(&cf, nslots, nhashbits, 0);

  for (int i=0; i<16; ++i) {
    qf_insert(&cf, (i%8)%cf.range, 0, 1);
    printf("%i slots:%llu nocc:%llu nunique:%llu\n",
           i%8, cf.nslots, cf.noccupied_slots, cf.ndistinct_elts);
  }

I was expecting this to end with 8 occupied slots and 8 distinct elements. However it counts up past nslots. Repeatedly inserting the same element into the filter at first increases the number of occupied slots which also puzzles me. I was thinking each slot is used to track the count for one unique item?

@rtjohnso thanks for the detailed answer! I will think about it while trying to sort out these technical things.

betatim · 2017-06-22T13:44:05Z

This is the output of running the above snippet:

0 slots:8 nocc:1 nunique:1
1 slots:8 nocc:2 nunique:2
2 slots:8 nocc:3 nunique:3
3 slots:8 nocc:4 nunique:4
4 slots:8 nocc:5 nunique:5
5 slots:8 nocc:6 nunique:6
6 slots:8 nocc:7 nunique:7
7 slots:8 nocc:8 nunique:8
0 slots:8 nocc:9 nunique:8
1 slots:8 nocc:10 nunique:8
2 slots:8 nocc:11 nunique:8
3 slots:8 nocc:12 nunique:8
4 slots:8 nocc:13 nunique:8
5 slots:8 nocc:14 nunique:8
6 slots:8 nocc:15 nunique:8
7 slots:8 nocc:16 nunique:8

rtjohnso · 2017-06-22T17:50:11Z

Hi Tim, That actually looks right. We do encode high counters in the slots but, for some technical reasons, we wanted to never use more than t slots to store t copies of an item, even for very small values of t. Therefore the encoding scheme is a little complicated and, as a result, for small counts (e.g. 1, 2, or 3) the number of slots used to represent that item is the same as the number of occurrences of that item. But after 3, the number of slots used to represent C copies of an item is basically 2 + (log C) / r. Best, Rob

…

On Thu, Jun 22, 2017 at 6:44 AM, Tim Head ***@***.***> wrote: This is the output of running the above snippet: 0 slots:8 nocc:1 nunique:1 1 slots:8 nocc:2 nunique:2 2 slots:8 nocc:3 nunique:3 3 slots:8 nocc:4 nunique:4 4 slots:8 nocc:5 nunique:5 5 slots:8 nocc:6 nunique:6 6 slots:8 nocc:7 nunique:7 7 slots:8 nocc:8 nunique:8 0 slots:8 nocc:9 nunique:8 1 slots:8 nocc:10 nunique:8 2 slots:8 nocc:11 nunique:8 3 slots:8 nocc:12 nunique:8 4 slots:8 nocc:13 nunique:8 5 slots:8 nocc:14 nunique:8 6 slots:8 nocc:15 nunique:8 7 slots:8 nocc:16 nunique:8 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADnSxY2vikne51Q8pM6iUvUTZwsWgDtoks5sGm-mgaJpZM4N-Ot2> .

-- Calendar info: https://www.google.com/calendar/embed?src=rob%40cs.stonybrook.edu

prashantpandey · 2017-06-22T18:05:36Z

And the reason why you are able to insert more than 8 elements in the CQF even when the number of slots is only 8 is because of the extra slots we create to handle the overflow.

qf->xnslots = nslots + 10*sqrt((double)nslots);

Thanks
Prashant

betatim · 2017-06-23T12:55:43Z

Thanks both of you! After reading your first long reply to the end I wasn't surprised by the behaviour anymore.

Just to check: singletons, doubletons and tripletons -> item that appears once, twice, thrice(?), etc right?

prashantpandey · 2017-06-23T13:20:54Z

Yes.

betatim · 2017-06-23T13:40:32Z

Thanks a lot for all the answering of naive questions 😃

ctb · 2017-06-23T13:41:05Z

yes, thank you!!

prashantpandey · 2017-06-23T23:15:22Z

Here is the link to the SIGMOD17 paper on the CQF (http://dl.acm.org/citation.cfm?id=3035963) if you want to understand the data structure in more detail.

Thanks
Prashant

betatim mentioned this issue Jun 23, 2017

[MRG] CQF based storage dib-lab/khmer#1675

Merged

8 tasks

prashantpandey closed this as completed Jun 23, 2017

betatim mentioned this issue Aug 1, 2017

Bulk loading fails with QFCounttable dib-lab/khmer#1751

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Checking how full a CQF is #4

Checking how full a CQF is #4

betatim commented Jun 19, 2017

prashantpandey commented Jun 22, 2017 •

edited

Loading

rtjohnso commented Jun 22, 2017 via email

ctb commented Jun 22, 2017

betatim commented Jun 22, 2017

betatim commented Jun 22, 2017

rtjohnso commented Jun 22, 2017 via email

prashantpandey commented Jun 22, 2017

betatim commented Jun 23, 2017

prashantpandey commented Jun 23, 2017

betatim commented Jun 23, 2017

ctb commented Jun 23, 2017 via email

prashantpandey commented Jun 23, 2017

Checking how full a CQF is #4

Checking how full a CQF is #4

Comments

betatim commented Jun 19, 2017

prashantpandey commented Jun 22, 2017 • edited Loading

rtjohnso commented Jun 22, 2017 via email

ctb commented Jun 22, 2017

betatim commented Jun 22, 2017

betatim commented Jun 22, 2017

rtjohnso commented Jun 22, 2017 via email

prashantpandey commented Jun 22, 2017

qf->xnslots = nslots + 10*sqrt((double)nslots);

betatim commented Jun 23, 2017

prashantpandey commented Jun 23, 2017

betatim commented Jun 23, 2017

ctb commented Jun 23, 2017 via email

prashantpandey commented Jun 23, 2017

prashantpandey commented Jun 22, 2017 •

edited

Loading