Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checking how full a CQF is #4

Closed
betatim opened this issue Jun 19, 2017 · 12 comments
Closed

Checking how full a CQF is #4

betatim opened this issue Jun 19, 2017 · 12 comments

Comments

@betatim
Copy link

betatim commented Jun 19, 2017

Hi, we are adding CQF support to khmer and one thing we've run into is that after inserting some number of unique kmers we hit this assertion: https://github.com/splatlab/cqf/blob/master/gqf.c#L487

For example when you create a QF with qf_init(cf, (1ULL << size), size+8, 0) and size=7. I then attempt to insert 400 random 20-mers. The assert gets triggered when I try to insert kmer 242.

I think this is because the QF is "full". Is that right? I'm asking because coming from a world of bloom filters (BF) you can keep adding and adding to your BF and while eventually all bits will be set to 1, you can always add more (even if it makes little sense).

For khmer this is nice. We can guarantee to the user that the program will not run out of memory (all allocated up front) and we can warn them if we think that they made their BF too small. But you can leave your program running for hours and get "some kind of result" instead of "ah we used up all the memory and crashed please start from scratch".

So finally my question: is there a way for me to check how full the CQF is and if adding one more kmer will fail or not? Or is there a way to use the CQF so that it will let me keep adding kmers until I go blue in the face regardless of how full it is/overflow the counter?


(I think you can reproduce this with squeakr: ./main 0 10 1 test.fastq though for me that segfaults and doesn't print the assert it hit so I am not 100% sure.)

@prashantpandey
Copy link
Member

prashantpandey commented Jun 22, 2017

Hi Tim,

You are right, this should be because the CQF is full. Right now, there is no explicit check in the code to abort insertion after the CQF is certain percent full.
However, we have metadata in the CQF that gets updated every time you insert or delete an element. You can use the metadata to know how full the CQF is at any instant.
For example,


typedef struct quotient_filter {
	uint64_t nslots;  // total number of slots in the CQF
	uint64_t xnslots;  // total number of slots+some extra slots to handle the overflow in the last block
	uint64_t key_bits; 
	uint64_t value_bits;
	uint64_t key_remainder_bits;
	uint64_t bits_per_slot;
	__uint128_t range;
	uint64_t nblocks;
	uint64_t nelts;  // total number of elements inserted 
	uint64_t ndistinct_elts;  // total number of distinct elements inserted
	uint64_t noccupied_slots;  // total number of occupied slots
	qfblock *blocks;
} quotient_filter;

So, at any instant the load factor of the CQF is (noccupied_slots/nslots). You can stop after the CQF is 95% full.
In the Bloom filter, you can continue inserting elements and it will never crib. However, since the false-positive rate depends on the number of inserted elements, if you insert more elements then the false-positive rate will go down.
The CQF is a hash table like data structure. So, once you are very full (e.g., 80% or 90%) then you have to resize the CQF. You can resize the CQF by creating a bigger CQF (twice the size of the original one) and iterating over the hashes in the original one and inserting them in the bigger.
There is an iterator interface that you can use.

Thanks
Prashant

@rtjohnso
Copy link
Member

rtjohnso commented Jun 22, 2017 via email

@ctb
Copy link

ctb commented Jun 22, 2017

Hi @rtjohnso @prashantpandey thank you for your excellent replies!

We've been through a bunch of stuff with memory size choices in khmer over the years - see dib-lab/khmer#1117 for what we finally settled on.

@betatim is correct that one thing we really value in general is the ability to guarantee no out of memory errors. Over the years this has become more nuanced because in some (most) real data processing cases, we don't run the risk of running out of memory - but we still encounter some such cases regularly.

We do have a very fast (HLL-based) approach to closely estimating the number of k-mers and feeding them into the scripts (using -U).

Also, interestingly, there are several situations in which sampling estimation does not work - RNAseq and metagenome analyses are two of them. Because of the variable coverage situation in both (equiv. unknown abundance spectrum), you cannot easily guesstimate the total number of k-mers in a data set.

Another interesting use case where sampling estimation doesn't work well is in streaming single-pass downsampling/error correction of data (normalize-by-median.py/digital normalization and trim-low-abund.py/streaming error trimming) where we are removing errors as we walk across the data. There is no simple way to estimate what is left in the remaining data (although for genomes you can start to place probabilistic upper bounds on what you're likely to see).

This will certainly not prevent us from adopting the CQF which is far more efficient (it seems :) than our Bloom filter approach, but I just wanted to share! With khmer we can implement both and let the user select, although then we run into UX challenges :)

@betatim
Copy link
Author

betatim commented Jun 22, 2017

So, at any instant the load factor of the CQF is (noccupied_slots/nslots). You can stop after the CQF is 95% full.

That makes sense. Just tried it out and am confused (again) :-)

  QF cf;
  uint64_t qbits = 3;
  uint64_t nhashbits = qbits + 8;
  uint64_t nslots = (1ULL << qbits);

  /* Initialise the CQF */
  qf_init(&cf, nslots, nhashbits, 0);

  for (int i=0; i<16; ++i) {
    qf_insert(&cf, (i%8)%cf.range, 0, 1);
    printf("%i slots:%llu nocc:%llu nunique:%llu\n",
           i%8, cf.nslots, cf.noccupied_slots, cf.ndistinct_elts);
  }

I was expecting this to end with 8 occupied slots and 8 distinct elements. However it counts up past nslots. Repeatedly inserting the same element into the filter at first increases the number of occupied slots which also puzzles me. I was thinking each slot is used to track the count for one unique item?


@rtjohnso thanks for the detailed answer! I will think about it while trying to sort out these technical things.

@betatim
Copy link
Author

betatim commented Jun 22, 2017

This is the output of running the above snippet:

0 slots:8 nocc:1 nunique:1
1 slots:8 nocc:2 nunique:2
2 slots:8 nocc:3 nunique:3
3 slots:8 nocc:4 nunique:4
4 slots:8 nocc:5 nunique:5
5 slots:8 nocc:6 nunique:6
6 slots:8 nocc:7 nunique:7
7 slots:8 nocc:8 nunique:8
0 slots:8 nocc:9 nunique:8
1 slots:8 nocc:10 nunique:8
2 slots:8 nocc:11 nunique:8
3 slots:8 nocc:12 nunique:8
4 slots:8 nocc:13 nunique:8
5 slots:8 nocc:14 nunique:8
6 slots:8 nocc:15 nunique:8
7 slots:8 nocc:16 nunique:8

@rtjohnso
Copy link
Member

rtjohnso commented Jun 22, 2017 via email

@prashantpandey
Copy link
Member

And the reason why you are able to insert more than 8 elements in the CQF even when the number of slots is only 8 is because of the extra slots we create to handle the overflow.


qf->xnslots = nslots + 10*sqrt((double)nslots);

Thanks
Prashant

@betatim
Copy link
Author

betatim commented Jun 23, 2017

Thanks both of you! After reading your first long reply to the end I wasn't surprised by the behaviour anymore.

Just to check: singletons, doubletons and tripletons -> item that appears once, twice, thrice(?), etc right?

@prashantpandey
Copy link
Member

Yes.

@betatim
Copy link
Author

betatim commented Jun 23, 2017

Thanks a lot for all the answering of naive questions 😃

@ctb
Copy link

ctb commented Jun 23, 2017 via email

@prashantpandey
Copy link
Member

Here is the link to the SIGMOD17 paper on the CQF (http://dl.acm.org/citation.cfm?id=3035963) if you want to understand the data structure in more detail.

Thanks
Prashant

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants