-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Checking how full a CQF is #4
Comments
Hi Tim, You are right, this should be because the CQF is full. Right now, there is no explicit check in the code to abort insertion after the CQF is certain percent full.
So, at any instant the load factor of the CQF is (noccupied_slots/nslots). You can stop after the CQF is 95% full. Thanks |
Here are some ideas for dealing with this problem:
First, use a sampling-based method to estimate the number D of distinct
k-mers in the dataset. For example, we used Mohamadi's
"ntcard: A streaming algorithm for cardinality estimation in genomics
data"
Once you know D, you can size your hash function output to achieve the
desired false positve rate, i.e. set the number p of output bits of your
hash function so that p >= log D / epsilon, where epsilon is the fp rate
you want. The fp rate is entirely controlled by D and epsilon, so you can
now choose the size of your CQF, or resize your CQF, handle duplicates,
etc., without having to worry about exceeding your fp rate.
At this point I recommend that you select some initial estimate of the
number of slots you need, and then resize the CQF if that guess turns out
to be too low. For example, you could guess that you'll only need D slots
(this is what Squeakr does, but more on this later). You then set q =
ceiling(log D) and r = p - q.
When adding items, you can detect that the CQF is getting full by checking
whether
qf->metadata.noccupied_slots >= 0.95 * qf->metadata.nslots
If this condition holds, then you can create a new empty CQF twice as big
as your old one, i.e. with q' = q + 1 and r' = r - 1. Then iterate over
the hashes in the old CQF and insert them into the new CQF. Then discard
the old CQF and proceed adding more data to the new CQF. Repeat resizing as
many times as necessary.
The annoyance with resizing is that it wastes time and requires allocating
memory in the middle of counting. Therefore it is worth spending a little
time up front estimating the number of slots you need. Here are some
guidelines:
- You always need at least D slots.
- You never need more than N slots, where N is the number of k-mer
instances in your input.
- Singletons take a singe slot, doubletons take 2 slots.
- For tripletons up, it's probably ok to just treat them as if they take 3
slots. (Some could take more, but those will have high counts, and hence
by definition there can't be too many of those.)
The upshot is that, in order to avoid resizing, you can just estimate the
number of slots needed as something like 2 * D. Then round up to a power of
2. The justification is that the errors introduced by the read process
will guarantee that a large fraction (e.g. 1/2) of the k-mers will be
singletons, which will all take a single slot. The remaining k-mers will
take, typically, 3 slots. So 1/2 * 1 + 1/2 * 3 = 2 slots per k-mer.
I hope this gives some insight into how to think about sizing the CQF.
Best,
Rob
…On Mon, Jun 19, 2017 at 6:29 AM, Tim Head ***@***.***> wrote:
Hi, we are adding CQF support to khmer <//github.com/dib-lab/khmer> and
one thing we've run into is that after inserting some number of unique
kmers we hit this assertion: https://github.com/splatlab/
cqf/blob/master/gqf.c#L487
For example when you create a QF with qf_init(cf, (1ULL << size), size+8,
0) and size=7. I then attempt to insert 400 random 20-mers. The assert
gets triggered when I try to insert kmer 242.
I think this is because the QF is "full". Is that right? I'm asking
because coming from a world of bloom filters (BF) you can keep adding and
adding to your BF and while eventually all bits will be set to 1, you can
always add more (even if it makes little sense).
For khmer this is nice. We can guarantee to the user that the program will
not run out of memory (all allocated up front) and we can warn them if we
think that they made their BF too small. But you can leave your program
running for hours and get "some kind of result" instead of "ah we used up
all the memory and crashed please start from scratch".
So finally my question: is there a way for me to check how full the CQF is
and if adding one more kmer will fail or not? Or is there a way to use the
CQF so that it will let me keep adding kmers until I go blue in the face
regardless of how full it is/overflow the counter?
------------------------------
(I think you can reproduce this with squeakr: ./main 0 10 1 test.fastq
though for me that segfaults and doesn't print the assert it hit so I am
not 100% sure.)
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#4>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADnSxQFGMUn2-CGwqwAWLr7Yv0JiKZyWks5sFne2gaJpZM4N-Ot2>
.
--
Calendar info:
https://www.google.com/calendar/embed?src=rob%40cs.stonybrook.edu
|
Hi @rtjohnso @prashantpandey thank you for your excellent replies! We've been through a bunch of stuff with memory size choices in khmer over the years - see dib-lab/khmer#1117 for what we finally settled on. @betatim is correct that one thing we really value in general is the ability to guarantee no out of memory errors. Over the years this has become more nuanced because in some (most) real data processing cases, we don't run the risk of running out of memory - but we still encounter some such cases regularly. We do have a very fast (HLL-based) approach to closely estimating the number of k-mers and feeding them into the scripts (using Also, interestingly, there are several situations in which sampling estimation does not work - RNAseq and metagenome analyses are two of them. Because of the variable coverage situation in both (equiv. unknown abundance spectrum), you cannot easily guesstimate the total number of k-mers in a data set. Another interesting use case where sampling estimation doesn't work well is in streaming single-pass downsampling/error correction of data ( This will certainly not prevent us from adopting the CQF which is far more efficient (it seems :) than our Bloom filter approach, but I just wanted to share! With khmer we can implement both and let the user select, although then we run into UX challenges :) |
That makes sense. Just tried it out and am confused (again) :-) QF cf;
uint64_t qbits = 3;
uint64_t nhashbits = qbits + 8;
uint64_t nslots = (1ULL << qbits);
/* Initialise the CQF */
qf_init(&cf, nslots, nhashbits, 0);
for (int i=0; i<16; ++i) {
qf_insert(&cf, (i%8)%cf.range, 0, 1);
printf("%i slots:%llu nocc:%llu nunique:%llu\n",
i%8, cf.nslots, cf.noccupied_slots, cf.ndistinct_elts);
} I was expecting this to end with 8 occupied slots and 8 distinct elements. However it counts up past @rtjohnso thanks for the detailed answer! I will think about it while trying to sort out these technical things. |
This is the output of running the above snippet:
|
Hi Tim,
That actually looks right. We do encode high counters in the slots but,
for some technical reasons, we wanted to never use more than t slots to
store t copies of an item, even for very small values of t. Therefore the
encoding scheme is a little complicated and, as a result, for small counts
(e.g. 1, 2, or 3) the number of slots used to represent that item is the
same as the number of occurrences of that item. But after 3, the number of
slots used to represent C copies of an item is basically 2 + (log C) / r.
Best,
Rob
…On Thu, Jun 22, 2017 at 6:44 AM, Tim Head ***@***.***> wrote:
This is the output of running the above snippet:
0 slots:8 nocc:1 nunique:1
1 slots:8 nocc:2 nunique:2
2 slots:8 nocc:3 nunique:3
3 slots:8 nocc:4 nunique:4
4 slots:8 nocc:5 nunique:5
5 slots:8 nocc:6 nunique:6
6 slots:8 nocc:7 nunique:7
7 slots:8 nocc:8 nunique:8
0 slots:8 nocc:9 nunique:8
1 slots:8 nocc:10 nunique:8
2 slots:8 nocc:11 nunique:8
3 slots:8 nocc:12 nunique:8
4 slots:8 nocc:13 nunique:8
5 slots:8 nocc:14 nunique:8
6 slots:8 nocc:15 nunique:8
7 slots:8 nocc:16 nunique:8
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADnSxY2vikne51Q8pM6iUvUTZwsWgDtoks5sGm-mgaJpZM4N-Ot2>
.
--
Calendar info:
https://www.google.com/calendar/embed?src=rob%40cs.stonybrook.edu
|
And the reason why you are able to insert more than 8 elements in the CQF even when the number of slots is only 8 is because of the extra slots we create to handle the overflow. qf->xnslots = nslots + 10*sqrt((double)nslots);Thanks |
Thanks both of you! After reading your first long reply to the end I wasn't surprised by the behaviour anymore. Just to check: singletons, doubletons and tripletons -> item that appears once, twice, thrice(?), etc right? |
Yes. |
Thanks a lot for all the answering of naive questions 😃 |
yes, thank you!!
|
Here is the link to the SIGMOD17 paper on the CQF (http://dl.acm.org/citation.cfm?id=3035963) if you want to understand the data structure in more detail. Thanks |
Hi, we are adding CQF support to khmer and one thing we've run into is that after inserting some number of unique kmers we hit this assertion: https://github.com/splatlab/cqf/blob/master/gqf.c#L487
For example when you create a QF with
qf_init(cf, (1ULL << size), size+8, 0)
andsize=7
. I then attempt to insert 400 random 20-mers. The assert gets triggered when I try to insert kmer 242.I think this is because the QF is "full". Is that right? I'm asking because coming from a world of bloom filters (BF) you can keep adding and adding to your BF and while eventually all bits will be set to 1, you can always add more (even if it makes little sense).
For khmer this is nice. We can guarantee to the user that the program will not run out of memory (all allocated up front) and we can warn them if we think that they made their BF too small. But you can leave your program running for hours and get "some kind of result" instead of "ah we used up all the memory and crashed please start from scratch".
So finally my question: is there a way for me to check how full the CQF is and if adding one more kmer will fail or not? Or is there a way to use the CQF so that it will let me keep adding kmers until I go blue in the face regardless of how full it is/overflow the counter?
(I think you can reproduce this with squeakr:
./main 0 10 1 test.fastq
though for me that segfaults and doesn't print the assert it hit so I am not 100% sure.)The text was updated successfully, but these errors were encountered: