New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
barcode hash conflict? #1741
Comments
This is the "birthday paradox". See https://stackoverflow.com/questions/14210298/probability-of-collision-when-using-a-32-bit-hash for details of a 32-bit hash. Basically a 50/50 chance is of any two hashes colliding for bit-length N, is more or less 1 in 2^(N/2). Note that's "any 2 colliding". This is not the same as the probability of this specific barcode colliding with another, which is still 1 in 2^32 for a 32-bit hash. The main thought here is that although you likely will get the occasionally collision on any data set with enough barcodes, they are likely to be randomly distributed and so act as a tiny reduction in throughput (0.001%). |
Thanks for your explanations. I think it makes sense even I did not quite get the last 0.001% reduction out from the post data. However, I think any company/group/person who have this concern can fully avoid the collision by the following steps:
|
Correct. If you are designing barcodes rather than randomly constructing them, then a set can be designed that avoids collisions. Maybe we should add a tool to take a list of barcodes and report any collisions. |
It is really so good that samtools finally support mark duplications in terms of barcode information,
however I found that barcode are just reduced to int32_t by do_hash function without conflict check,
is it possible that two different barcodes share the same hash?
or that it is nearly impossible that only the barcodes are different with same hash value, while the other information(coords, refs, leftmost, read group, orientation) are the same for two different reads?
The text was updated successfully, but these errors were encountered: