barcode hash conflict? #1741

wulj2 · 2022-10-28T02:45:59Z

It is really so good that samtools finally support mark duplications in terms of barcode information,
however I found that barcode are just reduced to int32_t by do_hash function without conflict check,
is it possible that two different barcodes share the same hash?
or that it is nearly impossible that only the barcodes are different with same hash value, while the other information(coords, refs, leftmost, read group, orientation) are the same for two different reads?

jkbonfield · 2022-10-28T08:55:21Z

This is the "birthday paradox". See https://stackoverflow.com/questions/14210298/probability-of-collision-when-using-a-32-bit-hash for details of a 32-bit hash. Basically a 50/50 chance is of any two hashes colliding for bit-length N, is more or less 1 in 2^(N/2). Note that's "any 2 colliding". This is not the same as the probability of this specific barcode colliding with another, which is still 1 in 2^32 for a 32-bit hash.

The main thought here is that although you likely will get the occasionally collision on any data set with enough barcodes, they are likely to be randomly distributed and so act as a tiny reduction in throughput (0.001%).

wulj2 · 2022-10-28T09:25:48Z

Thanks for your explanations. I think it makes sense even I did not quite get the last 0.001% reduction out from the post data.

However, I think any company/group/person who have this concern can fully avoid the collision by the following steps:

design a large barcode library to use in all your NGS experiment
use the do_hash function to select a final barcode library which have not any collisions
use the final barcode set in all your NGS experiment and use samtools markup will work like charm without any concern about barcode hash conflict

jkbonfield · 2022-10-28T10:15:54Z

Correct. If you are designing barcodes rather than randomly constructing them, then a set can be designed that avoids collisions. Maybe we should add a tool to take a list of barcodes and report any collisions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

barcode hash conflict? #1741

barcode hash conflict? #1741

wulj2 commented Oct 28, 2022

jkbonfield commented Oct 28, 2022

wulj2 commented Oct 28, 2022 •

edited

jkbonfield commented Oct 28, 2022

barcode hash conflict? #1741

barcode hash conflict? #1741

Comments

wulj2 commented Oct 28, 2022

jkbonfield commented Oct 28, 2022

wulj2 commented Oct 28, 2022 • edited

jkbonfield commented Oct 28, 2022

wulj2 commented Oct 28, 2022 •

edited