Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

barcode hash conflict? #1741

Open
wulj2 opened this issue Oct 28, 2022 · 3 comments
Open

barcode hash conflict? #1741

wulj2 opened this issue Oct 28, 2022 · 3 comments

Comments

@wulj2
Copy link
Contributor

wulj2 commented Oct 28, 2022

It is really so good that samtools finally support mark duplications in terms of barcode information,
however I found that barcode are just reduced to int32_t by do_hash function without conflict check,
is it possible that two different barcodes share the same hash?
or that it is nearly impossible that only the barcodes are different with same hash value, while the other information(coords, refs, leftmost, read group, orientation) are the same for two different reads?

@jkbonfield
Copy link
Contributor

This is the "birthday paradox". See https://stackoverflow.com/questions/14210298/probability-of-collision-when-using-a-32-bit-hash for details of a 32-bit hash. Basically a 50/50 chance is of any two hashes colliding for bit-length N, is more or less 1 in 2^(N/2). Note that's "any 2 colliding". This is not the same as the probability of this specific barcode colliding with another, which is still 1 in 2^32 for a 32-bit hash.

The main thought here is that although you likely will get the occasionally collision on any data set with enough barcodes, they are likely to be randomly distributed and so act as a tiny reduction in throughput (0.001%).

@wulj2
Copy link
Contributor Author

wulj2 commented Oct 28, 2022

Thanks for your explanations. I think it makes sense even I did not quite get the last 0.001% reduction out from the post data.

However, I think any company/group/person who have this concern can fully avoid the collision by the following steps:

  1. design a large barcode library to use in all your NGS experiment
  2. use the do_hash function to select a final barcode library which have not any collisions
  3. use the final barcode set in all your NGS experiment and use samtools markup will work like charm without any concern about barcode hash conflict

@jkbonfield
Copy link
Contributor

Correct. If you are designing barcodes rather than randomly constructing them, then a set can be designed that avoids collisions. Maybe we should add a tool to take a list of barcodes and report any collisions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants