Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Odds of Hash Collision for Custom Type Keyword ID's are quite High #126

Closed
alexanderkiel opened this issue Jan 10, 2020 · 3 comments
Closed

Comments

@alexanderkiel
Copy link

You use 16-bit Hashes for custom type keyword ID's. The equation to calculate hash collisions is:

(- 1 (Math/exp (* -1 (/ (* k (dec k)) (* 2 n)))))

or simpler, but only valid for small probabilities:

(/ (* k k) (* 2 n))

So assume one would like to use at least 128 custom types, as it would be possible with numeric ID's, the odds of a hash collision would be 1 out of 10. So in 1/10 of cases were I use roughly 128 random keywords to identify my custom types, I will have a hash collision.

For a small number of custom types like 12 the odds are 1 out of 1000. So I assume for most use cases the choice of a 16-bit hash is ok. But if one plans to use a large number of custom types, collisions will be a problem.

I would suggest to add some hint to the documentation that the hashing system works only for small amounts of custom types. Furthermore I would include collided keywords explicitly in the warning inside extend-thaw.

@ptaoussanis
Copy link
Member

Hi Alexander, thanks for pinging!

This is an intentional tradeoff since:

  • The common case is folks using only 1 or 2 custom types. I don't think I've ever seen anyone use more than ~8.
  • Collisions will be caught at compile-time with a warning.
  • In the event of a collision, it's often possible to change the identifier.
  • The identifiers are anyway just a convenience; you can always choose to use a byte identifier directly.

Is your concern theoretical, or have you actually run into a problem on your end?

I would suggest to add some hint to the documentation that the hashing system works only for small amounts of custom types.

Would be happy to look at a PR if you have something in mind 👍

Furthermore I would include collided keywords explicitly in the warning inside extend-thaw.

I'm not sure I follow? Collisions will already print a warning message, including the specific id.

Thanks!

@alexanderkiel
Copy link
Author

You are right, as I already said, for most use cases it is ok.

Is your concern theoretical, or have you actually run into a problem on your end?

Yes, I explore to use something like 50 to 100 records instead of normal maps in my application and plan to serialize them with nippy.

Would be happy to look at a PR if you have something in mind

Something like this:

* Keyword           - 2 byte overhead, resistent to id collisions

* Keyword           - 2 byte overhead, resistent to id collisions (for < 10 different custom types)

I'm not sure I follow? Collisions will already print a warning message, including the specific id.

I would include both, the old and the new id.

@ptaoussanis
Copy link
Member

Docstring will be updated in next release, thanks Alexander!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants