High risk of post IDs colliding #57

Closed
graue opened this Issue Dec 17, 2012 · 4 comments

3 participants

@graue

TentD calculates post IDs like this:

rand(36 ** 6).to_s(36)

A base36-encoded number, 0 <= post id < 36**6.

I was curious about the possibility of collision. I found some Python code for the generalized birthday problem here. Using it, we can confirm the original birthday paradox, namely that if you have 23 people in a room there's ~50% chance two have the same birthday:

>>> bp(23,365)
0.5072972343239854

The collision results for post IDs from 0 to (36**6)-1 are not good:

>>> bp(1000,36**6)
0.0002294408417131688
>>> bp(5000,36**6)
0.005724827003213306
>>> bp(10000,36**6)
0.02270567757151587
>>> bp(50000,36**6)
0.4368644884272206
>>> bp(100000,36**6)
0.8994379666202111
>>> bp(500000,36**6)
1.0

Even at 10,000 posts, there's a 2.3% chance of collision. At 500,000 posts a collision is so nearly certain that Python rounds the result to 100%!

How feasible would it be to simply use a cryptographic hash as a post ID? Are you optimizing for something (human memorability? DB lookup speed?) by not doing that?

Otherwise, it seems that 14-character strings might be acceptable, or 10-character strings might work if the server explicitly checks for collisions and regenerates the ID until one is unique:

>>> bp(1000000,36**10)
0.00013674607495806068
>>> bp(1000000,36**12)
1.055211606981743e-07
>>> bp(1000000,36**14)
7.317579875376623e-11
>>> bp(1000000,36**16)
0.0
@titanous
Tent member

Thanks for pointing this out. The original decision for this was made quite a while ago, and the reasons for it are irrelevant now. Luckily, it's very easy to change.

What we need is a way to generate non-sequential unique identifiers. These should be as short as is reasonably possible, I think the easiest thing to do is to use a UUID/GUID, I don't have a strong preference which type though.

@titanous
Tent member

Also, whatever UUID system we end up with should use base32 or urlsafe base64 instead of hex, as there is no reason to waste bytes.

@mfferreira

Bump

@jvatic jvatic pushed a commit that referenced this issue Dec 23, 2012
Jesse Stuart Generate longer ids
refs #57
dc8b998
@titanous titanous closed this Dec 23, 2012
@titanous
Tent member

The IDs are now 16 bytes of random data Base64 encoded with url-safe characters (22 ASCII characters). They look like this:

5Ns3iBvkUBzET4MDokbPwA
DV-J704mVXfSNwAEABM_kA
8nAfk6hCoU724kGpXmievA
BiCemc84A1w4ra8osddevg
EmwUB6_7mcGsxNXIrrBOAw
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment