# High risk of post IDs colliding #57

Closed
opened this Issue Dec 17, 2012 · 4 comments

### 3 participants

TentD calculates post IDs like this:

`rand(36 ** 6).to_s(36)`

A base36-encoded number, 0 <= post id < 36**6.

I was curious about the possibility of collision. I found some Python code for the generalized birthday problem here. Using it, we can confirm the original birthday paradox, namely that if you have 23 people in a room there's ~50% chance two have the same birthday:

``````>>> bp(23,365)
0.5072972343239854
``````

The collision results for post IDs from 0 to (36**6)-1 are not good:

``````>>> bp(1000,36**6)
0.0002294408417131688
>>> bp(5000,36**6)
0.005724827003213306
>>> bp(10000,36**6)
0.02270567757151587
>>> bp(50000,36**6)
0.4368644884272206
>>> bp(100000,36**6)
0.8994379666202111
>>> bp(500000,36**6)
1.0
``````

Even at 10,000 posts, there's a 2.3% chance of collision. At 500,000 posts a collision is so nearly certain that Python rounds the result to 100%!

How feasible would it be to simply use a cryptographic hash as a post ID? Are you optimizing for something (human memorability? DB lookup speed?) by not doing that?

Otherwise, it seems that 14-character strings might be acceptable, or 10-character strings might work if the server explicitly checks for collisions and regenerates the ID until one is unique:

``````>>> bp(1000000,36**10)
0.00013674607495806068
>>> bp(1000000,36**12)
1.055211606981743e-07
>>> bp(1000000,36**14)
7.317579875376623e-11
>>> bp(1000000,36**16)
0.0
``````
Tent member

Thanks for pointing this out. The original decision for this was made quite a while ago, and the reasons for it are irrelevant now. Luckily, it's very easy to change.

What we need is a way to generate non-sequential unique identifiers. These should be as short as is reasonably possible, I think the easiest thing to do is to use a UUID/GUID, I don't have a strong preference which type though.

Tent member

Also, whatever UUID system we end up with should use base32 or urlsafe base64 instead of hex, as there is no reason to waste bytes.

Bump

pushed a commit that referenced this issue Dec 23, 2012
 Jesse Stuart `Generate longer ids` `refs #57` `dc8b998`
referenced this issue Dec 23, 2012
Merged

#### Generate longer ids #59

closed this Dec 23, 2012
Tent member

The IDs are now 16 bytes of random data Base64 encoded with url-safe characters (22 ASCII characters). They look like this:

``````5Ns3iBvkUBzET4MDokbPwA
DV-J704mVXfSNwAEABM_kA
8nAfk6hCoU724kGpXmievA
BiCemc84A1w4ra8osddevg
EmwUB6_7mcGsxNXIrrBOAw
``````