chore: Add basic collision probabilities script #29

justinvdm · 2022-09-22T15:58:07Z

Context

We currently have very little information on how likely collisions are for each of copycat's API methods. For some methods (e.g. bool), these are trivial or unimportant. For data types that typically need to be unique (e.g. uuid or email), this is important information to know: it allows us to know where in the API we need to improve to make collisions less likely, and lets our users know how likely collisions will be for their use case.

Approach

* For each API method in the shortlist:
   * Until the stopping conditions are met (see below):
     * Run the method with a new uuid v4 as input until the first collision is found, or until we've iterated 999,999 times
     * Record the data set size at which the first collision happened as a datapoint
   * Output the obtained stats (e.g. mean, stddev, margin of error)

// The stopping conditions
* Both of these must be true:
 * We've run the method a minimum (100); or
 * Either
   * The margin of error is under a lower threshold (`0.05`); or
   * The margin of error is under a higher threshold (`0.10`), but the sum (sum of first collision dataset sizes obtained so far) is over a higher threshold (999,999)

Since the task here is number-crunching, we do this using a worker farm to leverage multiple cpu cores to get results faster.

Why complicate the logic for the stopping conditions?

Ideally, we could only check two things: that our runs (the set of first collision data set sizes) is over a minimum threshold, and that our margin of error is under a maximum threshold.

Unfortunately, this isn't very computationally feasible for methods with large numbers for their first collision dataset size - it would just take too long to reach a lower margin of error. So we make a compromise, and allow for a higher margin of error (0.10 rather than 0.05) for methods with high numbers for their first collision data set size.

For example, the first collision dataset size for int is very large - so in this case, it isn't very feasible to run this until we obtained 0.05. Let's say we ran it 100 times, and in total, all of the runs (i.e. all of the first collision data set sizes) summed up to over 999,999. We then decide that we're happy enough with a margin of error of 0.10, and stop there.

Why not analyse for the theoretical probabilities instead?

We're taking an empirical approach here, rather than analysing the code to work out the theoretical probabilities. The answer is that it is difficult for us to do this for each and every copycat API method in question, and know for certain these calculations are accurate - there are just too many different variables at play influencing the probabilities, and we wouldn't really know for sure if we've accounted for all of them. The only way we can know how the collision probabilities look in practice, is to actually run the code.

Limitations

I must admit that I know very little about the areas I'm wondering around in with this PR: both hash collision probabilities, and stats. If there's something important I'm missing with the approach I'm taking, I'd like to know!
The approach required taking some shortcuts, since calculating the probabilities within better margins of error for some of the methods would take infeasibly long. See Why such complex logic for the stopping conditions? above for more on this.

How to read the results

For each api method:
*mean is roughly: the average data set size at which a collision happened (arithmetic mean)

stddev is the standard deviation
min is the worst-case run - the smallest data set size at which first collision happened
max is the best-case run - the largest data set size at which first collision happened
runs is the number of times we tested the api method before deciding the stats are "good enough"
sum is the sum of all the runs - the total number of times the method was called
moe, the "margin of error", basically means: we are 95% certain the "real" average dataset size at which the first collision happens is within x % of mean - for example, a moe of 0.05 means we are 95% certain the "real" average data set size at which the first collision happens is within 5 % of mean
hasCollided, if we reached the maximum dataset size that we test collisions for (999,999)

Results

Incomplete list (have yet to tweak the script to actually finish running in a reasonable time for the other methods):

{"methodName":"dateString","mean":"414.61","stddev":"214.23","moe":"0.05","runs":411,"min":5,"max":1412,"sum":170404,"hasCollided":true}
{"methodName":"fullName","mean":"1574.32","stddev":"784.96","moe":"0.05","runs":383,"min":75,"max":4853,"sum":602964,"hasCollided":true}
{"methodName":"streetAddress","mean":"22367.15","stddev":"12490.97","moe":"0.10","runs":122,"min":1154,"max":58605,"sum":2728792,"hasCollided":true}
{"methodName":"float","mean":"186526.06","stddev":"103245.91","moe":"0.10","runs":118,"min":12602,"max":550474,"sum":22010075,"hasCollided":true}
{"methodName":"ipv4","mean":"79714.37","stddev":"41439.56","moe":"0.10","runs":106,"min":6581,"max":205074,"sum":8449723,"hasCollided":true}
{"methodName":"email","mean":"119657.21","stddev":"59025.15","moe":"0.10","runs":100,"min":5254,"max":275462,"sum":11965721,"hasCollided":true}

## Context We currently have very little information on how likely collisions are for each of copycat's API methods. For some methods (e.g. bool), these are trivial or unimportant. For data types that typically need to be unique (e.g. `uuid` or `email`), this is important information to know: it allows us to know where in the API we need to improve to make collisions less likely, and lets our users know how likely collisions will be for their use case. ## Worth noting Methods like `scramble` are difficult to measure this for, since the probability is largely dependant on the input string. For example, an input string of a short length will have a higher collision rate. For this reason, `scramble` is omitted from the script at the moment. ## Approach * For each API method: * While we're under the minimum number of runs (100), or until the margin of error is below 0.05% if we're still under the maximum number of runs (9999): * Run the method with a new uuid v4 as input until the first collision is found * Output the obtained stats (e.g. mean, stddev, margin of error) Since the task here is number-crunching, we do this using a worker farm to leverage multiple cpus to get results faster. ## How to read the results For each api method the `mean` is roughly: the data set size at which the first collision is likely to happen. `runs` is the number of times we tested the api method before deciding the margin of error is "good enough" - under 5%. "likely to happen" is pretty vague though. What the numbers are really saying, is: we are 95% certain that the "real" first collision data set size is within 5% of the `mean` displayed in these results. Unless we reached the maximum number of runs - 9999, in which case, it is whatever the `moe` stat shows instead of `0.05` (5%). ## Limitations I must admit that I know very little about this subject (hash collision probabilities). That, and I didn't pay enough attention in my stats lectures back in the day. I don't know where these measurements suck, or how badly they do. Any input would be greatly appreciated! ## Alternatives and why I didn't go for them An alternative approach would be to test for the probability of collisions at a particular data set size (similar to how you might do this for calculating numbers for the birthday problem). This is arguably less helpful for our case, since for each api method, we want to know _when_ the first collision happened, so that we know how useful the copycat api methods would be for some given data set size (e.g. if email's first collision is likely to happen for a data set size of 100,000 and we need uniqueness, then we have some surety that it is fine to use for a data set of size 500, but not one of size 900,000). ## Results I haven't run this for all API methods yet, it takes a while to run. I don't know if its broken for some methods yet, and will need to run it for longer to get results for the other API methods. ``` {"methodName":"bool","mean":"1.00","stddev":"0.00","moe":"0.00","runs":2} {"methodName":"digit","mean":"6.00","stddev":"0.00","moe":"0.00","runs":2} {"methodName":"oneOf","mean":"2.00","stddev":"0.00","moe":"0.00","runs":2} {"methodName":"char","mean":"9.49","stddev":"4.84","moe":"0.05","runs":401} {"methodName":"hex","mean":"4.90","stddev":"2.33","moe":"0.05","runs":349} {"methodName":"lastName","mean":"26.77","stddev":"14.68","moe":"0.05","runs":463} {"methodName":"timezone","mean":"12.56","stddev":"6.61","moe":"0.05","runs":427} {"methodName":"word","mean":"19.32","stddev":"9.81","moe":"0.05","runs":396} {"methodName":"dateString","mean":"421.42","stddev":"216.10","moe":"0.05","runs":406} {"methodName":"email","mean":"98728.50","stddev":"3043.50","moe":"0.04","runs":2} {"methodName":"country","mean":"19.50","stddev":"9.61","moe":"0.05","runs":374} ``` An interesting one is `email` - only 2 runs is a bit dodgy - I'll need to look into that one further.

nodejs/node#37320

justinvdm · 2022-09-26T20:09:21Z

I'm in the process of trying to get this script to run in a more reasonable amount of time - assuming that's possible, but in the end I'll probably have more of an idea of that too. Closing until I've got something more complete to show.

peterp

This is awesome!

justinvdm force-pushed the collisions branch from c52ad11 to 409f2d1 Compare September 22, 2022 15:59

justinvdm force-pushed the collisions branch from 409f2d1 to 1657834 Compare September 23, 2022 08:22

justinvdm added 11 commits September 26, 2022 17:04

Use object for collision recording to avoid node range error issue

788c932

nodejs/node#37320

Change MAX_N default to 999999

97a5371

Change max runs to 1000

01e573f

Fix checking for collision after {} change

31e61f1

Change MAX_N to 999,999

7e81ca2

Change max runs to 999

5a2c4a5

Add range

9fb4ba6

Introduce max n sum

0e69e87

Fix syntax error

2e9b349

Better logic for completing iterations

9a5a9c5

Improve logic more

778c307

justinvdm closed this Sep 26, 2022

justinvdm added 7 commits September 26, 2022 22:10

Fix logic

cf11cb1

Better min runs

0f06720

Fix sum

a9ae1d6

Fix logic for checking margin of error

0f8f9ab

Improve logic for stopping calculations some more

86b650a

Shortlist methods

8c27278

Simplify logic

0448335

justinvdm reopened this Sep 29, 2022

peterp approved these changes Oct 11, 2022

View reviewed changes

justinvdm merged commit 693e381 into main Oct 11, 2022

justinvdm mentioned this pull request Oct 13, 2022

Less collisions and better { limit } handling for email and username #32

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: Add basic collision probabilities script #29

chore: Add basic collision probabilities script #29

justinvdm commented Sep 22, 2022 •

edited

justinvdm commented Sep 26, 2022

peterp left a comment

chore: Add basic collision probabilities script #29

chore: Add basic collision probabilities script #29

Conversation

justinvdm commented Sep 22, 2022 • edited

Context

Approach

Why complicate the logic for the stopping conditions?

Why not analyse for the theoretical probabilities instead?

Limitations

How to read the results

Results

justinvdm commented Sep 26, 2022

peterp left a comment

Choose a reason for hiding this comment

justinvdm commented Sep 22, 2022 •

edited