Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: Add basic collision probabilities script #29

Merged
merged 19 commits into from Oct 11, 2022
Merged

chore: Add basic collision probabilities script #29

merged 19 commits into from Oct 11, 2022

Conversation

justinvdm
Copy link
Contributor

@justinvdm justinvdm commented Sep 22, 2022

Context

We currently have very little information on how likely collisions are for each of copycat's API methods. For some methods (e.g. bool), these are trivial or unimportant. For data types that typically need to be unique (e.g. uuid or email), this is important information to know: it allows us to know where in the API we need to improve to make collisions less likely, and lets our users know how likely collisions will be for their use case.

Approach

* For each API method in the shortlist:
   * Until the stopping conditions are met (see below):
     * Run the method with a new uuid v4 as input until the first collision is found, or until we've iterated 999,999 times
     * Record the data set size at which the first collision happened as a datapoint
   * Output the obtained stats (e.g. mean, stddev, margin of error)

// The stopping conditions
* Both of these must be true:
 * We've run the method a minimum (100); or
 * Either
   * The margin of error is under a lower threshold (`0.05`); or
   * The margin of error is under a higher threshold (`0.10`), but the sum (sum of first collision dataset sizes obtained so far) is over a higher threshold (999,999)

Since the task here is number-crunching, we do this using a worker farm to leverage multiple cpu cores to get results faster.

Why complicate the logic for the stopping conditions?

Ideally, we could only check two things: that our runs (the set of first collision data set sizes) is over a minimum threshold, and that our margin of error is under a maximum threshold.

Unfortunately, this isn't very computationally feasible for methods with large numbers for their first collision dataset size - it would just take too long to reach a lower margin of error. So we make a compromise, and allow for a higher margin of error (0.10 rather than 0.05) for methods with high numbers for their first collision data set size.

For example, the first collision dataset size for int is very large - so in this case, it isn't very feasible to run this until we obtained 0.05. Let's say we ran it 100 times, and in total, all of the runs (i.e. all of the first collision data set sizes) summed up to over 999,999. We then decide that we're happy enough with a margin of error of 0.10, and stop there.

Why not analyse for the theoretical probabilities instead?

We're taking an empirical approach here, rather than analysing the code to work out the theoretical probabilities. The answer is that it is difficult for us to do this for each and every copycat API method in question, and know for certain these calculations are accurate - there are just too many different variables at play influencing the probabilities, and we wouldn't really know for sure if we've accounted for all of them. The only way we can know how the collision probabilities look in practice, is to actually run the code.

Limitations

  • I must admit that I know very little about the areas I'm wondering around in with this PR: both hash collision probabilities, and stats. If there's something important I'm missing with the approach I'm taking, I'd like to know!
  • The approach required taking some shortcuts, since calculating the probabilities within better margins of error for some of the methods would take infeasibly long. See Why such complex logic for the stopping conditions? above for more on this.

How to read the results

For each api method:
*mean is roughly: the average data set size at which a collision happened (arithmetic mean)

  • stddev is the standard deviation
  • min is the worst-case run - the smallest data set size at which first collision happened
  • max is the best-case run - the largest data set size at which first collision happened
  • runs is the number of times we tested the api method before deciding the stats are "good enough"
  • sum is the sum of all the runs - the total number of times the method was called
  • moe, the "margin of error", basically means: we are 95% certain the "real" average dataset size at which the first collision happens is within x % of mean - for example, a moe of 0.05 means we are 95% certain the "real" average data set size at which the first collision happens is within 5 % of mean
  • hasCollided, if we reached the maximum dataset size that we test collisions for (999,999)

Results

Incomplete list (have yet to tweak the script to actually finish running in a reasonable time for the other methods):

{"methodName":"dateString","mean":"414.61","stddev":"214.23","moe":"0.05","runs":411,"min":5,"max":1412,"sum":170404,"hasCollided":true}
{"methodName":"fullName","mean":"1574.32","stddev":"784.96","moe":"0.05","runs":383,"min":75,"max":4853,"sum":602964,"hasCollided":true}
{"methodName":"streetAddress","mean":"22367.15","stddev":"12490.97","moe":"0.10","runs":122,"min":1154,"max":58605,"sum":2728792,"hasCollided":true}
{"methodName":"float","mean":"186526.06","stddev":"103245.91","moe":"0.10","runs":118,"min":12602,"max":550474,"sum":22010075,"hasCollided":true}
{"methodName":"ipv4","mean":"79714.37","stddev":"41439.56","moe":"0.10","runs":106,"min":6581,"max":205074,"sum":8449723,"hasCollided":true}
{"methodName":"email","mean":"119657.21","stddev":"59025.15","moe":"0.10","runs":100,"min":5254,"max":275462,"sum":11965721,"hasCollided":true}

 ## Context
We currently have very little information on how likely collisions are for each of copycat's API methods. For some methods (e.g. bool), these are trivial or unimportant. For data types that typically need to be unique (e.g. `uuid` or `email`), this is important information to know: it allows us to know where in the API we need to improve to make collisions less likely, and lets our users know how likely collisions will be for their use case.

 ## Worth noting
Methods like `scramble` are difficult to measure this for, since the probability is largely dependant on the input string. For example, an input string of a short length will have a higher collision rate. For this reason, `scramble` is omitted from the script at the moment.

 ## Approach
* For each API method:
  * While we're under the minimum number of runs (100), or until the margin of error is below 0.05% if we're still under the maximum number of runs (9999):
    * Run the method with a new uuid v4 as input until the first collision is found
  * Output the obtained stats (e.g. mean, stddev, margin of error)

Since the task here is number-crunching, we do this using a worker farm to leverage multiple cpus to get results faster.

 ## How to read the results
For each api method the `mean` is roughly: the data set size at which the first collision is likely to happen.

`runs` is the number of times we tested the api method before deciding the margin of error is "good enough" - under 5%.

"likely to happen" is pretty vague though. What the numbers are really saying, is: we are 95% certain that the "real" first collision data set size is within 5% of the `mean` displayed in these results. Unless we reached the maximum number of runs - 9999, in which case, it is whatever the `moe` stat shows instead of `0.05` (5%).

 ## Limitations
I must admit that I know very little about this subject (hash collision probabilities). That, and I didn't pay enough attention in my stats lectures back in the day. I don't know where these measurements suck, or how badly they do. Any input would be greatly appreciated!

 ## Alternatives and why I didn't go for them
An alternative approach would be to test for the probability of collisions at a particular data set size (similar to how you might do this for calculating numbers for the birthday problem). This is arguably less helpful for our case, since for each api method, we want to know _when_ the first collision happened, so that we know how useful the copycat api methods would be for some given data set size (e.g. if email's first collision is likely to happen for a data set size of 100,000 and we need uniqueness, then we have some surety that it is fine to use for a data set of size 500, but not one of size 900,000).

 ## Results
I haven't run this for all API methods yet, it takes a while to run. I don't know if its broken for some methods yet, and will need to run it for longer to get results for the other API methods.

```
{"methodName":"bool","mean":"1.00","stddev":"0.00","moe":"0.00","runs":2}
{"methodName":"digit","mean":"6.00","stddev":"0.00","moe":"0.00","runs":2}
{"methodName":"oneOf","mean":"2.00","stddev":"0.00","moe":"0.00","runs":2}
{"methodName":"char","mean":"9.49","stddev":"4.84","moe":"0.05","runs":401}
{"methodName":"hex","mean":"4.90","stddev":"2.33","moe":"0.05","runs":349}
{"methodName":"lastName","mean":"26.77","stddev":"14.68","moe":"0.05","runs":463}
{"methodName":"timezone","mean":"12.56","stddev":"6.61","moe":"0.05","runs":427}
{"methodName":"word","mean":"19.32","stddev":"9.81","moe":"0.05","runs":396}
{"methodName":"dateString","mean":"421.42","stddev":"216.10","moe":"0.05","runs":406}
{"methodName":"email","mean":"98728.50","stddev":"3043.50","moe":"0.04","runs":2}
{"methodName":"country","mean":"19.50","stddev":"9.61","moe":"0.05","runs":374}
```

An interesting one is `email` - only 2 runs is a bit dodgy - I'll need to look into that one further.
@justinvdm
Copy link
Contributor Author

I'm in the process of trying to get this script to run in a more reasonable amount of time - assuming that's possible, but in the end I'll probably have more of an idea of that too. Closing until I've got something more complete to show.

@justinvdm justinvdm closed this Sep 26, 2022
@justinvdm justinvdm reopened this Sep 29, 2022
Copy link
Member

@peterp peterp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants