New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: Add basic collision probabilities script #29
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
justinvdm
force-pushed
the
collisions
branch
from
September 22, 2022 15:59
c52ad11
to
409f2d1
Compare
## Context We currently have very little information on how likely collisions are for each of copycat's API methods. For some methods (e.g. bool), these are trivial or unimportant. For data types that typically need to be unique (e.g. `uuid` or `email`), this is important information to know: it allows us to know where in the API we need to improve to make collisions less likely, and lets our users know how likely collisions will be for their use case. ## Worth noting Methods like `scramble` are difficult to measure this for, since the probability is largely dependant on the input string. For example, an input string of a short length will have a higher collision rate. For this reason, `scramble` is omitted from the script at the moment. ## Approach * For each API method: * While we're under the minimum number of runs (100), or until the margin of error is below 0.05% if we're still under the maximum number of runs (9999): * Run the method with a new uuid v4 as input until the first collision is found * Output the obtained stats (e.g. mean, stddev, margin of error) Since the task here is number-crunching, we do this using a worker farm to leverage multiple cpus to get results faster. ## How to read the results For each api method the `mean` is roughly: the data set size at which the first collision is likely to happen. `runs` is the number of times we tested the api method before deciding the margin of error is "good enough" - under 5%. "likely to happen" is pretty vague though. What the numbers are really saying, is: we are 95% certain that the "real" first collision data set size is within 5% of the `mean` displayed in these results. Unless we reached the maximum number of runs - 9999, in which case, it is whatever the `moe` stat shows instead of `0.05` (5%). ## Limitations I must admit that I know very little about this subject (hash collision probabilities). That, and I didn't pay enough attention in my stats lectures back in the day. I don't know where these measurements suck, or how badly they do. Any input would be greatly appreciated! ## Alternatives and why I didn't go for them An alternative approach would be to test for the probability of collisions at a particular data set size (similar to how you might do this for calculating numbers for the birthday problem). This is arguably less helpful for our case, since for each api method, we want to know _when_ the first collision happened, so that we know how useful the copycat api methods would be for some given data set size (e.g. if email's first collision is likely to happen for a data set size of 100,000 and we need uniqueness, then we have some surety that it is fine to use for a data set of size 500, but not one of size 900,000). ## Results I haven't run this for all API methods yet, it takes a while to run. I don't know if its broken for some methods yet, and will need to run it for longer to get results for the other API methods. ``` {"methodName":"bool","mean":"1.00","stddev":"0.00","moe":"0.00","runs":2} {"methodName":"digit","mean":"6.00","stddev":"0.00","moe":"0.00","runs":2} {"methodName":"oneOf","mean":"2.00","stddev":"0.00","moe":"0.00","runs":2} {"methodName":"char","mean":"9.49","stddev":"4.84","moe":"0.05","runs":401} {"methodName":"hex","mean":"4.90","stddev":"2.33","moe":"0.05","runs":349} {"methodName":"lastName","mean":"26.77","stddev":"14.68","moe":"0.05","runs":463} {"methodName":"timezone","mean":"12.56","stddev":"6.61","moe":"0.05","runs":427} {"methodName":"word","mean":"19.32","stddev":"9.81","moe":"0.05","runs":396} {"methodName":"dateString","mean":"421.42","stddev":"216.10","moe":"0.05","runs":406} {"methodName":"email","mean":"98728.50","stddev":"3043.50","moe":"0.04","runs":2} {"methodName":"country","mean":"19.50","stddev":"9.61","moe":"0.05","runs":374} ``` An interesting one is `email` - only 2 runs is a bit dodgy - I'll need to look into that one further.
justinvdm
force-pushed
the
collisions
branch
from
September 23, 2022 08:22
409f2d1
to
1657834
Compare
I'm in the process of trying to get this script to run in a more reasonable amount of time - assuming that's possible, but in the end I'll probably have more of an idea of that too. Closing until I've got something more complete to show. |
peterp
approved these changes
Oct 11, 2022
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome!
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Context
We currently have very little information on how likely collisions are for each of copycat's API methods. For some methods (e.g. bool), these are trivial or unimportant. For data types that typically need to be unique (e.g.
uuid
oremail
), this is important information to know: it allows us to know where in the API we need to improve to make collisions less likely, and lets our users know how likely collisions will be for their use case.Approach
Since the task here is number-crunching, we do this using a worker farm to leverage multiple cpu cores to get results faster.
Why complicate the logic for the stopping conditions?
Ideally, we could only check two things: that our runs (the set of first collision data set sizes) is over a minimum threshold, and that our margin of error is under a maximum threshold.
Unfortunately, this isn't very computationally feasible for methods with large numbers for their first collision dataset size - it would just take too long to reach a lower margin of error. So we make a compromise, and allow for a higher margin of error (
0.10
rather than0.05
) for methods with high numbers for their first collision data set size.For example, the first collision dataset size for
int
is very large - so in this case, it isn't very feasible to run this until we obtained0.05
. Let's say we ran it 100 times, and in total, all of the runs (i.e. all of the first collision data set sizes) summed up to over 999,999. We then decide that we're happy enough with a margin of error of0.10
, and stop there.Why not analyse for the theoretical probabilities instead?
We're taking an empirical approach here, rather than analysing the code to work out the theoretical probabilities. The answer is that it is difficult for us to do this for each and every copycat API method in question, and know for certain these calculations are accurate - there are just too many different variables at play influencing the probabilities, and we wouldn't really know for sure if we've accounted for all of them. The only way we can know how the collision probabilities look in practice, is to actually run the code.
Limitations
How to read the results
For each api method:
*
mean
is roughly: the average data set size at which a collision happened (arithmetic mean)stddev
is the standard deviationmin
is the worst-case run - the smallest data set size at which first collision happenedmax
is the best-case run - the largest data set size at which first collision happenedruns
is the number of times we tested the api method before deciding the stats are "good enough"sum
is the sum of all the runs - the total number of times the method was calledmoe
, the "margin of error", basically means: we are 95% certain the "real" average dataset size at which the first collision happens is withinx
% ofmean
- for example, amoe
of0.05
means we are 95% certain the "real" average data set size at which the first collision happens is within5
% ofmean
hasCollided
, if we reached the maximum dataset size that we test collisions for (999,999)Results
Incomplete list (have yet to tweak the script to actually finish running in a reasonable time for the other methods):