chore: Add basic collision probabilities script (#29)

## Context We currently have very little information on how likely collisions are for each of copycat's API methods. For some methods (e.g. bool), these are trivial or unimportant. For data types that typically need to be unique (e.g. `uuid` or `email`), this is important information to know: it allows us to know where in the API we need to improve to make collisions less likely, and lets our users know how likely collisions will be for their use case. ## Approach ``` * For each API method in the shortlist: * Until the stopping conditions are met (see below): * Run the method with a new uuid v4 as input until the first collision is found, or until we've iterated 999,999 times * Record the data set size at which the first collision happened as a datapoint * Output the obtained stats (e.g. mean, stddev, margin of error) // The stopping conditions * Both of these must be true: * We've run the method a minimum (100); or * Either * The margin of error is under a lower threshold (`0.05`); or * The margin of error is under a higher threshold (`0.10`), but the sum (sum of first collision dataset sizes obtained so far) is over a higher threshold (999,999) ``` Since the task here is number-crunching, we do this using a worker farm to leverage multiple cpu cores to get results faster. ## Why complicate the logic for the stopping conditions? Ideally, we could only check two things: that our runs (the set of first collision data set sizes) is over a minimum threshold, and that our margin of error is under a maximum threshold. Unfortunately, this isn't very computationally feasible for methods with large numbers for their first collision dataset size - it would just take too long to reach a lower margin of error. So we make a compromise, and allow for a higher margin of error (`0.10` rather than `0.05`) for methods with high numbers for their first collision data set size. For example, the first collision dataset size for `int` is very large - so in this case, it isn't very feasible to run this until we obtained `0.05`. Let's say we ran it 100 times, and in total, all of the runs (i.e. all of the first collision data set sizes) summed up to over 999,999. We then decide that we're happy enough with a margin of error of `0.10`, and stop there. ## Why not analyse for the theoretical probabilities instead? We're taking an empirical approach here, rather than analysing the code to work out the theoretical probabilities. The answer is that it is difficult for us to do this for each and every copycat API method in question, and know for certain these calculations are accurate - there are just too many different variables at play influencing the probabilities, and we wouldn't really know for sure if we've accounted for all of them. The only way we can know how the collision probabilities look in practice, is to actually run the code. ## Limitations * I must admit that I know very little about the areas I'm wondering around in with this PR: both hash collision probabilities, and stats. If there's something important I'm missing with the approach I'm taking, I'd like to know! * The approach required taking some shortcuts, since calculating the probabilities within better margins of error for some of the methods would take infeasibly long. See _Why such complex logic for the stopping conditions?_ above for more on this. ## How to read the results For each api method: *`mean` is roughly: the average data set size at which a collision happened (arithmetic mean) * `stddev` is the standard deviation * `min` is the worst-case run - the smallest data set size at which first collision happened * `max` is the best-case run - the largest data set size at which first collision happened * `runs` is the number of times we tested the api method before deciding the stats are "good enough" * `sum` is the sum of all the runs - the total number of times the method was called * `moe`, the "margin of error", basically means: we are 95% certain the "real" average dataset size at which the first collision happens is within `x` % of `mean` - for example, a `moe` of `0.05` means we are 95% certain the "real" average data set size at which the first collision happens is within `5` % of `mean` * `hasCollided`, if we reached the maximum dataset size that we test collisions for (999,999) ## Results Incomplete list (have yet to tweak the script to actually finish running in a reasonable time for the other methods): ``` {"methodName":"dateString","mean":"414.61","stddev":"214.23","moe":"0.05","runs":411,"min":5,"max":1412,"sum":170404,"hasCollided":true} {"methodName":"fullName","mean":"1574.32","stddev":"784.96","moe":"0.05","runs":383,"min":75,"max":4853,"sum":602964,"hasCollided":true} {"methodName":"streetAddress","mean":"22367.15","stddev":"12490.97","moe":"0.10","runs":122,"min":1154,"max":58605,"sum":2728792,"hasCollided":true} {"methodName":"float","mean":"186526.06","stddev":"103245.91","moe":"0.10","runs":118,"min":12602,"max":550474,"sum":22010075,"hasCollided":true} {"methodName":"ipv4","mean":"79714.37","stddev":"41439.56","moe":"0.10","runs":106,"min":6581,"max":205074,"sum":8449723,"hasCollided":true} {"methodName":"email","mean":"119657.21","stddev":"59025.15","moe":"0.10","runs":100,"min":5254,"max":275462,"sum":11965721,"hasCollided":true} ```
snaplet · Oct 11, 2022 · 693e381 · 693e381
1 parent 745fc92
commit 693e381
Show file tree

Hide file tree

Showing 3 changed files with 141 additions and 1 deletion.
diff --git a/package.json b/package.json
@@ -28,12 +28,14 @@
     "eslint": "^8.14.0",
     "eslint-config-prettier": "^8.5.0",
     "eslint-plugin-prettier": "^4.0.0",
+    "fast-stats": "^0.0.6",
     "is-email": "^1.0.2",
     "is-mac-address": "^1.0.4",
     "jest": "^28.0.1",
     "prettier": "^2.6.2",
     "typescript": "^4.6.3",
-    "user-agent-is-browser": "^0.1.0"
+    "user-agent-is-browser": "^0.1.0",
+    "worker-farm": "^1.7.0"
   },
   "scripts": {
     "build": "yarn build:js && yarn build:types",

diff --git a/scripts/collisions.js b/scripts/collisions.js
@@ -0,0 +1,114 @@
+const workerFarm = require('worker-farm')
+const { promisify } = require('util')
+const { v4: uuid } = require('uuid')
+const Stats = require('fast-stats').Stats
+
+const { TRANSFORMATIONS } = require('../dist/testutils')
+
+const METHODS = [
+  'email',
+  'int',
+  'dateString',
+  'ipv4',
+  'mac',
+  'float',
+  'fullName',
+  'streetAddress',
+  'postalAddress',
+  'password',
+  'uuid',
+]
+
+const MAX_N = +(process.env.MAX_N ?? 999999)
+const MIN_RUNS = Math.max(2, +(process.env.MIN_RUNS ?? 100))
+const MAX_SUM = +(process.env.MAX_SUM ?? MAX_N)
+const LO_MOE = +(process.env.MOE ?? 0.05)
+const HI_MOE = +(process.env.MOE ?? 0.10)
+
+const workerOptions = {
+  workerOptions: {
+    env: {
+      IS_WORKER: '1',
+    },
+  }
+}
+
+const workers = workerFarm(workerOptions, require.resolve(__filename))
+
+const runWorker = promisify(workers)
+
+const findFirstCollisionN = (methodName) => {
+  const fn = TRANSFORMATIONS[methodName]
+  let i = -1
+  const seen = new Set()
+  let firstCollisionN = null
+
+  while (++i < MAX_N && firstCollisionN == null) {
+    const result = fn(uuid()).toString()
+    if (seen.has(result)) {
+      firstCollisionN = i
+    } else {
+      seen.add(result)
+    }
+  }
+
+  return firstCollisionN
+}
+
+const worker = (methodName, done) => {
+  const stats = new Stats()
+  let hasCollided = false
+  let sum = 0
+
+  const isComplete = () => {
+    const moe = stats.length > 2
+      ? stats.moe() / stats.amean()
+      : null
+
+    return stats.length >= MIN_RUNS && (moe != null && (moe <= LO_MOE || (moe <= HI_MOE && sum >= MAX_SUM)))
+  }
+
+  while (!isComplete()) {
+    const firstCollisionN = findFirstCollisionN(methodName)
+
+    if (findFirstCollisionN != null) {
+      hasCollided = true
+      sum += firstCollisionN
+    } else {
+      sum = MAX_N
+    }
+
+    stats.push(firstCollisionN)
+  }
+
+  const [min, max] = stats.range()
+
+  done(null, {
+    methodName,
+    mean: stats.amean().toFixed(2),
+    stddev: stats.stddev().toFixed(2),
+    moe: (stats.moe() / stats.amean()).toFixed(2),
+    runs: stats.length,
+    min,
+    max,
+    sum,
+    hasCollided,
+  })
+}
+
+async function main() {
+  await Promise.all(
+    METHODS.map(async (methodName) => {
+      const results = await runWorker(methodName)
+      console.log(JSON.stringify(results))
+    })
+  )
+
+  workerFarm.end(workers)
+}
+
+module.exports = worker
+
+if (require.main === module && !process.env.IS_WORKER) {
+  main()
+}
diff --git a/yarn.lock b/yarn.lock
@@ -1436,6 +1436,13 @@ end-of-stream@^1.1.0:
   dependencies:
     once "^1.4.0"
 
+errno@~0.1.7:
+  version "0.1.8"
+  resolved "https://registry.yarnpkg.com/errno/-/errno-0.1.8.tgz#8bb3e9c7d463be4976ff888f76b4809ebc2e811f"
+  integrity sha512-dJ6oBr5SQ1VSd9qkk7ByRgb/1SH4JZjCHSW/mr63/QcXO9zLVxvJ6Oy13nio03rxpSnVDDjFor75SjVeZWPW/A==
+  dependencies:
+    prr "~1.0.1"
+
 error-ex@^1.3.1:
   version "1.3.2"
   resolved "https://registry.yarnpkg.com/error-ex/-/error-ex-1.3.2.tgz#b4ac40648107fdcdcfae242f428bea8a14d4f1bf"
@@ -1849,6 +1856,11 @@ fast-levenshtein@^2.0.6:
   resolved "https://registry.yarnpkg.com/fast-levenshtein/-/fast-levenshtein-2.0.6.tgz#3d8a5c66883a16a30ca8643e851f19baa7797917"
   integrity sha1-PYpcZog6FqMMqGQ+hR8Zuqd5eRc=
 
+fast-stats@^0.0.6:
+  version "0.0.6"
+  resolved "https://registry.yarnpkg.com/fast-stats/-/fast-stats-0.0.6.tgz#949e7a97ef12effba710c6322a1fad3fcc8609e7"
+  integrity sha512-m0zkwa7Z07Wc4xm1YtcrCHmhzNxiYRrrfUyhkdhSZPzaAH/Ewbocdaq7EPVBFz19GWfIyyPcLfRHjHJYe83jlg==
+
 fastq@^1.6.0:
   version "1.13.0"
   resolved "https://registry.yarnpkg.com/fastq/-/fastq-1.13.0.tgz#616760f88a7526bdfc596b7cab8c18938c36b98c"
@@ -3307,6 +3319,11 @@ prompts@^2.0.1:
     kleur "^3.0.3"
     sisteransi "^1.0.5"
 
+prr@~1.0.1:
+  version "1.0.1"
+  resolved "https://registry.yarnpkg.com/prr/-/prr-1.0.1.tgz#d3fc114ba06995a45ec6893f484ceb1d78f5f476"
+  integrity sha512-yPw4Sng1gWghHQWj0B3ZggWUm4qVbPwPFcRG8KyxiU7J2OHFSoEHKS+EZ3fv5l1t9CyCiop6l/ZYeWbrgoQejw==
+
 pump@^3.0.0:
   version "3.0.0"
   resolved "https://registry.yarnpkg.com/pump/-/pump-3.0.0.tgz#b4a2116815bde2f4e1ea602354e8c75565107a64"
@@ -3895,6 +3912,13 @@ word-wrap@^1.2.3:
   resolved "https://registry.yarnpkg.com/word-wrap/-/word-wrap-1.2.3.tgz#610636f6b1f703891bd34771ccb17fb93b47079c"
   integrity sha512-Hz/mrNwitNRh/HUAtM/VT/5VH+ygD6DV7mYKZAtHOrbs8U7lvPS6xf7EJKMF0uW1KJCl0H701g3ZGus+muE5vQ==
 
+worker-farm@^1.7.0:
+  version "1.7.0"
+  resolved "https://registry.yarnpkg.com/worker-farm/-/worker-farm-1.7.0.tgz#26a94c5391bbca926152002f69b84a4bf772e5a8"
+  integrity sha512-rvw3QTZc8lAxyVrqcSGVm5yP/IJ2UcB3U0graE3LCFoZ0Yn2x4EoVSqJKdB/T5M+FLcRPjz4TDacRf3OCfNUzw==
+  dependencies:
+    errno "~0.1.7"
+
 wrap-ansi@^7.0.0:
   version "7.0.0"
   resolved "https://registry.yarnpkg.com/wrap-ansi/-/wrap-ansi-7.0.0.tgz#67e145cff510a6a6984bdf1152911d69d2eb9e43"