Less collisions and better { limit } handling for email and username #32

justinvdm · 2022-10-13T14:58:33Z

Context

username and email need to be improved as far as collisions (different inputs returning the same outputs) are concerned:

from measurements done (see chore: Add basic collision probabilities script #29 and stats below), the worst case dataset size at which collisions were observed was around 9000 for username and 8700 for email. This means collisions might happen at datasets larger than 9000, which is not a very uncommon size for production databases.
username and email are used for values that usually need to be unique in a database
- uuid also of course, though since that is basically proxying straight through to a uuid v5, I'm less worried about it.
- Things like fullName are less likely to need to be unique - it may look a little weird if people see the names appearing multiple times, so still something worth fixing, but lower priority than values that typically need to be unique
- There's probably still others to improve (dateString maybe?), for the same reasons, but this is a start

Approach

Add more possibilities to the return values for username and email to allow for a larger output range, then measure with the collision script again.

While I was there, I also added limit support for username (see #30 for context), and some improvements to limit logic. I also made some improvements to the collisions script. I've added PR comments for these things for more context.

Measurements

before:

{"methodName":"username","mean":"59674.56","stddev":"29524.62","moe":"0.14","runs":50,"n":50,"min":9005,"max":145998,"sum":2983728,"hasCollided":true,"duration":379735}
{"methodName":"email","mean":"276483.20","stddev":"179478.58","moe":"0.18","runs":50,"n":50,"min":8706,"max":849407,"sum":13824160,"hasCollided":true,"duration":2543995}

after:

{"methodName":"username","mean":"800847.73","stddev":"138023.51","moe":"0.10","runs":50,"n":11,"min":485308,"max":994067,"sum":999999,"hasCollided":true,"duration":7921728}
{"methodName":"email","mean":null,"stddev":null,"moe":null,"runs":50,"n":0,"min":null,"max":null,"sum":999999,"hasCollided":false,"duration":19710157}

Note how there were only 11 runs for username that had collisions - for the rest, we generated 999999 values without finding any collision.

email's stats look strange, but that's because no collisions were found at all - for 50 runs, where in of those runs we generated 999999 values without finding any collision.

justinvdm · 2022-10-13T15:00:56Z

scripts/collisions.js

    } else {
      sum = MAX_N
    }

-    stats.push(firstCollisionN)


We shouldn't be pushing null results (which happens if we found no collisions and reached MAX_N), since there isn't a datapoint to work with in this case. This also means we now need to maintain our own separate counter (numRuns)

justinvdm · 2022-10-13T15:01:57Z

scripts/collisions.js

-    runs: stats.length,
+    moe: computeMoe().toFixed(2),
+    runs: numRuns,
+    n: stats.length,


n and runs would differ in cases where no collisions were found: https://github.com/snaplet/copycat/pull/32/files#r994766064

n is the number of collision data set size datapoints

runs is the total number of runs - including the runs where there were no collisions and MAX_N was reached

justinvdm · 2022-10-13T15:02:34Z

scripts/collisions.js

+
+    const moe = computeMoe()
+    return numRuns >= MIN_RUNS &&
+      ((duration >= MAX_DURATION) ||


If we've reached the minimum number of runs, and it took longer than MAX_DURATION to get there, we don't worry about the other conditions.

justinvdm · 2022-10-13T15:03:05Z

src/despace.ts

@@ -0,0 +1 @@
+export const despace = (s: string) => s.replace(/\W/g, '')


Some faker data sets have spaces in them, username and email cannot have these

justinvdm · 2022-10-13T15:05:06Z

src/join.ts

+export const joinCurried =
+  (joiner: string, segments: Transform[]) =>
+  (input: Input, options?: JoinOptions) =>
+    joinMain(input, joiner, segments, options)


Without a curried form of join, some of the usages were just becoming very awkward and cumbersome, for example: https://github.com/snaplet/copycat/pull/32/files#diff-2d82b91fbfca95ee52a36b8f41125b2b304ecaba2f7d5d039f6837969e0c3bdeR18

join isn't exposed as a public api though, in case we don't want such an api in copycat

…r join

justinvdm · 2022-10-13T15:09:25Z

src/join.ts

+  // valid, rather than the segments that follow. For example, without this, we could end up with an
+  // invalid email like `@b.c` for small `limit`s, or usernames starting with numbers rather than letters
+  return Math.max(
+    index === 0 ? 1 : 0,


What the comment says :)

justinvdm · 2022-10-13T15:14:04Z

src/oneOfString.ts

+          ? fallback
+          : fallback([input, 'copycat:oneOfString'] as JSONSerializable)
+
+      return fallbackResult.toString().slice(0, limit)


For cases where the fallback isn't a function but just a string, e.g, ''

justinvdm · 2022-10-19T08:21:19Z

I'm going to take the liberty of merging this one - it is quite an improvement over what we currently have for collisions. Please feel free to still review this though, happy to discuss and make changes in new PRs.

…32) * Change collisions script to allow specifying methods as env vars * Omit empty join segments * Improve collisions for email() * Fix passing through of env vars for collisions script * Fix stopping logic for no collision case * Improve email() and username() collisions and limits * Add max duration to collision script * Respect min runs in collision script * Count number of runs separately * Add explanatory comment for preferring single char for first index for join * Fix reporting for cases where there were no collisions across all runs

justinvdm added 9 commits October 13, 2022 00:05

Change collisions script to allow specifying methods as env vars

2403630

Omit empty join segments

4bd8f9c

Improve collisions for email()

fe2faca

Fix passing through of env vars for collisions script

93ff61d

Fix stopping logic for no collision case

a7db462

Improve email() and username() collisions and limits

8bf9a32

Add max duration to collision script

e62e59b

Respect min runs in collision script

24c1442

Count number of runs separately

56edcee

justinvdm commented Oct 13, 2022

View reviewed changes

Add explanatory comment for preferring single char for first index fo…

d72cb77

…r join

justinvdm commented Oct 13, 2022

View reviewed changes

justinvdm requested review from CarelFdeWaal, jgoux and peterp October 13, 2022 15:13

justinvdm commented Oct 13, 2022

View reviewed changes

Fix reporting for cases where there were no collisions across all runs

2ef952b

justinvdm merged commit 0938742 into main Oct 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Less collisions and better { limit } handling for email and username #32

Less collisions and better { limit } handling for email and username #32

justinvdm commented Oct 13, 2022 •

edited

justinvdm Oct 13, 2022

justinvdm Oct 13, 2022

justinvdm Oct 13, 2022

justinvdm Oct 13, 2022

justinvdm Oct 13, 2022

justinvdm Oct 13, 2022

justinvdm Oct 13, 2022

justinvdm commented Oct 19, 2022

		@@ -0,0 +1 @@
		export const despace = (s: string) => s.replace(/\W/g, '')

Less collisions and better { limit } handling for email and username #32

Less collisions and better { limit } handling for email and username #32

Conversation

justinvdm commented Oct 13, 2022 • edited

Context

Approach

Measurements

justinvdm Oct 13, 2022

Choose a reason for hiding this comment

justinvdm Oct 13, 2022

Choose a reason for hiding this comment

justinvdm Oct 13, 2022

Choose a reason for hiding this comment

justinvdm Oct 13, 2022

Choose a reason for hiding this comment

justinvdm Oct 13, 2022

Choose a reason for hiding this comment

justinvdm Oct 13, 2022

Choose a reason for hiding this comment

justinvdm Oct 13, 2022

Choose a reason for hiding this comment

justinvdm commented Oct 19, 2022

justinvdm commented Oct 13, 2022 •

edited