New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Less collisions and better { limit } handling for email and username #32
Conversation
} else { | ||
sum = MAX_N | ||
} | ||
|
||
stats.push(firstCollisionN) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't be pushing null
results (which happens if we found no collisions and reached MAX_N
), since there isn't a datapoint to work with in this case. This also means we now need to maintain our own separate counter (numRuns
)
runs: stats.length, | ||
moe: computeMoe().toFixed(2), | ||
runs: numRuns, | ||
n: stats.length, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
n
and runs
would differ in cases where no collisions were found: https://github.com/snaplet/copycat/pull/32/files#r994766064
n
is the number of collision data set size datapointsruns
is the total number of runs - including the runs where there were no collisions andMAX_N
was reached
|
||
const moe = computeMoe() | ||
return numRuns >= MIN_RUNS && | ||
((duration >= MAX_DURATION) || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we've reached the minimum number of runs, and it took longer than MAX_DURATION
to get there, we don't worry about the other conditions.
@@ -0,0 +1 @@ | |||
export const despace = (s: string) => s.replace(/\W/g, '') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some faker data sets have spaces in them, username and email cannot have these
export const joinCurried = | ||
(joiner: string, segments: Transform[]) => | ||
(input: Input, options?: JoinOptions) => | ||
joinMain(input, joiner, segments, options) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without a curried form of join
, some of the usages were just becoming very awkward and cumbersome, for example: https://github.com/snaplet/copycat/pull/32/files#diff-2d82b91fbfca95ee52a36b8f41125b2b304ecaba2f7d5d039f6837969e0c3bdeR18
join
isn't exposed as a public api though, in case we don't want such an api in copycat
// valid, rather than the segments that follow. For example, without this, we could end up with an | ||
// invalid email like `@b.c` for small `limit`s, or usernames starting with numbers rather than letters | ||
return Math.max( | ||
index === 0 ? 1 : 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What the comment says :)
? fallback | ||
: fallback([input, 'copycat:oneOfString'] as JSONSerializable) | ||
|
||
return fallbackResult.toString().slice(0, limit) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For cases where the fallback isn't a function but just a string, e.g, ''
I'm going to take the liberty of merging this one - it is quite an improvement over what we currently have for collisions. Please feel free to still review this though, happy to discuss and make changes in new PRs. |
…32) * Change collisions script to allow specifying methods as env vars * Omit empty join segments * Improve collisions for email() * Fix passing through of env vars for collisions script * Fix stopping logic for no collision case * Improve email() and username() collisions and limits * Add max duration to collision script * Respect min runs in collision script * Count number of runs separately * Add explanatory comment for preferring single char for first index for join * Fix reporting for cases where there were no collisions across all runs
Context
username
andemail
need to be improved as far as collisions (different inputs returning the same outputs) are concerned:9000
forusername
and8700
foremail
. This means collisions might happen at datasets larger than9000
, which is not a very uncommon size for production databases.username
andemail
are used for values that usually need to be unique in a databaseuuid
also of course, though since that is basically proxying straight through to a uuid v5, I'm less worried about it.fullName
are less likely to need to be unique - it may look a little weird if people see the names appearing multiple times, so still something worth fixing, but lower priority than values that typically need to be uniquedateString
maybe?), for the same reasons, but this is a startApproach
Add more possibilities to the return values for
username
andemail
to allow for a larger output range, then measure with the collision script again.While I was there, I also added
limit
support forusername
(see #30 for context), and some improvements tolimit
logic. I also made some improvements to the collisions script. I've added PR comments for these things for more context.Measurements
before:
after:
Note how there were only 11 runs for username that had collisions - for the rest, we generated
999999
values without finding any collision.email
's stats look strange, but that's because no collisions were found at all - for 50 runs, where in of those runs we generated999999
values without finding any collision.