Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Less collisions and better { limit } handling for email and username #32

Merged
merged 11 commits into from Oct 19, 2022

Conversation

justinvdm
Copy link
Contributor

@justinvdm justinvdm commented Oct 13, 2022

Context

username and email need to be improved as far as collisions (different inputs returning the same outputs) are concerned:

  • from measurements done (see chore: Add basic collision probabilities script #29 and stats below), the worst case dataset size at which collisions were observed was around 9000 for username and 8700 for email. This means collisions might happen at datasets larger than 9000, which is not a very uncommon size for production databases.
  • username and email are used for values that usually need to be unique in a database
    • uuid also of course, though since that is basically proxying straight through to a uuid v5, I'm less worried about it.
    • Things like fullName are less likely to need to be unique - it may look a little weird if people see the names appearing multiple times, so still something worth fixing, but lower priority than values that typically need to be unique
    • There's probably still others to improve (dateString maybe?), for the same reasons, but this is a start

Approach

Add more possibilities to the return values for username and email to allow for a larger output range, then measure with the collision script again.

While I was there, I also added limit support for username (see #30 for context), and some improvements to limit logic. I also made some improvements to the collisions script. I've added PR comments for these things for more context.

Measurements

before:

{"methodName":"username","mean":"59674.56","stddev":"29524.62","moe":"0.14","runs":50,"n":50,"min":9005,"max":145998,"sum":2983728,"hasCollided":true,"duration":379735}
{"methodName":"email","mean":"276483.20","stddev":"179478.58","moe":"0.18","runs":50,"n":50,"min":8706,"max":849407,"sum":13824160,"hasCollided":true,"duration":2543995}

after:

{"methodName":"username","mean":"800847.73","stddev":"138023.51","moe":"0.10","runs":50,"n":11,"min":485308,"max":994067,"sum":999999,"hasCollided":true,"duration":7921728}
{"methodName":"email","mean":null,"stddev":null,"moe":null,"runs":50,"n":0,"min":null,"max":null,"sum":999999,"hasCollided":false,"duration":19710157}

Note how there were only 11 runs for username that had collisions - for the rest, we generated 999999 values without finding any collision.

email's stats look strange, but that's because no collisions were found at all - for 50 runs, where in of those runs we generated 999999 values without finding any collision.

} else {
sum = MAX_N
}

stats.push(firstCollisionN)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't be pushing null results (which happens if we found no collisions and reached MAX_N), since there isn't a datapoint to work with in this case. This also means we now need to maintain our own separate counter (numRuns)

runs: stats.length,
moe: computeMoe().toFixed(2),
runs: numRuns,
n: stats.length,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n and runs would differ in cases where no collisions were found: https://github.com/snaplet/copycat/pull/32/files#r994766064

  • n is the number of collision data set size datapoints
  • runs is the total number of runs - including the runs where there were no collisions and MAX_N was reached


const moe = computeMoe()
return numRuns >= MIN_RUNS &&
((duration >= MAX_DURATION) ||
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we've reached the minimum number of runs, and it took longer than MAX_DURATION to get there, we don't worry about the other conditions.

@@ -0,0 +1 @@
export const despace = (s: string) => s.replace(/\W/g, '')
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some faker data sets have spaces in them, username and email cannot have these

export const joinCurried =
(joiner: string, segments: Transform[]) =>
(input: Input, options?: JoinOptions) =>
joinMain(input, joiner, segments, options)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without a curried form of join, some of the usages were just becoming very awkward and cumbersome, for example: https://github.com/snaplet/copycat/pull/32/files#diff-2d82b91fbfca95ee52a36b8f41125b2b304ecaba2f7d5d039f6837969e0c3bdeR18

join isn't exposed as a public api though, in case we don't want such an api in copycat

// valid, rather than the segments that follow. For example, without this, we could end up with an
// invalid email like `@b.c` for small `limit`s, or usernames starting with numbers rather than letters
return Math.max(
index === 0 ? 1 : 0,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What the comment says :)

? fallback
: fallback([input, 'copycat:oneOfString'] as JSONSerializable)

return fallbackResult.toString().slice(0, limit)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For cases where the fallback isn't a function but just a string, e.g, ''

@justinvdm
Copy link
Contributor Author

I'm going to take the liberty of merging this one - it is quite an improvement over what we currently have for collisions. Please feel free to still review this though, happy to discuss and make changes in new PRs.

@justinvdm justinvdm merged commit 0938742 into main Oct 19, 2022
justinvdm added a commit that referenced this pull request Oct 19, 2022
…32)

* Change collisions script to allow specifying methods as env vars

* Omit empty join segments

* Improve collisions for email()

* Fix passing through of env vars for collisions script

* Fix stopping logic for no collision case

* Improve email() and username() collisions and limits

* Add max duration to collision script

* Respect min runs in collision script

* Count number of runs separately

* Add explanatory comment for preferring single char for first index for join

* Fix reporting for cases where there were no collisions across all runs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant