-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow uniqueness for any type, not just emails and usernames #86
Comments
Hi! In short, at the current time pynonymizer's whole process is about leveraging the database engine to make bulk updates on big datasets much more efficient. The cost of this efficiency is the non-uniqueness. The reason i'm dragging my feet with this one is that it's not going to be a trivial feature to implement, since it will need to be it's own seperate update process that updates as a row-by-row basis. The "official exit trapdoor" for this behaviour is to write something as a We can definitely use this thread to brainstorm and talk about how we see that row-by-row uniqueness being implemented, which i think would be helpful. |
understood. I can probably make the workaround work, thanks for sharing! Another potential solution would be to allow setting the size of the seed data pool - if i can choose for it to be 10x larger than the number of rows i have to anonymize then the chances of a collision are lower. Not sure if that exists already |
@ajgrover you can do this already 😇 check out the |
awesome. i think the |
Issue
Something I've run into while adding
pynonymizer
into my company's workflow is the inability to have table-level unique fake data. It's possible for emails and usernames, but not for anything else (like random strings/pystr
for instance).Concrete example: we have a
phone_numbers
table with a columnnumber: unique varchar<10>
. It's currently not possible to fill this column with fake, unique number strings. If you usepystr
then you will sometimes run into unique constraint violations (and assuming the table is larger than the seed data set you may always run into issues).For columns without length constraints you can kind of hack it together by using
unique_email
orunique_login
but there are a couple limitations:Solution I'd like
Ideally, being able to provide an argument like
unique: true
to a column strategy. If this is more difficult to implement, expandingunique_email
andunique_login
to include more types might be sufficient for most use cases as well, although it would be much more useful to be able to pass arguments to the fakes as well as guaranteeing their uniqueness.Not sure of any workable alternatives, besides maybe changing your schema to remove the unique constraint or allow nulls, but those both don't seem like ideal solutions to the problem.
The text was updated successfully, but these errors were encountered: