Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow uniqueness for any type, not just emails and usernames #86

Closed
alex-grover opened this issue Sep 16, 2021 · 4 comments
Closed

Allow uniqueness for any type, not just emails and usernames #86

alex-grover opened this issue Sep 16, 2021 · 4 comments
Labels
enhancement New feature or request

Comments

@alex-grover
Copy link
Contributor

alex-grover commented Sep 16, 2021

Issue

Something I've run into while adding pynonymizer into my company's workflow is the inability to have table-level unique fake data. It's possible for emails and usernames, but not for anything else (like random strings/pystr for instance).

Concrete example: we have a phone_numbers table with a column number: unique varchar<10>. It's currently not possible to fill this column with fake, unique number strings. If you use pystr then you will sometimes run into unique constraint violations (and assuming the table is larger than the seed data set you may always run into issues).

For columns without length constraints you can kind of hack it together by using unique_email or unique_login but there are a couple limitations:

  • Things that expect a phone number will be broken at runtime
  • The addition of the character limit breaks this solution

Solution I'd like

Ideally, being able to provide an argument like unique: true to a column strategy. If this is more difficult to implement, expanding unique_email and unique_login to include more types might be sufficient for most use cases as well, although it would be much more useful to be able to pass arguments to the fakes as well as guaranteeing their uniqueness.

Not sure of any workable alternatives, besides maybe changing your schema to remove the unique constraint or allow nulls, but those both don't seem like ideal solutions to the problem.

@rwnx
Copy link
Owner

rwnx commented Sep 16, 2021

Hi!

In short, at the current time pynonymizer's whole process is about leveraging the database engine to make bulk updates on big datasets much more efficient. The cost of this efficiency is the non-uniqueness.

The reason i'm dragging my feet with this one is that it's not going to be a trivial feature to implement, since it will need to be it's own seperate update process that updates as a row-by-row basis.

The "official exit trapdoor" for this behaviour is to write something as a literal in sql that meets your data format, in the same way that unique_login does. e.g. MD5(FLOOR((NOW() + RAND()) * (RAND() * RAND() / RAND()) + RAND())). Does that workaround work for you?

We can definitely use this thread to brainstorm and talk about how we see that row-by-row uniqueness being implemented, which i think would be helpful.

@rwnx rwnx added the enhancement New feature or request label Sep 16, 2021
@alex-grover
Copy link
Contributor Author

understood. I can probably make the workaround work, thanks for sharing!

Another potential solution would be to allow setting the size of the seed data pool - if i can choose for it to be 10x larger than the number of rows i have to anonymize then the chances of a collision are lower. Not sure if that exists already

@rwnx
Copy link
Owner

rwnx commented Sep 16, 2021

@ajgrover you can do this already 😇 check out the --seed-rows SEED_ROWS option in the CLI. / seed_rows kwarg from the main fn.

@alex-grover
Copy link
Contributor Author

awesome. i think the random literal approach can solve nearly all cases (it works for my setup at least) so going to close this for now. thanks for the quick response!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants