Allow uniqueness for any type, not just emails and usernames #86

alex-grover · 2021-09-16T01:37:16Z

Issue

Something I've run into while adding pynonymizer into my company's workflow is the inability to have table-level unique fake data. It's possible for emails and usernames, but not for anything else (like random strings/pystr for instance).

Concrete example: we have a phone_numbers table with a column number: unique varchar<10>. It's currently not possible to fill this column with fake, unique number strings. If you use pystr then you will sometimes run into unique constraint violations (and assuming the table is larger than the seed data set you may always run into issues).

For columns without length constraints you can kind of hack it together by using unique_email or unique_login but there are a couple limitations:

Things that expect a phone number will be broken at runtime
The addition of the character limit breaks this solution

Solution I'd like

Ideally, being able to provide an argument like unique: true to a column strategy. If this is more difficult to implement, expanding unique_email and unique_login to include more types might be sufficient for most use cases as well, although it would be much more useful to be able to pass arguments to the fakes as well as guaranteeing their uniqueness.

Not sure of any workable alternatives, besides maybe changing your schema to remove the unique constraint or allow nulls, but those both don't seem like ideal solutions to the problem.

The text was updated successfully, but these errors were encountered:

rwnx · 2021-09-16T14:37:30Z

Hi!

In short, at the current time pynonymizer's whole process is about leveraging the database engine to make bulk updates on big datasets much more efficient. The cost of this efficiency is the non-uniqueness.

The reason i'm dragging my feet with this one is that it's not going to be a trivial feature to implement, since it will need to be it's own seperate update process that updates as a row-by-row basis.

The "official exit trapdoor" for this behaviour is to write something as a literal in sql that meets your data format, in the same way that unique_login does. e.g. MD5(FLOOR((NOW() + RAND()) * (RAND() * RAND() / RAND()) + RAND())). Does that workaround work for you?

We can definitely use this thread to brainstorm and talk about how we see that row-by-row uniqueness being implemented, which i think would be helpful.

alex-grover · 2021-09-16T14:44:35Z

understood. I can probably make the workaround work, thanks for sharing!

Another potential solution would be to allow setting the size of the seed data pool - if i can choose for it to be 10x larger than the number of rows i have to anonymize then the chances of a collision are lower. Not sure if that exists already

rwnx · 2021-09-16T14:46:28Z

@ajgrover you can do this already 😇 check out the --seed-rows SEED_ROWS option in the CLI. / seed_rows kwarg from the main fn.

alex-grover · 2021-09-16T17:03:45Z

awesome. i think the random literal approach can solve nearly all cases (it works for my setup at least) so going to close this for now. thanks for the quick response!

rwnx added the enhancement New feature or request label Sep 16, 2021

alex-grover closed this as completed Sep 16, 2021

rwnx mentioned this issue Feb 10, 2022

Generate random string values to not encounter duplicate entry issues #106

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow uniqueness for any type, not just emails and usernames #86

Allow uniqueness for any type, not just emails and usernames #86

alex-grover commented Sep 16, 2021 •

edited

Loading

rwnx commented Sep 16, 2021

alex-grover commented Sep 16, 2021

rwnx commented Sep 16, 2021

alex-grover commented Sep 16, 2021

Allow uniqueness for any type, not just emails and usernames #86

Allow uniqueness for any type, not just emails and usernames #86

Comments

alex-grover commented Sep 16, 2021 • edited Loading

Issue

Solution I'd like

rwnx commented Sep 16, 2021

alex-grover commented Sep 16, 2021

rwnx commented Sep 16, 2021

alex-grover commented Sep 16, 2021

alex-grover commented Sep 16, 2021 •

edited

Loading