Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Numerical unknowns should not be converted to sdv-pii-???? #2089

Merged
merged 13 commits into from
Jul 3, 2024

Conversation

lajohn4747
Copy link
Contributor

@lajohn4747 lajohn4747 commented Jun 25, 2024

resolves #2064
CU-86b0wh849

PII Type adds the prefix sdv-pii. If the sdtype is unknown, the transformers will auto-assign but the reverse transform for a pii sdtype should always have the dtype of object as it will contain a prefix. If a numerical column is detected we will add change the Faker function to use numerify and cap at the max amount of digits.

Adjusted some unit test to avoid test failures due to mocking

@lajohn4747 lajohn4747 requested a review from a team as a code owner June 25, 2024 16:27
@sdv-team
Copy link
Contributor

@lajohn4747 lajohn4747 requested review from R-Palazzo and removed request for frances-h June 28, 2024 16:16
Copy link
Contributor

@R-Palazzo R-Palazzo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just checking, it it an issue for HMA only? @lajohn4747, @amontanez24

@lajohn4747
Copy link
Contributor Author

LGTM! Just checking, it it an issue for HMA only? @lajohn4747, @amontanez24

It appears to be that way after some local testing

Copy link
Contributor

@amontanez24 amontanez24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should force the reversed type to be object. The reversed type should match whatever the input was. In the case where the original type was numerical, I don't think we should be adding that prefix

@lajohn4747 lajohn4747 changed the title Add condition to not default to transformer type for pii Numerical unknowns should not be converted to sdv-pii-???? Jul 1, 2024
sdv/sampling/hierarchical_sampler.py Outdated Show resolved Hide resolved
sdv/data_processing/data_processor.py Outdated Show resolved Hide resolved
sdv/data_processing/data_processor.py Outdated Show resolved Hide resolved
Copy link
Contributor

@amontanez24 amontanez24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also add an integration test with the example in the issue

sdv/sampling/hierarchical_sampler.py Show resolved Hide resolved
@lajohn4747 lajohn4747 merged commit 462812d into main Jul 3, 2024
39 checks passed
@lajohn4747 lajohn4747 deleted the issue_2064_sampling_unknown_sdtype_error branch July 3, 2024 00:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

HMA sampling crashes when unknown sdtype detected for numerical column
5 participants