-
Notifications
You must be signed in to change notification settings - Fork 49
Description
Problem Description
The CategoricalCAP metric is currently available in SDMetrics. It is based on a well-researched and cited methodology for measuring the risk of data disclosure. It acts as a great measure of privacy for particular sensitive columns.
However, there are a number of issues which make this metric hard for users to use and interpret. In #675, we are addressing many of the quality-related issues for this via a new metric called DiclosureProtection
. However, there still remain performance issues.
In this issue, we will address performance issues in DisclosureProtection
by creating a new metric called DisclosureProtectionEstimate
.
Expected behavior
The DisclosureProtectionEstimate
metric should wrap around the DisclosureProtection
metric. It should estimate the original metric's score by subsetting the data, iterating over many subsets, and returning the average score.
Parameters: All the same parameters as DisclosureProtection
, plus:
num_rows_subsample
: An integer describing the number of rows to subsample in each of the real and synthetic datasets. This subsampling occurs with replacement during every iteration (that is to say, every iteration should start subsampling from the same, original dataset)- (default)
1000
: Subsample 1000 rows in both the real and synthetic data <int>
: Subsample the number of rows providedNone
: Do not subsample
- (default)
num_iterations
: The number of iterations to do for different subsamples. The final score will be the average- (default)
10
: Do 10 iterations of the subsamples. <int>
: Perform the number of iterations provided
- (default)
verbose
: A boolean describing whether to show the progress- (default)
True
: Print the steps being run and the progress bar False
: Do not print anything out
- (default)
Computation:
- The
baseline_protection
score computation is exactly the same as inDisclosureProtection
, and it only needs to be computed once. - The
cap_protection
score will instead be acap_protection_estimate
by running through the desired # of iterations and averaging out the results. - The final score will then be
score = min(cap_protection_estimate/baseline_protection, 1)
Compute breakdown:
from sdmetrics.single_table import DisclosureProtectionEstimate
DisclosureProtectionEstimate.compute_breakdown(
real_data=real_table,
synthetic_data=synthetic_table,
known_columns=['age', 'gender'],
sensitive_column=['political_affiliation'],
columns_to_discretize=['age'],
)
{
'score': 0.912731436159061,
'cap_protection_estimate': 0.782341231,
'baseline_protection': 0.85714285715
}
Verbosity/Progress Bar:
If verbosity is turned on, it should show a progress bar that increments per iteration. The progress bar should be updated with the overall score (using the updated average), rounded to 4 decimal places.
>>> DisclosureProtectionEstimate.compute(real_data, synthetic_data, verbose=True)
Estimating Disclosure Protection (Score=0.8744): 100%|██████████| 10/10 [00:34<00:00, 3.42s/it]