# **YData Synthetic Data generation privacy controls**

YData synthesizers now offer a privacy layer that can provide differential privacy to the end-user by selecting one of three levels:
- High fidelity - the default behavior, which leads to synthetic data with higher fidelity/utility and less privacy.
- High privacy - enables the generation of synthetic data with higher privacy, accepting a loss of fidelity and utility.
- Balanced fidelity/privacy - tries to find a balance between high fidelity, utility, and privacy, aiming to reach good enough results in all three settings.

This notebook describes how to use the privacy layer with the regular synthesizer. The same logic here explained can be equally applied to the time series synthesizer.

We will use the Breast Cancer Wisconsin dataset to demonstrate how to take advantage of the privacy layer. This dataset contains computed features from a digitized image of a fine needle aspirate (FNA) of several breast masses. Each row has the diagnosis (M for malignant and B for benign) and 32 real-valued features computed for the cell nuclei. The diagnosis is the dataset target.

In [11]:
from ydata.synthesizers.regular.model import RegularSynthesizer
from ydata.metadata import Metadata
from ydata.dataset import Dataset
from ydata.labs import DataSources
from ydata.report import SyntheticDataProfile
from ydata.report.reports.report_type import ReportType
from ydata.synthesizers.privacy import PrivacyLevel

In [12]:
datasource = DataSources.get(uid='{dasource-uid}')
data = datasource.read()
data = data.drop_columns(columns=["id"])
metadata = Metadata(data)
_target = "diagnosis"

[########################################] | 100% Completed | 103.85 ms
[########################################] | 100% Completed | 104.14 ms
[########################################] | 100% Completed | 419.92 ms
[########################################] | 100% Completed | 103.60 ms
[########################################] | 100% Completed | 2.11 sms


In [13]:
data.head()

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


## **High Fidelity**

The synthesizer has a parameter named `privacy_level`, which is optional and defaults to the high-fidelity setting. Nevertheless, we can also explicitly specify this level. We just have to import the `PrivacyLevel` enumeration and chose the `HIGH_FIDELITY` option.

In [14]:
synthesizer = RegularSynthesizer()
synthesizer.fit(data, metadata=metadata, privacy_level=PrivacyLevel.HIGH_FIDELITY)
holdout_dataset = Dataset(synthesizer._holdout._data.compute())
train_dataset = Dataset(synthesizer._holdout._train_data.compute())
synthetic_dataset = synthesizer.sample(n_samples=len(holdout_dataset))

INFO: 2023-04-11 10:06:35,092 [SYNTHESIZER] - Number columns considered for synth: 31
INFO: 2023-04-11 10:06:35,749 [SYNTHESIZER] - Starting the synthetic data modeling process over 1x1 blocks.
INFO: 2023-04-11 10:06:35,752 [SYNTHESIZER] - Preprocess segment
INFO: 2023-04-11 10:06:35,759 [SYNTHESIZER] - Synthesizer init.
INFO: 2023-04-11 10:06:35,760 [SYNTHESIZER] - Processing the data prior fitting the synthesizer.
INFO: 2023-04-11 10:06:37,468 [SYNTHESIZER] - Start generating model samples.


In [15]:
%%capture
data_types = {k: v.datatype for k, v in metadata.columns.items()}
sdf = SyntheticDataProfile(real=holdout_dataset,
                           synth=synthetic_dataset,
                           metadata=metadata,
                           report_type=ReportType.TABULAR,
                           target=_target,
                           data_types=data_types,
                           training_data=train_dataset)
summary_metrics = sdf.get_summary()

INFO: 2023-04-11 10:06:38,497 [PROFILEREPORT] - Starting metrics calculation.
INFO: 2023-04-11 10:06:38,613 [PROFILEREPORT] - Synthetic data quality report selected target variable: diagnosis
INFO: 2023-04-11 10:06:38,614 [PROFILEREPORT] - preparing data format.
INFO: 2023-04-11 10:06:38,837 [PROFILEREPORT] - Preparing the data for metrics calculation
INFO: 2023-04-11 10:06:38,946 [PROFILEREPORT] - Calculating privacy metrics.
INFO: 2023-04-11 10:06:38,955 [PROFILEREPORT] - Calculating metric [Exact Matches].
INFO: 2023-04-11 10:06:38,961 [PROFILEREPORT] - Metric [Exact Matches] took 0.00s.
INFO: 2023-04-11 10:06:38,965 [PROFILEREPORT] - Calculating metric [Neighbours Privacy].
INFO: 2023-04-11 10:06:38,975 [PROFILEREPORT] - Metric [Neighbours Privacy] took 0.01s.
INFO: 2023-04-11 10:06:38,979 [PROFILEREPORT] - Calculating metric [Identifiability].
INFO: 2023-04-11 10:06:38,992 [PROFILEREPORT] - Metric [Identifiability] took 0.01s.
INFO: 2023-04-11 10:06:38,999 [PROFILEREPORT] - Calcul

In [16]:
print(f"\033[1m{PrivacyLevel.HIGH_FIDELITY.name}")
print(f"\033[1mFidelity: {summary_metrics['fidelity']:.2f}")
print(f"\033[1mUtility: {summary_metrics['utility']:.2f}")
print(f"\033[1mPrivacy: {summary_metrics['privacy']:.2f}")

[1mHIGH_FIDELITY
[1mFidelity: 0.87
[1mUtility: 0.70
[1mPrivacy: 0.58


## **High Privacy**

To achieve high privacy, the `privacy_level` parameter must be defined with the `HIGH_PRIVACY` option of the `PrivacyLevel` enumeration.

In [17]:
synthesizer = RegularSynthesizer()
synthesizer.fit(data, metadata=metadata, privacy_level=PrivacyLevel.HIGH_PRIVACY)
holdout_dataset = Dataset(synthesizer._holdout._data.compute())
train_dataset = Dataset(synthesizer._holdout._train_data.compute())
synthetic_dataset = synthesizer.sample(n_samples=len(holdout_dataset))

INFO: 2023-04-11 10:06:52,838 [SYNTHESIZER] - Number columns considered for synth: 31
INFO: 2023-04-11 10:07:13,238 [SYNTHESIZER] - Starting the synthetic data modeling process over 1x1 blocks.
INFO: 2023-04-11 10:07:13,242 [SYNTHESIZER] - Preprocess segment
INFO: 2023-04-11 10:07:13,249 [SYNTHESIZER] - Synthesizer init.
INFO: 2023-04-11 10:07:13,250 [SYNTHESIZER] - Processing the data prior fitting the synthesizer.
INFO: 2023-04-11 10:07:14,740 [SYNTHESIZER] - Start generating model samples.


In [18]:
%%capture
data_types = {k: v.datatype for k, v in metadata.columns.items()}
sdf = SyntheticDataProfile(real=holdout_dataset,
                           synth=synthetic_dataset,
                           metadata=metadata,
                           report_type=ReportType.TABULAR,
                           target=_target,
                           data_types=data_types,
                           training_data=train_dataset)
summary_metrics = sdf.get_summary()

INFO: 2023-04-11 10:07:15,968 [PROFILEREPORT] - Starting metrics calculation.
INFO: 2023-04-11 10:07:16,090 [PROFILEREPORT] - Synthetic data quality report selected target variable: diagnosis
INFO: 2023-04-11 10:07:16,091 [PROFILEREPORT] - preparing data format.
INFO: 2023-04-11 10:07:16,278 [PROFILEREPORT] - Preparing the data for metrics calculation
INFO: 2023-04-11 10:07:16,418 [PROFILEREPORT] - Calculating privacy metrics.
INFO: 2023-04-11 10:07:16,423 [PROFILEREPORT] - Calculating metric [Exact Matches].
INFO: 2023-04-11 10:07:16,431 [PROFILEREPORT] - Metric [Exact Matches] took 0.01s.
INFO: 2023-04-11 10:07:16,434 [PROFILEREPORT] - Calculating metric [Neighbours Privacy].
INFO: 2023-04-11 10:07:16,444 [PROFILEREPORT] - Metric [Neighbours Privacy] took 0.01s.
INFO: 2023-04-11 10:07:16,448 [PROFILEREPORT] - Calculating metric [Identifiability].
INFO: 2023-04-11 10:07:16,458 [PROFILEREPORT] - Metric [Identifiability] took 0.01s.
INFO: 2023-04-11 10:07:16,471 [PROFILEREPORT] - Calcul

In [19]:
print(f"\033[1m{PrivacyLevel.HIGH_PRIVACY.name}")
print(f"\033[1mFidelity: {summary_metrics['fidelity']:.2f}")
print(f"\033[1mUtility: {summary_metrics['utility']:.2f}")
print(f"\033[1mPrivacy: {summary_metrics['privacy']:.2f}")

[1mHIGH_PRIVACY
[1mFidelity: 0.66
[1mUtility: 0.35
[1mPrivacy: 0.99


## **Balanced Fidelity/Privacy**

To achieve the balanced setting between fidelity, utility, and privacy, the `privacy_level` parameter must be defined with the `BALANCED_PRIVACY_FIDELITY` option of the `PrivacyLevel` enumeration.

In [20]:
synthesizer = RegularSynthesizer()
synthesizer.fit(data, metadata=metadata, privacy_level=PrivacyLevel.BALANCED_PRIVACY_FIDELITY)
holdout_dataset = Dataset(synthesizer._holdout._data.compute())
train_dataset = Dataset(synthesizer._holdout._train_data.compute())
synthetic_dataset = synthesizer.sample(n_samples=len(holdout_dataset))

INFO: 2023-04-11 10:07:31,534 [SYNTHESIZER] - Number columns considered for synth: 31
INFO: 2023-04-11 10:07:50,798 [SYNTHESIZER] - Starting the synthetic data modeling process over 1x1 blocks.
INFO: 2023-04-11 10:07:50,803 [SYNTHESIZER] - Preprocess segment
INFO: 2023-04-11 10:07:50,810 [SYNTHESIZER] - Synthesizer init.
INFO: 2023-04-11 10:07:50,811 [SYNTHESIZER] - Processing the data prior fitting the synthesizer.
INFO: 2023-04-11 10:07:52,553 [SYNTHESIZER] - Start generating model samples.


In [21]:
%%capture
data_types = {k: v.datatype for k, v in metadata.columns.items()}
sdf = SyntheticDataProfile(real=holdout_dataset,
                           synth=synthetic_dataset,
                           metadata=metadata,
                           report_type=ReportType.TABULAR,
                           target=_target,
                           data_types=data_types,
                           training_data=train_dataset)
summary_metrics = sdf.get_summary()

INFO: 2023-04-11 10:07:53,513 [PROFILEREPORT] - Starting metrics calculation.
INFO: 2023-04-11 10:07:53,653 [PROFILEREPORT] - Synthetic data quality report selected target variable: diagnosis
INFO: 2023-04-11 10:07:53,658 [PROFILEREPORT] - preparing data format.
INFO: 2023-04-11 10:07:53,893 [PROFILEREPORT] - Preparing the data for metrics calculation
INFO: 2023-04-11 10:07:54,044 [PROFILEREPORT] - Calculating privacy metrics.
INFO: 2023-04-11 10:07:54,053 [PROFILEREPORT] - Calculating metric [Exact Matches].
INFO: 2023-04-11 10:07:54,062 [PROFILEREPORT] - Metric [Exact Matches] took 0.01s.
INFO: 2023-04-11 10:07:54,066 [PROFILEREPORT] - Calculating metric [Neighbours Privacy].
INFO: 2023-04-11 10:07:54,078 [PROFILEREPORT] - Metric [Neighbours Privacy] took 0.01s.
INFO: 2023-04-11 10:07:54,082 [PROFILEREPORT] - Calculating metric [Identifiability].
INFO: 2023-04-11 10:07:54,092 [PROFILEREPORT] - Metric [Identifiability] took 0.01s.
INFO: 2023-04-11 10:07:54,104 [PROFILEREPORT] - Calcul

In [22]:
print(f"\033[1m{PrivacyLevel.BALANCED_PRIVACY_FIDELITY.name}")
print(f"\033[1mFidelity: {summary_metrics['fidelity']:.2f}")
print(f"\033[1mUtility: {summary_metrics['utility']:.2f}")
print(f"\033[1mPrivacy: {summary_metrics['privacy']:.2f}")

[1mBALANCED_PRIVACY_FIDELITY
[1mFidelity: 0.72
[1mUtility: 0.64
[1mPrivacy: 0.96
