# Tutorial on metasyn-disclosure

In this tutorial, we will show how to use the metasyn disclosure control plugin to enhance the privacy of generative metadata (MetaFrames and exported GMF files). 

This tutorial assumes you are familiar with the basic workflow of `metasyn`,  if you're not, please first check out our [metasyn tutorials](https://metasynth.readthedocs.io/en/latest/usage/interactive_tutorials.html).

## Setup

### Installation
The first step is to install the metasyn-disclosure package, this can be done by uncommenting the following line and running it.

In [42]:
# !pip install git+https://github.com/sodascience/metasyn-disclosure-control.git

### Importing Packages
Then we import the necessary packages.

In [43]:
import polars as pl
from metasyn import MetaFrame, demo_file
from metasyn.distribution import (
    DiscreteUniformDistribution,
    FakerDistribution,
    RegexDistribution,
)

from metasyncontrib.disclosure import DisclosurePrivacy

### Preparing a dataset

The first step in creating the metadata is reading and converting your dataset to a polars DataFrame. 

In [44]:
titanic_path = demo_file()
df = pl.read_csv(
    source=titanic_path,
    try_parse_dates=True,
    dtypes={"Sex": pl.Categorical, "Embarked": pl.Categorical},
)
df.head()

  df = pl.read_csv(


PassengerId,Name,Sex,Age,Parch,Ticket,Fare,Cabin,Embarked,Birthday,Board time,Married since,all_NA
i64,str,cat,i64,i64,str,f64,str,cat,date,time,datetime[μs],str
1,"""Braund, Mr. Owen Harris""","""male""",22,0,"""A/5 21171""",7.25,,"""S""",1937-10-28,15:53:04,2022-08-05 04:43:34,
2,"""Cumings, Mrs. John Bradley (Fl…","""female""",38,0,"""PC 17599""",71.2833,"""C85""","""C""",,12:26:00,2022-08-07 01:56:33,
3,"""Heikkinen, Miss. Laina""","""female""",26,0,"""STON/O2. 3101282""",7.925,,"""S""",1931-09-24,16:08:25,2022-08-04 20:27:37,
4,"""Futrelle, Mrs. Jacques Heath (…","""female""",35,0,"""113803""",53.1,"""C123""","""S""",1936-11-30,,2022-08-07 07:05:55,
5,"""Allen, Mr. William Henry""","""male""",35,0,"""373450""",8.05,,"""S""",1918-11-07,10:59:08,2022-08-02 15:13:34,


Then we prepare a variable specification for the dataset.

In [45]:
specs = [
    # we set passengerId to unique
    {"name": "PassengerId", "distribution": {"unique": True}},
    # We create new fake names for the name column
    {"name": "Name", "distribution": FakerDistribution("name")},
    # Fit an exponential distribution based on the data for fare
    {"name": "Fare", "distribution": {"implements": "core.exponential"}},
    # For age we enforce a specific uniform distribution
    {"name": "Age", "distribution": DiscreteUniformDistribution(20, 40)},
    # We know cabin has a specific regular expression
    {"name": "Cabin", "distribution": RegexDistribution(r"[ABCDEF][0-9]{2,3}")},
]

## Using Disclosure Control

First, let's look at what happens when we fit a MetaFrame to the data without the disclosure plugin.

### Without Disclosure Control

In [46]:
mf = MetaFrame.fit_dataframe(
    df=df,
    var_specs=specs
)

print(
    f"Original Data vs Generative MetaData (for 'Married since' column):\n\n"
    f"Lowest value in source DataFrame: {df['Married since'].min()}\n"
    f"Lowest bound in the fitted MetaFrame:  {mf['Married since'].distribution.lower}\n\n"
    f"Highest value in source DataFrame: {df['Married since'].max()}\n"
    f"Highest bound in the fitted MetaFrame: {mf['Married since'].distribution.upper}"
)


100%|██████████| 13/13 [00:00<00:00, 26.07it/s]

Original Data vs Generative MetaData (for 'Married since' column):

Lowest value in source DataFrame: 2022-07-15 12:21:15
Lowest bound in the fitted MetaFrame:  2022-07-15 12:21:15

Highest value in source DataFrame: 2022-08-15 10:32:05
Highest bound in the fitted MetaFrame: 2022-08-15 10:32:05





As you can see, when comparing the original data to the fitted MetaFrame, we can see that the lower bound of the distribution is equal to the lowest value in the source DataFrame, and the upper bound is equal to the highest value. 

This means that source data can be easily inferred from the generative metadata.


### With Disclosure Control

By using the disclosure plugin, we can prevent this privacy concern. 

We can opt to use the disclosure plugin for the entire MetaFrame, by: 
- Set the `dist_providers` parameter to "metasyn-disclosure" to use the plugin. 
- Set the `privacy` parameter to `DisclosurePrivacy()` to enable disclosure control for the entire MetaFrame.

Be aware that not all distributions are implemented by the disclosure plugin. As such, when globally setting the privacy not all distributions will be fit by default, but they can be manually set. 



In [47]:
mf_disclosure = MetaFrame.fit_dataframe(
    df=df,
    var_specs=specs,
    dist_providers="metasyn-disclosure",  # Make the source of the distributions the disclosure plugin
    privacy=DisclosurePrivacy(),  # Make the entire MetaFrame use disclosure control
)

print(
    f"Original Data vs Generative MetaData (for 'Married since' column):\n\n"
    f"Lowest value in source DataFrame: {df['Married since'].min()}\n"
    f"Lowest bound in the fitted MetaFrame:  {mf_disclosure['Married since'].distribution.lower}\n\n"
    f"Highest value in source DataFrame: {df['Married since'].max()}\n"
    f"Highest bound in the fitted MetaFrame: {mf_disclosure['Married since'].distribution.upper}"
)

mf_disclosure.to_json("temp.json")

100%|██████████| 13/13 [00:00<00:00, 84.05it/s]


Original Data vs Generative MetaData (for 'Married since' column):

Lowest value in source DataFrame: 2022-07-15 12:21:15
Lowest bound in the fitted MetaFrame:  2022-07-15 17:12:24

Highest value in source DataFrame: 2022-08-15 10:32:05
Highest bound in the fitted MetaFrame: 2022-08-15 07:56:54


As you can see, the disclosure plugin fixes the privacy concern present in the base metasyn package: distribution bounds are no longer equal to the source data.

It is also possible to use disclosure control for only a specific variable. To do so, we have to specify the `privacy` parameter for the variable in the variable specification.

We can do this as follows:

In [48]:
specs_disclosure = [
    # Same as previous
    {"name": "PassengerId", "distribution": {"unique": True}},
    {"name": "Name", "distribution": FakerDistribution("name")},
    {"name": "Fare", "distribution": {"implements": "core.exponential"}},
    {"name": "Age", "distribution": DiscreteUniformDistribution(20, 40)},
    {"name": "Cabin", "distribution": RegexDistribution(r"[ABCDEF][0-9]{2,3}")},
    
    # Use disclosure control for the 'Married since' column
    {"name": "Married since", "privacy": DisclosurePrivacy()}
]

Then, when fitting the MetaFrame, we have to specify the `dist_providers` parameter to include both the "builtin" distributions, as well as the diclosure variations. 
We have to include them both because otherwise the distributions not specified in the var_spec to use the disclosure plugin will give an error. 

In [49]:
mf_disclosure_alt = MetaFrame.fit_dataframe(
    df=df,
    var_specs=specs_disclosure,
    dist_providers=["builtin", "metasyn-disclosure"]  # Allow for distributions from both the builtin and disclosure plugin
)

print(
    f"Original Data vs Generative MetaData (for 'Married since' column):\n\n"
    f"Lowest value in source DataFrame: {df['Married since'].min()}\n"
    f"Lowest bound in the fitted MetaFrame:  {mf_disclosure_alt['Married since'].distribution.lower}\n\n"
    f"Highest value in source DataFrame: {df['Married since'].max()}\n"
    f"Highest bound in the fitted MetaFrame: {mf_disclosure_alt['Married since'].distribution.upper}"
)

100%|██████████| 13/13 [00:00<00:00, 25.99it/s]

Original Data vs Generative MetaData (for 'Married since' column):

Lowest value in source DataFrame: 2022-07-15 12:21:15
Lowest bound in the fitted MetaFrame:  2022-07-15 17:12:24

Highest value in source DataFrame: 2022-08-15 10:32:05
Highest bound in the fitted MetaFrame: 2022-08-15 07:56:54





As you can see, the ''Married since" column now has different bounds for the distribution than the source data, while the other columns are still the same as before.