# Tutorial on metasyn-disclosure

In this tutorial, we will show how to use the metasyn disclosure control plugin to enhance the privacy of generative metadata (MetaFrames and exported GMF files). 

This tutorial assumes you are familiar with the basic workflow of `metasyn`,  if you're not, please first check out our [metasyn tutorials](https://metasynth.readthedocs.io/en/latest/usage/interactive_tutorials.html).

## Setup

### Installation
The first step is to install the metasyn-disclosure package, this can be done by uncommenting the following line and running it.

In [64]:
# !pip install git+https://github.com/sodascience/metasyn-disclosure-control.git

### Importing Packages
Then we import the necessary packages.

In [65]:
import polars as pl
from metasyn import MetaFrame, demo_file
from metasyn.distribution import (
    DiscreteUniformDistribution,
    FakerDistribution,
    RegexDistribution,
)

from metasyncontrib.disclosure import DisclosurePrivacy

### Preparing a dataset

The first step in creating the metadata is reading and converting your dataset to a polars DataFrame. 

In [66]:
titanic_path = demo_file()
df = pl.read_csv(
    source=titanic_path,
    try_parse_dates=True,
    dtypes={"Sex": pl.Categorical, "Embarked": pl.Categorical},
)
df.head()

  df = pl.read_csv(


PassengerId,Name,Sex,Age,Parch,Ticket,Fare,Cabin,Embarked,Birthday,Board time,Married since,all_NA
i64,str,cat,i64,i64,str,f64,str,cat,date,time,datetime[μs],str
1,"""Braund, Mr. Owen Harris""","""male""",22,0,"""A/5 21171""",7.25,,"""S""",1937-10-28,15:53:04,2022-08-05 04:43:34,
2,"""Cumings, Mrs. John Bradley (Fl…","""female""",38,0,"""PC 17599""",71.2833,"""C85""","""C""",,12:26:00,2022-08-07 01:56:33,
3,"""Heikkinen, Miss. Laina""","""female""",26,0,"""STON/O2. 3101282""",7.925,,"""S""",1931-09-24,16:08:25,2022-08-04 20:27:37,
4,"""Futrelle, Mrs. Jacques Heath (…","""female""",35,0,"""113803""",53.1,"""C123""","""S""",1936-11-30,,2022-08-07 07:05:55,
5,"""Allen, Mr. William Henry""","""male""",35,0,"""373450""",8.05,,"""S""",1918-11-07,10:59:08,2022-08-02 15:13:34,


## Using Disclosure Control

First, let's look at what happens when we fit a MetaFrame to the data without the disclosure plugin.

### Without Disclosure Control

In [67]:
mf = MetaFrame.fit_dataframe(
    df=df,
)

print(
    f"Original Data vs Generative MetaData (for 'Married since' column):\n\n"
    f"Lowest value in source DataFrame: {df['Married since'].min()}\n"
    f"Lowest bound in the fitted MetaFrame:  {mf['Married since'].distribution.lower}\n\n"
    f"Highest value in source DataFrame: {df['Married since'].max()}\n"
    f"Highest bound in the fitted MetaFrame: {mf['Married since'].distribution.upper}"
)

100%|██████████| 13/13 [00:01<00:00, 12.55it/s]

Original Data vs Generative MetaData (for 'Married since' column):

Lowest value in source DataFrame: 2022-07-15 12:21:15
Lowest bound in the fitted MetaFrame:  2022-07-15 12:21:15

Highest value in source DataFrame: 2022-08-15 10:32:05
Highest bound in the fitted MetaFrame: 2022-08-15 10:32:05





As you can see, when comparing the original data to the fitted MetaFrame, we can see that the lower bound of the distribution is equal to the lowest value in the source DataFrame, and the upper bound is equal to the highest value. 

This means that source data can be easily inferred from the generative metadata.


### With Disclosure Control

By using the disclosure plugin, we can prevent this privacy concern. 

We can opt to use the disclosure plugin for the entire MetaFrame, by: 
- Set the `dist_providers` parameter to "metasyn-disclosure" to use the plugin. 
- Set the `privacy` parameter to `DisclosurePrivacy()` to enable disclosure control for the entire MetaFrame.

In [68]:
mf_disclosure = MetaFrame.fit_dataframe(
    df=df,
    dist_providers="metasyn-disclosure",  # Make the source of the distributions the disclosure plugin
    privacy=DisclosurePrivacy(),  # Make the entire MetaFrame use disclosure control
)

print(
    f"Original Data vs Generative MetaData (for 'Married since' column):\n\n"
    f"Lowest value in source DataFrame: {df['Married since'].min()}\n"
    f"Lowest bound in the fitted MetaFrame:  {mf_disclosure['Married since'].distribution.lower}\n\n"
    f"Highest value in source DataFrame: {df['Married since'].max()}\n"
    f"Highest bound in the fitted MetaFrame: {mf_disclosure['Married since'].distribution.upper}"
)

100%|██████████| 13/13 [00:00<00:00, 29.49it/s]

Original Data vs Generative MetaData (for 'Married since' column):

Lowest value in source DataFrame: 2022-07-15 12:21:15
Lowest bound in the fitted MetaFrame:  2022-07-15 17:12:24

Highest value in source DataFrame: 2022-08-15 10:32:05
Highest bound in the fitted MetaFrame: 2022-08-15 07:56:54





As you can see, the disclosure plugin fixes the privacy concern present in the base metasyn package: distribution bounds are no longer equal to the source data. 

### Manually Specifying Distributions
However, there still is a problem. The disclosure plugin does not implement every type of distribution that the base metasyn package does. This means, that in cases where a distribution can not be found, the plugin will default to a NA distribution (returning only NA values).
We can see this happening if we print the generative metadata, for example, notice how the 'Cabin' column has a NA distribution.

In [69]:
print(mf_disclosure)

# Rows: 891
# Columns: 13

Column 1: "PassengerId"
- Variable Type: discrete
- Data Type: Int64
- Proportion of Missing Values: 0.0000
- Distribution:
	- Type: core.normal
	- Provenance: metasyn-disclosure
	- Parameters:
		- mean: 446.0
		- sd: 257.18994277900265
	

Column 2: "Name"
- Variable Type: string
- Data Type: String
- Proportion of Missing Values: 0.0000
- Distribution:
	- Type: core.freetext
	- Provenance: metasyn-disclosure
	- Parameters:
		- locale: EN
		- avg_sentences: 2.4691358024691357
		- avg_words: 4.093153759820426
	

Column 3: "Sex"
- Variable Type: categorical
- Data Type: Categorical(ordering='physical')
- Proportion of Missing Values: 0.0000
- Distribution:
	- Type: core.multinoulli
	- Provenance: metasyn-disclosure
	- Parameters:
		- labels: ['female' 'male']
		- probs: [0.35241302 0.64758698]
	

Column 4: "Age"
- Variable Type: discrete
- Data Type: Int64
- Proportion of Missing Values: 0.1987
- Distribution:
	- Type: core.normal
	- Provenance: metasyn-disclos

We can fix this by manually specifying the distributions for the variables that the disclosure plugin does not support. For example, we can set the 'Cabin' column to have a regex distribution.

In [70]:
specs = [
    # Specify cabin to have a RegexDistribution
    {"name": "Cabin", "distribution": RegexDistribution(r"[ABCDEF][0-9]{2,3}")},
]

mf_disclosure = MetaFrame.fit_dataframe(
    df=df,
    var_specs=specs, 
    dist_providers="metasyn-disclosure",
    privacy=DisclosurePrivacy(),  
)

print(mf_disclosure)


100%|██████████| 13/13 [00:00<00:00, 35.13it/s]

# Rows: 891
# Columns: 13

Column 1: "PassengerId"
- Variable Type: discrete
- Data Type: Int64
- Proportion of Missing Values: 0.0000
- Distribution:
	- Type: core.normal
	- Provenance: metasyn-disclosure
	- Parameters:
		- mean: 446.0
		- sd: 257.18994277900265
	

Column 2: "Name"
- Variable Type: string
- Data Type: String
- Proportion of Missing Values: 0.0000
- Distribution:
	- Type: core.freetext
	- Provenance: metasyn-disclosure
	- Parameters:
		- locale: EN
		- avg_sentences: 2.4691358024691357
		- avg_words: 4.093153759820426
	

Column 3: "Sex"
- Variable Type: categorical
- Data Type: Categorical(ordering='physical')
- Proportion of Missing Values: 0.0000
- Distribution:
	- Type: core.multinoulli
	- Provenance: metasyn-disclosure
	- Parameters:
		- labels: ['female' 'male']
		- probs: [0.35241302 0.64758698]
	

Column 4: "Age"
- Variable Type: discrete
- Data Type: Int64
- Proportion of Missing Values: 0.1987
- Distribution:
	- Type: core.normal
	- Provenance: metasyn-disclos




As you can see, the 'Cabin' column now has a RegexDistribution, and is no longer NA.

## Privacy Control for Individual Columns

Instead of using disclosure control for the entire MetaFrame, we can also use it for individual columns.  This way, other columns (which might not be supported by the disclosure plugin) can still use the base metasyn distributions.

To do so, we have to specify the `privacy` parameter for the variable in the variable specification.

We can do this as follows:

In [71]:
specs_disclosure = [
    # Use disclosure control for the 'Married since' column
    {"name": "Married since", "privacy": DisclosurePrivacy()}
]

Then, when fitting the MetaFrame, we have to specify the `dist_providers` parameter to include both the "builtin" distributions, as well as the diclosure variations. 
We have to include them both because otherwise the distributions not specified in the var_spec to use the disclosure plugin will give an error. 

In [72]:
mf_disclosure_alt = MetaFrame.fit_dataframe(
    df=df,
    var_specs=specs_disclosure,
    dist_providers=["builtin", "metasyn-disclosure"]  # Allow for distributions from both the builtin and disclosure plugin
)

print(
    f"Original Data vs Generative MetaData (for 'Married since' column):\n\n"
    f"Lowest value in source DataFrame: {df['Married since'].min()}\n"
    f"Lowest bound in the fitted MetaFrame:  {mf_disclosure_alt['Married since'].distribution.lower}\n\n"
    f"Highest value in source DataFrame: {df['Married since'].max()}\n"
    f"Highest bound in the fitted MetaFrame: {mf_disclosure_alt['Married since'].distribution.upper}"
)

mf_disclosure_alt.synthesize(5)

100%|██████████| 13/13 [00:01<00:00, 11.87it/s]

Original Data vs Generative MetaData (for 'Married since' column):

Lowest value in source DataFrame: 2022-07-15 12:21:15
Lowest bound in the fitted MetaFrame:  2022-07-15 17:12:24

Highest value in source DataFrame: 2022-08-15 10:32:05
Highest bound in the fitted MetaFrame: 2022-08-15 07:56:54





PassengerId,Name,Sex,Age,Parch,Ticket,Fare,Cabin,Embarked,Birthday,Board time,Married since,all_NA
i64,str,cat,i64,i64,str,f64,null,cat,date,time,datetime[μs],null
671,"""Score. Level.""","""female""",27,0,"""949904""",18.701995,,"""C""",1910-10-17,13:32:43,2022-08-06 02:00:34,
376,"""Candidate middle. Reflect.""","""female""",16,0,"""0719""",50.066777,,"""S""",1932-03-10,17:42:53,2022-07-24 19:18:59,
528,"""Recognize.""","""male""",34,0,"""51255""",10.739662,,"""S""",1904-11-20,17:19:15,2022-07-30 21:26:37,
58,"""Become. School. Night. Leave. …","""male""",22,0,"""35068""",10.523629,,"""S""",1932-04-29,,2022-08-01 20:56:25,
512,"""Commercial forward able. Simil…","""male""",25,0,"""2976""",3.724113,,"""S""",1913-06-07,11:31:47,2022-07-23 13:55:52,


As you can see by the bounds, the disclosure plugin has been applied effectively for the 'Married since' column.

## Changing the Privacy Level

By default, the disclosure privacy partition size is set to 11.  This partition size determines how data is grouped for privacy protection. Data is grouped into partitions of (at least) the specified size, and the mean of each group is then used. A larger partition size increases privacy but may reduce data utility, while a smaller partition size may increase utility but decrease privacy.

The partition size can be changed by specifying the `partition_size` parameter in the `DisclosurePrivacy` class. This can be done either for the entire MetaFrame, or for individual columns.

When using the plugin for the entire MetaFrame, the syntax is as follows:

In [73]:
mf = MetaFrame.fit_dataframe(
    df=df,
    dist_providers="metasyn-disclosure",  
    privacy=DisclosurePrivacy(partition_size=5),  # Set the partition size to 5
)

100%|██████████| 13/13 [00:00<00:00, 27.34it/s]


When using the plugin for individual columns, the syntax is as follows:

In [74]:
specs = [
    {"name": "Married since", "privacy": DisclosurePrivacy(partition_size=5)},
]

mf = MetaFrame.fit_dataframe(
    df=df,
    var_specs=specs,
    dist_providers=["builtin", "metasyn-disclosure"]  # Allow for distributions from both the builtin and disclosure plugin
)

100%|██████████| 13/13 [00:01<00:00, 12.89it/s]


# Conclusion
That covers the basics of using the disclosure control plugin. If you have any questions, or encounter an issue, please feel free to reach out to us on our [GitHub page](https://github.com/sodascience/metasyn-disclosure-control)