# Tutorial on metasyn-disclosure

In this tutorial, we will show how to use the metasyn disclosure control plugin. It uses the same procedure as the base metasyn [package](https://github.com/sodascience/metasyn/blob/main/examples/advanced_tutorial.ipynb). The output format will be the same: a GMF file will be created with the same parameter types as with the base package. One difference is that not all distributions are implemented by the disclosure plugin. The missing distributions will not be fit by default, but they can be manually set.

In [None]:
# import required packages
from collections import defaultdict
import datetime as dt

import numpy as np
import polars as pl
from matplotlib import pyplot as plt

from metasyn import MetaFrame, demo_file
from metasyncontrib.disclosure import DisclosurePrivacy
from metasyn.provider import DistributionProviderList
#from utils import get_demonstration_fp

## Transforming your data into a polars DataFrame

The first step in creating the metadata is reading and converting your dataset to a polars DataFrame. 

In [None]:
demonstration_fp =demo_file()
df = pl.read_csv(
    source=demonstration_fp, 
    try_parse_dates=True,
    dtypes={
        "Sex": pl.Categorical,
        "Embarked": pl.Categorical
    }
)
df.head()

### A full example with the base package

Below we have the synthesis that uses the base metasyn package so that we can see potential problems with it. For a more detailed explanation of the base package, see our base [tutorial](https://github.com/sodascience/metasyn/blob/main/examples/advanced_tutorial.ipynb).

In [None]:
from metasyn.distribution import RegexDistribution, FakerDistribution
from metasyn.distribution import DiscreteUniformDistribution

cabin_distribution = RegexDistribution(r"[ABCDEF][0-9]{2,3}")
var_spec = {
    "PassengerId": {"unique": True}, 
    "Name":        {"distribution": FakerDistribution("name")},
    "Fare":        {"distribution": "exponential"}, # Fit an exponential distribution based on the data
    "Age":         {"distribution": DiscreteUniformDistribution(20, 40)},
    "Cabin":       {"distribution": cabin_distribution}
}

meta_frame = MetaFrame.fit_dataframe(df, spec=var_spec)
print(f"Lower bound distribution:  {meta_frame['Married since'].distribution.start}\n"
      f"Lowest value in dataframe: {df['Married since'].min()}")
meta_frame.synthesize(5)

From the previous results, we can see the problem that can sometimes happen with the base package: the earliest datetime of the "Married since" column gets recorded in the distribution itself, and thus in the resulting GMF file.

### A full example with disclosure

Below is the same example with the metasyn disclosure plugin.

In [None]:
meta_frame = MetaFrame.fit_dataframe(
    df=df, 
    spec=var_spec,
    dist_providers="metasyn-disclosure",  # Use the metasyn-disclosure plugin
    privacy=DisclosurePrivacy()             # Use disclosure control
) 
print(f"Lower bound distribution:  {meta_frame['Married since'].distribution.start}\n"
      f"Lowest value in dataframe: {df['Married since'].min()}")
meta_frame.synthesize(5)

As you can see, the disclosure plugin fixes the privacy concern present in the base metasyn package: the lower bound of the distribution is no longer equal to the lowest value in the dataframe.

## Single outliers

Below we will look at what happens to our parameters if we add a single new value (outlier) to the data. We do this for both the base metasyn implementation and the disclosure control implementation. We expect that following the rules of disclosure control, a single outlier should have a smaller (and limited) effect on the results than with the base metasyn implementation.

Define the plotting function.

In [None]:
from metasyn.distribution import MultinoulliDistribution

def plot_outliers(dist_type, series_size=50):
    dist_providers = DistributionProviderList(["builtin", "metasyn-disclosure"])
    disc_distributions = dist_providers.get_distributions(var_type=dist_type, privacy=DisclosurePrivacy())
    
    for disc_class in disc_distributions:
        if issubclass(disc_class, MultinoulliDistribution):
            continue
        base_class = dist_providers.find_distribution(disc_class.implements, disc_class.var_type)

        dist = base_class.default_distribution()
        series = pl.Series([dist.draw() for _ in range(series_size)])
        clean_base_param = base_class.fit(series).to_dict()["parameters"]
        clean_disc_param = disc_class.fit(series).to_dict()["parameters"]

        base_param = defaultdict(lambda: [])
        disc_param = defaultdict(lambda: [])
        def _add(parameters, param, new_val):
            for key, val in param.items():
                parameters[key].append(val)
            parameters["new_val"].append(new_val)

        for new_val in np.linspace(-100, 100, 51):
            new_series = series.extend_constant(new_val, 1)
            base_dist = base_class.fit(new_series)
            disc_dist = disc_class.fit(new_series)
            _add(base_param, base_dist.to_dict()["parameters"], new_val)
            _add(disc_param, disc_dist.to_dict()["parameters"], new_val)

        for param in base_param:
            if param == "new_val":
                continue
            plt.plot(base_param["new_val"], np.array(base_param[param])-clean_base_param[param], label="base")
            plt.plot(disc_param["new_val"], np.array(disc_param[param]) - clean_disc_param[param], label="disclosure")
            plt.title(f"{disc_class.__name__}: {param}")
            plt.ylabel("Difference between dist with and without outlier")
            plt.xlabel("Value of the outlier")
            plt.legend()
            plt.show()


### Graphs for all continuous distributions

In [None]:
plot_outliers("continuous")

As we can see, the effect from outliers is much reduced compared to the baseline implementation. For all of the distributions, the change in the result is much smaller when an outlier is added.

### Graphs for all discrete distributions

In [None]:
plot_outliers("discrete")

The same is true for the discrete distributions, only a localized effect is present.