# Case Study 3 - Privacy

## The Task
Make a private version of the Brazil COVID-19 dataset, that could safely be used by anyone to create a COVID-19 survival analysis model, without the risk of (re-)identification of individuals.

### Imports
Lets get the imports out of the way. We import the required standard and 3rd party libraries and relevant Synthcity modules. We can also set the level of logging here, using Synthcity's bespoke logger. 

In [None]:
# Standard
import sys
import warnings
from pathlib import Path

# 3rd party
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report

# synthcity
import synthcity.logger as log
from synthcity.utils import serialization
from synthcity.plugins import Plugins
from synthcity.plugins.core.dataloader import (GenericDataLoader, SurvivalAnalysisDataLoader)
from synthcity.metrics import Metrics

# Configure warnings and logging
warnings.filterwarnings("ignore")

# Set the level for the logging
# log.add(sink=sys.stderr, level="DEBUG")
log.remove()

### Load the data

Load the data from file into a SurvivalAnalysisDataLoader object. For this we need to pass the names of our `target_column` and our `time_to_event_column` to the data loader. Then we can see the data by calling loader.dataframe() and get the infomation about the data loader object with loader.info().

In [None]:
X = pd.read_csv(f"../data/Brazil_COVID/covid_normalised_numericalised.csv")
loader = SurvivalAnalysisDataLoader(
    X,
    target_column="is_dead",
    time_to_event_column="Days_hospital_to_outcome",
    sensitive_features=["Age", "Sex", "Ethnicity", "Region"],
    random_state=42,
)

display(loader.dataframe())
# display(loader.info())

## Synthetic generators

We can list the available generic synthetic generators by calling list() on the Plugins object.

In [None]:
print(Plugins(categories=["generic", "survival_analysis"]).list())

### Create synthetic datasets

From the above list we are going to select the synthetic generation models for privacy: "privbayes", "dpgan", "adsgan", and "pategan". Then we will create and fit the synthetic model before using it to generate a synthetic dataset.

In [None]:
outdir = Path("saved_models")
prefix = "privacy"
n_iter = 100
models=[
    "dpgan",
    "adsgan",
    "pategan",
]
for model in models:
    save_file = outdir / f"{prefix}.{model}_numericalised_n_iter={n_iter}_4.bkp"
    if not Path(save_file).exists():
        print(model)
        syn_model = Plugins().get(model)
        syn_model.fit(loader)
        syn_model.generate(count=6882).dataframe()
        serialization.save_to_file(save_file, syn_model)

### Evaluate the generated synthetic dataset in terms of privacy

We can select some metrics to choose. The full list of available metrics can be seen by calling Metrics().list(). We are going to use the metrics associated with detection of the synthetic data and data privacy. Then we will print them to a dataframe to look at the results.

In [None]:
eval_results = {}
for model in models:
    print(model)
    save_file = outdir / f"{prefix}.{model}_numericalised_n_iter={n_iter}_4.bkp"
    if Path(save_file).exists():
        syn_model = serialization.load_from_file(save_file)
        selected_metrics = {
            "detection": ["detection_xgb", "detection_mlp", "detection_gmm"],
            "privacy": ["delta-presence", "k-anonymization", "k-map", "distinct l-diversity", "identifiability_score"],
            'performance': ['linear_model', 'mlp', 'xgb', 'feat_rank_distance'],
        }
        my_metrics = Metrics()
        selected_metrics_in_my_metrics = {k: my_metrics.list()[k] for k in my_metrics.list().keys() & selected_metrics.keys()}
        X_syn = syn_model.generate(count=6882)
        evaluation = my_metrics.evaluate(
            loader,
            X_syn,
            task_type="survival_analysis",
            metrics=selected_metrics_in_my_metrics,
            workspace="workspace",
        )
        display(evaluation)
        eval_results[model] = evaluation

The above table contains all the infomation we need to evaluate the methods, but lets convert it to a format where it is easier to compare the methods

In [None]:
means = []
for plugin in eval_results:
    data = eval_results[plugin]["mean"]
    directions = eval_results[plugin]["direction"].to_dict()
    means.append(data)

out = pd.concat(means, axis=1)
out.set_axis(eval_results.keys(), axis=1, inplace=True)

bad_highlight = "background-color: lightcoral;"
ok_highlight = "background-color: green;"
default = ""


def highlights(row):
    metric = row.name
    if directions[metric] == "minimize":
        best_val = np.min(row.values)
        worst_val = np.max(row)
    else:
        best_val = np.max(row.values)
        worst_val = np.min(row)

    styles = []
    for val in row.values:
        if val == best_val:
            styles.append(ok_highlight)
        elif val == worst_val:
            styles.append(bad_highlight)
        else:
            styles.append(default)

    return styles


out.style.apply(highlights, axis=1)

### Results of evaluation

We are using two types of metric here to dicsuss privacy: detection and privacy. Detection metrics measure the ability to identify the real data compared to the synthetic data. This has impacts on privacy as if an attacker can identify the real patients in a dataset they can then go about using the subset of real records to try and re-identify the real individuals, i.e. ability to identify the real records reduces the chance of a patient being lost in a crowd of similar synthetic records. The privacy metrics measure how easy it would be to re-identify a patient given the quasi-identifying fields in the dataset.
Generally, ADSGAN performs best in synthetic data detection tasks, then PATEGAN, and DPGAN tends to perform very poorly.

k-anonymization - risk of re-identification is approximately 1/k according to [this paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2528029/). Therefore the risk of re-identification is < 2% for DPGAN or ADSGAN and for PATEGAN it is < 3%. In any case there is huge improvement from ground truth k=3. For PATEGAN there is >11-fold increase in k.

k-map - is a metric where every combination of values for the quasi-identifiers appears at least k times in the synthetic dataset. ADSGAN performs worse than PATEGAN, but DPGAN comes out on top.

l-diversity - Is a similar metric to k-anonymization, but ir is also concerned with the diversity of the generalized block. We see the same pattern as for k-anonymization.

identifiability_score - Risk of re-identification as defined in [this paper](https://ieeexplore.ieee.org/document/9034117). This is the best for DPGAN. ADSGAN and PATEGAN perform worse.

### Conclusion
Generally, it seems DPGAN performs best in the privacy metrics, but the synthetic data is completely distinguishable from the real data by multiple detection algorithms, significantly reducing its utility. ADSGAN performs best in the detection metrics such that detection is not much better than random chance, with PATEGAN second best. ADSGAN and PATEGAN perform better in the detection metrics, but worse in privacy. These need balancing up to find the best solution for your use case.

## Synthetic Data Quality

To get a good sense of the quality of the synthetic datasets and validate our previous conclusion. Lets plot the correlation/strength-of-association of features in data-set with both categorical and continuous features using:
- Pearson's R for continuous-continuous cases
- Correlation Ratio for categorical-continuous cases
- Cramer's V or Theil's U for categorical-categorical cases

In each of the following plots we are looking for the synthetic data to be as similar to the real data as possible. That is minimal values for Jensen-Shannon distance and pairwise correlation distance, and T-SNEs with similar looking distribution in the representation space.

In [None]:
import matplotlib.pyplot as plt
for model in models:
    print(model)
    save_file = outdir / f"{prefix}.{model}_numericalised_4.bkp"
    if Path(save_file).exists():
        syn_model = serialization.load_from_file(save_file)
        syn_model.plot(plt, loader, plots=["associations","marginal", "tsne"])
        plt.show()

## Training models on both sets of data

Please now train your own model on both the original dataset and each of the private datasets we have generated to see if you reach the same conclusion. Which privacy method provides the best performance and what are the trade-offs?