# Case Study 3 - Privacy
These notebooks are also available on Google Colab. This enables you to run the notebooks without having to set up an environment locally and gives you access to GPUs to run the notebooks on.
 
[![Run in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1JOstMJmhI2wcufyBqZ1iV3YqOdThJ-_U?usp=sharing#scrollTo=05sK9fusnBp8)

## 1. Introduction
Machine learning (ML) is empowering more and more communities by using their historical datasets. Unfortunately, some sectors and use cases have been precluded from the benefits of ML, due to the requirement of their data to remain private. In this case study we will look at methods that aim to solve this problem by creating synthetic datasets that are not bound by the constraints of privacy.

### 1.1 The Task
Make a private version of the Brazil COVID-19 dataset, that could safely be used by anyone to create a COVID-19 survival analysis model, without the risk of (re-)identification of individuals.

### 2. Imports
Lets get the imports out of the way. We import the required standard and 3rd party libraries and relevant Synthcity modules. We can also set the level of logging here, using Synthcity's bespoke logger. 

In [1]:
# Standard
import sys
import warnings
from pathlib import Path

# 3rd party
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report

# synthcity
import synthcity.logger as log
from synthcity.utils import serialization
from synthcity.plugins import Plugins
from synthcity.plugins.core.dataloader import (GenericDataLoader, SurvivalAnalysisDataLoader)
from synthcity.metrics import Metrics

# Configure warnings and logging
warnings.filterwarnings("ignore")

# Set the level for the logging
# log.add(sink=sys.stderr, level="DEBUG")
log.remove()

  from .autonotebook import tqdm as notebook_tqdm


### 3. Load the data

Load the data from file into a SurvivalAnalysisDataLoader object. For this we need to pass the names of our `target_column` and our `time_to_event_column` to the data loader. Then we can see the data by calling loader.dataframe() and get the infomation about the data loader object with loader.info().

In [2]:
X = pd.read_csv(f"../data/Brazil_COVID/covid_normalised_numericalised.csv")
loader = SurvivalAnalysisDataLoader(
    X,
    target_column="is_dead",
    time_to_event_column="Days_hospital_to_outcome",
    sensitive_features=["Age", "Sex", "Ethnicity", "Region"],
    random_state=42,
)

print(loader.info())
display(loader.dataframe())

{'data_type': 'survival_analysis', 'len': 6569, 'static_features': ['is_dead', 'Days_hospital_to_outcome', 'Age', 'Sex', 'Ethnicity', 'Region', 'Fever', 'Cough', 'Sore_throat', 'Shortness_of_breath', 'Respiratory_discomfort', 'SPO2', 'Dihareea', 'Vomitting', 'Cardiovascular', 'Asthma', 'Diabetis', 'Pulmonary', 'Immunosuppresion', 'Obesity', 'Liver', 'Neurologic', 'Renal'], 'sensitive_features': ['Age', 'Sex', 'Ethnicity', 'Region'], 'important_features': [], 'outcome_features': ['is_dead'], 'target_column': 'is_dead', 'time_to_event_column': 'Days_hospital_to_outcome', 'time_horizons': [16.25, 32.5, 48.75], 'train_size': 0.8}


Unnamed: 0,is_dead,Days_hospital_to_outcome,Age,Sex,Ethnicity,Region,Fever,Cough,Sore_throat,Shortness_of_breath,...,Vomitting,Cardiovascular,Asthma,Diabetis,Pulmonary,Immunosuppresion,Obesity,Liver,Neurologic,Renal
0,0,3,1,0,0,2,1,1,0,1,...,0,0,1,0,0,0,0,0,0,0
1,0,7,75,0,0,2,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,6,81,1,0,2,1,1,0,1,...,0,1,0,1,0,0,0,0,0,0
5,1,7,64,1,1,4,0,0,0,1,...,0,1,0,1,0,0,0,0,0,0
6,1,5,62,1,1,4,1,1,0,1,...,0,1,0,1,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6877,0,5,52,0,0,4,1,1,0,1,...,0,0,0,1,0,0,0,0,0,0
6878,0,2,34,1,1,4,1,1,0,1,...,0,0,0,0,0,0,0,0,0,0
6879,0,18,44,1,1,4,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
6880,0,17,23,1,1,4,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## 4. Load/Create synthetic datasets

We can list the available synthetic generators by calling list() on the Plugins object.

In [3]:
print(Plugins().list())

['fflows', 'survival_nflow', 'ctgan', 'pategan', 'dpgan', 'adsgan', 'bayesian_network', 'timevae', 'rtvae', 'decaf', 'timegan', 'tvae', 'privbayes', 'survival_ctgan', 'radialgan', 'survival_gan', 'nflow', 'survae']


From the above list we are going to select the synthetic generation models for privacy: "dpgan", "adsgan", and "pategan". Then we will create and fit the synthetic model before using it to generate a synthetic dataset.

In [4]:
outdir = Path("saved_models")
prefix = "privacy"
n_iter = 100
random_state=42
models=[
    "dpgan",
    "adsgan",
    "pategan",
]

For each model check if there is already a saved version, and if not use get() and fit() to produce one to then save to file

In [5]:
for model in models:
    save_file = outdir / f"{prefix}.{model}_numericalised_n_iter={n_iter}_rnd={random_state}.bkp"
    
    if not Path(save_file).exists():
        print(model)
        syn_model = Plugins().get(model, random_state=random_state)
        syn_model.fit(loader)
        serialization.save_to_file(save_file, syn_model)

## 5. Evaluate the generated synthetic dataset in terms of privacy

We can select some metrics to choose. The full list of available metrics can be seen by calling Metrics().list(). We are going to use the metrics associated with detection of the synthetic data and data privacy. Then we will print them to a dataframe to look at the results.

In [6]:
eval_results = {}
for model in models:
    print(model)
    save_file = outdir / f"{prefix}.{model}_numericalised_n_iter={n_iter}_rnd={random_state}.bkp"
    syn_model = serialization.load_from_file(save_file)
    selected_metrics = {
        "detection": ["detection_xgb", "detection_mlp", "detection_gmm"],
        "privacy": ["delta-presence", "k-anonymization", "k-map", "distinct l-diversity", "identifiability_score"],
        'performance': ['linear_model', 'mlp', 'xgb'],
    }
    my_metrics = Metrics()
    selected_metrics_in_my_metrics = {k: my_metrics.list()[k] for k in my_metrics.list().keys() & selected_metrics.keys()}
    X_syn = syn_model.generate(count=6882, random_state=random_state)
    evaluation = my_metrics.evaluate(
        loader,
        X_syn,
        task_type="survival_analysis",
        metrics=selected_metrics_in_my_metrics,
        workspace="workspace",
    )
    # Drop some metrics that we dont need
    display_metrics = [
      "performance.xgb.syn_ood.c_index",
      "performance.linear_model.syn_ood.c_index",
      "performance.mlp.syn_ood.c_index",
      "detection.detection_xgb.mean",
      "detection.detection_mlp.mean",
      "detection.detection_gmm.mean",
      "detection.detection_linear.mean",
      "privacy.k-anonymization.syn",
      "privacy.k-map.score",
      "privacy.distinct l-diversity.syn",
      "privacy.identifiability_score.score",
    ]
    evaluation = evaluation.loc[display_metrics]
    display(evaluation)
    eval_results[model] = evaluation

dpgan


Unnamed: 0,min,max,mean,stddev,median,iqr,rounds,errors,durations,direction
performance.linear_model.gt.c_index,0.652484,0.652484,0.652484,0.0,0.652484,0.0,1,0,2.38,maximize
performance.linear_model.gt.brier_score,0.226254,0.226254,0.226254,0.0,0.226254,0.0,1,0,2.38,maximize
performance.linear_model.syn_id.c_index,0.458031,0.458031,0.458031,0.0,0.458031,0.0,1,0,2.38,maximize
performance.linear_model.syn_id.brier_score,0.891738,0.891738,0.891738,0.0,0.891738,0.0,1,0,2.38,maximize
performance.linear_model.syn_ood.c_index,0.438443,0.438443,0.438443,0.0,0.438443,0.0,1,0,2.38,maximize
performance.linear_model.syn_ood.brier_score,0.85051,0.85051,0.85051,0.0,0.85051,0.0,1,0,2.38,maximize
performance.mlp.gt.c_index,0.63667,0.63667,0.63667,0.0,0.63667,0.0,1,0,18.03,maximize
performance.mlp.gt.brier_score,0.116824,0.116824,0.116824,0.0,0.116824,0.0,1,0,18.03,maximize
performance.mlp.syn_id.c_index,0.466759,0.466759,0.466759,0.0,0.466759,0.0,1,0,18.03,maximize
performance.mlp.syn_id.brier_score,0.204943,0.204943,0.204943,0.0,0.204943,0.0,1,0,18.03,maximize


adsgan


Unnamed: 0,min,max,mean,stddev,median,iqr,rounds,errors,durations,direction
performance.linear_model.gt.c_index,0.652484,0.652484,0.652484,0.0,0.652484,0.0,1,0,2.36,maximize
performance.linear_model.gt.brier_score,0.2262538,0.2262538,0.2262538,0.0,0.2262538,0.0,1,0,2.36,maximize
performance.linear_model.syn_id.c_index,0.6509711,0.6509711,0.6509711,0.0,0.6509711,0.0,1,0,2.36,maximize
performance.linear_model.syn_id.brier_score,0.3871931,0.3871931,0.3871931,0.0,0.3871931,0.0,1,0,2.36,maximize
performance.linear_model.syn_ood.c_index,0.6677291,0.6677291,0.6677291,0.0,0.6677291,0.0,1,0,2.36,maximize
performance.linear_model.syn_ood.brier_score,0.3754579,0.3754579,0.3754579,0.0,0.3754579,0.0,1,0,2.36,maximize
performance.mlp.gt.c_index,0.6366697,0.6366697,0.6366697,0.0,0.6366697,0.0,1,0,18.29,maximize
performance.mlp.gt.brier_score,0.1168239,0.1168239,0.1168239,0.0,0.1168239,0.0,1,0,18.29,maximize
performance.mlp.syn_id.c_index,0.5028734,0.5028734,0.5028734,0.0,0.5028734,0.0,1,0,18.29,maximize
performance.mlp.syn_id.brier_score,0.122484,0.122484,0.122484,0.0,0.122484,0.0,1,0,18.29,maximize


pategan


Unnamed: 0,min,max,mean,stddev,median,iqr,rounds,errors,durations,direction
performance.linear_model.gt.c_index,0.652484,0.652484,0.652484,0.0,0.652484,0.0,1,0,2.27,maximize
performance.linear_model.gt.brier_score,0.226254,0.226254,0.226254,0.0,0.226254,0.0,1,0,2.27,maximize
performance.linear_model.syn_id.c_index,0.480786,0.480786,0.480786,0.0,0.480786,0.0,1,0,2.27,maximize
performance.linear_model.syn_id.brier_score,0.138065,0.138065,0.138065,0.0,0.138065,0.0,1,0,2.27,maximize
performance.linear_model.syn_ood.c_index,0.455805,0.455805,0.455805,0.0,0.455805,0.0,1,0,2.27,maximize
performance.linear_model.syn_ood.brier_score,0.188725,0.188725,0.188725,0.0,0.188725,0.0,1,0,2.27,maximize
performance.mlp.gt.c_index,0.63667,0.63667,0.63667,0.0,0.63667,0.0,1,0,16.4,maximize
performance.mlp.gt.brier_score,0.116824,0.116824,0.116824,0.0,0.116824,0.0,1,0,16.4,maximize
performance.mlp.syn_id.c_index,0.499014,0.499014,0.499014,0.0,0.499014,0.0,1,0,16.4,maximize
performance.mlp.syn_id.brier_score,0.126214,0.126214,0.126214,0.0,0.126214,0.0,1,0,16.4,maximize


### 5.1 Display the evalution results
The above table contains all the infomation we need to evaluate the methods, but lets convert it to a format where it is easier to compare the methods

In [7]:
means = []
for plugin in eval_results:
    data = eval_results[plugin]["mean"]
    directions = eval_results[plugin]["direction"].to_dict()
    means.append(data)

out = pd.concat(means, axis=1)
out.set_axis(eval_results.keys(), axis=1, inplace=True)

bad_highlight = "background-color: lightcoral;"
ok_highlight = "background-color: green;"
default = ""


def highlights(row):
    metric = row.name
    if directions[metric] == "minimize":
        best_val = np.min(row.values)
        worst_val = np.max(row)
    else:
        best_val = np.max(row.values)
        worst_val = np.min(row)

    styles = []
    for val in row.values:
        if val == best_val:
            styles.append(ok_highlight)
        elif val == worst_val:
            styles.append(bad_highlight)
        else:
            styles.append(default)

    return styles


out.style.apply(highlights, axis=1)

Unnamed: 0,dpgan,adsgan,pategan
performance.linear_model.gt.c_index,0.652484,0.652484,0.652484
performance.linear_model.gt.brier_score,0.226254,0.226254,0.226254
performance.linear_model.syn_id.c_index,0.458031,0.650971,0.480786
performance.linear_model.syn_id.brier_score,0.891738,0.387193,0.138065
performance.linear_model.syn_ood.c_index,0.438443,0.667729,0.455805
performance.linear_model.syn_ood.brier_score,0.85051,0.375458,0.188725
performance.mlp.gt.c_index,0.63667,0.63667,0.63667
performance.mlp.gt.brier_score,0.116824,0.116824,0.116824
performance.mlp.syn_id.c_index,0.466759,0.502873,0.499014
performance.mlp.syn_id.brier_score,0.204943,0.122484,0.126214


### 5.2 Results of evaluation

We are using two types of metric here to dicsuss privacy: detection and privacy. Detection metrics measure the ability to identify the real data compared to the synthetic data. The privacy metrics measure how easy it would be to re-identify a patient given the quasi-identifying fields in the dataset.
Generally, ADSGAN performs best in synthetic data detection tasks, then PATEGAN, and DPGAN tends to perform very poorly.

k-anonymization - risk of re-identification is approximately 1/k according to [this paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2528029/). Therefore the risk of re-identification is < 2% for DPGAN or ADSGAN and for PATEGAN it is < 3%. In any case there is huge improvement from ground truth k=3. For PATEGAN there is >11-fold increase in k.

k-map - is a metric where every combination of values for the quasi-identifiers appears at least k times in the synthetic dataset. ADSGAN performs worse than PATEGAN, but DPGAN comes out on top.

l-diversity - Is a similar metric to k-anonymization, but ir is also concerned with the diversity of the generalized block. We see the same pattern as for k-anonymization.

identifiability_score - Risk of re-identification as defined in [this paper](https://ieeexplore.ieee.org/document/9034117). This is the best for DPGAN. ADSGAN and PATEGAN perform worse.

**Conclusion**<br/>
Generally, it seems DPGAN performs best in the privacy metrics, but the synthetic data is completely distinguishable from the real data by multiple detection algorithms, significantly reducing its utility. ADSGAN performs best in the detection metrics such that detection is not much better than random chance, with PATEGAN second best. ADSGAN and PATEGAN perform better in the detection metrics, but worse in privacy. These need balancing up to find the best solution for your use case.

## 6. Synthetic Data Quality

To get a good sense of the quality of the synthetic datasets and validate our previous conclusion. Lets plot the correlation/strength-of-association of features in data-set with both categorical and continuous features using:
- Pearson's R for continuous-continuous cases
- Correlation Ratio for categorical-continuous cases
- Cramer's V or Theil's U for categorical-categorical cases

In each of the following plots we are looking for the synthetic data to be as similar to the real data as possible. That is minimal values for Jensen-Shannon distance and pairwise correlation distance, and T-SNEs with similar looking distribution in the representation space.

In [None]:
import matplotlib.pyplot as plt
for model in models:
    print(model)
    save_file = outdir / f"{prefix}.{model}_numericalised_n_iter={n_iter}_rnd={random_state}.bkp"
    if Path(save_file).exists():
        syn_model = serialization.load_from_file(save_file)
        syn_model.plot(plt, loader, plots=["associations","marginal", "tsne"])
        plt.show()

## 7. Extension
TODO:
Use the code block below as a space to complete the extension exercises below.

### 7.1 Training models on both sets of data
Compute the marginal distribution/correlation etc... Utility

Please now train your own model on both the original dataset and each of the private datasets we have generated to see if you reach the same conclusion. Which privacy method provides the best performance and what are the trade-offs?