<a href="https://colab.research.google.com/github/wendyZhang98/DS-GA-1008-DeepLearning/blob/main/lab_9_wz2164.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!git clone https://github.com/ryan112358/private-pgm.git
%cd private-pgm
! pip install -r requirements.txt
! python setup.py install
import os, sys
sys.path.append(os.getcwd())

!git clone https://github.com/lurosenb/host_mst_wrapper

Cloning into 'private-pgm'...
remote: Enumerating objects: 485, done.[K
remote: Counting objects: 100% (314/314), done.[K
remote: Compressing objects: 100% (203/203), done.[K
remote: Total 485 (delta 131), reused 260 (delta 95), pack-reused 171[K
Receiving objects: 100% (485/485), 2.14 MiB | 12.77 MiB/s, done.
Resolving deltas: 100% (205/205), done.
/content/private-pgm
Collecting nose
  Downloading nose-1.3.7-py3-none-any.whl (154 kB)
[K     |████████████████████████████████| 154 kB 7.3 MB/s 
[?25hCollecting disjoint-set
  Downloading disjoint_set-0.7.3-py3-none-any.whl (5.2 kB)
Installing collected packages: nose, disjoint-set
Successfully installed disjoint-set-0.7.3 nose-1.3.7
running install
running bdist_egg
running egg_info
creating src/private_pgm.egg-info
writing src/private_pgm.egg-info/PKG-INFO
writing dependency_links to src/private_pgm.egg-info/dependency_links.txt
writing requirements to src/private_pgm.egg-info/requires.txt
writing top-level names to src/private_pg

In [None]:
from IPython.display import clear_output
from scipy.stats import entropy, ks_2samp
from scipy.spatial.distance import euclidean
from sklearn.metrics import mutual_info_score
from random import randint
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Packages for reading csv file into Colaboratory:
!pip install -U -q PyDrive==1.3.1

from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# Authenticate and create the PyDrive client. 
# Please follow the steps as instructed when you run the following commands. 

auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

fileid_compas = '1kgSIBkOM9y0nz_l8LI8ze9TAhF5gbb64'    
real_data_file = 'hw_compas.csv'

downloaded = drive.CreateFile({'id':fileid_compas}) 
downloaded.GetContentFile(real_data_file)  
df_real = pd.read_csv(real_data_file)

In [None]:
import sys
sys.path.insert(1, "/content/private-pgm/src")
from host_mst_wrapper.mst.mst import MSTSynthesizer
from host_mst_wrapper.mst.pmse import pmse_ratio

df_real['sex'] = df_real['sex'].astype('category')
df_real['race'] = df_real['race'].astype('category')
df_real['score'] = df_real['score'].astype('category')
categorical = df_real.select_dtypes(['category']).columns
df_real[categorical] = df_real[categorical].apply(lambda x: x.cat.codes)

## MST
We will generate differentially private synthetic data using MST. Using some adversarial examples, we will evaluate how attribute pair decisions impact the quality of our synthetic data, and think about why the maximum spanning tree based on mutual information is a good idea.

What do we mean here by "evaluate," though? How can we tell when our synthetic data is performing well, and when it's not? We will explore this today, starting with an example of a metric that tries to capture synthetic data quality with an elegant and simple approach.

## Example Metric: pMSE
We will evaluate the performance of MST here using a recent similarity metric for assessing the quality of the synthetic data called pMSE (propensity Mean Squared Error). 

### Intuition
The intuition behind this metric is quite simple: assign a binary indicator variable to each sample - 0 or 1 for real or synthetic. Then, train a regression model to classify synthetic data - the better the regression model, the worse the synthetic data. The metric is normalized to be optimal at 1, though values significantly above 1 are common with poor synthetic data fit. Values close to 1 (i.e. lower values) are better.

### Example of pMSE
So, if we compare the real data to itself using pMSE, we would expect to see a score of ~1.0.

In [None]:
pmse = pmse_ratio(df_real, df_real, seed=3)
print('pMSE ratio:',round(pmse,3))

pMSE ratio: 1.35


The best way to compute pMSE is to average the score across a set of seeds, such that you mitigate the noise inherent in which rows are chosen for classification.

In [None]:
import random
avg_pmse = 0
seeds = [0, 1, 2, 3, 4, 5]
for s in seeds:
  avg_pmse += pmse_ratio(df_real, df_real, seed=s)


print('Average pMSE:',round(avg_pmse/len(seeds),3))

Average pMSE: 1.319


MST always uses some privacy budget to measure each attribute independent of other attributes as a starting point. So, for example, with the score attribute, we simply count up how many individuals received each score, and MST incorporates this information into its private distribution. These count queries are relatively inexpensive to perform in a private manner. 

### Sanity check
Below, we use pMSE to check that, as we improve the 2-way marginals that we feed MST, we improve the performance of the synthetic data. We start with providing 0 2-way marginals, then provide a dumb set (across uncorrelated identities), and then provide a set with high correlation (the identity->score set).

In [None]:
cliques_to_try = [
    [],
    [('sex', 'age'), ('age', 'race')],
    [('sex', 'score'), ('age', 'score'), ('race', 'score')]
]

for cl in cliques_to_try:
    synth = MSTSynthesizer(epsilon=1.0, 
                        domain_path="host_mst_wrapper/mst/compas-domain.json",
                        custom_cliques=True,
                        cliques_set=cl)
    synth.fit(df_real)
    mst_fake_data = synth.sample(samples=len(df_real))
    avg_pmse = 0
    for s in seeds:
      avg_pmse += pmse_ratio(df_real, mst_fake_data, seed=s)
    print('Average pMSE:',round(avg_pmse/len(seeds),3))
    print('')

Domain(sex: 2, age: 101, race: 6, score: 11)
Index(['sex', 'age', 'race', 'score'], dtype='object')
[]
Average pMSE: 2.881

Domain(sex: 2, age: 101, race: 6, score: 11)
Index(['sex', 'age', 'race', 'score'], dtype='object')
[('sex', 'age'), ('age', 'race')]
Average pMSE: 2.305

Domain(sex: 2, age: 101, race: 6, score: 11)
Index(['sex', 'age', 'race', 'score'], dtype='object')
[('sex', 'score'), ('age', 'score'), ('race', 'score')]
Average pMSE: 1.407



## Evaluate using a new (or your own) metric!
Your turn! Let's try to determine the quality of private synthetic data, in comparison to the real data.

Feel free to scrape the internet for a metric that makes sense for comparing real to synthetic data, or come up with your own! What are you trying to evaluate? Specific queries? Predictive tasks? Be creative!

Some examples of interesting tasks include:
- Predictive: how good is synthetic data at **retaining predictive utility** of a classifier?
  - Train classifier on MST data and real data, compare performance. Vary settings!
- Statistical: how good is synthetic data at maintaining **more than 2 way correlations**? 
  - How good is the synthetic data at maintaining different distributions (normal, power, bi-modal, etc.)?
- Fairness: how **fair** is the synthetic data demonstrated with respect to some protected class? 
  - How do we define fair here?

Feel free to grab your own data for synthesizing here. We expect that you will turn in a notebook in which you discuss at least one meaningful metric and show at least one meaningful plot concerning that metric, but the exercise is designed to be otherwise unconstrained.

In [None]:
synth = MSTSynthesizer(epsilon=1.0, 
                        domain_path="host_mst_wrapper/mst/compas-domain.json")
synth.fit(df_real)
mst_fake_data = synth.sample(samples=len(df_real))

Domain(sex: 2, age: 101, race: 6, score: 11)
Index(['sex', 'age', 'race', 'score'], dtype='object')
[('sex', 'age'), ('age', 'score'), ('race', 'score')]


In [None]:
mst_fake_data.head()

Unnamed: 0,sex,age,race,score
0,1,32,0,9
1,1,24,0,8
2,1,22,0,2
3,1,42,0,8
4,0,27,0,8


In [None]:
df_real.head()

Unnamed: 0,sex,age,race,score
0,1,69,5,1
1,1,31,2,5
2,1,34,0,3
3,1,24,0,4
4,1,23,0,8


### Include brief description of your metric here:

In [None]:
#INCLUDE YOUR CODE FOR PROPOSED METRIC HERE