# Using ABSEHRD package for synthetic data challenge
This notebook demonstrates Automated Brewering Synthetic Electronic Health Record Data (ABSEHRD) package functionality on the SAT GPA dataset supplied in the HLG-MOS Synthetic data challenge.  The ABSEHRD package adapts the implementation of CorGAN (Torfi et al. 2020. CorGAN: Correlation-Capturing Convolutional Generative Adversarial Networks for Generating Synthetic Healthcare Records. https://arxiv.org/abs/2001.09346).


ABSEHRD installation instructions and additional information is available on GitHub: https://github.com/hhunterzinck/absehrd

## Setup

### Import and parameters

In [1]:
import numpy as np
import pandas as pd

In [2]:
!mkdir data
!mkdir output

mkdir: cannot create directory ‘data’: File exists
mkdir: cannot create directory ‘output’: File exists


In [3]:
!wget https://www.openintro.org/data/csv/satgpa.csv
!mv satgpa.csv data/

--2022-01-27 09:16:41--  https://www.openintro.org/data/csv/satgpa.csv
Resolving www.openintro.org (www.openintro.org)... 192.185.65.127
Connecting to www.openintro.org (www.openintro.org)|192.185.65.127|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 20106 (20K) [text/csv]
Saving to: ‘satgpa.csv’


2022-01-27 09:16:41 (294 KB/s) - ‘satgpa.csv’ saved [20106/20106]



In [4]:
dir_data = 'data'
dir_output = 'output'
prefix = 'satgpa'

### Read and explore dataset

In [5]:
df = pd.read_csv(f'{dir_data}/{prefix}.csv', delimiter=',')

In [6]:
print(df.shape)

(1000, 6)


In [7]:
print(df)

     sex  sat_v  sat_m  sat_sum  hs_gpa  fy_gpa
0      1     65     62      127    3.40    3.18
1      2     58     64      122    4.00    3.33
2      2     56     60      116    3.75    3.25
3      1     42     53       95    3.75    2.42
4      1     55     52      107    4.00    2.63
..   ...    ...    ...      ...     ...     ...
995    2     50     50      100    3.70    2.19
996    1     54     54      108    3.30    1.50
997    1     56     58      114    3.50    3.17
998    1     55     65      120    2.30    1.94
999    1     49     44       93    2.70    2.38

[1000 rows x 6 columns]


### Split into training and testing set

In [8]:
x = df.to_numpy()

In [9]:
n_subset = round(len(x) * 0.75)
idx_trn = np.random.choice(len(x), n_subset, replace=False)
idx_tst = np.setdiff1d(range(len(x)), idx_trn)
x_trn = x[idx_trn,:]
x_tst = x[idx_tst,:]

print('Number of training samples: '+str(len(x_trn)))
print('Number of testing samples: '+str(len(x_tst)))

Number of training samples: 750
Number of testing samples: 250


In [10]:
# write to file
df_trn = pd.DataFrame(x_trn)
df_trn.columns = df.columns
df_tst = pd.DataFrame(x_tst)
df_tst.columns = df.columns

df_trn.to_csv(f'{dir_output}/{prefix}_trn.csv', index = False)
df_tst.to_csv(f'{dir_output}/{prefix}_tst.csv', index = False)

## Generation

### Train CorGAN model 

In [11]:
!absehrd train \
    --file_data output/satgpa_trn.csv \
    --outprefix_train output/satgpa \
    --n_epoch 100 \
    --frac_train 1 \
    --verbose \
    --n_cpu_train 1


Pre-training: 100%|█████| 100/100 [00:01<00:00, 57.61 epochs/s, [A loss: 3.141]]
Training: 100%|█| 100/100 [00:14<00:00,  4.37 epochs/s, TRAIN: [Loss_D: -0.014] [Loss_G: 0.023] [Loss_D_real: 0.033] [Loss_D_fake 0.019] | TEST: [A loss: 3.18] [real accuracy: 98.67] [fake accuracy: 21.33]]  


### Generate synthetic samples

In [12]:
!absehrd generate \
    --file_model output/satgpa.pkl \
    --outprefix_generate output/satgpa_syn \
    --generate_size 1000 \
    --n_cpu_generate 1

In [13]:
!mkdir -p ../synthetic_datasets
!cp output/satgpa_syn.csv ../synthetic_datasets

### Preview synthetic dataset

In [14]:
df_syn = pd.read_csv(f'{dir_output}/{prefix}_syn.csv')

In [15]:
print(df_syn)

     sex  sat_v  sat_m  sat_sum    hs_gpa    fy_gpa
0    2.0   74.0   63.0    118.0  3.591259  3.784474
1    1.0   74.0   47.0    117.0  1.887216  3.981789
2    2.0   45.0   71.0     82.0  3.967049  0.874094
3    1.0   44.0   76.0    143.0  2.984431  3.930748
4    1.0   75.0   64.0     69.0  2.128014  2.150660
..   ...    ...    ...      ...       ...       ...
995  2.0   70.0   73.0    138.0  3.518326  1.666179
996  2.0   65.0   35.0     56.0  2.768295  0.023680
997  2.0   66.0   32.0     63.0  2.265948  3.219861
998  1.0   42.0   63.0     87.0  1.871731  3.889534
999  1.0   68.0   73.0    117.0  3.971414  3.987265

[1000 rows x 6 columns]


## Realism

### Compare univariate frequency for real and synthetic features

In [16]:
!absehrd realism \
    --outprefix_realism output/satgpa \
    --file_realism_real_train output/satgpa_trn.csv \
    --file_realism_real_test output/satgpa_tst.csv \
    --file_realism_synth output/satgpa_syn.csv \
    --outcome sex \
    --analysis_realism feature_frequency \
    --output_realism all


Summary of feature_frequency:
  > Frequency correlation (train): 1.0
  > Frequency correlation (test): 0.99


### Compare effect sizes
Compare effect sizes from a train logistic regression method for a binary outcome between a model trained from real data and synthetic data.

In [17]:
!absehrd realism \
    --outprefix_realism output/satgpa \
    --file_realism_real_train output/satgpa_trn.csv \
    --file_realism_real_test output/satgpa_tst.csv \
    --file_realism_synth output/satgpa_syn.csv \
    --outcome sex \
    --analysis_realism feature_effect \
    --output_realism all


Summary of feature_effect:
  > Importance correlation (train): 0.7
  > Importance correlation (test): 0.77


### Compare predictive performance
* Real: use real dataset to train predictive model and test on a separate real dataset
* GAN-train: use synthetic dataset to train predictive model and test on a real dataset
*GAN-test: use real dataset to train predictive model and test on the synthetic dataset

In [18]:
!absehrd realism \
    --outprefix_realism output/satgpa \
    --file_realism_real_train output/satgpa_trn.csv \
    --file_realism_real_test output/satgpa_tst.csv \
    --file_realism_synth output/satgpa_syn.csv \
    --outcome sex \
    --analysis_realism gan_train_test \
    --output_realism all


Summary of gan_train_test:
  > Real AUC: 0.66
  > GAN-train AUC: 0.64
  > GAN-test AUC: 0.64


## Privacy

## Nearest neighbors
Ensure that synthetic dataset is not a copy of the real dataset by comparing distances between pairs of real and synthetic samples
* Real-real: distance between randomly selected pairs of real samples
* Real-synthetic: distance between pairs of real and synthetic samples
* Real-probabilistic: distance between a real sample and sampled binary vector where each column is sampled from a binomial where the frequency equals that in the real training set
* Real-random: distance between a real sample and a randomly sampled binary vector

Calculate, summarize, and plot nearest neighbor distributions:

In [19]:
!absehrd privacy \
    --outprefix_privacy output/satgpa \
    --file_privacy_real_train output/satgpa_trn.csv \
    --file_privacy_real_test output/satgpa_tst.csv \
    --file_privacy_synth output/satgpa_syn.csv \
    --analysis_privacy nearest_neighbors \
    --output_privacy all


Summary of nearest_neighbors:
(note: average nearest neighbor distance)
  > Real-real:             0.1
  > Real-synthetic:        0.26
  > Real-probabilistic:    0.79
  > Real-random:           0.93


## Membership inference
Membership inference refers to the ability to determine if a given data sample was used to train a model of interest.  In the case of synthetic data, calculating the risk of accurate membership inference given a sample of synthetic data can provide a metric to assess risk to privacy of the synthetic dataset to individuals whose data was used to train the synthetic data generator.  

Risk of membership inference can be assessed in multiple scenarios with differing assumptions about what data is available to the attacker.

### Distance-based thresholding
Choi et al. (2017) and Torfi et al. (2020) calculated the distance between synthetic and real samples. Real samples were derived from the training dataset for the synthetic data generator and from a separate testing set.  Pairwise distances between synthetic and real samples were predicted as a match if the distance was within a specified threshold.  Predictions and labels were then compared to derived performance metrics for the membership inference attack. 

In [20]:
!absehrd privacy \
    --outprefix_privacy output/satgpa \
    --file_privacy_real_train output/satgpa_trn.csv \
    --file_privacy_real_test output/satgpa_tst.csv \
    --file_privacy_synth output/satgpa_syn.csv \
    --analysis_privacy membership_inference \
    --output_privacy all


Summary of membership_inference:
  > Attack AUC: 0.56
