Data Privacy Final Project by Sage Hahn
Implementation based on "Plausible Deniability for Privacy-Preserving Data Synthesis" (https://arxiv.org/pdf/1708.07975.pdf)

Using data from the https://www.nist.gov/ctl/pscr/funding-opportunities/prizes-challenges/2018-differential-privacy-synthetic-data-challenge

Project goal to namely generate a differential private synthetic version

In [1]:
from utils import load_data, flip, convert
from sklearn.model_selection import train_test_split
import numpy as np

from config import config
from structure_learn import learn_structure
from param_learn import learn_cond_marginals
from generate_data import generate_data

First load data, according to params specified in config.py. Notably this process just buckets all of the data

In [2]:
data, names, encoders = load_data()

#Generate unique val counts on the whole dataset, before splits
unique_vals = [list(np.unique(data[i])) for i in range(len(data))]

Next, split the data into data to be used to learn a structure, learn the distribution and lastly the dataset to generate new samples

In [3]:
data = np.swapaxes(data,0,1)
seed_data, train_data = train_test_split(data, test_size=config['seed_split'], random_state=config['ran_state'])
struct_data, param_data = train_test_split(train_data, test_size=.5, random_state=config['ran_state'])

Learn the DAG structure of the data with privacy preserving FCBF

In [4]:
parents, order = learn_structure(flip(struct_data), unique_vals)

Currently, learning noisy conditional marginals as the underlying distribution of the data

In [5]:
count_dicts = learn_cond_marginals(flip(param_data), parents, unique_vals)

Generate raw samples based on the learned structure and conditional probs. satifying the privacy constraints as specified in config

In [6]:
to_release = generate_data(flip(seed_data), order, parents, count_dicts, unique_vals)

Lastly, un-bucketize (not a word), the data back into its original form

In [7]:
fake_data = convert(to_release, names, encoders)

In [8]:
fake_data

[[1,
  1,
  0,
  1,
  1,
  0.0,
  2,
  6.0,
  7.0,
  1.0,
  4,
  11,
  1,
  11,
  7,
  1476412240,
  1478507235,
  1452373658,
  1476875376,
  -122.41993872854846,
  1,
  3,
  37.74522872513122,
  1476716318,
  35,
  381,
  520]]