Data Privacy Final Project by Sage Hahn
Implementation based on "Plausible Deniability for Privacy-Preserving Data Synthesis" (https://arxiv.org/pdf/1708.07975.pdf)

Using data from the https://www.nist.gov/ctl/pscr/funding-opportunities/prizes-challenges/2018-differential-privacy-synthetic-data-challenge

Project goal: to generate differential private synthetic data from an original dataset

** This notebook represents an *older* implementation of the code designed to copy the paper listed above, the code will therefore might not run correctly as files might have been changed - it is left as a reference**

In [1]:
from utils import load_data, flip, convert
from sklearn.model_selection import train_test_split
import numpy as np

from config import config
from structure_learn import learn_structure
from param_learn import learn_cond_marginals
from generate_data import generate_data

First load data, according to params specified in config.py. Notably this process just buckets all of the data

In [18]:
data, names, encoders = load_data()

#Generate unique val counts on the whole dataset, before splits
unique_vals = [list(np.unique(data[i])) for i in range(len(data))]

Next, split the data into data to be used to learn a structure, learn the distribution and lastly the dataset to generate new samples

In [3]:
data = np.swapaxes(data,0,1)
seed_data, train_data = train_test_split(data, test_size=config['seed_split'], random_state=config['ran_state'])
struct_data, param_data = train_test_split(train_data, test_size=.5, random_state=config['ran_state'])

Learn the DAG structure of the data with privacy preserving FCBF

In [4]:
parents, order = learn_structure(flip(struct_data), unique_vals)

Currently, learning noisy conditional marginals as the underlying distribution of the data

In [5]:
count_dicts = learn_cond_marginals(flip(param_data), parents, unique_vals)

Generate raw samples based on the learned structure and conditional probs. satifying the privacy constraints as specified in config

In [6]:
to_release = generate_data(flip(seed_data), order, parents, count_dicts, unique_vals)

Lastly, un-bucketize (not a word), the data back into its original form

In [7]:
fake_data = convert(to_release, names, encoders)

In [8]:
fake_data

[[1,
  0.0,
  1,
  0.0,
  0.0,
  0.0,
  1,
  6.0,
  7,
  6,
  0.0,
  7,
  0,
  6,
  8,
  1461381113,
  1462508066,
  1461573935,
  1461362898,
  -122.43342768372231,
  1.0,
  2,
  37.78670801107178,
  1461138972,
  339,
  160,
  6844],
 [1,
  1.0,
  0,
  1.0,
  1.0,
  0.0,
  1,
  3.0,
  3,
  9,
  0.0,
  12,
  0,
  3,
  26,
  1452145538,
  1452694317,
  1451882593,
  1452634038,
  -122.47260684430995,
  1.0,
  2,
  37.71891225301083,
  1453098538,
  348,
  1037,
  8023],
 [1,
  1.0,
  2,
  1.0,
  1.0,
  0.0,
  2,
  9.0,
  10,
  10,
  3.0,
  19,
  14,
  23,
  37,
  1451915495,
  1452922210,
  1452266504,
  1452458581,
  -122.45991795603173,
  1.0,
  4,
  37.71109355834513,
  1452843897,
  67,
  498,
  3798],
 [1,
  1.0,
  0,
  1.0,
  1.0,
  0.0,
  1,
  0.0,
  0,
  4,
  0.0,
  0,
  0,
  0,
  2,
  1463141489,
  1463708625,
  1462827317,
  1464230856,
  -122.41239232812957,
  1.0,
  2,
  37.79093842811821,
  1464101306,
  337,
  544,
  648],
 [1,
  1.0,
  3,
  1.0,
  1.0,
  0.0,
  5,
  4.0,

In [9]:
print(config)

{'ran_state': 42, 'csv_file': 'fire-data.csv', 'specs': 'fire-data-specs.json', 'drop_na': True, 'bins': 20, 'seed_split': 0.3, 'struct_threshold': 0.1, 'max_struct_cost': 500, 'n': 305133, 'delta': 1.0740429335990503e-11, 'epsilon_nt': 0.005, 'epsilon': 41, 'num_features': 27, 'omega': 15, 'num_to_generate': 100, 'lam': 1.0000001, 'k': 150, 'epsilon_p': 1.5185185185185186, 't': 32, 'epsilon_per': 0.2845567551674008, 'delta_per': 1.0740429335990503e-13, 'epsilon_0': 0.2140424241714298}


In [None]:
'''
import pymc3 as pm
import numpy as np

c = np.array([round(14955.5), round(28783.5)])
n = np.sum(c)
alphas = np.array([1, 1])

model = pm.Model()

with model:
    parameters = pm.Dirichlet('parameters', a=alphas, shape=2)
    observed_data = pm.Multinomial(
            'observed_data', n=n, p=parameters, shape=np.shape(c)[0], observed=c)  
    
    trace = pm.sample(draws=1000, chains=2, tune=1500, 
                      discard_tuned_samples=True)
    

with model:
    samples = pm.sample_ppc(trace, samples = 1000)
    
'''
