# Synthetic Data - CS3110 Final Project

### Main Goal
Our team consists of Teddy Ruth, Joe Brennan, and Jordan Gottlieb. Our goal in this project is to go further in depth into the accuracies (or lackthereof) of synthetic data. How does accuracy compare across different marginals? What techniques can we use to maximize accuracy? What are some of the negative consequences of increasing accuracy?

TODO: 

1. Establish accuracy of 4-way marginal (started)
2. Develop overlapping marginals (3-4 columns?), code from hw 10 is included at the bottom of this document
3. Establish accuracy of overlapping marginals 
4. Run a series of queries on overlapping marginals
5. Run the same queries using laplace/gaussian mech on orignal data
6. Present findings

In [2]:
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

def laplace_mech(v, sensitivity, epsilon):
    return v + np.random.laplace(loc=0, scale=sensitivity / epsilon)

def gaussian_mech(v, sensitivity, epsilon, delta):
    return v + np.random.normal(loc=0, scale=sensitivity * np.sqrt(2*np.log(1.25/delta)) / epsilon)

def gaussian_mech_vec(vec, sensitivity, epsilon, delta):
    return [v + np.random.normal(loc=0, scale=sensitivity * np.sqrt(2*np.log(1.25/delta)) / epsilon)
            for v in vec]

def pct_error(orig, priv):
    return np.abs(orig - priv)/orig * 100.0

def pct_error_vec(orig, priv):
    errors = []
    for i in range(len(orig)):
        pct_err = np.abs(orig[i]-priv[i])/orig[i] * 100.0
        errors.append(pct_err)

    return errors


evs = pd.read_csv("Electric_Vehicle_Population_Data.csv")


### dp_marginal

The below function calculates the marginal of a given column and epsilon value. The function returns a dictionary where the keys are the column value, and the value is the chance of occurence over the whole dataset. For example, when passing in the `County` column, the function will return the counties in the dataset as key values, and the number of occurences over the whole dataset

In [8]:
def dp_marginal(col, epsilon):
    
    data = evs[col].value_counts()
    results = [x for x in data]
    noisy = [laplace_mech(v, 1, epsilon) for v in results]
    labels = evs[col].value_counts().index.to_list()
    syn_rep = {}
    
    for x in range(len(labels)):
        syn_rep[labels[x]] = max(0, noisy[x])
    
    total = sum(syn_rep.values())
    
    marginal = {}
    for x in labels:
        marginal[x] = syn_rep[x] / total
    return marginal


marginal = dp_marginal('County', 1.0)

### dp_synthetic_data

This function generates synthetic data by generating a marginal for every column passed to the function and then combining data into a single dataframe to return. The function takes which columns, the number of rows to generate, and the epsilon value.

In [9]:
data = evs['County'].value_counts().index.to_list()
print (data)


['King', 'Snohomish', 'Pierce', 'Clark', 'Thurston', 'Kitsap', 'Spokane', 'Whatcom', 'Benton', 'Skagit', 'Island', 'Clallam', 'Chelan', 'Jefferson', 'Yakima', 'San Juan', 'Cowlitz', 'Mason', 'Lewis', 'Grays Harbor', 'Kittitas', 'Franklin', 'Grant', 'Walla Walla', 'Douglas', 'Whitman', 'Klickitat', 'Okanogan', 'Stevens', 'Pacific', 'Skamania', 'Asotin', 'Wahkiakum', 'Pend Oreille', 'Adams', 'Lincoln', 'San Diego', 'Ferry', 'Columbia', 'Orange', 'Santa Clara', 'Fairfax', 'Anne Arundel', 'Los Angeles', 'Maricopa', 'El Paso', 'Montgomery', 'Honolulu', 'Virginia Beach', 'Cumberland', 'New London', 'Lake', 'Garfield', 'Solano', 'Harnett', 'Burlington', 'Sacramento', 'Alameda', 'Ventura', 'Riverside', 'Cook', "Prince George's", 'Kings', 'Bexar', 'San Bernardino', 'Multnomah', 'Charleston', 'Goochland', 'Hillsborough', 'Middlesex', 'Stafford', 'Monterey', 'Alexandria', 'District of Columbia', 'Loudoun', 'Harford', 'Contra Costa', 'Polk', 'Kern', 'Hoke', 'New Haven', 'Berkeley', 'Collin', 'Rich

In [4]:
def dp_synthetic_data(cols, n, epsilon):
    df_data = {}
    for col in cols:
        df_data[col] = []
        data = evs[col].value_counts().index.to_list()
        results = [x for x in data]
        marginal = list(dp_marginal(col, epsilon).values())
        synthetic = np.random.choice(results, size=n, p=marginal)
        
        for x in synthetic:
            df_data[col].append(x)
            
    dp_df = pd.DataFrame.from_dict(df_data)
    
    return dp_df

dp_synthetic_data(['County', 'Model'], 100, 1.0)

Unnamed: 0,County,Model
0,Pierce,MODEL 3
1,King,MODEL Y
2,Snohomish,LEAF
3,Island,MODEL S
4,King,MODEL Y
...,...,...
95,Kitsap,MODEL Y
96,King,MODEL Y
97,Kitsap,MODEL 3
98,King,MODEL 3


The issue with `dp_synthetic_data` is that it does not preserve any of the correlations in the dataset. It is merely taking 2 marginals and stitching them together. While this may generate good data if there is no preference to maintain correlations between columns in the data, this will not do a good job if it is desired to preserve these correlations. In order to address this we can generate a new set of synthetic data using 2 way marginals which preserves these correlations.

In [5]:
def dp_two_marginal(col1, col2, epsilon):
    hist = evs[[col1, col2]].value_counts()
    dp_hist = hist.apply(lambda x: laplace_mech(x, 1, epsilon))
    dp_hist = dp_hist.clip(lower=0)
    
    s = dp_hist.sum()
    
    dp_hist = dp_hist.to_frame().reset_index()
    dp_hist.rename(columns={0:'probability'}, inplace=True)
    
    for x in range(len(dp_hist)):
        val = dp_hist.at[x,'probability']
        dp_hist.at[x, 'probability'] = val / s
    
    
    return(dp_hist)

dp_two_marginal('County', 'Make', 1.0)

Unnamed: 0,County,Make,probability
0,King,TESLA,2.582890e-01
1,Snohomish,TESLA,6.107783e-02
2,King,NISSAN,4.190134e-02
3,Pierce,TESLA,3.254177e-02
4,King,CHEVROLET,3.245864e-02
...,...,...,...
1183,Meade,VOLKSWAGEN,6.902003e-06
1184,Mercer,CHEVROLET,4.359329e-06
1185,Miami-Dade,TESLA,1.593276e-05
1186,Middlesex,CHEVROLET,8.294070e-06


While we may not initially notice anything by constructing a 2 way marginal, what happens when we construct a 4 way marginal? Below we notice that the number of rows significantly increases. This is because with a 4 way marginal we group the data by 4 columns which leds to more specificity in grouping causing more and smaller groups, causing a lower signal and more noise disruption when generating our marginal.

In [63]:
def dp_four_marginal(col1, col2, col3, col4, epsilon):
    hist = evs[[col1, col2, col3, col4]].value_counts()
    dp_hist = hist.apply(lambda x: laplace_mech(x, 1, epsilon))
    dp_hist = dp_hist.clip(lower=0)
    
    s = dp_hist.sum()
    
    dp_hist = dp_hist.to_frame().reset_index()
    dp_hist.rename(columns={0:'probability'}, inplace=True)
    
    for x in range(len(dp_hist)):
        val = dp_hist.at[x,'probability']
        dp_hist.at[x, 'probability'] = val / s
    
    
    return(dp_hist)

dp_four_marginal('County', 'Electric Range', 'Model', 'Model Year', 1.0)

Unnamed: 0,County,Electric Range,Model,Model Year,probability
0,King,0,MODEL Y,2023,0.052188
1,King,0,MODEL Y,2022,0.027399
2,King,0,MODEL Y,2021,0.023117
3,King,215,MODEL 3,2018,0.022486
4,King,0,MODEL 3,2023,0.020811
...,...,...,...,...,...
7658,Lincoln,35,VOLT,2012,0.000001
7659,Lincoln,38,VOLT,2013,0.000013
7660,Lincoln,38,VOLT,2015,0.000008
7661,Lincoln,39,X5,2024,0.000000


In [41]:
def mean_pct_error(column):
    marginal = dp_marginal(column, 1.0)

    def gen_samples(cols, n, epsilon):
        df_data = {}
        for col in cols:
            df_data[col] = []
            data = evs[col].value_counts().index.to_list()
            results = [x for x in data]
            marginal = list(dp_marginal(col, epsilon).values())
            synthetic = np.random.choice(results, size=n, p=marginal)
            
            for x in synthetic:
                df_data[col].append(x)
                
        dp_df = pd.DataFrame.from_dict(df_data)

        return dp_df
    
    syn_data = gen_samples([column], len(evs), 1.0).value_counts()
    ev_makes = evs[column].value_counts()


    errors = pct_error_vec(list(syn_data), list(ev_makes))

    errors_sum = sum(errors)

    return((errors_sum / len(errors)))


column = 'Make'
mean_error = mean_pct_error(column)
print(f"Mean Error for Column '{column}': " + str(mean_error))


Mean Error for Column 'Make': 6.787410649783726


In the above cell we determine the mean percent error of a single iteration of generating a single column of synthetic data. This information while useful is still volatile, below I will establish the mean percent error over 50 iterations.

In [48]:
column = 'Make'
errors = []

for x in range(50):
    error = mean_pct_error(column)
    errors.append(error)

mean_error = sum(errors) / len(errors)

print(mean_error)

7.350739159317558


Running this code over 50 iterations yields us an average percent error of ~7-8% variance from the initial dataset. While this number is not incredibly high, it does not indicate possible errors over multiple columns. In the next cells we will determine the percent error over multiple columns. We will start with the percent error of the previous 4 way marginal

In [64]:
def fourWay_err():
    fourWay = dp_four_marginal('County', 'Make', 'Electric Range', 'Model Year', 1.0)

    error_dict = {'County':[], 'Make':[], 'Electric Range':[], 'Model Year':[]}

    syn_counties = fourWay['County']
    syn_makes = fourWay['Make']
    syn_ranges = fourWay['Electric Range']
    syn_years = fourWay['Model Year']


    



In [None]:
""" CODE FOR OVERLAPPING MARGINALS FROM HW 10 USING ADULT DATASET, MUST BE REFACTORED FOR EV DATA """

def dp_synthetic_data_two_marginal(n, epsilon):
    # generate age based off work class and education
    # generate work class based of occupation and age
    syn_data = {'County':[], 'Electric Range':[], 'Model':[], 'Model Year':[]}
    
    
    for x in range(n):
        age = dp_synthetic_data(['Age'], 1, epsilon).at[0, 'Age']

        workclass_marginal = a_given_b(age, 'Workclass', 'Age', epsilon)
        workclass = np.random.choice(list(workclass_marginal.keys()), size=1, p=workclass_marginal)[0]

        occupation_marginal = a_given_b(workclass, 'Occupation', 'Workclass', epsilon)
        occupation = np.random.choice(list(occupation_marginal.keys()), size=1, p=occupation_marginal)[0]

        education_marginal = a_given_b(occupation, 'Education', 'Occupation', epsilon)
        education = np.random.choice(list(education_marginal.keys()), size=1, p=education_marginal)[0]

        syn_data['Age'].append(age)
        syn_data['Workclass'].append(workclass)
        syn_data['Occupation'].append(occupation)
        syn_data['Education'].append(education)
        
    synthetic_dataframe = pd.DataFrame.from_dict(syn_data)
    return synthetic_dataframe
    
    
def a_given_b(b, a_col, b_col, epsilon):
    temp = adult.copy()

    
    temp = temp[temp[b_col] == b]
    
    data = temp[a_col].value_counts()
    results = [x for x in data]
    noisy = [laplace_mech(v, 1, epsilon) for v in results]
    labels = temp[a_col].value_counts().index.to_list()
    syn_rep = {}
    
    for x in range(len(labels)):
        syn_rep[labels[x]] = max(0, noisy[x])
    
    total = sum(syn_rep.values())
    
    marginal = {}
    for x in labels:
        marginal[x] = syn_rep[x] / total
        
    marginal = pd.Series(marginal)
    
    return marginal
    

dp_synthetic_data_two_marginal(100, 1.0)