# Synthetic Data - CS3110 Final Project

### Main Goal
Our team consists of Teddy Ruth, Joe Brennan, and Jordan Gottlieb. Our goal in this project is to go further in depth into the accuracies (or lackthereof) of synthetic data. How does accuracy compare across different marginals? What techniques can we use to maximize accuracy? What are some of the negative consequences of increasing accuracy?

TODO: 

1. Establish accuracy of 4-way marginal - done
2. Develop overlapping marginals - done
3. Establish accuracy of overlapping marginals - done
4. Run a series of queries on overlapping marginals
5. Run the same queries using laplace/gaussian mech on orignal data
6. Present findings

In [14]:
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

def laplace_mech(v, sensitivity, epsilon):
    return v + np.random.laplace(loc=0, scale=sensitivity / epsilon)

def gaussian_mech(v, sensitivity, epsilon, delta):
    return v + np.random.normal(loc=0, scale=sensitivity * np.sqrt(2*np.log(1.25/delta)) / epsilon)

def gaussian_mech_vec(vec, sensitivity, epsilon, delta):
    return [v + np.random.normal(loc=0, scale=sensitivity * np.sqrt(2*np.log(1.25/delta)) / epsilon)
            for v in vec]

def pct_error(orig, priv):
    return np.abs(orig - priv)/orig * 100.0

def pct_error_vec(orig, priv):
    errors = []
    for i in range(len(orig)):
        pct_err = np.abs(orig[i]-priv[i])/orig[i] * 100.0
        errors.append(pct_err)

    return errors


evs = pd.read_csv("Electric_Vehicle_Population_Data.csv")


### dp_marginal

The below function calculates the marginal of a given column and epsilon value. The function returns a dictionary where the keys are the column value, and the value is the chance of occurence over the whole dataset. For example, when passing in the `County` column, the function will return the counties in the dataset as key values, and the number of occurences over the whole dataset

In [15]:
def dp_marginal(col, epsilon):
    
    data = evs[col].value_counts()
    results = [x for x in data]
    noisy = [laplace_mech(v, 1, epsilon) for v in results]
    labels = evs[col].value_counts().index.to_list()
    syn_rep = {}
    
    for x in range(len(labels)):
        syn_rep[labels[x]] = max(0, noisy[x])
    
    total = sum(syn_rep.values())
    
    marginal = {}
    for x in labels:
        marginal[x] = syn_rep[x] / total
    return marginal


marginal = dp_marginal('County', 1.0)

### dp_synthetic_data

This function generates synthetic data by generating a marginal for every column passed to the function and then combining data into a single dataframe to return. The function takes which columns, the number of rows to generate, and the epsilon value.

In [16]:
data = evs['County'].value_counts().index.to_list()
print (data)


['King', 'Snohomish', 'Pierce', 'Clark', 'Thurston', 'Kitsap', 'Spokane', 'Whatcom', 'Benton', 'Skagit', 'Island', 'Clallam', 'Chelan', 'Jefferson', 'Yakima', 'San Juan', 'Cowlitz', 'Mason', 'Lewis', 'Grays Harbor', 'Kittitas', 'Franklin', 'Grant', 'Walla Walla', 'Douglas', 'Whitman', 'Klickitat', 'Okanogan', 'Stevens', 'Pacific', 'Skamania', 'Asotin', 'Wahkiakum', 'Pend Oreille', 'Adams', 'Lincoln', 'San Diego', 'Ferry', 'Columbia', 'Orange', 'Santa Clara', 'Fairfax', 'Anne Arundel', 'Los Angeles', 'Maricopa', 'El Paso', 'Montgomery', 'Honolulu', 'Virginia Beach', 'Cumberland', 'New London', 'Lake', 'Garfield', 'Solano', 'Harnett', 'Burlington', 'Sacramento', 'Alameda', 'Ventura', 'Riverside', 'Cook', "Prince George's", 'Kings', 'Bexar', 'San Bernardino', 'Multnomah', 'Charleston', 'Goochland', 'Hillsborough', 'Middlesex', 'Stafford', 'Monterey', 'Alexandria', 'District of Columbia', 'Loudoun', 'Harford', 'Contra Costa', 'Polk', 'Kern', 'Hoke', 'New Haven', 'Berkeley', 'Collin', 'Rich

In [17]:
def dp_synthetic_data(cols, n, epsilon):
    df_data = {}
    for col in cols:
        df_data[col] = []
        data = evs[col].value_counts().index.to_list()
        results = [x for x in data]
        marginal = list(dp_marginal(col, epsilon).values())
        synthetic = np.random.choice(results, size=n, p=marginal)
        
        for x in synthetic:
            df_data[col].append(x)
            
    dp_df = pd.DataFrame.from_dict(df_data)
    
    return dp_df

dp_synthetic_data(['County', 'Model'], 100, 1.0)

Unnamed: 0,County,Model
0,King,MODEL 3
1,King,LEAF
2,Pierce,MODEL Y
3,King,F-150
4,Thurston,WRANGLER
...,...,...
95,King,LEAF
96,Franklin,VOLT
97,Kitsap,VOLT
98,Whatcom,BOLT EV


The issue with `dp_synthetic_data` is that it does not preserve any of the correlations in the dataset. It is merely taking 2 marginals and stitching them together. While this may generate good data if there is no preference to maintain correlations between columns in the data, this will not do a good job if it is desired to preserve these correlations. In order to address this we can generate a new set of synthetic data using 2 way marginals which preserves these correlations.

In [18]:
def dp_two_marginal(col1, col2, epsilon):
    hist = evs[[col1, col2]].value_counts()
    dp_hist = hist.apply(lambda x: laplace_mech(x, 1, epsilon))
    dp_hist = dp_hist.clip(lower=0)
    
    s = dp_hist.sum()
    
    dp_hist = dp_hist.to_frame().reset_index()
    dp_hist.rename(columns={0:'probability'}, inplace=True)
    
    for x in range(len(dp_hist)):
        val = dp_hist.at[x,'probability']
        dp_hist.at[x, 'probability'] = val / s
    
    
    return(dp_hist)

dp_two_marginal('County', 'Make', 1.0)

Unnamed: 0,County,Make,probability
0,King,TESLA,0.258222
1,Snohomish,TESLA,0.061065
2,King,NISSAN,0.041888
3,Pierce,TESLA,0.032538
4,King,CHEVROLET,0.032410
...,...,...,...
1183,Meade,VOLKSWAGEN,0.000004
1184,Mercer,CHEVROLET,0.000006
1185,Miami-Dade,TESLA,0.000000
1186,Middlesex,CHEVROLET,0.000012


While we may not initially notice anything by constructing a 2 way marginal, what happens when we construct a 4 way marginal? Below we notice that the number of rows significantly increases. This is because with a 4 way marginal we group the data by 4 columns which leds to more specificity in grouping causing more and smaller groups thus leading to a lower signal and more noise disruption when generating our marginal.

In [19]:
def dp_four_marginal(col1, col2, col3, col4, epsilon):
    hist = evs[[col1, col2, col3, col4]].value_counts()
    dp_hist = hist.apply(lambda x: laplace_mech(x, 1, epsilon))
    dp_hist = dp_hist.clip(lower=0)
    
    s = dp_hist.sum()
    
    dp_hist = dp_hist.to_frame().reset_index()
    dp_hist.rename(columns={0:'probability'}, inplace=True)
    
    for x in range(len(dp_hist)):
        val = dp_hist.at[x,'probability']
        dp_hist.at[x, 'probability'] = val / s
    
    
    return(dp_hist)

dp_four_marginal('County', 'Make', 'Electric Range', 'Model Year', 1.0)

Unnamed: 0,County,Make,Electric Range,Model Year,probability
0,King,TESLA,0,2023,7.883860e-02
1,King,TESLA,0,2022,5.111608e-02
2,King,TESLA,0,2021,3.788425e-02
3,Snohomish,TESLA,0,2023,2.257345e-02
4,King,TESLA,215,2018,2.249652e-02
...,...,...,...,...,...
6702,Lincoln,TESLA,291,2020,3.361502e-06
6703,Lincoln,TOYOTA,25,2017,2.035670e-05
6704,Lincoln,TOYOTA,25,2018,6.171635e-07
6705,Lincoln,TOYOTA,42,2023,6.292098e-06


In [20]:
def pct_error_1Dmarginal(marginal):

    def gen_samples(cols, n, epsilon):
        df_data = {}
        for col in cols:
            df_data[col] = []
            data = evs[col].value_counts().index.to_list()
            results = [x for x in data]
            marginal = list(dp_marginal(col, epsilon).values())
            synthetic = np.random.choice(results, size=n, p=marginal)
            
            for x in synthetic:
                df_data[col].append(x)
                
        dp_df = pd.DataFrame.from_dict(df_data)

        return dp_df
    
    syn_data = gen_samples([column], len(evs), 1.0).value_counts()
    ev_makes = evs[column].value_counts()


    errors = pct_error_vec(list(syn_data), list(ev_makes))

    errors_sum = sum(errors)

    return((errors_sum / len(errors)))


column = 'Make'
marginal = dp_marginal(column, 1.0)
mean_error = pct_error_1Dmarginal(marginal)
print(f"Mean Error for Column '{column}': " + str(mean_error))


Mean Error for Column 'Make': 10.197103388453499


In the above cell we determine the mean percent error of a single iteration of generating a single column of synthetic data. This information while useful is still volatile, below I will establish the mean percent error over 50 iterations.

In [21]:
def mean_pct_error_1Dmarginal(column):
    errors = []

    for x in range(50):
        marginal = dp_marginal(column, 1.0)
        error = pct_error_1Dmarginal(marginal)
        errors.append(error)

    mean_error = sum(errors) / len(errors)

    return (mean_error)

mean_pct_error_1Dmarginal('Make')

7.606510839200354

Running this code over 50 iterations yields us an average percent error of ~7-8% variance from the initial dataset. While this number is not incredibly high, it does not indicate possible errors over multiple columns. In the next cells we will determine the percent error over multiple columns. We will start with the percent error of the previous 4 way marginal

The function below generates the percent error of a 4D marginal generated synthetic dataset by comparing each of its column values to the column values of the original dataset.

In [22]:
def pct_error_4Dmarginal(col1, col2, col3, col4):
    fourWay = dp_four_marginal(col1, col2, col3, col4, 1.0)
    error_col1 = pct_error_1Dmarginal(fourWay[col1])
    error_col2 = pct_error_1Dmarginal(fourWay[col2])
    error_col3 = pct_error_1Dmarginal(fourWay[col3])
    error_col4 = pct_error_1Dmarginal(fourWay[col4])

    return error_col1, error_col2, error_col3, error_col4
    
pct_error_4Dmarginal('County', 'Make', 'Electric Range', 'Model Year')


(8.407071963301144, 2.522331254981834, 5.73207438499442, 5.776952649034274)

In order to obtain a more accurate result we generate 50 percent errors for each of the columns and then take the mean of each.

In [25]:
def mean_pct_error_4Dmarginal(col1, col2, col3, col4):
    errors1 = []
    errors2 = []
    errors3 = []
    errors4 = []

    for x in range(50):
        error1, error2, error3, error4 = pct_error_4Dmarginal(col1, col2, col3, col4)
        errors1.append(error1)
        errors2.append(error2)
        errors3.append(error3)
        errors4.append(error4)

    mean_error1 = sum(errors1) / len(errors1)
    mean_error2 = sum(errors2) / len(errors2)
    mean_error3 = sum(errors3) / len(errors3)
    mean_error4 = sum(errors4) / len(errors4)
    
    return (mean_error1, mean_error2, mean_error3, mean_error4)

mean_pct_error_4Dmarginal('County', 'Make', 'Electric Range', 'Model Year')

(7.346586172835307, 7.527030159769923, 7.912793896866559, 7.251490276038537)

Our last method for generating accurate multidimensional marginals is by using overlapping marginals to preserve some of the correlations that we deem important while at the same time not utilizing high dimensional groupings that we would severly lower the signal of that data and hurt accuracy.

In [26]:
def dp_marginal_df(col, epsilon, df):
    f = lambda x: x + np.random.laplace(loc=0, scale=1/epsilon)
    hist_noisy = df[col].value_counts().apply(f)
     
    non_negative_syn_rep = np.clip(hist_noisy, 0, None)
    
    total = np.sum(non_negative_syn_rep)
    h = lambda x: x/total #normalized
    
    return non_negative_syn_rep.apply(h)

def gen_sample(marginal, col, df):
    keys = df[col].value_counts().keys()
    return np.random.choice(keys, size=1, p=marginal)

In [27]:
def dp_synthetic_data_two_marginal(n, epsilon):
    synth_df = pd.DataFrame(columns=['County','Make','Electric Range','Model Year'])
    
    for x in range(n):
        # generate marginal of County in evs
        m_County = dp_marginal_df('County', epsilon, evs)
        # get single synthetic County using marginal and evs
        County_syn = gen_sample(m_County, 'County', evs)
        
        # filter evs to only contain people with that County
        filtered_County = evs[evs['County']==County_syn[0]]
        
        # generate marginal of Make in filtered dataframe
        m_Make = dp_marginal_df('Make', epsilon, filtered_County)
        # get single synthetic Make using marginal and filtered evs
        Make_syn = gen_sample(m_Make, 'Make', filtered_County)
        
        # filter evs to only contain people with that Make
        filtered_Make = evs[evs['Make']==Make_syn[0]]
        
        #repeat
        m_Range = dp_marginal_df('Electric Range', epsilon, filtered_Make)
        Range_syn = gen_sample(m_Range, 'Electric Range', filtered_Make)
        
        filtered_Range = evs[evs['Electric Range']==Range_syn[0]]
        
        m_Year = dp_marginal_df('Model Year', epsilon, filtered_Range)
        Year_syn = gen_sample(m_Year, 'Model Year', filtered_Range)
    
        synth_df.loc[x] = [County_syn[0], Make_syn[0], Range_syn[0], Year_syn[0]]
    return synth_df




    """syn_data = {'County':[], 'Make':[], 'Electric Range':[], 'Model Year':[]}
    
    
    for x in range(n):
        age = dp_synthetic_data(['County'], 1, epsilon).at[0, 'County']

        workclass_marginal = a_given_b(age, 'Make', 'County', epsilon)
        workclass = np.random.choice(list(workclass_marginal.keys()), size=1, p=workclass_marginal)[0]

        occupation_marginal = a_given_b(workclass, 'Electric Range', 'Make', epsilon)
        occupation = np.random.choice(list(occupation_marginal.keys()), size=1, p=occupation_marginal)[0]

        education_marginal = a_given_b(occupation, 'Model Year', 'Electric Range', epsilon)
        education = np.random.choice(list(education_marginal.keys()), size=1, p=education_marginal)[0]

        syn_data['County'].append(age)
        syn_data['Make'].append(workclass)
        syn_data['Electric Range'].append(occupation)
        syn_data['Model Year'].append(education)
        
    synthetic_dataframe = pd.DataFrame.from_dict(syn_data)
    return synthetic_dataframe
    
    
def a_given_b(b, a_col, b_col, epsilon):
    temp = evs.copy()

    
    temp = temp[temp[b_col] == b]
    
    data = temp[a_col].value_counts()
    results = [x for x in data]
    noisy = [laplace_mech(v, 1, epsilon) for v in results]
    labels = temp[a_col].value_counts().index.to_list()
    syn_rep = {}
    
    for x in range(len(labels)):
        syn_rep[labels[x]] = max(0, noisy[x])
    
    total = sum(syn_rep.values())
    
    marginal = {}
    for x in labels:
        marginal[x] = syn_rep[x] / total
        
    marginal = pd.Series(marginal)
    
    return marginal"""
    

dp_synthetic_data_two_marginal(100, 1.0)

Unnamed: 0,County,Make,Electric Range,Model Year
0,Clark,TESLA,0,2021
1,King,TESLA,0,2023
2,King,CHEVROLET,238,2017
3,King,NISSAN,150,2019
4,King,NISSAN,75,2013
...,...,...,...,...
95,Kitsap,HONDA,47,2018
96,Chelan,TESLA,0,2021
97,King,MAZDA,26,2024
98,Skagit,CHEVROLET,38,2015


In [29]:
def pct_error_overlapping_marginals(col1, col2, col3, col4):
    marginal = dp_synthetic_data_two_marginal(2000,1)
    error_col1 = pct_error_1Dmarginal(marginal[col1])
    error_col2 = pct_error_1Dmarginal(marginal[col2])
    error_col3 = pct_error_1Dmarginal(marginal[col3])
    error_col4 = pct_error_1Dmarginal(marginal[col4])

    return error_col1, error_col2, error_col3, error_col4
    
pct_error_overlapping_marginals('County', 'Make', 'Electric Range', 'Model Year')

(5.703830165617258, 6.432888148225146, 7.300792303410722, 5.936484171912751)

It is possible to evaluate this multiple times and take the average; however, due to the nature of the task it takes a significant amount of time (10 - 15 minutes). We have provided the code to run the error of overlapping marginals multiple times for those interested.

In [None]:
def mean_pct_error_overlapping_marginal(col1, col2, col3, col4):
    errors1 = []
    errors2 = []
    errors3 = []
    errors4 = []

    for x in range(10):
        error1, error2, error3, error4 = pct_error_overlapping_marginals(col1, col2, col3, col4)
        errors1.append(error1)
        errors2.append(error2)
        errors3.append(error3)
        errors4.append(error4)

    mean_error1 = sum(errors1) / len(errors1)
    mean_error2 = sum(errors2) / len(errors2)
    mean_error3 = sum(errors3) / len(errors3)
    mean_error4 = sum(errors4) / len(errors4)
    
    return (mean_error1, mean_error2, mean_error3, mean_error4)

mean_pct_error_overlapping_marginal('County', 'Make', 'Electric Range', 'Model Year')

(9.196959193888604, 7.136894440226206, 8.494631905781915, 6.384757144998363)