# Synthetic Data - CS3110 Final Project

### Main Goal
Our team consists of Teddy Ruth, Joe Brennan, and Jordan Gottlieb. Our goal in this project is to go further in depth into the accuracies (or lackthereof) of synthetic data. How does accuracy compare across different marginals? What techniques can we use to maximize accuracy? What are some of the negative consequences of increasing accuracy?

TODO: 

1. Establish accuracy of 4-way marginal - done
2. Develop overlapping marginals - done
3. Establish accuracy of overlapping marginals - done
4. Run a series of queries on overlapping marginals
5. Run the same queries using laplace/gaussian mech on orignal data
6. Present findings

In [2]:
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

def laplace_mech(v, sensitivity, epsilon):
    return v + np.random.laplace(loc=0, scale=sensitivity / epsilon)

def gaussian_mech(v, sensitivity, epsilon, delta):
    return v + np.random.normal(loc=0, scale=sensitivity * np.sqrt(2*np.log(1.25/delta)) / epsilon)

def gaussian_mech_vec(vec, sensitivity, epsilon, delta):
    return [v + np.random.normal(loc=0, scale=sensitivity * np.sqrt(2*np.log(1.25/delta)) / epsilon)
            for v in vec]

def pct_error(orig, priv):
    return np.abs(orig - priv)/orig * 100.0

def pct_error_vec(orig, priv):
    errors = []
    for i in range(len(orig)):
        pct_err = np.abs(orig[i]-priv[i])/orig[i] * 100.0
        errors.append(pct_err)

    return errors


evs = pd.read_csv("Electric_Vehicle_Population_Data.csv")


## 1. Generating synthetic representations and data

### dp_marginal

The below function calculates the marginal of a given column and epsilon value. The function returns a dictionary where the keys are the column value, and the value is the chance of occurence over the whole dataset. For example, when passing in the `County` column, the function will return the counties in the dataset as key values, and the number of occurences over the whole dataset

In [3]:
def dp_marginal(col, epsilon):
    
    data = evs[col].value_counts()
    results = [x for x in data]
    noisy = [laplace_mech(v, 1, epsilon) for v in results]
    labels = evs[col].value_counts().index.to_list()
    syn_rep = {}
    
    for x in range(len(labels)):
        syn_rep[labels[x]] = max(0, noisy[x])
    
    total = sum(syn_rep.values())
    
    marginal = {}
    for x in labels:
        marginal[x] = syn_rep[x] / total
    return marginal


marginal = dp_marginal('County', 1.0)

### dp_synthetic_data_1way

This function generates synthetic data by generating a marginal for every column passed to the function and then combining data into a single dataframe to return. The function takes which columns, the number of rows to generate, and the epsilon value.

In [55]:
def dp_synthetic_data_1way(cols, n, epsilon):
    df_data = {}
    for col in cols:
        df_data[col] = []
        data = evs[col].value_counts().index.to_list()
        results = [x for x in data]
        marginal = list(dp_marginal(col, epsilon).values())
        synthetic = np.random.choice(results, size=n, p=marginal)
        
        for x in synthetic:
            df_data[col].append(x)
            
    dp_df = pd.DataFrame.from_dict(df_data)
    
    return dp_df

### syn_data1

Synthetic data of just the `County` column.

In [61]:
syn_data1 = dp_synthetic_data_1way(['County'], 100, 1.0)
print(syn_data1)

         County
0   Santa Clara
1          King
2          King
3          King
4          King
..          ...
95         King
96         King
97         King
98       Pierce
99         King

[100 rows x 1 columns]


### syn_data2
Synthetic data of `County` and `Make` by stitching together 2 1-way marginals

In [62]:
syn_data2 = dp_synthetic_data_1way(['County', 'Make'], 100, 1.0)
print(syn_data2)

       County        Make
0        King       TESLA
1        King        MINI
2   Snohomish       TESLA
3     Spokane       TESLA
4        King         KIA
..        ...         ...
95       King  VOLKSWAGEN
96      Clark      NISSAN
97   Thurston   CHEVROLET
98       King   CHEVROLET
99       King       TESLA

[100 rows x 2 columns]


The issue with `dp_synthetic_data_1way` is that it does not preserve any of the correlations in the dataset. It is merely taking 2 marginals and stitching them together. While this may generate good data if there is no preference to maintain correlations between columns in the data, this will not do a good job if it is desired to preserve these correlations. In order to address this we can generate a new set of synthetic data using 2 way marginals which preserves these correlations.

### dp_two_marginal

In [50]:
def dp_two_marginal(col1, col2, epsilon):
    hist = evs[[col1, col2]].value_counts()
    dp_hist = hist.apply(lambda x: laplace_mech(x, 1, epsilon))
    dp_hist = dp_hist.clip(lower=0)
    
    s = dp_hist.sum()
    
    dp_hist = dp_hist.to_frame().reset_index()
    dp_hist.rename(columns={0:'probability'}, inplace=True)
    
    for x in range(len(dp_hist)):
        val = dp_hist.at[x,'probability']
        dp_hist.at[x, 'probability'] = val / s
    
    
    return(dp_hist)

marginal2 = dp_two_marginal('County', 'Make', 1.0)

### dp_synthetic_data

Generates synthetic data of n rows when given a marginal

In [51]:
def dp_synthetic_data(n, marginal):
    samples = marginal.sample(n=n, replace=True, weights='probability')
    return samples

### syn_data3
Synthetic data generated using 2-way marginal of `County` and `Make`

In [65]:
syn_data3 = dp_synthetic_data(100, marginal2)
print(syn_data3)

        County       Make  probability
0         King      TESLA     0.258317
0         King      TESLA     0.258317
135   San Juan  CHEVROLET     0.000731
4         King  CHEVROLET     0.032422
19        King   CHRYSLER     0.009847
..         ...        ...          ...
0         King      TESLA     0.258317
83   Snohomish       AUDI     0.001531
0         King      TESLA     0.258317
0         King      TESLA     0.258317
0         King      TESLA     0.258317

[100 rows x 3 columns]


### Note:
While we may not initially notice anything by constructing a 2 way marginal, what happens when we construct a 4 way marginal? Below we notice that the number of rows significantly increases. This is because with a 4 way marginal we group the data by 4 columns which leds to more specificity in grouping causing more and smaller groups. These small groups lead to a lower signal and more noise disruption when generating our marginal.

### dp_four_marginal

In [53]:
def dp_four_marginal(col1, col2, col3, col4, epsilon):
    hist = evs[[col1, col2, col3, col4]].value_counts()
    dp_hist = hist.apply(lambda x: laplace_mech(x, 1, epsilon))
    dp_hist = dp_hist.clip(lower=0)
    
    s = dp_hist.sum()
    
    dp_hist = dp_hist.to_frame().reset_index()
    dp_hist.rename(columns={0:'probability'}, inplace=True)
    
    for x in range(len(dp_hist)):
        val = dp_hist.at[x,'probability']
        dp_hist.at[x, 'probability'] = val / s
    
    
    return(dp_hist)

marginal4 = dp_four_marginal('County', 'Make', 'Electric Range', 'Model Year', 1.0)

### syn_data4
Synthetic data generated using 4-way marginal of `County`, `Make`, `Electric Range`, and `Model Year`.

In [67]:
syn_data4 = dp_synthetic_data(100, marginal4)
print(syn_data4)

         County       Make  Electric Range  Model Year  probability
37         King     NISSAN               0        2023     0.003387
109        King     NISSAN              73        2012     0.001280
1524     Chelan     NISSAN             107        2017     0.000062
948      Pierce     TOYOTA               6        2013     0.000135
285   Snohomish  CHEVROLET              38        2014     0.000518
...         ...        ...             ...         ...          ...
809       Clark     NISSAN              84        2014     0.000163
1          King      TESLA               0        2022     0.051126
77         King       AUDI             204        2019     0.001827
81    Snohomish    HYUNDAI               0        2023     0.001772
1144     Pierce      TESLA             289        2020     0.000101

[100 rows x 5 columns]


Our last method for generating accurate multidimensional marginals is by using overlapping marginals to preserve some of the correlations that we deem important while at the same time not utilizing high dimensional groupings that we would severly lower the signal of that data and hurt accuracy.

### dp_marginal_df and gen_sample

Below are two helper functions for our 'dp_synthetic_data_overlapping_marginals' function.

In [13]:
def dp_marginal_df(col, epsilon, df):
    f = lambda x: x + np.random.laplace(loc=0, scale=1/epsilon)
    hist_noisy = df[col].value_counts().apply(f)
     
    non_negative_syn_rep = np.clip(hist_noisy, 0, None)
    
    total = np.sum(non_negative_syn_rep)
    h = lambda x: x/total #normalized
    
    return non_negative_syn_rep.apply(h)

def gen_sample(marginal, cols, df):
    keys = df[cols].value_counts().keys()
    return np.random.choice(keys, size=1, p=marginal)

### dp_synthetic_data_overlapping_marginals

In [19]:
def dp_synthetic_data_overlapping_marginals(n, epsilon):
    synth_df = pd.DataFrame(columns=['County','Make','Electric Range','Model Year'])
    
    for x in range(n):
        # generate marginal of County in evs
        m_County = dp_marginal_df('County', epsilon, evs)
        # get single synthetic County using marginal and evs
        County_syn = gen_sample(m_County, 'County', evs)
        
        # filter evs to only contain people with that County
        filtered_County = evs[evs['County']==County_syn[0]]
        
        # generate marginal of Make in filtered dataframe
        m_Make = dp_marginal_df('Make', epsilon, filtered_County)
        # get single synthetic Make using marginal and filtered evs
        Make_syn = gen_sample(m_Make, 'Make', filtered_County)
        
        # filter evs to only contain people with that Make
        filtered_Make = evs[evs['Make']==Make_syn[0]]
        
        #repeat
        m_Range = dp_marginal_df('Electric Range', epsilon, filtered_Make)
        Range_syn = gen_sample(m_Range, 'Electric Range', filtered_Make)
        
        filtered_Range = evs[evs['Electric Range']==Range_syn[0]]
        
        m_Year = dp_marginal_df('Model Year', epsilon, filtered_Range)
        Year_syn = gen_sample(m_Year, 'Model Year', filtered_Range)
    
        synth_df.loc[x] = [County_syn[0], Make_syn[0], Range_syn[0], Year_syn[0]]
    return synth_df

syn_data = dp_synthetic_data_overlapping_marginals(2000, 1.0)

Unnamed: 0,County,Make,Electric Range,Model Year
0,King,CHEVROLET,53,2017
1,Snohomish,TESLA,208,2014
2,Snohomish,TESLA,0,2023
3,King,TESLA,330,2020
4,Pierce,TESLA,0,2023
...,...,...,...,...
95,Snohomish,TESLA,0,2022
96,King,PORSCHE,14,2017
97,Snohomish,TESLA,208,2015
98,Spokane,KIA,0,2022


## 2. Evaluation  of percent errors of synthetic data vs. original data

### pct_error_1Dmarginal

In [71]:
def pct_error_1Dmarginal(marginal):

    def gen_samples(cols, n, epsilon):
        df_data = {}
        for col in cols:
            df_data[col] = []
            data = evs[col].value_counts().index.to_list()
            results = [x for x in data]
            marginal = list(dp_marginal(col, epsilon).values())
            synthetic = np.random.choice(results, size=n, p=marginal)
            
            for x in synthetic:
                df_data[col].append(x)
                
        dp_df = pd.DataFrame.from_dict(df_data)

        return dp_df
    
    syn_data = gen_samples([column], len(evs), 1.0).value_counts()
    ev_makes = evs[column].value_counts()


    errors = pct_error_vec(list(syn_data), list(ev_makes))

    errors_sum = sum(errors)

    return((errors_sum / len(errors)))


column = 'Make'
marginal = dp_marginal(column, 1.0)
mean_error = pct_error_1Dmarginal(marginal)
print(f"Mean Error for Column '{column}': " + str(mean_error))


Mean Error for Column 'Make': 8.884915813085707


In the above cell we determine the mean percent error of a single iteration of generating a single column of synthetic data. This information while useful is still volatile, below I will establish the mean percent error over 50 iterations.

In [72]:
def mean_pct_error_1Dmarginal(column):
    errors = []

    for x in range(50):
        marginal = dp_marginal(column, 1.0)
        error = pct_error_1Dmarginal(marginal)
        errors.append(error)

    mean_error = sum(errors) / len(errors)

    return (mean_error)

mean_pct_error_1Dmarginal('Make')

7.733500137423709

Running this code over 50 iterations yields us an average percent error of ~7-8% variance from the initial dataset. While this number is not incredibly high, it does not indicate possible errors over multiple columns. In the next cells we will determine the percent error over multiple columns. We will start with the percent error of the previous 4 way marginal

### pct_error_4Dmarginal
The function below generates the percent error of a 4D marginal generated synthetic dataset by comparing each of its column values to the column values of the original dataset.

In [73]:
def pct_error_4Dmarginal(col1, col2, col3, col4):
    fourWay = dp_four_marginal(col1, col2, col3, col4, 1.0)
    error_col1 = pct_error_1Dmarginal(fourWay[col1])
    error_col2 = pct_error_1Dmarginal(fourWay[col2])
    error_col3 = pct_error_1Dmarginal(fourWay[col3])
    error_col4 = pct_error_1Dmarginal(fourWay[col4])

    return error_col1, error_col2, error_col3, error_col4
    
pct_error_4Dmarginal('County', 'Make', 'Electric Range', 'Model Year')

(9.294698825579147, 8.531031222384305, 8.140073828065582, 8.978874845300655)

In order to obtain a more accurate result we generate 50 percent errors for each of the columns and then take the mean of each.

In [74]:
def mean_pct_error_4Dmarginal(col1, col2, col3, col4):
    errors1 = []
    errors2 = []
    errors3 = []
    errors4 = []

    for x in range(50):
        error1, error2, error3, error4 = pct_error_4Dmarginal(col1, col2, col3, col4)
        errors1.append(error1)
        errors2.append(error2)
        errors3.append(error3)
        errors4.append(error4)

    mean_error1 = sum(errors1) / len(errors1)
    mean_error2 = sum(errors2) / len(errors2)
    mean_error3 = sum(errors3) / len(errors3)
    mean_error4 = sum(errors4) / len(errors4)
    
    return (mean_error1, mean_error2, mean_error3, mean_error4)

mean_pct_error_4Dmarginal('County', 'Make', 'Electric Range', 'Model Year')

(6.90853366440221, 7.378668659455775, 8.135875099474562, 7.5798876411004015)

### pct_error_overlapping_marginals

In [75]:
def pct_error_overlapping_marginals(col1, col2, col3, col4):
    marginal = dp_synthetic_data_overlapping_marginals(2000,1)
    error_col1 = pct_error_1Dmarginal(marginal[col1])
    error_col2 = pct_error_1Dmarginal(marginal[col2])
    error_col3 = pct_error_1Dmarginal(marginal[col3])
    error_col4 = pct_error_1Dmarginal(marginal[col4])

    return error_col1, error_col2, error_col3, error_col4
    
pct_error_overlapping_marginals('County', 'Make', 'Electric Range', 'Model Year')

(4.802018009994064, 9.055459966583419, 6.468841529642848, 8.477605204435296)

It is possible to evaluate this multiple times and take the average; however, due to the nature of the task it takes a significant amount of time (10 - 15 minutes). We have provided the code to run the error of overlapping marginals multiple times for those interested.

In [None]:
def mean_pct_error_overlapping_marginal(col1, col2, col3, col4):
    errors1 = []
    errors2 = []
    errors3 = []
    errors4 = []

    for x in range(10):
        error1, error2, error3, error4 = pct_error_overlapping_marginals(col1, col2, col3, col4)
        errors1.append(error1)
        errors2.append(error2)
        errors3.append(error3)
        errors4.append(error4)

    mean_error1 = sum(errors1) / len(errors1)
    mean_error2 = sum(errors2) / len(errors2)
    mean_error3 = sum(errors3) / len(errors3)
    mean_error4 = sum(errors4) / len(errors4)
    
    return (mean_error1, mean_error2, mean_error3, mean_error4)

mean_pct_error_overlapping_marginal('County', 'Make', 'Electric Range', 'Model Year')

(9.196959193888604, 7.136894440226206, 8.494631905781915, 6.384757144998363)

In [None]:
""" COUNT QUERIES """
def count_query(df, col, equals):
    return len(df[df[col]==equals])


evs = evs[0:len(syn_data)]
syn_count = count_query(syn_data, 'Make', 'TESLA')
lp_count = laplace_mech(count_query(evs[0:len(syn_data)], 'Make', 'TESLA'), 1, 1.0)
count = len(evs[evs['Make'] == 'TESLA'])


syn_count_err = pct_error(syn_count, count)
lp_count_err = pct_error(lp_count, count)




""" MEAN QUERIES """


def mean_query(df, col):
    df_sum = df[col].sum()
    df_len = len(df)
    return df_sum / df_len


def dp_mean_query(df, col, epsilon):
    df_sum = laplace_mech(df[col].sum(), 400, epsilon / 2)
    df_len = laplace_mech(len(df), 1, epsilon / 2)
    return df_sum / df_len




syn_mean = mean_query(syn_data, 'Electric Range')
mech_mean = dp_mean_query(evs, 'Electric Range', 1.0)
ev_mean = mean_query(evs, 'Electric Range')


syn_mean_err = pct_error(syn_mean, ev_mean)
mech_mean_err = pct_error(mech_mean, ev_mean)


print("SYNTHETIC")
print(syn_count_err)
print(syn_mean_err)