In [1]:
from diffprivlib import tools
import pandas as pd
import random

data = pd.read_csv("FPM_toy_dataset_Slider.csv")
data['how_often_do_you_have_a_drink_of_alcohol'] = 9-data['how_often_do_you_have_a_drink_of_alcohol']


In [23]:
data.columns

Index(['Unnamed: 0', 'participant_id', 'past_month_illicit_drug_use',
       'past_year_marijuana_use', 'past_month_marijuana_use',
       'perceptions_of_great_risk_from_smoking_marijuana_once_a_month',
       'first_use_of_marijuana',
       'past_month_illicit_drug_use_other_than_marijuana',
       'past_year_cocaine_use',
       'perceptions_of_great_risk_from_trying_heroin_once_or_twice',
       'past_year_methamphetamine_use', 'past_year_misuse_of_pain_relievers',
       'past_month_alcohol_use', 'past_month_bing_alcohol_use',
       'perceptions_of_great_risk_from_having_five_or_more_drinks_once_or_twice_a_week',
       'past_month_alcohol_use.1', 'past_month_bing_alcohol',
       'past_month_tobacco_product_use', 'past_month_cigarette_use',
       'perceptions_of_great_risk_from_smoking_one_or_more_packs_per_day',
       'illicit_drug_use_disorder', 'pain_reliever_use_disorder',
       'alcohol_use_disorder', 'substance_use_disorder',
       'needing_but_not_retrieving_treatmen

In [2]:
data[(data['zip_code'] == 0) & (data['race'] == 'Black or African American') & (data['gender'] == 'female') & (data['survey_taken_date'] == 4) & (data['state'] == 1)]

Unnamed: 0.1,Unnamed: 0,participant_id,past_month_illicit_drug_use,past_year_marijuana_use,past_month_marijuana_use,perceptions_of_great_risk_from_smoking_marijuana_once_a_month,first_use_of_marijuana,past_month_illicit_drug_use_other_than_marijuana,past_year_cocaine_use,perceptions_of_great_risk_from_trying_heroin_once_or_twice,...,how_often_do_you_have_a_drink_of_alcohol,birth_year,race,gender,household_income,school_grade,state,zip_code,survey_taken_date,past_year_cocain_use
1561,1561,1562,0,0,0,2,0,0,1,2,...,9,2000,Black or African American,female,5,10,1,0,4,0


In [3]:
d_without = data[(data['zip_code'] != 0) | (data['race'] != 'Black or African American') | (data['household_income'] != 5) | (data['survey_taken_date'] != 4) | (data['state'] != 1)]

In [4]:
#uses the IBM differential privacy library to take a differentially private mean on a list of values, given some epsilon.
#this function repeatedly performs the DP mean query and creates a distribution of the outputed values.
def get_mean_distr(data_column, data_min, data_max, epsilon = 1):
    output = []
    #data_min = min(data_column)
    #data_max = max(data_column)
    for i in range(0, 1000):
        temp_mean = tools.mean(data_column, bounds = (data_min, data_max), epsilon = epsilon)
        output.append(temp_mean)
    return(output)

In [22]:
import ipywidgets as widgets
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import gaussian_kde
import statistics

float_slider = widgets.FloatSlider(
    value=1,
    min=0.001,
    max=2.1,
    step=0.1,
    description='epsilon:',
    disabled=False,
    continuous_update=False,
    orientation='horizontal',
    readout=True,
    readout_format='.3f',
)
output = widgets.Output()

display(float_slider, output)
def on_value_change(change):
    with output:
        output.clear_output()
        data_min = data['how_often_do_you_have_a_drink_of_alcohol'].min()
        data_max = data['how_often_do_you_have_a_drink_of_alcohol'].max()

        data1 = get_mean_distr(data['how_often_do_you_have_a_drink_of_alcohol'], data_min, data_max, epsilon= float_slider.value)
        data2 = get_mean_distr(d_without['how_often_do_you_have_a_drink_of_alcohol'], data_min, data_max, epsilon= float_slider.value)


        bound_1 = min(data1)
        bound_2 = max(data1)
        d1_df = pd.DataFrame(data1)
        d1_density = gaussian_kde(d1_df[0])
        d2_df = pd.DataFrame(data2)
        d2_density = gaussian_kde(d2_df[0])

        xs = np.linspace(bound_1,bound_2,200)
        plt.plot(xs,d1_density(xs), label = 'data w/ individual')
        plt.plot(xs, d2_density(xs), label = 'data w/o individual')
        plt.legend(loc='upper right')
        plt.title("Distribution of Query Outputs for neighboring datasets")
        plt.show()
        print(statistics.stdev(data1))
        
float_slider.observe(on_value_change, names='value')

FloatSlider(value=1.0, continuous_update=False, description='epsilon:', max=2.1, min=0.001, readout_format='.3…

Output()

In [19]:
data_min = data['how_often_do_you_have_a_drink_of_alcohol'].min()
data_max = data['how_often_do_you_have_a_drink_of_alcohol'].max()

data1 = get_mean_distr(data['how_often_do_you_have_a_drink_of_alcohol'], data_min, data_max, epsilon= 0.101)
data2 = get_mean_distr(d_without['how_often_do_you_have_a_drink_of_alcohol'], data_min, data_max, epsilon= 0.101)

import scipy.stats as stats

stats.ttest_ind(data1, data2)

Ttest_indResult(statistic=0.4408188410183241, pvalue=0.6593918435485799)

## The Epsilon Parameter

In our dataset, the variable `how_often_do_you_have_a_drink_of_alcohol` is a quantitative score representing how often an individual drinks alcohol. A score of 0 represents an individual who does not drink and a score of 9 represents an individual who drinks multiple times every day.

The cell above shows the distribution of `how_often_do_you_have_a_drink_of_alcohol` query results for two neighboring datasets using differentially private mechanisms. By neighboring datasets we mean that they differ in only one individual of interest, which has been specified earlier. As we move around the slider and change the epsilon parameter value, we can see the effects of having too large or too small an epsilon. 

If the privacy implementation were to use too large of an epsilon, say `epsilon = 2`, then we can see that the distribution of query results are significantly different. This means that an attacker can identify that the individual of interest is not only within the dataset, but also that the individual's `how_often_do_you_have_a_drink_of_alcohol` score was higher than average.

On the flip side, as we slower the value of epsilon we can see that the query results become less accurate. With an epsilon of 2 the query results have a standard deviation of around 0.0006, while with an epsilon of 0.1 the query results have a standard deviation of around 0.01. While with this dataset this does not amount to a significant amount of noise, choosing a lower epsilon value will decrease the accuracy of your query results. Thus there is always a trade off between privacy and accuracy to keep in mind. How accurate do you need your query results to be? Is there a certain amount of privacy that you're willing to risk in return for more accurate results for the people using your system? It is important to think about these questions and figure out the goals of your data system are, and what are the potential consequences of various implementations of differential privacy.

## Privacy Budget

In the previous example we saw that if an inappropriate epsilon is chosen, an attacker can use a large number of arbitrary queries to construct the noise distribution for any given query result. With a conservative enough epsilon privacy can still be preserved to a point, but giving users unlimited query power decreases the privacy of a dataset and may require outputting queries with too much noise to be useful.

To combat the privacy loss associated with querying, we can consider implementing a privacy budget. The idea of a privacy budget is that, beforehand, you set up an epsilon budget. This budget serves as an upper ceiling to cap the amount of queries that can be performed. As a query is made, the epsilon of that query is added to an epsilon counter. Users can  perform queries as long as adding the epsilon associated with the query does not make the epsilon counter go over the budget.

For example if you were to set the IBM privacy accountant to have an epsilon budget of 2, users could only make queries with an epsilon value of 2 or lower. They could ask for a mean of a sample using an epsilon of 2, or they could ask for the mean of two different samples with epsilon 1, etc. Once they have exhausted the privacy budget, all users will no longer be able to ask any more questions.

Preventing users from getting any information after the privacy budget is spent may not be ideal. To alleviate this issue and make it so people querying the data can still get some information, a cache can be implemented. What the cache will do is keep track of previous query results and simply return these values if the same query is run again. This can reduce how quickly a privacy budget is spent, as redundant queries will not spend more of the budget, and it means that once the budget has run out users can still get the results of previously made queries.