# Example for Differential Privacy
This notebook illustrates the use of differential privacy in a (sypothetical) survey amont students to find out whether they ever cheated in an exam.
Inspired by https://towardsdatascience.com/a-differential-privacy-example-for-beginners-ef3c23f69401

In [None]:
import matplotlib.pyplot as plt #matplotlib to graph data
import numpy as np 
from scipy.stats import norm
np.random.seed(42)

## Generate "True" Responses
Simulate 200 Students, choose randomly whether they ever cheated:

In [None]:
nStudents = 200
rawData = [ 1 if x else 0 for x in np.random.randn(200) > 0.5 ]

We also define an auxiliary function to generate a bar plot with labels:

In [None]:
def make_barplot(plotData, dataDesc=''):
    labels = list(plotData.keys())
    values = list(plotData.values())

    plt.figure()
    graph = plt.bar(labels, values)
    for bar in graph:
        plt.annotate(bar.get_height(), xy=(bar.get_x()+0.33, bar.get_height()-10), fontsize=10, color="white")
    plt.title("No DP: Number of students who reported 'cheating' vs 'never cheated'")
    
    title_str_root = '' if dataDesc=='' else dataDesc + ': '
    title_str = title_str_root + "Number of students who reported 'cheating' vs 'never cheated'"
    plt.title(title_str)
    plt.show()

Now we use this function to generate a plot showing the distribution of cheaters among the sensitive data:

In [None]:
# count 1 ("cheated") and 0 ("never cheated")
plotData = {'cheated': rawData.count(1), 'never cheated': rawData.count(0)}

# make plot
make_barplot(plotData, "No DP")

Given this hypothetical (synthetic) dataset, about a quarter of all students have cheated at least once in an exam:

In [None]:
true_cheat_rate = sum(rawData)/len(rawData)
true_cheat_rate

## The Impact of an Additional Student
Let's assume we add one more student to our dataset, and that the newly added student did actually cheat:

In [None]:
rawData.append(1)  # indicates Student n+1 did actually cheat

Let's now look at the overall distribution of cheaters in the dataset including the additional student:

In [None]:
# count 1 ("cheated") and 0 ("never cheated")
plotData = {'cheated': rawData.count(1), 'never cheated': rawData.count(0)}

# make plot
make_barplot(plotData, "No DP")

Well, much to our surprise, we now have an additional cheater in our dataset. We thus know for sure that the additional student (the last one we added) did actually cheat.

## Defining Differential Privacy
Now, we define the measurement function as described in the slides. This is a randomized function that will return the counts of self-declaring cheaters and non-cheaters:

In [None]:
p_tell_truth = 0.5
p_cheat_random = 0.5

In [None]:
def get_counts_dp(rawData):
    dpData = []
    for each in rawData:
        flip = int(np.random.uniform(0, 1)>p_tell_truth)
        if flip == 1:
            rand_cheat = int(np.random.uniform(0, 1)<p_cheat_random)
            dpData.append(rand_cheat)
        else:
            dpData.append(each)
    
    dpCountsData = {'cheated': dpData.count(1), 'never cheated': dpData.count(0)}

    return dpCountsData

Let's look at the counts we get when running the stochastic counting function a first time:

In [None]:
# count 1 ("cheated") and 0 ("never cheated")
dpCountsData = get_counts_dp(rawData)

# make plot
make_barplot(dpCountsData, "No DP")

Let's run the same counting function again:

In [None]:
# count 1 ("cheated") and 0 ("never cheated")
dpCountsData = get_counts_dp(rawData)

# make plot
make_barplot(dpCountsData, "No DP")

The differences we see between the two runs are typically larger than 1 - so, any single student could plausibly deny the effect of their own statement on the total outcome. This is exactly the idea of differential privacy.

## Understanding the DP Estimator
Next, we want to look at how the estimation result we get using differential privacy is distributed. To do so, we run 500 runs of the stochastic counting function, and compute the corresponding cheater rate according to the formula derived on the slides:

In [None]:
np.random.seed(123)
cheaterRate = []
for trial in range(500):
    dpCountsData = get_counts_dp(rawData)
    dpCheatRate = dpCountsData['cheated'] / len(rawData)
    estTrueCheatRate = (dpCheatRate-p_cheat_random)/p_tell_truth + p_cheat_random
    cheaterRate.append(estTrueCheatRate)

Next, we plot a histogram of the estimated cheater rates as obtained in the 500 trials:

In [None]:
plt.hist(cheaterRate, bins=[ 0.01*x for x in range(0, 51) ], label='Estimated Cheat Rate')
plt.vlines(true_cheat_rate, 0, 40, linestyles='dashed', color='r', label='True Cheat Rate')
plt.grid()
plt.legend()
plt.title('Distribution of Cheat Rate Estimators\nResults obtained using Differential Privacy')
plt.xlabel('Estimated Cheat Rate')
plt.ylabel('Frequency')
plt.show()

In [None]:
(mu, sigma) = norm.fit(cheaterRate)
print('Estimated Cheat Rate:')
print('- Mean: ', mu)
print('- St. Dev.:', sigma)

So, even with 500 runs, the variability of the estimated cheater rate is still quite high! This is the price we have to pay for the plausible deniability.