# Differential Privacy Jupyter Lab Lesson 1
Welcome to the Differential Privacy Jupyter Lab Lesson #1. 

In this lab, we'll see how the Laplace and the Geometric mechanism can be used in private data analysis.

In [1]:
import numpy
import numpy.random

# TODO: The differential privacy laplace mechanism uses the laplace distribtion
# The original paper used the laplace mechanism because it made the math easier.
# Graph the gaussian & the laplace distribution.

# TODO: Redo this so that we just have a single run
# Then build it to multiple runs.


def dp_laplace(*,private_x, sensitivity, epsilon):
    """This versin of the """
    return numpy.random.laplace(private_x, 1.0/epsilon )


In [2]:
"""
Let's assume a hypothetical survey in which there are 100 people who respond.
We want to protect with differential privacy the number of respondents.
The sensitivity is 1 because a person being added or removed will change that
number by 1.  Here is such a computation, with an epsilon of 2.0:
"""
dp_laplace(private_x=100, sensitivity=1, epsilon=2.0)

99.78990855774366

In [3]:
"""
Now we will run this experiment 10 times, to show the range of the protection values:
"""
runs=10
for i in range(runs):
    display(dp_laplace(private_x=100, sensitivity=1, epsilon=2.0))




100.01284546494115

99.694837662643

102.22375374526727

100.04583801374753

99.71922741912032

100.07810186038466

99.77171050015554

100.59777189754632

99.88483535889821

99.60947660891654

In [3]:
"""
Because our dp_lapace mechanism is built with numpy.random.laplace,
we can repeat that experiment with a single operation.

REMEMBER -- this is just for demonstration purposes. If we were *actually*
using differential privacy, we would just run it once.

We can get integer counts by rounding afterwards, or by using a different
mechanism (the geometeric mechanism gives integers.)
"""
private_data = [100,100,100,100,100,100,100,100,100,100]
display( dp_laplace(private_x = private_data, sensitivity=1.0, epsilon=2.0) )

array([100.35729877, 100.55875696,  99.67556729, 100.57051232,
       100.65316795, 101.46853687, 100.51663244, 101.01694994,
        99.95664489, 100.5017077 ])

In [5]:
"""
Here we introduce a nifty tool for displaying tables that's part of the ctools package.
We will re-run the experiment

TODO: redo this with pandas
"""
from ctools.tydoc import jupyter_display_table
private_data = [100] * 10
public_data  = dp_laplace(private_x = private_data, sensitivity=1.0, epsilon=2.0)
jupyter_display_table({'epsilon 2.0':public_data}, float_format='{:.4f}')

epsilon 2.0
100.4974
99.9354
100.3602
100.043
100.1967
99.8954
100.4098
99.2838
100.3214
99.8367


In [6]:
"""Averaging the 10 draws above with an epsilon of 2.0 is the same a doing a single
draw with an epsilon of 20. Let's compare those two possibilities; they look 
pretty simlar (and pretty accurate)"""
import statistics

display("Average of the {} epsilon 2.0 runs: {}".
        format(len(public_data), statistics.mean(public_data)))
display("Private query with a single epsilon 20.0 run: {}".format(dp_laplace(private_x = 100.0, sensitivity=1.0, epsilon=20.0)))

'Average of the 10 epsilon 2.0 runs: 100.07796734339607'

'Private query with a single epsilon 20.0 run: 99.93285145707846'

In [7]:
"""Here we observe the impact of epsilon by comparing the noise added to a count of 100
for epsilon values of 0.01, 0.1, 1.0, and 2.0."""

def run_experiment(epsilon):
    private_data = [100] * 10
    return {f"epsilon {epsilon}": 
            dp_laplace(private_x = private_data, sensitivity=1.0, epsilon=epsilon)}

trials = {"Trial":[f"trial #{i}" for i in range(1,11)]}

jupyter_display_table( {**trials, 
                        **run_experiment(0.01),
                        **run_experiment(0.1), 
                        **run_experiment(1.0), 
                        **run_experiment(2.0)} )

Trial,epsilon 0.01,epsilon 0.1,epsilon 1.0,epsilon 2.0
trial #1,162.32371,52.24147,100.45223,100.37677
trial #2,-71.06996,126.70577,99.54509,99.65678
trial #3,98.30786,105.64513,99.91781,99.59219
trial #4,50.34816,104.04263,100.20413,100.02424
trial #5,104.23652,103.62971,100.88855,99.33968
trial #6,157.81133,90.76929,99.48144,99.19592
trial #7,260.45953,113.55517,99.79914,100.97379
trial #8,33.38402,98.06752,100.16099,99.86632
trial #9,-18.07944,92.67601,100.78996,101.503
trial #10,123.01359,107.32679,99.72126,99.73756


In [8]:
"""
Instead of protecting 10 independent trial, the approach that we take above
could use used to protect 10 independent measurements of a single population.
Let's protect the population numbers for the District of Columbia from the 2010 Census.

Here we round the counts. That's post-processing, so it's totally okay to do.

We'll be using an epsilon of 0.01 so that we can see some differences.
"""

categories = ["Under 5 years"] + [f"{age} to {age+4} years" for age in range(5,90,5)]+ ["90 years and over"]
true_counts = [32613, 26147, 25041, 39919, 64110, 69649, 55096, 42925, 37734, 
                38539, 37164, 34274, 29703, 21488, 15481, 11820, 9705, 6496, 3819]
protected_counts = [int(x) for x in 
                        dp_laplace(private_x = true_counts, sensitivity=1.0, epsilon=0.01 )]
jupyter_display_table( {"Age":categories, 
                        "True Counts":true_counts, 
                        "Protected Counts":protected_counts} )


Age,True Counts,Protected Counts
Under 5 years,32613,32963
5 to 9 years,26147,25978
10 to 14 years,25041,24973
15 to 19 years,39919,40075
20 to 24 years,64110,64036
25 to 29 years,69649,69453
30 to 34 years,55096,55117
35 to 39 years,42925,43001
40 to 44 years,37734,37505
45 to 49 years,38539,38520


In [9]:
"""By comparing the differences between the counts above and the true counts,
we can see the overall impact of differential privacy for epsilon=0.01.
NOTE --- comparing the protected counts to the true counts is something you cannot
typically do in differential privacy, because that's making a comparision across the noise barrier.
But it's useful for learning DP and understanding how mechanisms work.
"""
diff_counts = [p-t for (p,t) in zip(protected_counts,true_counts)]
jupyter_display_table( {"Age":categories + ['total'], 
                        "True Counts":true_counts + [sum(true_counts)], 
                        "Protected Counts":protected_counts + [sum(protected_counts)],
                        "Difference":diff_counts + [sum(diff_counts)]
                       } )



Age,True Counts,Protected Counts,Difference
Under 5 years,32613,32963,350
5 to 9 years,26147,25978,-169
10 to 14 years,25041,24973,-68
15 to 19 years,39919,40075,156
20 to 24 years,64110,64036,-74
25 to 29 years,69649,69453,-196
30 to 34 years,55096,55117,21
35 to 39 years,42925,43001,76
40 to 44 years,37734,37505,-229
45 to 49 years,38539,38520,-19
