In [2]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
import pandas as pd
import numpy as np

# Some useful utilities

def laplace_mech(v, sensitivity, epsilon):
    return v + np.random.laplace(loc=0, scale=sensitivity / epsilon)

def gaussian_mech(v, sensitivity, epsilon, delta):
    return v + np.random.normal(loc=0, scale=sensitivity * np.sqrt(2*np.log(1.25/delta)) / epsilon)

def pct_error(orig, priv):
    return np.abs(orig - priv)/orig * 100.0

def z_clip(xs, b):
    return [min(x, b) for x in xs]

def clip(xs, upper, lower):
    return [max(min(x, upper), lower) for x in xs]

def your_code_here():
    return 1

def test(msg, value, expected):
    if value == expected:
        print(f"{msg}: {value}, as expected")
    else:
        print(f"{msg}: OH NO! Got {value}, but expected {expected}.")

In [3]:
adult_data = pd.read_csv("adult_with_pii.csv", parse_dates=['DOB'])

## Question 1 (20 points)

Implement a function `above_10000` that releases the **value** of the first query in a sequence of queries whose value is above 10000. Your function should have a **total** privacy cost equal to the privacy parameter $\epsilon$ passed in when it is called.

**Note**: this function (and the rest of the ones you'll define in this assignment) take a list of *query results* rather than the queries themselves (as we saw in class). This simplification makes your code a little bit simpler.

In [4]:
def above_10000(query_results, epsilon):
    t_hat = 10000 + np.random.laplace(loc=0, scale=2/epsilon)
    
    for idx, q in enumerate(query_results):
        noise = np.random.laplace(loc=0, scale=4/epsilon)
        if q + noise >= t_hat:
            return q + noise
    
    return -1

queries = adult_data['Martial Status'].value_counts()
print(f"above_10000 #1: {above_10000(queries, 100)}")
print(f"above_10000 #2: {above_10000(queries, 1)}")
print(f"above_10000 #3: {above_10000(queries, .01)}")

above_10000 #1: 14976.158697205135
above_10000 #2: 14969.10347649895
above_10000 #3: 14921.056929301492


## Question 2 (10 points)
In 2-3 sentences, argue informally (via the definition of the sparse vector technique, post-processing, and sequential composition), that your implementation of `above_10000` has a total privacy cost of $\epsilon$.

- This algorithm has a privacy cost of $\epsilon$: post-processing allows us to alter the data without reverting its privacy protection and we are only releasing a single index. Thus, the privacy cost is $\epsilon$, as it is not altered by the algorithm's addition of noise and it is not increased by any other releases of data.

## Question 3 (20 points)

Implement a function `bounded_all_above_10000` that releases the **value** of **$c$ queries** in a sequence of queries whose value is above 10000 (where $c$ is an analyst-provided parameter limiting the number of returned results, as in the `Sparse` algorithm in Dwork & Roth). Your function should have a **total privacy cost** bounded by its parameter $\epsilon$.

In [5]:
def bounded_all_above_10000(query_results, c, epsilon):
    results = []
    i = 0
    epsilon_i = epsilon/c
    
    while i < len(query_results) and len(results) < c:
        x = above_10000(query_results[i:], epsilon_i)        
        if x == -1:
            return results
        else:
            results.append(x)
            i += 1
    
    return results

# Note: the official solution also returns the total budget used
# This will always be <= ε, but might (often is) be less than ε
queries = list(adult_data['Martial Status'].value_counts())
print(f"bounded_all_above_10000 #1: {bounded_all_above_10000(queries, 3, 100)}")
print(f"bounded_all_above_10000 #2: {bounded_all_above_10000(queries, 3, 1)}")
print(f"bounded_all_above_10000 #3: {bounded_all_above_10000(queries, 3, .01)}")

bounded_all_above_10000 #1: [14975.95470694663, 10683.108689918843]
bounded_all_above_10000 #2: [14978.959170525623, 10671.201982140936]
bounded_all_above_10000 #3: [16028.440147448458, 10784.091575607856]


## Question 4 (10 points)

In 2-3 sentences, argue informally that your implementation of `bounded_all_above_10000` has privacy cost bounded by $\epsilon$.

- Normally, because we can in the worst case scenario call `above_10000` n times (where `n = len(query_results_`), our privacy cost would be $n\epsilon$. However, we are splitting the privacy cost for each call by using $\epsilon_i = \epsilon/c$, thus bounding the privacy cost to: $c\frac{\epsilon}{c} = \epsilon$.

## Question 5 (30 points)

Implement a function `mean_age` that computes the mean age of participants in the `adult_data` dataset. Your function should have a **total** privacy cost of $\epsilon$. It should work as follows:

1. Compute an *upper* clipping parameter based on the data
2. Compute a *lower* clipping parameter based on the data
3. Clip the data using the lower and upper clipping parameters
4. Use `laplace_mech` to release a differentially private mean of the clipped data

*Hint*: Use the sparse vector technique to compute the clipping parameters. Consider using a sequence of queries that looks like `np.sum(clip(df, b, 0)) - np.sum(clip(df, b+1, 0))`.

*Hint*: Be careful of sensitivities and set the scale of the noise accordingly!

In [6]:
bs = list(range(0, 200, 10))
df = adult_data['Age']

def sens_one_query(b):
    # Return a lambda that generates a sensitivity-1 version of a query of the sum of ages in 'adult_data[Age]'
    return lambda df: df.clip(lower=0, upper=b).sum() - df.clip(lower=0, upper=b+1).sum()

def above_0(queries, epsilon):
    # Define the noise threshold
    t_hat = np.random.laplace(loc=0, scale = 2/epsilon)
    
    # If noisy query value is greater than the noisy threshold, return its index in 'queries'. 
    # This means the best value of q in queries has been identified
    for idx, q in enumerate(queries):
        noise = np.random.laplace(loc=0, scale = 4/epsilon)
        if q(df) + noise >= t_hat:
            return idx

    # If no value was found, return the index of the last element in 'queries'
    return -1 

def mean_age(epsilon):
    # Define constants
    lower_bound = 0
    epsilon_fraction = epsilon/3
    
    # Generate queries
    sens_one_queries = [sens_one_query(b) for b in bs]
    
    # Determine the lowest value of upper bound b which stops increasing the values of the sens_one_queries
    upper_bound = bs[above_0(sens_one_queries, epsilon_fraction)]
    
    # Return noisy mean
    return laplace_mech(df.clip(lower=lower_bound, upper=upper_bound).sum(), upper_bound, epsilon_fraction)/laplace_mech(len(df), 1, epsilon_fraction)
    
for epsilon in [0.001, 0.01, 0.1, 0.5, 1, 10]:
    print(f"epsilon: {epsilon}, mean age: {mean_age(epsilon)}")

epsilon: 0.001, mean age: 54.81847089651001
epsilon: 0.01, mean age: 39.17962688302697
epsilon: 0.1, mean age: 38.51168278029223
epsilon: 0.5, mean age: 38.510669957462746
epsilon: 1, mean age: 38.59243118757453
epsilon: 10, mean age: 38.58022182685648


## Question 6 (10 points)

In 3-5 sentences, describe your approach to implementing `mean_age` and argue informally that your implementation has privacy cost bounded by $\epsilon$.

- This implementation defines constants (critically, splitting the privacy budget), generates a set of sensitivity-1 queries of the sum of `adult_data[age]` which are used to determine the best clipping parameter for the data (by using `above_0`, a modified version of `above_threshold`, and, with this optimally-clipped data, calculates its mean. When designing the implementation, I referenced the notes on the use of the sparse-vector-technique and tried to understand how it is able to determine an optimal clipping parameter. This was quite confusing to me at first, but after I figured it out, the rest came naturally.
- This implementation has a total privacy cost of $\epsilon$ because it splits the privacy budget in thirds: one third is used to calculate the best clipping parameter, a third to determine the noisy sum of the clipped data, and a third used to calculate a noisy length of the clipped data. 