***
**Adapted by**: Peter Lu \
CS/STAT108: Data Science Ethics (UCR - Winter 2024)
***

# Data Privacy
This week, we'll be taking a look at de-identification, re-identification, and $\varepsilon$-differential privacy with continuous values!

In [None]:
# Load the data and libraries
import pandas as pd
import numpy as np

adult = pd.read_csv('adult_with_pii.csv')

In [None]:
adult.head()

## Question 1

How many individuals in this dataset are uniquely identified by their Zip code? How many are uniquely identified by their age?

Hint:
1. The number of *unique ZIP codes* is **different** from the number of *individuals uniquely identified by ZIP code*.
2. You can use the `value_counts` method to count the number of occurrences of each value in a series.

In [None]:
def num_unique_id(data: pd.DataFrame, feature: str) -> int:
    '''
    Returns the number of individuals uniquely identifiable by `feature`.
    -------------------
    data: pd.DataFrame data matrix
    feature: string of variable to group individuals by
    '''

    # Enter code below

    raise NotImplementedError()

### Test Cases

In [None]:
assert num_unique_id(adult, 'Zip') == 23513
assert num_unique_id(adult, 'Age') == 2

## Question 2

Write code to determine the `Education-Num` of any individual by performing a differencing attack. Your code should *only* use aggregate data to find the education number.

Assume you can look up aggregate data about the dataset, but no one's specific education number.

In [None]:
def differencing_attack(data: pd.DataFrame, name: str, feature: str) -> int:
    '''
    Returns the `feature` of `name` using differencing attack on `feature`.
    Only works with numeric variables.
    -----------------
    data: pd.DataFrame data matrix
    name: string of user whose feature is of interest
    feature: string of variable of interest
    '''

    # Enter code below

    raise NotImplementedError()

### Test Cases

In [None]:
assert differencing_attack(adult, 'Ardyce Golby', 'Education-Num') == 12
assert differencing_attack(adult, 'Reuben Skrzynski', 'Education-Num') == 9

## Differential Privacy
As seen in lecture, a randomized algorithm ***M*** provides **$\boldsymbol{\varepsilon}$-differential privacy** if, for all neighboring databases $\boldsymbol{D}_1$ and $\boldsymbol{D}_2$, and for any set of outputs $\boldsymbol{S}$:
$$P[M(D_1) \in S] \leq e^{\varepsilon}P[M(D_2) \in S].$$
$\varepsilon$ is the privacy parameter. Smaller values of $\varepsilon$ provide stronger privacy.

The two-coin algorithm is a useful ***M*** in the context of categorical or discrete variables, but what about continuous variables? One useful randomized algorithm is the **Laplace Mechanism**.


## Laplace Distribution
Before discussing the the Laplace Mechanism, we must discuss the Laplace distribution which is used to generate our noise. \
\
The **Laplace distribution** $\text{Lap}(\mu,  b)$ with mean $\mu$ and scale parameter $b$ has the following distribution function:
$$f(x; \mu, b) = \frac{1}{2b}\exp{\bigg(-\frac{|x - \mu|}{b}\bigg)}.$$
When noise is generated from this distribution under certain conditions, we can produce differentially privatized data using the noise.\
\
Essentially, the **Laplace Mechanism** is a function that inserts noise into numeric data. The noise can be inserted when the data is collected or when the data is queried. This can be seen in the following formulation:
$$M_{L}(x, f, \varepsilon) = f(x) + \text{noise}$$
where
* f is either $f(x) = x$ or a query function (like a histogram)
* $\text{noise} \sim \text{Lap}(0, \frac{s}{\varepsilon})$
* $s$ is the sensitivity of $f$, or how much a change in your data can change $f$



## Question 3

Suppose we want to query the `Education-Num` data using a histogram. Use a Laplace Mechanism to introduce $\varepsilon$-differential privacy into the data. Test your function with $\varepsilon = 1, 2, 3$. Compare your queries before and after differentially privatizing your data.

*Hints*:  
1. `numpy` may have a useful module for generating values from probability distributions.
2. The sensitivity of a histogram is $1$.
3. The query function $f$ is the histogram counts.
4. Check out the data with the `better_hist`! After calling it, you can still add labels and whatnot.



In [4]:
import matplotlib.pyplot as plt

# use instead of plt's histogram
def better_hist(data: pd.DataFrame, bins: int = 10, color: str = None) -> None:
  _, _, patches = plt.hist(data, bins = bins, edgecolor = 'black', color = color)
  for i in range(len(patches)):
    plt.text(patches[i].get_x() + patches[i].get_width() / 2, patches[i].get_height(),
             str(int(patches[i].get_height())), ha='center', va='bottom')

In [None]:
# Plot initial histogram
# Enter code below


In [None]:
def eps_diff_privacy_hist(data: pd.Series, eps: float) -> pd.Series:
  '''
  Introduces `eps`-differential privacy into `data` using Laplace Mechanism.
  Returns data with noise added.
  -------------
  data: pd.Series column data to be noisified
  eps: privacy parameter
  '''

  # Enter code below

  return noisy_data

In [5]:
# Plot histogram of un-noisified and noisified data
# Enter code below
