# Privacy Preserving Machine Learning

Course taught by Aurélien Bellet

Course page: http://researchers.lille.inria.fr/abellet/teaching/private_machine_learning_course.html

# Practical session 5: Local Differential Privacy and Federated Learning

In [1]:
import pandas as pd
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split, KFold
from sklearn.preprocessing import Normalizer
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelBinarizer

## Dataset

We will be working again with the US Census dataset. You can read about the dataset [here](https://archive.ics.uci.edu/ml/datasets/census+income).

The following line loads the dataset from [OpenML](https://www.openml.org/) with the `fetch_openml` method of `sklearn`. The option `as_frame=True` loads the dataset in `pandas DataFrame` format: this keeps the attributes in their original form and will be more convenient to work with. If you prefer working with a numpy array (not recommended for the first part of the practical), set `as_frame=False`.

In [2]:
dataset_handle = fetch_openml(name='adult', version=2, as_frame=True)
dataset = dataset_handle.frame

In [3]:
n, d = dataset.shape
print(n, d)
dataset.head(10)

48842 15


Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,25.0,Private,226802.0,11th,7.0,Never-married,Machine-op-inspct,Own-child,Black,Male,0.0,0.0,40.0,United-States,<=50K
1,38.0,Private,89814.0,HS-grad,9.0,Married-civ-spouse,Farming-fishing,Husband,White,Male,0.0,0.0,50.0,United-States,<=50K
2,28.0,Local-gov,336951.0,Assoc-acdm,12.0,Married-civ-spouse,Protective-serv,Husband,White,Male,0.0,0.0,40.0,United-States,>50K
3,44.0,Private,160323.0,Some-college,10.0,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688.0,0.0,40.0,United-States,>50K
4,18.0,,103497.0,Some-college,10.0,Never-married,,Own-child,White,Female,0.0,0.0,30.0,United-States,<=50K
5,34.0,Private,198693.0,10th,6.0,Never-married,Other-service,Not-in-family,White,Male,0.0,0.0,30.0,United-States,<=50K
6,29.0,,227026.0,HS-grad,9.0,Never-married,,Unmarried,Black,Male,0.0,0.0,40.0,United-States,<=50K
7,63.0,Self-emp-not-inc,104626.0,Prof-school,15.0,Married-civ-spouse,Prof-specialty,Husband,White,Male,3103.0,0.0,32.0,United-States,>50K
8,24.0,Private,369667.0,Some-college,10.0,Never-married,Other-service,Unmarried,White,Female,0.0,0.0,40.0,United-States,<=50K
9,55.0,Private,104996.0,7th-8th,4.0,Married-civ-spouse,Craft-repair,Husband,White,Male,0.0,0.0,10.0,United-States,<=50K


## Question 1 (K-ary randomized response)

Consider an individual who has a certain value of an attribute. This attribute can take $K$ possible discrete values. The individual would like to report his/her value in a differentially private way. As we have seen in the lecture, $K$-ary randomized response ($K$-RR) is a standard technique for this problem. 

Write a function that implements $K$-RR: it takes as input a private value, a list of possible values, the desired value of $\epsilon$ and a random seed (for reproducibility), and returns a randomized value in a way that satisfies $\epsilon$-LDP.

In [4]:
def RR(v, value_list, eps, random_state=None):
    rng = np.random.RandomState(random_state)
    K = len(value_list)
    if v not in value_list:
        raise ValueError("The input value " + repr(v) + " is not in the list " + str(value_list) + ".")
    # TO COMPLETE

## Question 2 (unbiased histogram with randomized response)

We are now interested in estimating the frequency of each value taken by the attribute in a given population (i.e., contruct a histogram). Without any privacy constraint, we could get the true histogram, as computed by the function below.

In [5]:
def histogram_query(df, attribute):
    return df[attribute].value_counts(dropna=False, sort=False, normalize=True)

q = histogram_query(dataset, 'workclass')
print(q)

Private             0.694198
Self-emp-not-inc    0.079071
Self-emp-inc        0.034704
Federal-gov         0.029319
Local-gov           0.064207
State-gov           0.040559
Without-pay         0.000430
Never-worked        0.000205
NaN                 0.057307
Name: workclass, dtype: float64


If we have privacy constraints but we assume the presence of a trusted curator, we would simply use the Laplace or Gaussian mechanisms to perturb the true histogram. However, when there is no trusted curator, each individual response must be collected in a private manner: this is the local model of differential privacy.

Write a function that computes a private and *unbiased* estimate of the true histogram from responses obtained from the $K$-RR local randomizer implemented in Question 1. This function should take as input the dataset, the attribute of interest, the desired value of $\epsilon$ and a random seed, and return an unbiased estimate of the histogram in a way that satisfies $\epsilon$-LDP.

*Advice:* it will be convenient to return the histogram as a pandas `Series`, as done by the function `histogram_query` above.

In [6]:
def histogram_with_RR(df, attribute, eps, random_state=None):
    rng = np.random.RandomState(random_state)
    value_list = list(histogram_query(df, attribute).index)
    K = len(value_list)
    hist = np.zeros(K)
    
    # TO COMPLETE
    
    return pd.Series(hist, index=value_list)

In [7]:
private_q_rr = histogram_with_RR(dataset, 'workclass', 1)
print(private_q_rr)

Private             0.0
Self-emp-not-inc    0.0
Self-emp-inc        0.0
Federal-gov         0.0
Local-gov           0.0
State-gov           0.0
Without-pay         0.0
Never-worked        0.0
NaN                 0.0
dtype: float64


## Question 3 (utility of randomized response)

Compare the utility (as measured by the $\ell_1$ error) of the histogram obtained with $K$-RR in the *local* model of differential privacy with that of the histogram obtained with the Laplace mechanism in the *centralized* model of differential privacy. For convenience, the functions `Laplace_mechanism` and `l1_error` are provided below.

Study the effect of $\epsilon$, the number of possible values $K$, the number of data points $n$, and discuss your results. What are the relative merits of the centralized and local DP models? In which situations do you expect the local model to be useful?

In [8]:
def l1_error(a, b):
    if not(hasattr(a, 'shape')):
        return np.abs(a - b)
    else:
        return np.linalg.norm(a - b, ord=1)

def Laplace_mechanism(q, s1, eps, random_state=None):
    rng = np.random.RandomState(random_state)
    f = lambda x: x + rng.laplace(scale=s1 / eps)
    if hasattr(q, 'shape'):
        # 1d vector
        if q.ndim == 1:
            return q.apply(f)
        # k-way table
        else:
            return q.applymap(f)
    else:
        # scalar
        return f(q) 

## Question 4 (local Laplace mechanism)

We would now like to compute an averaging query in the local model of differential privacy. How can we do this using the Laplace mechanism? What are the differences with the centralized (trusted curator) version?

Implement your solution to privately estimate the average age of the individuals in the dataset, and compare the utility of the local and centralized solutions.

## Question 5 (federated learning with DP: DP-SGD in the distributed model)

Enough with these simple queries! Let's train a machine learning model in the federated learning, in which $n$ participants with their own datasets collaborate to train a joint model. Each participant $i$ wants to ensure that the algorithm satisfies $(\epsilon,\delta)$-DP with respect to his/her own dataset $D_i$. This is sometimes referred to as the distributed model of DP. Note that if each participant has a dataset of size 1 then this is exactly local DP. However, the privacy-utility trade-off will be better when participants have more data points, which is what we consider below.

The following code loads the US Census dataset in one-hot encoded version. Feel free to use another binary classification dataset of your choice instead.

In [9]:
X, y = fetch_openml(name='a9a', version=1, return_X_y=True, as_frame=False)
normalizer = Normalizer()
X = normalizer.transform(X)
m, d = X.shape
print(m, d)

48842 123


Consider a setting with $n=5$ participants. To simulate the federated learning setting, we will split the dataset in $n$ local datasets of roughly equal size. To do this, we use `sklearn.model_selection.KFold`.

In [10]:
n = 5
features = {}
labels = {}
for i, idx in enumerate(KFold(n_splits=n, shuffle=True).split(X)):
    features[i] = X[idx[1],:]
    labels[i] = y[idx[1]]
    
for i in range(n):
    print("Dataset of participant " + str(i) + ":", features[i].shape, labels[i].shape)

Dataset of participant 0: (9769, 123) (9769,)
Dataset of participant 1: (9769, 123) (9769,)
Dataset of participant 2: (9768, 123) (9768,)
Dataset of participant 3: (9768, 123) (9768,)
Dataset of participant 4: (9768, 123) (9768,)


We would now like to train a logistic regression classifier with DP-SGD in the federated setting. For simplicity of exposition, assume the presence of an *untrusted* aggregator. The algorithm follows an interative process, where each iteration consists of the following steps:
1. The trusted aggregator sends the current parameters of the model to the participants.
2. Each participant $i$ computes a stochastic gradient using a mini-batch from his local dataset $D_i$, adds Gaussian noise locally to ensure DP, and sends it to the untrusted aggregator.
3. The untrusted aggregator averages these gradients and use the result to update the model with a gradient step.

How much Gaussian noise should each participant add at each iteration to ensure an $(\epsilon,\delta)$-DP guarantee for the entire algorithm?

Adapt your centralized DP-SGD code from the previous practical to simulate this federated learning version. Compare the utility with the centralized version, studying in particular the effect of the number of participants.

Suppose that the local dataset sizes are imbalanced across participants. How does this affect the Gaussian noise added by each participant? How does this effect the utility? Suggest an appropriate weighted aggregation scheme to mitigate this.

## Bonus Question 1 (recover trusted curator accuracy)

We have seen in the previous questions that the utility cost of the local model can be high. As discussed in the lecture, we can rely on secure multi-party computation primitives like secure aggregation and secure shuffling to recover the utility of the trusted curator model. However, these primitives pose some implementation challenges.

Here we consider a simple alternative described in [Sabater et al. (2020)](https://arxiv.org/pdf/2006.07218.pdf). We consider an averaging query over $n$ participants, where each participant $i$ holds a bounded real value $X_i$ (we can assume $X_i\in[0,1]$ for simplicity). We consider that all participants are honest-but-curious, which corresponds to $\rho=1$ in the paper. The proposed approach relies on the Gaussian mechanism. As can be seen from Algorithm 1 in the paper, the idea is that each participant will add Gaussian noise $\eta_i\sim \mathcal{N}(0,\sigma_\eta^2)$ to its local value, but also some *correlated* Gaussian noise terms $\Delta_{i,j}\sim\mathcal{N}(0,\sigma_\Delta^2)$ with other participants $j$ such that $\Delta_{i,j}=-\Delta_{j,i}$. We consider for simplicity that all pairs of participants will exchange such correlated noise (corresponding to a complete graph of exchanges). Corollary 1 gives the values of the variances $\sigma_\eta^2$ and $\sigma_\Delta^2$ to use to achieve a desired DP guarantee.

Implement this protocol and compare its privacy-utility trade-off to that of the Gaussian mechanism in the trusted curator setting.

## Bonus Question 2 (federated learning with TensorFlow Federated)

[TensorFlow Federated](https://www.tensorflow.org/federated/) (TFF) is a framework to experiment with federated learning algorithms. You can [install it using `pip`](https://www.tensorflow.org/federated/install#install_tensorflow_federated_using_pip).

Use TFF to train a neural network with the FedAvg algorithm seen in the lecture on one of the benchmark datasets provided (without privacy constraints). Adding differential privacy cannot be done natively but TFF is interoperable with [TensorFlow Privacy](https://github.com/tensorflow/privacy), see [here for details](https://www.tensorflow.org/federated/tff_for_research#differential_privacy).