<a href="https://colab.research.google.com/github/woodyx218/Deep-Learning-with-GDP-Tensorflow/blob/master/GDP_NN_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Differentailly Private Deep Learning accounted by Gaussian differential privacy

### This is a tutorial of how to train deep neural networks with differential privacy, a mathematically rigorous privacy definition that defends against privacy leakage, e.g. the [membership inference attack](https://github.com/tensorflow/privacy/tree/master/tensorflow_privacy/privacy/membership_inference_attack).

### Our tutorial is based on the Tensorflow privacy library, [**tensorflow-privacy**](https://github.com/tensorflow/privacy). First we install tensorflow-privacy.

In [1]:
!pip install tensorflow-privacy

Collecting tensorflow-privacy
[?25l  Downloading https://files.pythonhosted.org/packages/5e/53/31a388f82202a155f248f75cc0f45bd0b85a0ef020a2472e600dc19d38d6/tensorflow_privacy-0.5.2-py3-none-any.whl (192kB)
[K     |█▊                              | 10kB 16.2MB/s eta 0:00:01[K     |███▍                            | 20kB 22.9MB/s eta 0:00:01[K     |█████                           | 30kB 27.6MB/s eta 0:00:01[K     |██████▉                         | 40kB 21.4MB/s eta 0:00:01[K     |████████▌                       | 51kB 16.8MB/s eta 0:00:01[K     |██████████▏                     | 61kB 15.2MB/s eta 0:00:01[K     |████████████                    | 71kB 14.2MB/s eta 0:00:01[K     |█████████████▋                  | 81kB 14.5MB/s eta 0:00:01[K     |███████████████▎                | 92kB 13.5MB/s eta 0:00:01[K     |█████████████████               | 102kB 12.8MB/s eta 0:00:01[K     |██████████████████▊             | 112kB 12.8MB/s eta 0:00:01[K     |████████████████████▍ 

### We also need the privacy accountants including our GDP accountant

In [2]:
!git clone https://github.com/woodyx218/Deep-Learning-with-GDP-Tensorflow.git

Cloning into 'Deep-Learning-with-GDP-Tensorflow'...
remote: Enumerating objects: 145, done.[K
remote: Counting objects: 100% (26/26), done.[K
remote: Compressing objects: 100% (26/26), done.[K
remote: Total 145 (delta 13), reused 0 (delta 0), pack-reused 119[K
Receiving objects: 100% (145/145), 6.03 MiB | 12.33 MiB/s, done.
Resolving deltas: 100% (65/65), done.


In [3]:
cd /content/Deep-Learning-with-GDP-Tensorflow

/content/Deep-Learning-with-GDP-Tensorflow


### Import necessary packages.

In [4]:
import numpy as np
import tensorflow as tf
from tensorflow_privacy.privacy.analysis import compute_dp_sgd_privacy
from tensorflow_privacy.privacy.optimizers.dp_optimizer import *
#DPGradientDescentGaussianOptimizer

print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))

Num GPUs Available:  1


### We use MNIST here, similar to 
https://github.com/tensorflow/privacy/blob/master/tutorials/Classification_Privacy.ipynb

In [5]:
train, test = tf.keras.datasets.mnist.load_data()
train_data, train_labels = train
test_data, test_labels = test

train_data = np.array(train_data, dtype=np.float32) / 255
test_data = np.array(test_data, dtype=np.float32) / 255

train_labels = np.array(train_labels, dtype=np.int32).flatten()
test_labels = np.array(test_labels, dtype=np.int32).flatten()


Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz


### Define and tune learning model hyperparameters

In [6]:
epochs = 15
batch_size = 250

l2_norm_clip = 1.5
noise_multiplier = 1.3
num_microbatches = 250
learning_rate = 0.25

if batch_size % num_microbatches != 0:
  raise ValueError('Batch size should be an integer multiple of the number of microbatches')

# indicate whether you want DP
dpsgd=True

Tensorflow Privacy uses a variant of DP optimizer, known as the microbatch technique. I copy the explanation from their tutorial:

---

**microbatches (int)** - Each batch of data is split in smaller units called microbatches. By default, each microbatch should contain a single training example. This allows us to clip gradients on a per-example basis rather than after they have been averaged across the minibatch. This in turn decreases the (negative) effect of clipping on signal found in the gradient and typically maximizes utility. However, computational overhead can be reduced by increasing the size of microbatches to include more than one training examples. The average gradient across these multiple training examples is then clipped. The total number of examples consumed in a batch, i.e., one step of gradient descent, remains the same. The number of microbatches should evenly divide the batch size.

---

If num_microbatches==batch_size, then this is vanilla DP optimizer (e.g. DP-SGD). If not, then DP-SGD is accelerated because you are no longer performing per-sample clipping. E.g. if num_microbatches=50, then you are performing 5-sample clipping for 50 times, instead of per-sample clipping for 250 times. However, this usually worsens the accuracy since the noise is relatively larger.


### We define a simple CNN, almost same as [Tensorflow Privacy example](https://github.com/tensorflow/privacy/blob/master/tutorials/Classification_Privacy.ipynb) and [Opacus example](https://github.com/pytorch/opacus/blob/master/examples/mnist.py), but we use **tanh** activation instead of **relu** for better performance (see https://arxiv.org/abs/2007.14191).

### Unlike Pytorch, we need to write a custom training function to wrap the model and the optimizer and the loss. You can find other optimizers [here](https://github.com/tensorflow/privacy/blob/master/tensorflow_privacy/privacy/optimizers/dp_optimizer.py). All optimizers can be turned into DP!

In [7]:
def cnn_model_fn(features, labels, mode):
  """Model function for a CNN."""

  # Define CNN architecture using tf.keras.layers.
  input_layer = tf.reshape(features['x'], [-1, 28, 28, 1])
  y = tf.keras.layers.Conv2D(16, 8,
                           strides=2,
                           padding='same',
                           activation='tanh',
                           input_shape=(28, 28, 1))(input_layer)
  y = tf.keras.layers.MaxPool2D(2, 1)(y)
  y = tf.keras.layers.Conv2D(32, 4,
                           strides=2,
                           padding='valid',
                           activation='tanh')(y)
  y = tf.keras.layers.MaxPool2D(2, 1)(y)
  y = tf.keras.layers.Flatten()(y)
  y = tf.keras.layers.Dense(32, activation='tanh')(y)
  logits = tf.keras.layers.Dense(10)(y)

  # Calculate loss as a vector (to support microbatches in DP-SGD).
  vector_loss = tf.nn.sparse_softmax_cross_entropy_with_logits(
      labels=labels, logits=logits)
  # Define mean of loss across minibatch (for reporting through tf.Estimator).
  scalar_loss = tf.reduce_mean(vector_loss)

  # Configure the training op (for TRAIN mode).
  if mode == tf.estimator.ModeKeys.TRAIN:

    if dpsgd:
      # Use DP version of GradientDescentOptimizer. Other optimizers are
      # available in dp_optimizer. Most optimizers inheriting from
      # tf.train.Optimizer should be wrappable in differentially private
      # counterparts by calling dp_optimizer.optimizer_from_args().
      optimizer = DPGradientDescentGaussianOptimizer(
          l2_norm_clip=l2_norm_clip,
          noise_multiplier=noise_multiplier,
          num_microbatches=num_microbatches,
          learning_rate=learning_rate)
      opt_loss = vector_loss
    else:
      optimizer = tf.compat.v1.train.GradientDescentOptimizer(
          learning_rate=learning_rate)
      opt_loss = scalar_loss
    global_step = tf.compat.v1.train.get_global_step()
    train_op = optimizer.minimize(loss=opt_loss, global_step=global_step)
    # In the following, we pass the mean of the loss (scalar_loss) rather than
    # the vector_loss because tf.estimator requires a scalar loss. This is only
    # used for evaluation and debugging by tf.estimator. The actual loss being
    # minimized is opt_loss defined above and passed to optimizer.minimize().
    return tf.estimator.EstimatorSpec(mode=mode,
                                      loss=scalar_loss,
                                      train_op=train_op)

  # Add evaluation metrics (for EVAL mode).
  elif mode == tf.estimator.ModeKeys.EVAL:
    eval_metric_ops = {
        'accuracy':
            tf.compat.v1.metrics.accuracy(
                labels=labels,
                predictions=tf.argmax(input=logits, axis=1))
    }
    return tf.estimator.EstimatorSpec(mode=mode,
                                      loss=scalar_loss,
                                      eval_metric_ops=eval_metric_ops)


### Now that we have the data preprocessed and the model & optimizer defined, we move on to set the training process.

### It is important to understand that DP-SGD is implemented differently in TF and Pytorch. TF is significantly slower (e.g. for MNIST, >1 min/epoch; Pytorch Opacus package takes 10 sec/epoch). However, Opacus takes way much memory and thus not suitable for moderately large model. Opacus is also not generalizable (e.g. [cannot do two-layer LSTM](https://github.com/pytorch/opacus/blob/master/tutorials/building_lstm_name_classifier.ipynb)).

In [8]:
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR)  # or any {DEBUG, INFO, WARN, ERROR, FATAL}

mnist_classifier = tf.estimator.Estimator(model_fn=cnn_model_fn)

  # Create tf.Estimator input functions for the training and test data.
eval_input_fn = tf.compat.v1.estimator.inputs.numpy_input_fn(
    x={'x': test_data},
    y=test_labels,
    num_epochs=1,
    shuffle=False)
train_input_fn = tf.compat.v1.estimator.inputs.numpy_input_fn(
    x={'x': train_data},
    y=train_labels,
    batch_size=batch_size,
    num_epochs=epochs,
    shuffle=True)
    
  # Training loop.
steps_per_epoch = 60000 // batch_size
test_accuracy_list = []
for epoch in range(1, epochs + 1):
    np.random.seed(epoch)
    # Train the model for one step.
    mnist_classifier.train(input_fn=train_input_fn, steps=steps_per_epoch)
    
    # Evaluate the model and print results
    eval_results = mnist_classifier.evaluate(input_fn=eval_input_fn)
    test_accuracy = eval_results['accuracy']
    test_accuracy_list.append(test_accuracy)
    print('Test accuracy after %d epochs is: %.3f' % (epoch, test_accuracy))

Test accuracy after 1 epochs is: 0.908
Test accuracy after 2 epochs is: 0.938
Test accuracy after 3 epochs is: 0.948
Test accuracy after 4 epochs is: 0.955
Test accuracy after 5 epochs is: 0.960
Test accuracy after 6 epochs is: 0.962
Test accuracy after 7 epochs is: 0.964
Test accuracy after 8 epochs is: 0.964
Test accuracy after 9 epochs is: 0.966
Test accuracy after 10 epochs is: 0.968
Test accuracy after 11 epochs is: 0.968
Test accuracy after 12 epochs is: 0.970
Test accuracy after 13 epochs is: 0.968
Test accuracy after 14 epochs is: 0.968
Test accuracy after 15 epochs is: 0.968


### There exist many privacy accountants that give you an upper bound on epsilon. The most popular ones are [Moments Accountant](https://arxiv.org/abs/1607.00133) (MA) and Gaussian DP (GDP; promoted by this paper).

### There are many variants of MA. E.g. Opacus uses a version of MA that gives epsilon = 1.32. In below, Tensorflow Privacy uses another MA that gives epsilon = 0.94 and our GDP accountant gives epsilon = 0.82. Hence GDP is the tightest privacy guarantee.

In [9]:
print('eps_MA by Tensorflow Privacy: ',compute_dp_sgd_privacy.compute_dp_sgd_privacy(n=60000, batch_size=250, noise_multiplier=1.3, epochs=15, delta=1e-5))
# or equivalently
from gdp_accountant import compute_epsP, compute_epsilon # which uses tensorflow-privacy package
print('eps_MA=%.6g' %(compute_epsilon(15,1.3,60000,250,1e-5)))
print('eps_GDP=%.6g' %(compute_epsP(15,1.3,60000,250,1e-5)))


DP-SGD with sampling rate = 0.417% and noise_multiplier = 1.3 iterated over 3600 steps satisfies differential privacy with eps = 0.942 and delta = 1e-05.
The optimal RDP order is 17.0.
eps_MA by Tensorflow Privacy:  (0.9422002181627345, 17.0)
eps_MA=0.9422
eps_GDP=0.823696


## Some Tips:

1. Larger epsilon and delta means model is less private
2. Fixing other hyperparameters, more iterations/epochs less private
3. Learning rate and clipping norm do not affect DP but affect convergence
4. Iteration, sigma, batch size and sample size affect DP
5. One set of training corresponds to infinitely many (epsilon,delta) pairs, usually we choose delta = 1/sample size
6. Practionally, one tunes clipping norm first with sigma = 0, then fix clipping norm and tune sigma, under pre-specified privacy budget (e.g. epsilon < 8)
