<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Building-an-Image-Classifier-with-Differential-Privacy" data-toc-modified-id="Building-an-Image-Classifier-with-Differential-Privacy-1">Building an Image Classifier with Differential Privacy</a></span><ul class="toc-item"><li><span><a href="#Overview" data-toc-modified-id="Overview-1.1">Overview</a></span></li><li><span><a href="#Hyper-parameters" data-toc-modified-id="Hyper-parameters-1.2">Hyper-parameters</a></span></li><li><span><a href="#Data" data-toc-modified-id="Data-1.3">Data</a></span></li><li><span><a href="#Model" data-toc-modified-id="Model-1.4">Model</a></span></li><li><span><a href="#Prepare-for-Training" data-toc-modified-id="Prepare-for-Training-1.5">Prepare for Training</a></span></li><li><span><a href="#Train-the-network" data-toc-modified-id="Train-the-network-1.6">Train the network</a></span></li><li><span><a href="#Test-the-network-on-test-data" data-toc-modified-id="Test-the-network-on-test-data-1.7">Test the network on test data</a></span></li><li><span><a href="#Tips-and-Tricks" data-toc-modified-id="Tips-and-Tricks-1.8">Tips and Tricks</a></span></li><li><span><a href="#Private-Model-vs-Non-Private-Model-Performance" data-toc-modified-id="Private-Model-vs-Non-Private-Model-Performance-1.9">Private Model vs Non-Private Model Performance</a></span></li></ul></li></ul></div>

# Building an Image Classifier with Differential Privacy

## Overview

In this tutorial we will learn to do the following:
  1. Learn about privacy specific hyper-parameters related to DP-SGD 
  2. Learn about ModelInspector, incompatible layers, and use model rewriting utility. 
  3. Train a differentially private ResNet18 for image classification.

## Hyper-parameters

To train a model with Opacus there are three privacy-specific hyper-parameters that must be tuned for better performance:

* Max Grad Norm: The maximum L2 norm of per-sample gradients before they are aggregated by the averaging step.
* Noise Multiplier: The amount of noise sampled and added to the average of the gradients in a batch.
* Delta: The target δ of the (ϵ,δ)-differential privacy guarantee. Generally, it should be set to be less than the inverse of the size of the training dataset. In this tutorial, it is set to $10^{−5}$ as the CIFAR10 dataset has 50,000 training points.

We use the hyper-parameter values below to obtain results in the last section:

In [1]:
MAX_GRAD_NORM = 1.2
NOISE_MULTIPLIER = .38
DELTA = 1e-5

LR = 1e-3
NUM_WORKERS = 2

There's another constraint we should be mindful of&mdash;memory. To balance peak memory requirement, which is proportional to `batch_size^2`, and training performance, we use virtual batches. With virtual batches we can separate physical steps (gradient computation) and logical steps (noise addition and parameter updates): use larger batches for training, while keeping memory footprint low. Below we will specify two constants:

In [2]:
BATCH_SIZE = 128
VIRTUAL_BATCH_SIZE = 512
assert VIRTUAL_BATCH_SIZE % BATCH_SIZE == 0 # VIRTUAL_BATCH_SIZE should be divisible by BATCH_SIZE
N_ACCUMULATION_STEPS = int(VIRTUAL_BATCH_SIZE / BATCH_SIZE)

## Data

Now, let's load the CIFAR10 dataset. We don't use data augmentation here because, in our experiments, we found that data augmentation lowers utility when training with DP.

In [3]:
import torch
import torchvision
import torchvision.transforms as transforms

# These values, specific to the CIFAR10 dataset, are assumed to be known.
# If necessary, they can be computed with modest privacy budget.
CIFAR10_MEAN = (0.4914, 0.4822, 0.4465)
CIFAR10_STD_DEV = (0.2023, 0.1994, 0.2010)

transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize(CIFAR10_MEAN, CIFAR10_STD_DEV),
])


Using torchvision datasets, we can load CIFAR10 and transform the PILImage images to Tensors of normalized range [-1, 1]

In [4]:
from torchvision.datasets import CIFAR10
from opacus.utils.uniform_sampler import UniformWithReplacementSampler

DATA_ROOT = '../cifar10'

train_dataset = CIFAR10(
    root=DATA_ROOT, train=True, download=True, transform=transform)

SAMPLE_RATE = BATCH_SIZE / len(train_dataset)

train_loader = torch.utils.data.DataLoader(
    train_dataset,
    num_workers=NUM_WORKERS,
    batch_sampler=UniformWithReplacementSampler(
        num_samples=len(train_dataset),
        sample_rate=SAMPLE_RATE,
    ),
)

test_dataset = CIFAR10(
    root=DATA_ROOT, train=False, download=True, transform=transform)

test_loader = torch.utils.data.DataLoader(
    test_dataset,
    batch_size=BATCH_SIZE,
    shuffle=False,
    num_workers=NUM_WORKERS,
)

Files already downloaded and verified
Files already downloaded and verified


## Model

In [5]:
from torchvision import models

model = models.resnet18(num_classes=10)

Now, let’s check if the model is compatible with Opacus. Opacus does not support all type of Pytorch layers. To check if your model is compatible with the privacy engine, we have provided a util class to validate your model.

If you run these commands, you will get the following error:

In [6]:
from opacus.dp_model_inspector import DPModelInspector

inspector = DPModelInspector()
inspector.validate(model)

IncompatibleModuleException: ignored

Let us modify the model to work with Opacus. From the output above, you can see that the BatchNorm layers are not supported because they compute the mean and variance across the batch, creating a dependency between samples in a batch, a privacy violation. One way to modify our model is to replace all the BatchNorm layers with [GroupNorm](https://arxiv.org/pdf/1803.08494.pdf) using the `convert_batchnorm_modules` util function.

In [7]:
from opacus.utils import module_modification

model = module_modification.convert_batchnorm_modules(model)
inspector = DPModelInspector()
print(f"Is the model valid? {inspector.validate(model)}")

Is the model valid? True


For maximal speed, we can check if CUDA is available and supported by the PyTorch installation. If GPU is available, set the `device` variable to your CUDA-compatible device. We can then transfer the neural network onto that device.

In [8]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = model.to(device)

We then define our optimizer and loss function. Opacus’ privacy engine can attach to any (first-order) optimizer.  You can use your favorite&mdash;Adam, Adagrad, RMSprop&mdash;as long as it has an implementation derived from [torch.optim.Optimizer](https://pytorch.org/docs/stable/optim.html). In this tutorial, we're going to use [RMSprop](https://pytorch.org/docs/stable/optim.html).

In [9]:
import torch.nn as nn
import torch.optim as optim

criterion = nn.CrossEntropyLoss()
optimizer = optim.RMSprop(model.parameters(), lr=LR)

## Prepare for Training

We will define a util function to calculate accuracy

In [10]:
def accuracy(preds, labels):
    return (preds == labels).mean()

We now attach the privacy engine initialized with the privacy hyperparameters defined earlier. There’s also the enigmatic-looking parameter `alphas`, which we won’t touch for the time being.

In [11]:
from opacus import PrivacyEngine

print(f"Using sigma={NOISE_MULTIPLIER} and C={MAX_GRAD_NORM}")

privacy_engine = PrivacyEngine(
    model,
    sample_rate=SAMPLE_RATE * N_ACCUMULATION_STEPS,
    alphas=[1 + x / 10.0 for x in range(1, 100)] + list(range(12, 64)),
    noise_multiplier=NOISE_MULTIPLIER,
    max_grad_norm=MAX_GRAD_NORM,
)
privacy_engine.attach(optimizer)

Using sigma=0.38 and C=1.2


We will then define our train function. This function will train the model for one epoch. 

In [12]:
import numpy as np

def train(model, train_loader, optimizer, epoch, device):
    model.train()
    criterion = nn.CrossEntropyLoss()

    losses = []
    top1_acc = []

    for i, (images, target) in enumerate(train_loader):        
        images = images.to(device)
        target = target.to(device)

        # compute output
        output = model(images)
        loss = criterion(output, target)
        
        preds = np.argmax(output.detach().cpu().numpy(), axis=1)
        labels = target.detach().cpu().numpy()
        
        # measure accuracy and record loss
        acc = accuracy(preds, labels)

        losses.append(loss.item())
        top1_acc.append(acc)
        
        loss.backward()
        	
        # take a real optimizer step after N_VIRTUAL_STEP steps t
        if ((i + 1) % N_ACCUMULATION_STEPS == 0) or ((i + 1) == len(train_loader)):
            optimizer.step()
        else:
            optimizer.virtual_step() # take a virtual step

        if i % 200 == 0:
            epsilon, best_alpha = optimizer.privacy_engine.get_privacy_spent(DELTA)
            print(
                f"\tTrain Epoch: {epoch} \t"
                f"Loss: {np.mean(losses):.6f} "
                f"Acc@1: {np.mean(top1_acc) * 100:.6f} "
                f"(ε = {epsilon:.2f}, δ = {DELTA})"
            )

Next, we will define our test function to validate our model on our test dataset. 

In [13]:
def test(model, test_loader, device):
    model.eval()
    criterion = nn.CrossEntropyLoss()
    losses = []
    top1_acc = []

    with torch.no_grad():
        for images, target in test_loader:
            images = images.to(device)
            target = target.to(device)

            output = model(images)
            loss = criterion(output, target)
            preds = np.argmax(output.detach().cpu().numpy(), axis=1)
            labels = target.detach().cpu().numpy()
            acc = accuracy(preds, labels)

            losses.append(loss.item())
            top1_acc.append(acc)

    top1_avg = np.mean(top1_acc)

    print(
        f"\tTest set:"
        f"Loss: {np.mean(losses):.6f} "
        f"Acc: {top1_avg * 100:.6f} "
    )
    return np.mean(top1_acc)

## Train the network

In [14]:
from tqdm import tqdm

for epoch in tqdm(range(20), desc="Epoch", unit="epoch"):
    train(model, train_loader, optimizer, epoch + 1, device)

Epoch:   0%|          | 0/20 [00:00<?, ?epoch/s]

	Train Epoch: 1 	Loss: 2.493264 Acc@1: 12.096774 (ε = 0.10, δ = 1e-05)
	Train Epoch: 1 	Loss: 2.761356 Acc@1: 14.456262 (ε = 14.58, δ = 1e-05)


Epoch:   5%|▌         | 1/20 [01:05<20:38, 65.21s/epoch]

	Train Epoch: 2 	Loss: 1.955629 Acc@1: 25.641026 (ε = 17.11, δ = 1e-05)
	Train Epoch: 2 	Loss: 1.767638 Acc@1: 37.946047 (ε = 19.29, δ = 1e-05)


Epoch:  10%|█         | 2/20 [02:10<19:32, 65.11s/epoch]

	Train Epoch: 3 	Loss: 1.648330 Acc@1: 46.835443 (ε = 20.79, δ = 1e-05)
	Train Epoch: 3 	Loss: 1.723294 Acc@1: 46.021707 (ε = 22.31, δ = 1e-05)


Epoch:  15%|█▌        | 3/20 [03:15<18:28, 65.22s/epoch]

	Train Epoch: 4 	Loss: 1.556890 Acc@1: 50.769231 (ε = 23.78, δ = 1e-05)
	Train Epoch: 4 	Loss: 1.730266 Acc@1: 48.891415 (ε = 25.08, δ = 1e-05)


Epoch:  20%|██        | 4/20 [04:20<17:22, 65.18s/epoch]

	Train Epoch: 5 	Loss: 1.592981 Acc@1: 53.968254 (ε = 26.15, δ = 1e-05)
	Train Epoch: 5 	Loss: 1.692724 Acc@1: 51.628847 (ε = 27.26, δ = 1e-05)


Epoch:  25%|██▌       | 5/20 [05:26<16:20, 65.35s/epoch]

	Train Epoch: 6 	Loss: 1.790667 Acc@1: 50.000000 (ε = 28.33, δ = 1e-05)
	Train Epoch: 6 	Loss: 1.691323 Acc@1: 52.989614 (ε = 29.45, δ = 1e-05)


Epoch:  30%|███       | 6/20 [06:31<15:15, 65.41s/epoch]

	Train Epoch: 7 	Loss: 1.557785 Acc@1: 54.014599 (ε = 30.52, δ = 1e-05)
	Train Epoch: 7 	Loss: 1.674198 Acc@1: 54.527030 (ε = 31.63, δ = 1e-05)


Epoch:  35%|███▌      | 7/20 [07:36<14:07, 65.17s/epoch]

	Train Epoch: 8 	Loss: 1.935706 Acc@1: 50.000000 (ε = 32.61, δ = 1e-05)
	Train Epoch: 8 	Loss: 1.673220 Acc@1: 56.032113 (ε = 33.45, δ = 1e-05)


Epoch:  40%|████      | 8/20 [08:41<13:00, 65.06s/epoch]

	Train Epoch: 9 	Loss: 1.574659 Acc@1: 52.830189 (ε = 34.26, δ = 1e-05)
	Train Epoch: 9 	Loss: 1.657451 Acc@1: 57.054291 (ε = 35.10, δ = 1e-05)


Epoch:  45%|████▌     | 9/20 [09:46<11:56, 65.16s/epoch]

	Train Epoch: 10 	Loss: 1.855417 Acc@1: 57.037037 (ε = 35.90, δ = 1e-05)
	Train Epoch: 10 	Loss: 1.649551 Acc@1: 58.181273 (ε = 36.74, δ = 1e-05)


Epoch:  50%|█████     | 10/20 [10:51<10:49, 64.98s/epoch]

	Train Epoch: 11 	Loss: 1.901847 Acc@1: 58.823529 (ε = 37.54, δ = 1e-05)
	Train Epoch: 11 	Loss: 1.637535 Acc@1: 58.539230 (ε = 38.38, δ = 1e-05)


Epoch:  55%|█████▌    | 11/20 [11:56<09:44, 65.00s/epoch]

	Train Epoch: 12 	Loss: 1.689841 Acc@1: 56.349206 (ε = 39.19, δ = 1e-05)
	Train Epoch: 12 	Loss: 1.627441 Acc@1: 59.299434 (ε = 40.02, δ = 1e-05)


Epoch:  60%|██████    | 12/20 [13:01<08:40, 65.10s/epoch]

	Train Epoch: 13 	Loss: 1.931030 Acc@1: 57.983193 (ε = 40.83, δ = 1e-05)
	Train Epoch: 13 	Loss: 1.631124 Acc@1: 59.830398 (ε = 41.67, δ = 1e-05)


Epoch:  65%|██████▌   | 13/20 [14:06<07:36, 65.15s/epoch]

	Train Epoch: 14 	Loss: 1.670289 Acc@1: 58.741259 (ε = 42.47, δ = 1e-05)
	Train Epoch: 14 	Loss: 1.601755 Acc@1: 60.857855 (ε = 43.31, δ = 1e-05)


Epoch:  70%|███████   | 14/20 [15:11<06:30, 65.07s/epoch]

	Train Epoch: 15 	Loss: 1.663538 Acc@1: 61.206897 (ε = 44.11, δ = 1e-05)
	Train Epoch: 15 	Loss: 1.578390 Acc@1: 61.596206 (ε = 44.95, δ = 1e-05)


Epoch:  75%|███████▌  | 15/20 [16:16<05:24, 64.91s/epoch]

	Train Epoch: 16 	Loss: 1.691741 Acc@1: 61.946903 (ε = 45.72, δ = 1e-05)
	Train Epoch: 16 	Loss: 1.591795 Acc@1: 61.336092 (ε = 46.37, δ = 1e-05)


Epoch:  80%|████████  | 16/20 [17:21<04:19, 64.91s/epoch]

	Train Epoch: 17 	Loss: 1.885407 Acc@1: 55.384615 (ε = 46.99, δ = 1e-05)
	Train Epoch: 17 	Loss: 1.612365 Acc@1: 61.411545 (ε = 47.64, δ = 1e-05)


Epoch:  85%|████████▌ | 17/20 [18:26<03:15, 65.04s/epoch]

	Train Epoch: 18 	Loss: 1.843095 Acc@1: 61.744966 (ε = 48.26, δ = 1e-05)
	Train Epoch: 18 	Loss: 1.617456 Acc@1: 61.773505 (ε = 48.91, δ = 1e-05)


Epoch:  90%|█████████ | 18/20 [19:31<02:10, 65.12s/epoch]

	Train Epoch: 19 	Loss: 1.120436 Acc@1: 69.354839 (ε = 49.53, δ = 1e-05)
	Train Epoch: 19 	Loss: 1.571179 Acc@1: 62.916583 (ε = 50.18, δ = 1e-05)


Epoch:  95%|█████████▌| 19/20 [20:36<01:04, 64.96s/epoch]

	Train Epoch: 20 	Loss: 1.641913 Acc@1: 62.015504 (ε = 50.80, δ = 1e-05)
	Train Epoch: 20 	Loss: 1.562555 Acc@1: 62.999958 (ε = 51.45, δ = 1e-05)


Epoch: 100%|██████████| 20/20 [21:41<00:00, 65.07s/epoch]


## Test the network on test data

In [15]:
top1_acc = test(model, test_loader, device)

	Test set:Loss: 1.783089 Acc: 59.612342 


## Tips and Tricks

1. Generally speaking, differentially private training is enough of a regularizer by itself. Adding any more regularization (such as dropouts or data augmentation) is unnecessary and typically hurts performance.
2. Tuning MAX_GRAD_NORM is very important. Start with a low noise multiplier like .1, this should give comparable performance to a non-private model. Then do a grid search for the optimal MAX_GRAD_NORM value. The grid can be in the range [.1, 10]

## Private Model vs Non-Private Model Performance

Now let us compare how our private model compares with the non-private ResNet18.

We trained a non-private ResNet18 model for 20 epochs using the same hyper-parameters as above and with BatchNorm replaced with GroupNorm. The results of that training and the training that is discussed in this tutorial are summarized in the table below:

| Model          | Top 1 Accuracy (%) |  ϵ |
|----------------|--------------------|---|
| ResNet         | 76                 | ∞ |
| Private ResNet |         59.61         |  51.45  |