##### Copyright 2020 Google LLC

In [None]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Adversarial Regularization for Image Classification

The core idea of adversarial learning is to train a model with
adversarially-perturbed data (called adversarial examples) in addition to the
organic training data. The adversarial examples are constructed to intentionally
mislead the model into making wrong predictions or classifications. By training
with such examples, the model learns to be robust against adversarial
perturbation when making predictions.

In this tutorial, we illustrate the following procedure of applying adversarial
learning to obtain robust models using the Neural Structured Learning framework:

1.  Create a neural network as a base model. In this tutorial, the base model is
    created with the `tf.keras` functional API; this procedure is compatible
    with models created by `tf.keras` sequential and subclassing APIs as well.
2.  Wrap the base model with the **`AdversarialRegularization`** wrapper class,
    which is provided by the NSL framework, to create a new `tf.keras.Model`
    instance. This new model will include the adversarial loss as a
    regularization term in its training objective.
3.  Convert examples in the training data to feature dictionaries.
4.  Train and evaluate the new model.

Both the base and the new model will be evaluated against natural and
adversarial inputs.

## Setup

Install the Neural Structured Learning package.

In [None]:
!pip install --quiet neural-structured-learning

In [None]:
import matplotlib.pyplot as plt
import tensorflow as tf
import tensorflow_datasets as tfds
import numpy as np
import neural_structured_learning as nsl

## Hyperparameters

We collect and explain the hyperparameters (in an `HParams` object) for model
training and evaluation.

Input/Output:

*   **`input_shape`**: The shape of the input tensor. Each image is 28-by-28
pixels with 1 channel.
*   **`num_classes`**: There are a total of 10 classes, corresponding to 10
digits [0-9].

Model architecture:

*   **`conv_filters`**: A list of numbers, each specifying the number of
filters in a convolutional layer.
*   **`kernel_size`**: The size of 2D convolution window, shared by all
convolutional layers.
*   **`pool_size`**: Factors to downscale the image in each max-pooling layer.
*   **`num_fc_units`**: The number of units (i.e., width) of each
fully-connected layer.

Training and evaluation:

*  **`batch_size`**: Batch size used for training and evaluation.
*  **`epochs`**: The number of training epochs.

Adversarial learning:

*   **`adv_multiplier`**: The weight of adversarial loss in the training
    objective, relative to the labeled loss.
*   **`adv_step_size`**: The magnitude of adversarial perturbation.
*  **`adv_grad_norm`**: The norm to measure the magnitude of adversarial
   perturbation.
*  **`pgd_iterations`**: The number of iterative steps to take when using PGD.
*  **`pgd_epsilon`**: The bounds of the perturbation. PGD will project back to
   this epsilon ball when generating the adversary.
*  **`clip_value_min`**: Clips the final adversary to be at least as large as
   this value. This keeps the perturbed pixel values in a valid domain.
*  **`clip_value_max`**: Clips the final adversary to be no larger than this
   value. This also keeps the perturbed pixel values in a valid domain.


In [None]:
class HParams(object):
  def __init__(self):
    self.input_shape = [28, 28, 1]
    self.num_classes = 10
    self.conv_filters = [32, 64, 64]
    self.kernel_size = (3, 3)
    self.pool_size = (2, 2)
    self.num_fc_units = [64]
    self.batch_size = 32
    self.epochs = 5
    self.adv_multiplier = 0.2
    self.adv_step_size = 0.01
    self.adv_grad_norm = 'infinity'
    self.pgd_iterations = 40
    self.pgd_epsilon = 0.2
    self.clip_value_min = 0.0
    self.clip_value_max = 1.0

HPARAMS = HParams()

## MNIST dataset

The [MNIST dataset](http://yann.lecun.com/exdb/mnist/) contains grayscale
images of handwritten digits (from '0' to '9'). Each image showes one digit at
low resolution (28-by-28 pixels). The task involved is to classify images into
10 categories, one per digit.

Here we load the MNIST dataset from
[TensorFlow Datasets](https://www.tensorflow.org/datasets). It handles
downloading the data and constructing a `tf.data.Dataset`. The loaded dataset
has two subsets:

*   `train` with 60,000 examples, and
*   `test` with 10,000 examples.

Examples in both subsets are stored in feature dictionaries with the following
two keys:

*   `image`: Array of pixel values, ranging from 0 to 255.
*   `label`: Groundtruth label, ranging from 0 to 9.

In [None]:
datasets = tfds.load('mnist')

train_dataset = datasets['train']
test_dataset = datasets['test']

IMAGE_INPUT_NAME = 'image'
LABEL_INPUT_NAME = 'label'

To make the model numerically stable, we normalize the pixel values to [0, 1]
by mapping the dataset over the `normalize` function. After shuffling training
set and batching, we convert the examples to feature tuples `(image, label)`
for training the base model. We also provide a function to convert from tuples
to dictionaries for later use.

In [None]:
def normalize(features):
  features[IMAGE_INPUT_NAME] = tf.cast(
      features[IMAGE_INPUT_NAME], dtype=tf.float32) / 255.0
  return features

def convert_to_tuples(features):
  return features[IMAGE_INPUT_NAME], features[LABEL_INPUT_NAME]

def convert_to_dictionaries(image, label):
  return {IMAGE_INPUT_NAME: image, LABEL_INPUT_NAME: label}

train_dataset = train_dataset.map(normalize).shuffle(10000).batch(HPARAMS.batch_size).map(convert_to_tuples)
test_dataset = test_dataset.map(normalize).batch(HPARAMS.batch_size).map(convert_to_tuples)

## Base model

Our base model will be a neural network consisting of 3 convolutional layers
follwed by 2 fully-connected layers (as defined in `HPARAMS`). Here we define
it using the Keras functional API. Feel free to try other APIs or model
architectures.

In [None]:
def build_base_model(hparams):
  """Builds a model according to the architecture defined in `hparams`."""
  inputs = tf.keras.Input(
      shape=hparams.input_shape, dtype=tf.float32, name=IMAGE_INPUT_NAME)

  x = inputs
  for i, num_filters in enumerate(hparams.conv_filters):
    x = tf.keras.layers.Conv2D(
        num_filters, hparams.kernel_size, activation='relu')(
            x)
    if i < len(hparams.conv_filters) - 1:
      # max pooling between convolutional layers
      x = tf.keras.layers.MaxPooling2D(hparams.pool_size)(x)
  x = tf.keras.layers.Flatten()(x)
  for num_units in hparams.num_fc_units:
    x = tf.keras.layers.Dense(num_units, activation='relu')(x)
  pred = tf.keras.layers.Dense(hparams.num_classes, activation=None)(x)
  # pred = tf.keras.layers.Dense(hparams.num_classes, activation='softmax')(x)
  model = tf.keras.Model(inputs=inputs, outputs=pred)
  return model

In [None]:
base_model = build_base_model(HPARAMS)
base_model.summary()

Next we train and evaluate the base model.

In [None]:
base_model.compile(optimizer='adam',
                   loss=tf.keras.losses.SparseCategoricalCrossentropy(
                       from_logits=True),
                   metrics=['acc'])
base_model.fit(train_dataset, epochs=HPARAMS.epochs)

In [None]:
results = base_model.evaluate(test_dataset)
named_results = dict(zip(base_model.metrics_names, results))
print('\naccuracy:', named_results['acc'])

## Adversarial-regularized model

Here we show how to incorporate adversarial training into a Keras model with a
few lines of code, using the NSL framework. The base model is wrapped to create
a new `tf.Keras.Model`, whose training objective includes adversarial
regularization.

We will train one using the FGSM adversary and one using a stronger PGD
adversary.

First, we create config objects with relevant hyperparameters.

In [None]:
fgsm_adv_config = nsl.configs.make_adv_reg_config(
    multiplier=HPARAMS.adv_multiplier,
    # With FGSM, we want to take a single step equal to the epsilon ball size,
    # to get the largest allowable perturbation.
    adv_step_size=HPARAMS.pgd_epsilon,
    adv_grad_norm=HPARAMS.adv_grad_norm,
    clip_value_min=HPARAMS.clip_value_min,
    clip_value_max=HPARAMS.clip_value_max
)

pgd_adv_config = nsl.configs.make_adv_reg_config(
    multiplier=HPARAMS.adv_multiplier,
    adv_step_size=HPARAMS.adv_step_size,
    adv_grad_norm=HPARAMS.adv_grad_norm,
    pgd_iterations=HPARAMS.pgd_iterations,
    pgd_epsilon=HPARAMS.pgd_epsilon,
    clip_value_min=HPARAMS.clip_value_min,
    clip_value_max=HPARAMS.clip_value_max
)

Now we can wrap a base model with `AdversarialRegularization`. Here we create 
new base models (`base_fgsm_model`, `base_pgd_model`), so that the existing one
(`base_model`) can be used in later comparison.

The returned `adv_model` is a `tf.keras.Model` object, whose training objective
includes a regularization term for the adversarial loss. To compute that loss,
the model has to have access to the label information (feature `label`), in
addition to regular input (feature `image`). For this reason, we convert the
examples in the datasets from tuples back to dictionaries. And we tell the
model which feature contains the label information via the `label_keys`
parameter.

We will create two adversarially regularized models: `fgsm_adv_model`
(regularized with FGSM) and `pgd_adv_model` (regularized with PGD).

In [None]:
# Create model for FGSM.
base_fgsm_model = build_base_model(HPARAMS)
# Create FGSM-regularized model.
fgsm_adv_model = nsl.keras.AdversarialRegularization(
    base_fgsm_model,
    label_keys=[LABEL_INPUT_NAME],
    adv_config=fgsm_adv_config
)

In [None]:
# Create model for PGD.
base_pgd_model = build_base_model(HPARAMS)
# Create PGD-regularized model.
pgd_adv_model = nsl.keras.AdversarialRegularization(
    base_pgd_model,
    label_keys=[LABEL_INPUT_NAME],
    adv_config=pgd_adv_config
)

In [None]:
# Data for training.
train_set_for_adv_model = train_dataset.map(convert_to_dictionaries)
test_set_for_adv_model = test_dataset.map(convert_to_dictionaries)

Next we compile, train, and evaluate the
adversarial-regularized model. There might be warnings like
"Output missing from loss dictionary," which is fine because
the `adv_model` doesn't rely on the base implementation to
calculate the total loss.

In [None]:
fgsm_adv_model.compile(optimizer='adam',
                       loss=tf.keras.losses.SparseCategoricalCrossentropy(
                        from_logits=True),
                       metrics=['acc'])
fgsm_adv_model.fit(train_set_for_adv_model, epochs=HPARAMS.epochs)

In [None]:
results = fgsm_adv_model.evaluate(test_set_for_adv_model)
named_results = dict(zip(fgsm_adv_model.metrics_names, results))
print('\naccuracy:', named_results['sparse_categorical_accuracy'])

In [None]:
pgd_adv_model.compile(optimizer='adam',
                       loss=tf.keras.losses.SparseCategoricalCrossentropy(
                        from_logits=True),
                       metrics=['acc'])
pgd_adv_model.fit(train_set_for_adv_model, epochs=HPARAMS.epochs)

In [None]:
results = pgd_adv_model.evaluate(test_set_for_adv_model)
named_results = dict(zip(pgd_adv_model.metrics_names, results))
print('\naccuracy:', named_results['sparse_categorical_accuracy'])

Both adversarially regularized models perform well on the test set.

## Robustness under Adversarial Perturbations

Now we compare the base model and the adversarial-regularized model for
robustness under adversarial perturbation.

We will show how the base model is vulnerable to attacks from both FGSM and PGD,
the FGSM-regularized model can resist FGSM attacks but is vulnerable to PGD, and
the PGD-regularized model is able to resist both forms of attack.

We use `gen_adv_neighbor` to generate adversaries for our models.

### Attacking the Base Model

In [None]:
# Set up the neighbor config for FGSM.
fgsm_nbr_config = nsl.configs.AdvNeighborConfig(
    adv_grad_norm=HPARAMS.adv_grad_norm,
    adv_step_size=HPARAMS.pgd_epsilon,
    clip_value_min=0.0,
    clip_value_max=1.0,
)

# The labeled loss function provides the loss for each sample we pass in. This
# will be used to calculate the gradient.
labeled_loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(
    from_logits=True,
)


In [None]:
%%time
# Generate adversarial images using FGSM on the base model.
perturbed_images, labels, predictions = [], [], []

# We want to record the accuracy.
metric = tf.keras.metrics.SparseCategoricalAccuracy()

for batch in test_set_for_adv_model:
  # Record the loss calculation to get the gradient.
  with tf.GradientTape() as tape:
    tape.watch(batch)
    losses = labeled_loss_fn(batch[LABEL_INPUT_NAME],
                             base_model(batch[IMAGE_INPUT_NAME]))
    
  # Generate the adversarial example.
  fgsm_images, _ = nsl.lib.adversarial_neighbor.gen_adv_neighbor(
      batch[IMAGE_INPUT_NAME],
      losses,
      fgsm_nbr_config,
      gradient_tape=tape
  )

  # Update our accuracy metric.
  y_true = batch['label']
  y_pred = base_model(fgsm_images)
  metric(y_true, y_pred)

  # Store images for later use.
  perturbed_images.append(fgsm_images)
  labels.append(y_true.numpy())
  predictions.append(tf.argmax(y_pred, axis=-1).numpy())

print('%s model accuracy: %f' % ('base', metric.result().numpy()))

Let's examine what some of these images look like.

In [None]:
def examine_images(perturbed_images, labels, predictions, model_key):
  batch_index = 0

  batch_image = perturbed_images[batch_index]
  batch_label = labels[batch_index]
  batch_pred = predictions[batch_index]

  batch_size = HPARAMS.batch_size
  n_col = 4
  n_row = (batch_size + n_col - 1) / n_col

  print('accuracy in batch %d:' % batch_index)
  print('%s model: %d / %d' %
        (model_key, np.sum(batch_label == batch_pred), batch_size))

  plt.figure(figsize=(15, 15))
  for i, (image, y) in enumerate(zip(batch_image, batch_label)):
    y_base = batch_pred[i]
    plt.subplot(n_row, n_col, i+1)
    plt.title('true: %d, %s: %d' % (y, model_key, y_base), color='r'
      if y != y_base else 'k')
    plt.imshow(tf.keras.preprocessing.image.array_to_img(image), cmap='gray')
    plt.axis('off')

  plt.show()

In [None]:
examine_images(perturbed_images, labels, predictions, 'base')

Our perturbation budget of 0.2 is quite large, but even so, the perturbed
numbers are clearly recognizable to the human eye. On the other hand, our
network is fooled into misclassifying several examples.

As we can see, the FGSM attack is already highly effective, and quick to
execute, heavily reducing the model accuracy. We will see below, that the PGD
attack is even more effective, even with the same perturbation budget.

In [None]:
# Set up the neighbor config for PGD.
pgd_nbr_config = nsl.configs.AdvNeighborConfig(
    adv_grad_norm=HPARAMS.adv_grad_norm,
    adv_step_size=HPARAMS.adv_step_size,
    pgd_iterations=HPARAMS.pgd_iterations,
    pgd_epsilon=HPARAMS.pgd_epsilon,
    clip_value_min=HPARAMS.clip_value_min,
    clip_value_max=HPARAMS.clip_value_max,
)

# pgd_model_fn generates a prediction from which we calculate the loss, and the
# gradient for a given interation.
pgd_model_fn = base_model

# We need to pass in the loss function for repeated calculation of the gradient.
pgd_loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(
    from_logits=True, 
)
labeled_loss_fn = pgd_loss_fn

In [None]:
%%time
# Generate adversarial images using PGD on the base model.
perturbed_images, labels, predictions = [], [], []

# Record the accuracy.
metric = tf.keras.metrics.SparseCategoricalAccuracy()

for batch in test_set_for_adv_model:
  # Gradient tape to calculate the loss on the first iteration.
  with tf.GradientTape() as tape:
    tape.watch(batch)
    losses = labeled_loss_fn(batch[LABEL_INPUT_NAME],
                             base_model(batch[IMAGE_INPUT_NAME]))
    
  # Generate the adversarial examples.
  pgd_images, _ = nsl.lib.adversarial_neighbor.gen_adv_neighbor(
      batch[IMAGE_INPUT_NAME],
      losses,
      pgd_nbr_config,
      gradient_tape=tape,
      pgd_model_fn=pgd_model_fn,
      pgd_loss_fn=pgd_loss_fn,
      pgd_labels=batch[LABEL_INPUT_NAME],
  )

  # Update our accuracy metric.
  y_true = batch['label']
  y_pred = base_model(pgd_images)
  metric(y_true, y_pred)

  # Store images for visualization.
  perturbed_images.append(pgd_images)
  labels.append(y_true.numpy())
  predictions.append(tf.argmax(y_pred, axis=-1).numpy())

print('%s model accuracy: %f' % ('base', metric.result().numpy()))

In [None]:
examine_images(perturbed_images, labels, predictions, 'base')

The PGD attack is much stronger, but it also takes longer to run.

### Attacking the FGSM Regularized Model

In [None]:
# Set up the neighbor config.
fgsm_nbr_config = nsl.configs.AdvNeighborConfig(
    adv_grad_norm=HPARAMS.adv_grad_norm,
    adv_step_size=HPARAMS.pgd_epsilon,
    clip_value_min=0.0,
    clip_value_max=1.0,
)

# The labeled loss function provides the loss for each sample we pass in. This
# will be used to calculate the gradient.
labeled_loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(
    from_logits=True,
)

In [None]:
%%time
# Generate adversarial images using FGSM on the regularized model.
perturbed_images, labels, predictions = [], [], []

# Record the accuracy.
metric = tf.keras.metrics.SparseCategoricalAccuracy()

for batch in test_set_for_adv_model:
  # Record the loss calculation to get its gradients.
  with tf.GradientTape() as tape:
    tape.watch(batch)
    # We attack the adversarially regularized model.
    losses = labeled_loss_fn(batch[LABEL_INPUT_NAME],
                             fgsm_adv_model.base_model(batch[IMAGE_INPUT_NAME]))
    
  # Generate the adversarial examples.
  fgsm_images, _ = nsl.lib.adversarial_neighbor.gen_adv_neighbor(
      batch[IMAGE_INPUT_NAME],
      losses,
      fgsm_nbr_config,
      gradient_tape=tape
  )

  # Update our accuracy metric.
  y_true = batch['label']
  y_pred = fgsm_adv_model.base_model(fgsm_images)
  metric(y_true, y_pred)

  # Store images for visualization.
  perturbed_images.append(fgsm_images)
  labels.append(y_true.numpy())
  predictions.append(tf.argmax(y_pred, axis=-1).numpy())

print('%s model accuracy: %f' % ('base', metric.result().numpy()))

In [None]:
examine_images(perturbed_images, labels, predictions, 'fgsm_reg')

As we can see, the FGSM-regularized model performs much better than the base
model on images perturbed by FGSM. How does it do against PGD?

In [None]:
# Set up the neighbor config for PGD.
pgd_nbr_config = nsl.configs.AdvNeighborConfig(
    adv_grad_norm=HPARAMS.adv_grad_norm,
    adv_step_size=HPARAMS.adv_step_size,
    pgd_iterations=HPARAMS.pgd_iterations,
    pgd_epsilon=HPARAMS.pgd_epsilon,
    clip_value_min=HPARAMS.clip_value_min,
    clip_value_max=HPARAMS.clip_value_max,
)

# pgd_model_fn generates a prediction from which we calculate the loss, and the
# gradient for a given interation.
pgd_model_fn = fgsm_adv_model.base_model

# We need to pass in the loss function for repeated calculation of the gradient.
pgd_loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(
    from_logits=True, 
)
labeled_loss_fn = pgd_loss_fn

In [None]:
%%time
# Generate adversarial images using PGD on the FGSM-regularized model.
perturbed_images, labels, predictions = [], [], []

metric = tf.keras.metrics.SparseCategoricalAccuracy()

for batch in test_set_for_adv_model:
  # Gradient tape to calculate the loss on the first iteration.
  with tf.GradientTape() as tape:
    tape.watch(batch)
    losses = labeled_loss_fn(batch[LABEL_INPUT_NAME],
                             fgsm_adv_model.base_model(batch[IMAGE_INPUT_NAME]))
    
  # Generate the adversarial examples.
  pgd_images, _ = nsl.lib.adversarial_neighbor.gen_adv_neighbor(
      batch[IMAGE_INPUT_NAME],
      losses,
      pgd_nbr_config,
      gradient_tape=tape,
      pgd_model_fn=pgd_model_fn,
      pgd_loss_fn=pgd_loss_fn,
      pgd_labels=batch[LABEL_INPUT_NAME],
  )
  
  # Update our accuracy metric.
  y_true = batch['label']
  y_pred = fgsm_adv_model.base_model(pgd_images)
  metric(y_true, y_pred)

  # Store images for visualization.
  perturbed_images.append(pgd_images)
  labels.append(y_true.numpy())
  predictions.append(tf.argmax(y_pred, axis=-1).numpy())

print('%s model accuracy: %f' % ('base', metric.result().numpy()))

In [None]:
examine_images(perturbed_images, labels, predictions, 'fgsm_reg')

While the FGSM regularized model was robust to attacks via FGSM, as we can see
it is still vulnerable to attacks from PGD, which is a stronger attack mechanism
than FGSM.


### Attacking the PGD Regularized Model

In [None]:
# Set up the neighbor config.
fgsm_nbr_config = nsl.configs.AdvNeighborConfig(
    adv_grad_norm=HPARAMS.adv_grad_norm,
    adv_step_size=HPARAMS.pgd_epsilon,
    clip_value_min=0.0,
    clip_value_max=1.0,
)

# The labeled loss function provides the loss for each sample we pass in. This
# will be used to calculate the gradient.
labeled_loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(
    from_logits=True,
)

In [None]:
%%time
# Generate adversarial images using FGSM on the regularized model.
perturbed_images, labels, predictions = [], [], []

# Record the accuracy.
metric = tf.keras.metrics.SparseCategoricalAccuracy()

for batch in test_set_for_adv_model:
  # Record the loss calculation to get its gradients.
  with tf.GradientTape() as tape:
    tape.watch(batch)
    # We attack the adversarially regularized model.
    losses = labeled_loss_fn(batch[LABEL_INPUT_NAME],
                             pgd_adv_model.base_model(batch[IMAGE_INPUT_NAME]))

  # Generate the adversarial examples.
  fgsm_images, _ = nsl.lib.adversarial_neighbor.gen_adv_neighbor(
      batch[IMAGE_INPUT_NAME],
      losses,
      fgsm_nbr_config,
      gradient_tape=tape
  )

  # Update our accuracy metric.
  y_true = batch['label']
  y_pred = pgd_adv_model.base_model(fgsm_images)
  metric(y_true, y_pred)

  # Store images for visualization.
  perturbed_images.append(fgsm_images)
  labels.append(y_true.numpy())
  predictions.append(tf.argmax(y_pred, axis=-1).numpy())

print('%s model accuracy: %f' % ('base', metric.result().numpy()))

In [None]:
examine_images(perturbed_images, labels, predictions, 'pgd_reg')

In [None]:
# Set up the neighbor config for PGD.
pgd_nbr_config = nsl.configs.AdvNeighborConfig(
    adv_grad_norm=HPARAMS.adv_grad_norm,
    adv_step_size=HPARAMS.adv_step_size,
    pgd_iterations=HPARAMS.pgd_iterations,
    pgd_epsilon=HPARAMS.pgd_epsilon,
    clip_value_min=HPARAMS.clip_value_min,
    clip_value_max=HPARAMS.clip_value_max,
)

# pgd_model_fn generates a prediction from which we calculate the loss, and the
# gradient for a given interation.
pgd_model_fn = pgd_adv_model.base_model

# We need to pass in the loss function for repeated calculation of the gradient.
pgd_loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(
    from_logits=True, 
)
labeled_loss_fn = pgd_loss_fn

In [None]:
%%time
# Generate adversarial images using PGD on the PGD-regularized model.
perturbed_images, labels, predictions = [], [], []

metric = tf.keras.metrics.SparseCategoricalAccuracy()

for batch in test_set_for_adv_model:
  # Gradient tape to calculate the loss on the first iteration.
  with tf.GradientTape() as tape:
    tape.watch(batch)
    losses = labeled_loss_fn(batch[LABEL_INPUT_NAME],
                             pgd_adv_model.base_model(batch[IMAGE_INPUT_NAME]))
  
  # Generate the adversarial examples.
  pgd_images, _ = nsl.lib.adversarial_neighbor.gen_adv_neighbor(
      batch[IMAGE_INPUT_NAME],
      losses,
      pgd_nbr_config,
      gradient_tape=tape,
      pgd_model_fn=pgd_model_fn,
      pgd_loss_fn=pgd_loss_fn,
      pgd_labels=batch[LABEL_INPUT_NAME],
  )
  
  # Update our accuracy metric.
  y_true = batch['label']
  y_pred = pgd_adv_model.base_model(pgd_images)
  metric(y_true, y_pred)

  # Store images for visualization.
  perturbed_images.append(pgd_images)
  labels.append(y_true.numpy())
  predictions.append(tf.argmax(y_pred, axis=-1).numpy())

print('%s model accuracy: %f' % ('base', metric.result().numpy()))

In [None]:
examine_images(perturbed_images, labels, predictions, 'pgd_reg')

The PGD-regularized model is strong against both attack types.

# Conclusion

In this colab, we've explored two gradient-based attack methods, FGSM, and its
stronger variant PGD. We have seen how neural networks not trained to defend
against these attacks are vulnerable to these attacks, and also how to utilize
adversarial regularization in the Neural Structured Learning framework to
improve robustness.