## Motivation

In this notebook, we exame the idea declared in section 1.5.5 on Fashion MNIST data, a classification task.

The model follows: https://www.kaggle.com/code/m0hand/fashion-mnist-cnn

In [1]:
import numpy as np
import tensorflow as tf
from keras.layers import Conv2D, MaxPooling2D, Flatten, Dense
from keras.models import Sequential
from keras.losses import MSE, CategoricalCrossentropy
from sklearn.metrics import accuracy_score
from tqdm import tqdm

from utils import get_gradient_loss_fn

tf.random.set_seed(42)

2024-03-20 11:06:48.242273: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


## The MNIST Data

In [2]:
mnist = tf.keras.datasets.fashion_mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()

x_train = tf.convert_to_tensor(x_train / 255.0, 'float32')
x_test = tf.convert_to_tensor(x_test / 255.0, 'float32')
y_train = tf.one_hot(y_train, 10, dtype='float32')
y_test = tf.one_hot(y_test, 10, dtype='float32')

x_train.shape, y_train.shape

(TensorShape([60000, 28, 28]), TensorShape([60000, 10]))

## Train a Model with Gradient Loss

In [3]:
model = Sequential([ 
    Conv2D(64, (3, 3), activation='relu', input_shape=(28,28,1)),
    MaxPooling2D((2, 2)),
    
    Conv2D(64, (3, 3), activation='relu'),
    MaxPooling2D((2, 2)),

    Flatten(),
    Dense(64, 'relu'),
    Dense(10, 'softmax')
])

gradient_loss_fn = get_gradient_loss_fn(
    lambda inputs: MSE(inputs[1], model(inputs[0]))
)

In [4]:
optimizer = tf.optimizers.Adam()

@tf.function
def train_step(x, y):
    with tf.GradientTape() as tape:
        loss = gradient_loss_fn((x, y))
    grads = tape.gradient(loss, model.variables)
    optimizer.apply_gradients(zip(grads, model.variables))
    return loss

In [5]:
def evaluate(model):
    return accuracy_score(
        tf.argmax(y_test, axis=1),
        tf.argmax(model(x_test), axis=1),
    )

In [6]:
ds = tf.data.Dataset.from_tensor_slices((x_train, y_train))
ds = ds.batch(100)

In [7]:
for epoch in range(20):
    for x, y in tqdm(ds):
        loss = train_step(x, y)
    print(epoch, loss.numpy(), evaluate(model))

100%|██████████████████████████████████████████████████████████| 600/600 [00:57<00:00, 10.40it/s]
2024-03-20 11:07:48.193226: W tensorflow/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 1730560000 exceeds 10% of free system memory.
2024-03-20 11:07:48.648614: W tensorflow/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 1730560000 exceeds 10% of free system memory.
2024-03-20 11:07:49.077387: W tensorflow/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 1730560000 exceeds 10% of free system memory.


0 0.00072131696 0.8606


100%|██████████████████████████████████████████████████████████| 600/600 [00:56<00:00, 10.68it/s]
2024-03-20 11:08:46.445842: W tensorflow/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 1730560000 exceeds 10% of free system memory.
2024-03-20 11:08:46.893912: W tensorflow/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 1730560000 exceeds 10% of free system memory.


1 0.0005959251 0.8781


100%|██████████████████████████████████████████████████████████| 600/600 [00:55<00:00, 10.75it/s]


2 0.0005444534 0.8884


100%|██████████████████████████████████████████████████████████| 600/600 [00:55<00:00, 10.76it/s]


3 0.00049051974 0.8939


100%|██████████████████████████████████████████████████████████| 600/600 [00:56<00:00, 10.68it/s]


4 0.00044895208 0.8962


100%|██████████████████████████████████████████████████████████| 600/600 [00:57<00:00, 10.51it/s]


5 0.00045449394 0.8974


100%|██████████████████████████████████████████████████████████| 600/600 [00:58<00:00, 10.30it/s]


6 0.0004537671 0.8982


100%|██████████████████████████████████████████████████████████| 600/600 [00:57<00:00, 10.40it/s]


7 0.00041171588 0.898


100%|██████████████████████████████████████████████████████████| 600/600 [00:57<00:00, 10.38it/s]


8 0.0003834727 0.8996


100%|██████████████████████████████████████████████████████████| 600/600 [00:57<00:00, 10.49it/s]


9 0.00039342765 0.8937


100%|██████████████████████████████████████████████████████████| 600/600 [00:56<00:00, 10.54it/s]


10 0.0003597469 0.8958


100%|██████████████████████████████████████████████████████████| 600/600 [00:56<00:00, 10.69it/s]


11 0.00035220583 0.897


100%|██████████████████████████████████████████████████████████| 600/600 [00:56<00:00, 10.67it/s]


12 0.00035183047 0.8967


100%|██████████████████████████████████████████████████████████| 600/600 [00:56<00:00, 10.70it/s]


13 0.00024195388 0.9007


100%|██████████████████████████████████████████████████████████| 600/600 [00:55<00:00, 10.72it/s]


14 0.00034668285 0.8917


100%|██████████████████████████████████████████████████████████| 600/600 [00:56<00:00, 10.62it/s]


15 0.0002702052 0.8948


100%|██████████████████████████████████████████████████████████| 600/600 [00:56<00:00, 10.57it/s]


16 0.0001871549 0.8984


100%|██████████████████████████████████████████████████████████| 600/600 [00:56<00:00, 10.58it/s]


17 0.0002421062 0.8986


100%|██████████████████████████████████████████████████████████| 600/600 [00:56<00:00, 10.59it/s]


18 0.00020391753 0.8991


100%|██████████████████████████████████████████████████████████| 600/600 [00:56<00:00, 10.62it/s]


19 0.000275143 0.895


In [8]:
evaluate(model)

0.895

## Baseline Model with Usual Loss

In [9]:
baseline_model = Sequential([ 
    Conv2D(64, (3, 3), activation='relu', input_shape=(28,28,1)),
    MaxPooling2D((2, 2)),
    
    Conv2D(64, (3, 3), activation='relu'),
    MaxPooling2D((2, 2)),

    Flatten(),
    Dense(64, 'relu'),
    Dense(10, 'softmax')
])

In [10]:
baseline_model.compile(
    optimizer='adam',
    loss=CategoricalCrossentropy(),
    # loss=MSE,
    metrics=['accuracy'],
)

In [11]:
baseline_model.fit(
    x_train, y_train,
    epochs=20,
    validation_data=(x_test, y_test),
)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.src.callbacks.History at 0x7faef1935350>

In [12]:
evaluate(baseline_model)

0.9079

## Model Robustness

Now, we compare the robustness of the model and the baseline. To do so, we add Gaussian noise to the test data and check the accuracy.

In [27]:
stddev = 1e-1
noise = tf.random.normal(tf.shape(x_test)) * stddev
x_test_noised = x_test + noise

Or, non-Gaussian "pixal-flipping" noise.

In [28]:
# flip_ratio = 0.01
# x_test_noised = tf.where(tf.random.uniform(tf.shape(x_test)) < flip_ratio, 1 - x_test, x_test)

In [29]:
def evaluate_robustness(model):
    accuracy = accuracy_score(
        tf.argmax(y_test, axis=1), tf.argmax(model(x_test), axis=1))
    accuracy_noised = accuracy_score(
        tf.argmax(y_test, axis=1), tf.argmax(model(x_test_noised), axis=1))
    print(f'Accuracy: {accuracy} -> {accuracy_noised}')

In [30]:
evaluate_robustness(model)

Accuracy: 0.895 -> 0.8612


In [31]:
evaluate_robustness(baseline_model)

Accuracy: 0.9079 -> 0.8206


## Conclusion

- By simply using the "gradient loss", we obtained a result that approaches the baseline. But the robustness is greatly out-performs the baseline.
- Because the "gradient loss" computes gradients twice (once by x and y, and once by model variables), the training duration is doubled.
- We also tested non-Gaussian noise, such as flipping the given ratio of pixals. The result is that the baseline model turns to be more robust than the model trained by the "gradient loss". The difference is that the pixal-flipping noise is not continuous. The noised images jump to other local minima of the loss function.
- We also use MSE as loss to train the baseline model. The final performance is a little worse than the categorical cross-entropy, as expected. The above conclusion is invariant.