## Motivation

In this notebook, we exame the idea declared in section 1.5.5 on Fashion MNIST data, a classification task.

The model follows: https://www.kaggle.com/code/m0hand/fashion-mnist-cnn

In [1]:
import numpy as np
import tensorflow as tf
from keras.layers import Conv2D, MaxPooling2D, Flatten, Dense
from keras.models import Sequential
from keras.losses import MSE, CategoricalCrossentropy
from sklearn.metrics import accuracy_score
from tqdm import tqdm

from utils import get_gradient_loss_fn

tf.random.set_seed(42)

2024-03-21 11:53:18.223829: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


## The MNIST Data

In [2]:
mnist = tf.keras.datasets.fashion_mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()

x_train = tf.convert_to_tensor(x_train / 255.0, 'float32')
x_test = tf.convert_to_tensor(x_test / 255.0, 'float32')
y_train = tf.one_hot(y_train, 10, dtype='float32')
y_test = tf.one_hot(y_test, 10, dtype='float32')

x_train.shape, y_train.shape

(TensorShape([60000, 28, 28]), TensorShape([60000, 10]))

## Train a Model with Gradient Loss

In [3]:
model = Sequential([ 
    Conv2D(64, (3, 3), activation='relu', input_shape=(28,28,1)),
    MaxPooling2D((2, 2)),
    
    Conv2D(64, (3, 3), activation='relu'),
    MaxPooling2D((2, 2)),

    Flatten(),
    Dense(64, 'relu'),
    Dense(10, 'softmax')
])

gradient_loss_fn = get_gradient_loss_fn(
    lambda inputs: MSE(inputs[1], model(inputs[0]))
)

In [4]:
optimizer = tf.optimizers.Adam()

@tf.function
def train_step(x, y):
    with tf.GradientTape() as tape:
        loss = gradient_loss_fn((x, y))
    grads = tape.gradient(loss, model.variables)
    optimizer.apply_gradients(zip(grads, model.variables))
    return loss

In [5]:
def evaluate(model):
    return accuracy_score(
        tf.argmax(y_test, axis=1),
        tf.argmax(model(x_test), axis=1),
    )

In [6]:
ds = tf.data.Dataset.from_tensor_slices((x_train, y_train))
ds = ds.batch(100)

In [7]:
for epoch in range(20):
    for x, y in tqdm(ds):
        loss = train_step(x, y)
    print(epoch, loss.numpy(), evaluate(model))

100%|██████████████████████████████████████████████████████████| 600/600 [00:57<00:00, 10.42it/s]
2024-03-21 11:54:18.114825: W tensorflow/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 1730560000 exceeds 10% of free system memory.
2024-03-21 11:54:18.580471: W tensorflow/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 1730560000 exceeds 10% of free system memory.
2024-03-21 11:54:19.019070: W tensorflow/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 1730560000 exceeds 10% of free system memory.


0 0.0007949429 0.8572


100%|██████████████████████████████████████████████████████████| 600/600 [00:59<00:00, 10.05it/s]
2024-03-21 11:55:20.010328: W tensorflow/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 1730560000 exceeds 10% of free system memory.
2024-03-21 11:55:20.529951: W tensorflow/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 1730560000 exceeds 10% of free system memory.


1 0.00067475165 0.8771


100%|██████████████████████████████████████████████████████████| 600/600 [01:05<00:00,  9.10it/s]


2 0.0005717427 0.8892


100%|██████████████████████████████████████████████████████████| 600/600 [01:05<00:00,  9.11it/s]


3 0.00048587602 0.8928


100%|██████████████████████████████████████████████████████████| 600/600 [01:04<00:00,  9.24it/s]


4 0.00044733993 0.8953


100%|██████████████████████████████████████████████████████████| 600/600 [01:03<00:00,  9.43it/s]


5 0.00040259212 0.897


100%|██████████████████████████████████████████████████████████| 600/600 [01:01<00:00,  9.77it/s]


6 0.0003850808 0.8999


100%|██████████████████████████████████████████████████████████| 600/600 [00:57<00:00, 10.42it/s]


7 0.0004129472 0.9038


100%|██████████████████████████████████████████████████████████| 600/600 [00:57<00:00, 10.39it/s]


8 0.00047251425 0.9026


100%|██████████████████████████████████████████████████████████| 600/600 [00:57<00:00, 10.41it/s]


9 0.00040916956 0.9006


100%|██████████████████████████████████████████████████████████| 600/600 [00:58<00:00, 10.24it/s]


10 0.00038313595 0.901


100%|██████████████████████████████████████████████████████████| 600/600 [00:58<00:00, 10.26it/s]


11 0.0004143991 0.9021


100%|██████████████████████████████████████████████████████████| 600/600 [00:58<00:00, 10.32it/s]


12 0.0003378162 0.9061


100%|██████████████████████████████████████████████████████████| 600/600 [00:57<00:00, 10.35it/s]


13 0.00037573426 0.902


100%|██████████████████████████████████████████████████████████| 600/600 [00:58<00:00, 10.29it/s]


14 0.0003549408 0.9033


100%|██████████████████████████████████████████████████████████| 600/600 [00:57<00:00, 10.37it/s]


15 0.00030916827 0.905


100%|██████████████████████████████████████████████████████████| 600/600 [00:58<00:00, 10.32it/s]


16 0.0004225934 0.8983


100%|██████████████████████████████████████████████████████████| 600/600 [00:57<00:00, 10.49it/s]


17 0.00047631082 0.8913


100%|██████████████████████████████████████████████████████████| 600/600 [00:56<00:00, 10.53it/s]


18 0.00039022867 0.8891


100%|██████████████████████████████████████████████████████████| 600/600 [00:56<00:00, 10.54it/s]


19 0.00033668807 0.8996


In [8]:
evaluate(model)

0.8996

## Baseline Model with Usual Loss

In [9]:
baseline_model = Sequential([ 
    Conv2D(64, (3, 3), activation='relu', input_shape=(28,28,1)),
    MaxPooling2D((2, 2)),
    
    Conv2D(64, (3, 3), activation='relu'),
    MaxPooling2D((2, 2)),

    Flatten(),
    Dense(64, 'relu'),
    Dense(10, 'softmax')
])

In [10]:
baseline_model.compile(
    optimizer='adam',
    loss=CategoricalCrossentropy(),
    # loss=MSE,
    metrics=['accuracy'],
)

In [11]:
baseline_model.fit(
    x_train, y_train,
    epochs=20,
    validation_data=(x_test, y_test),
    verbose=2,
)

Epoch 1/20
1875/1875 - 27s - loss: 0.4514 - accuracy: 0.8369 - val_loss: 0.3497 - val_accuracy: 0.8718 - 27s/epoch - 14ms/step
Epoch 2/20
1875/1875 - 26s - loss: 0.3023 - accuracy: 0.8900 - val_loss: 0.3574 - val_accuracy: 0.8675 - 26s/epoch - 14ms/step
Epoch 3/20
1875/1875 - 26s - loss: 0.2603 - accuracy: 0.9042 - val_loss: 0.2963 - val_accuracy: 0.8909 - 26s/epoch - 14ms/step
Epoch 4/20
1875/1875 - 26s - loss: 0.2294 - accuracy: 0.9149 - val_loss: 0.2675 - val_accuracy: 0.9024 - 26s/epoch - 14ms/step
Epoch 5/20
1875/1875 - 26s - loss: 0.2056 - accuracy: 0.9237 - val_loss: 0.2706 - val_accuracy: 0.9025 - 26s/epoch - 14ms/step
Epoch 6/20
1875/1875 - 26s - loss: 0.1843 - accuracy: 0.9313 - val_loss: 0.2874 - val_accuracy: 0.8994 - 26s/epoch - 14ms/step
Epoch 7/20
1875/1875 - 26s - loss: 0.1650 - accuracy: 0.9377 - val_loss: 0.2527 - val_accuracy: 0.9148 - 26s/epoch - 14ms/step
Epoch 8/20
1875/1875 - 26s - loss: 0.1485 - accuracy: 0.9441 - val_loss: 0.2687 - val_accuracy: 0.9102 - 26s/ep

<keras.src.callbacks.History at 0x7f76ac141d10>

In [12]:
evaluate(baseline_model)

0.9097

## Model Robustness

Now, we compare the robustness of the model and the baseline. To do so, we add Gaussian noise to the test data and check the accuracy.

In [13]:
stddev = 1e-1
noise = tf.random.normal(tf.shape(x_test)) * stddev
x_test_noised = x_test + noise

Or, non-Gaussian "pixal-flipping" noise.

In [14]:
# flip_ratio = 0.01
# x_test_noised = tf.where(tf.random.uniform(tf.shape(x_test)) < flip_ratio, 1 - x_test, x_test)

In [15]:
def evaluate_robustness(model):
    accuracy = accuracy_score(
        tf.argmax(y_test, axis=1), tf.argmax(model(x_test), axis=1))
    accuracy_noised = accuracy_score(
        tf.argmax(y_test, axis=1), tf.argmax(model(x_test_noised), axis=1))
    print(f'Accuracy: {accuracy} -> {accuracy_noised}')

In [16]:
evaluate_robustness(model)

Accuracy: 0.8996 -> 0.8681


In [17]:
evaluate_robustness(baseline_model)

Accuracy: 0.9097 -> 0.7995


## Conclusion

- By simply using the "gradient loss", we obtained a result that approaches the baseline. But the robustness is greatly out-performs the baseline.
- Because the "gradient loss" computes gradients twice (once by x and y, and once by model variables), the training duration is doubled.
- We also tested non-Gaussian noise, such as flipping the given ratio of pixals. The result is that the baseline model turns to be more robust than the model trained by the "gradient loss". The difference is that the pixal-flipping noise is not continuous. The noised images jump to other local minima of the loss function.
- We also use MSE as loss to train the baseline model. The final performance is a little worse than the categorical cross-entropy, as expected. The above conclusion is invariant.