## Motivation

In this notebook, we exame the idea declared in section 1.5.5 on Fashion MNIST data, a classification task.

The model follows: https://www.kaggle.com/code/m0hand/fashion-mnist-cnn

In [1]:
import numpy as np
import tensorflow as tf
from keras.layers import Conv2D, MaxPooling2D, Flatten, Dense
from keras.models import Sequential
from keras.losses import MSE, CategoricalCrossentropy
from sklearn.metrics import accuracy_score
from tqdm import tqdm

from utils import get_gradient_loss_fn

tf.random.set_seed(42)

2024-03-20 09:32:11.572486: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


## The MNIST Data

In [2]:
mnist = tf.keras.datasets.fashion_mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()

x_train = tf.convert_to_tensor(x_train / 255.0, 'float32')
x_test = tf.convert_to_tensor(x_test / 255.0, 'float32')
y_train = tf.one_hot(y_train, 10, dtype='float32')
y_test = tf.one_hot(y_test, 10, dtype='float32')

x_train.shape, y_train.shape

(TensorShape([60000, 28, 28]), TensorShape([60000, 10]))

## Train a Model with Gradient Loss

In [3]:
model = Sequential([ 
    Conv2D(64, (3, 3), activation='relu', input_shape=(28,28,1)),
    MaxPooling2D((2, 2)),
    
    Conv2D(64, (3, 3), activation='relu'),
    MaxPooling2D((2, 2)),

    Flatten(),
    Dense(64, 'relu'),
    Dense(10, 'softmax')
])

gradient_loss_fn = get_gradient_loss_fn(
    lambda inputs: MSE(inputs[1], model(inputs[0]))
)

In [4]:
optimizer = tf.optimizers.Adam()

@tf.function
def train_step(x, y):
    with tf.GradientTape() as tape:
        loss = gradient_loss_fn((x, y))
    grads = tape.gradient(loss, model.variables)
    optimizer.apply_gradients(zip(grads, model.variables))
    return loss

In [5]:
def evaluate(model):
    return accuracy_score(
        tf.argmax(y_test, axis=1),
        tf.argmax(model(x_test), axis=1),
    )

In [6]:
ds = tf.data.Dataset.from_tensor_slices((x_train, y_train))
ds = ds.batch(100)

In [7]:
for epoch in range(10):
    for x, y in tqdm(ds):
        loss = train_step(x, y)
    print(epoch, loss)

100%|██████████████████████████████████████████████████████████| 600/600 [00:58<00:00, 10.34it/s]


0 tf.Tensor(0.0008826909, shape=(), dtype=float32)


100%|██████████████████████████████████████████████████████████| 600/600 [00:58<00:00, 10.24it/s]


1 tf.Tensor(0.00066508434, shape=(), dtype=float32)


100%|██████████████████████████████████████████████████████████| 600/600 [00:57<00:00, 10.39it/s]


2 tf.Tensor(0.0005956925, shape=(), dtype=float32)


100%|██████████████████████████████████████████████████████████| 600/600 [00:58<00:00, 10.25it/s]


3 tf.Tensor(0.000531269, shape=(), dtype=float32)


100%|██████████████████████████████████████████████████████████| 600/600 [00:57<00:00, 10.50it/s]


4 tf.Tensor(0.00045660275, shape=(), dtype=float32)


100%|██████████████████████████████████████████████████████████| 600/600 [00:57<00:00, 10.35it/s]


5 tf.Tensor(0.00045052278, shape=(), dtype=float32)


100%|██████████████████████████████████████████████████████████| 600/600 [00:58<00:00, 10.20it/s]


6 tf.Tensor(0.0004704871, shape=(), dtype=float32)


100%|██████████████████████████████████████████████████████████| 600/600 [00:58<00:00, 10.26it/s]


7 tf.Tensor(0.00045560423, shape=(), dtype=float32)


100%|██████████████████████████████████████████████████████████| 600/600 [00:58<00:00, 10.32it/s]


8 tf.Tensor(0.00042795483, shape=(), dtype=float32)


100%|██████████████████████████████████████████████████████████| 600/600 [00:57<00:00, 10.38it/s]

9 tf.Tensor(0.0004996294, shape=(), dtype=float32)





In [8]:
evaluate(model)

2024-03-20 09:41:55.163094: W tensorflow/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 1730560000 exceeds 10% of free system memory.
2024-03-20 09:41:55.628086: W tensorflow/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 1730560000 exceeds 10% of free system memory.
2024-03-20 09:41:56.098282: W tensorflow/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 1730560000 exceeds 10% of free system memory.
2024-03-20 09:41:56.524981: W tensorflow/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 432640000 exceeds 10% of free system memory.


0.9033

## Baseline Model with Usual Loss

In [9]:
baseline_model = Sequential([ 
    Conv2D(64, (3, 3), activation='relu', input_shape=(28,28,1)),
    MaxPooling2D((2, 2)),
    
    Conv2D(64, (3, 3), activation='relu'),
    MaxPooling2D((2, 2)),

    Flatten(),
    Dense(64, 'relu'),
    Dense(10, 'softmax')
])

In [10]:
baseline_model.compile(
    optimizer='adam',
    loss=CategoricalCrossentropy(),
    # loss=MSE,
    metrics=['accuracy'],
)

In [11]:
baseline_model.fit(
    x_train, y_train,
    epochs=20,
    validation_data=(x_test, y_test),
)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.src.callbacks.History at 0x7f919c2c4750>

In [12]:
evaluate(baseline_model)

2024-03-20 09:50:47.587941: W tensorflow/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 1730560000 exceeds 10% of free system memory.


0.907

## Model Robustness

Now, we compare the robustness of the model and the baseline. To do so, we add Gaussian noise to the test data and check the accuracy.

In [17]:
stddev = 1e-1
noise = tf.random.normal(tf.shape(x_test)) * stddev

In [18]:
def evaluate_robustness(model):
    accuracy = accuracy_score(
        tf.argmax(y_test, axis=1), tf.argmax(model(x_test), axis=1))
    noised_accuracy = accuracy_score(
        tf.argmax(y_test, axis=1), tf.argmax(model(x_test+noise), axis=1))
    print(f'Accuracy: {accuracy} -> {noised_accuracy}')

In [19]:
evaluate_robustness(model)

Accuracy: 0.9033 -> 0.8619


In [20]:
evaluate_robustness(baseline_model)

Accuracy: 0.907 -> 0.8154


## Conclusion

By simply using the "gradient loss", we obtain a result that approaches the baseline. But the robustness is greatly out-performs the baseline. Because the "gradient loss" computes gradients twice (once by x and y, and once by model variables), the training duration is doubled.