# Demo 0: Example and usage

In order to make things simple the following rules have been followed
during development:

-   `deel-lip` follows the `keras` package structure.
-   All elements (layers, activations, initializers, ...) are compatible
    with standard the `keras` elements.
-   When a k-Lipschitz layer overrides a standard keras layer, it uses
    the same interface and the same parameters. The only difference is a
    new parameter to control the Lipschitz constant of a layer.

## Which layers are safe to use?

The following table indicates which layers are safe to use in a Lipshitz
network, and which are not.

| layer                                                                                         | 1-lip? | deel-lip equivalent                                                                                         | comments                                                                          |
|-----------------------------------------------------------------------------------------------|--------|-------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------|
| `Dense`                                                                                       | no     | `SpectralDense`<br>`FrobeniusDense`                       | `SpectralDense` and `FrobeniusDense` are similar when there is a single output. |
| `Conv2D`                                                                                      | no     | `SpectralConv2D`<br>`FrobeniusConv2D`                     | `SpectralConv2D` also implements Björck normalization.                           |
| `MaxPooling`<br>`GlobalMaxPooling`             | yes    | n/a                                                                                                         |                                                                                   |
| `AveragePooling2D`<br>`GlobalAveragePooling2D` | no     | `ScaledAveragePooling2D`<br>`ScaledGlobalAveragePooling2D` | The lipschitz constant is bounded by `sqrt(pool_h * pool_h)`.                     |
| `Flatten`                                                                                     | yes    | n/a                                                                                                         |                                                                                   |
| `Dropout`                                                                                     | no     | None                                                                                                        | The lipschitz constant is bounded by the dropout factor.                          |
| `BatchNormalization`                                                                          | no     | None                                                                                                        | We suspect that layer normalization already limits internal covariate shift.      |

## Design tips

Designing lipschitz networks requires a careful design in order to avoid
vanishing/exploding gradient problems.

Choosing pooling layers:

| layer                                                                            | advantages                                                                   | disadvantages                                                                      |
|----------------------------------------------------------------------------------|------------------------------------------------------------------------------|------------------------------------------------------------------------------------|
| `ScaledAveragePooling2D` and `MaxPooling2D`                                      | very similar to original implementation (just add a scaling factor for avg). | not norm preserving nor gradient norm preserving.                                  |
| `InvertibleDownSampling`                                                         | norm preserving and gradient norm preserving.                                | increases the number of channels (and the number of parameters of the next layer). |
| `ScaledL2NormPooling2D` (_sqrt(avgpool(x\*\*2))_) | norm preserving.                                                             | lower numerical stability of the gradient when inputs are close to zero.           |

Choosing activations:

| layer                                                                  | advantages                                                                                   | disadvantages                                                                                  |
|------------------------------------------------------------------------|----------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------|
| `ReLU`                                                                 |                                                                                              | create a strong vanishing gradient effect. If you manage to learn with it, please call 911.    |
| `MaxMin` (_stack(\[ReLU(x), ReLU(-x)\])_) | have similar properties to ReLU, but is norm and gradient norm preserving                    | double the number of outputs                                                                   |
| `GroupSort`                                                            | Input and GradientNorm preserving. Also limit the need of biases (as it is shift invariant). | more computationally expensive, (when its parameter _n_ is large) |

Please note that when learning with the `HKR_loss` and `HKR_multiclass_loss`, no
activation is required on the last layer.


### How to use it ?
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deel-ai/deel-lip/blob/master/docs/notebooks/demo0.ipynb)

Here is an example of 1-lipschitz network trained on MNIST:

In [1]:
from deel.lip.layers import (
    SpectralDense,
    SpectralConv2D,
    ScaledL2NormPooling2D,
    FrobeniusDense,
)
from deel.lip.model import Sequential
from deel.lip.activations import GroupSort
from deel.lip.losses import MulticlassHKR, MulticlassKR
from keras.layers import Input, Flatten
from keras.optimizers import Adam
from keras.datasets import mnist
from keras.utils import to_categorical
import numpy as np

2025-04-04 11:52:01.424870: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1743760321.444148   21651 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1743760321.450123   21651 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-04-04 11:52:01.470735: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [None]:
# # load data
# (x_train, y_train), (x_test, y_test) = mnist.load_data()
# # standardize and reshape the data
# x_train = np.expand_dims(x_train, -1)
# mean = x_train.mean()
# std = x_train.std()
# x_train = (x_train - mean) / std
# x_test = np.expand_dims(x_test, -1)
# x_test = (x_test - mean) / std
# # one hot encode the labels
# y_train = to_categorical(y_train)
# y_test = to_categorical(y_test)
 

In [2]:
# load data
(x_train, y_train), (x_test, y_test) = mnist.load_data()
# standardize and reshape the data
x_train = np.expand_dims(x_train, -1) / 255
x_test = np.expand_dims(x_test, -1) / 255
# one hot encode the labels
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)

In [3]:
x_train = np.transpose(x_train,(0,3,1,2))
x_test = np.transpose(x_test,(0,3,1,2))

In [4]:
x_train.shape

(60000, 1, 28, 28)

In [6]:
np.min(x_train)

0.0

In [7]:
# Sequential (resp Model) from deel.model has the same properties as any lipschitz model.
# It act only as a container, with features specific to lipschitz
# functions (condensation, vanilla_exportation...) but The layers are fully compatible
# with the tf.keras.model.Sequential/Model
model = Sequential(
    [
        Input(shape=x_train.shape[1:]),
        # Lipschitz layers preserve the API of their superclass ( here Conv2D )
        # an optional param is available: k_coef_lip which control the lipschitz
        # constant of the layer
        SpectralConv2D(
            filters=16,
            kernel_size=(3, 3),
            activation=GroupSort(2),
            use_bias=True,
            kernel_initializer="orthogonal",
        ),
        # usual pooling layer are implemented (avg, max...), but new layers are also available
        ScaledL2NormPooling2D(pool_size=(2, 2), data_format="channels_first"),
        SpectralConv2D(
            filters=16,
            kernel_size=(3, 3),
            activation=GroupSort(2),
            use_bias=True,
            kernel_initializer="orthogonal",
        ),
        ScaledL2NormPooling2D(pool_size=(2, 2), data_format="channels_first"),
        # our layers are fully interoperable with existing keras layers
        Flatten(),
        SpectralDense(
            32,
            activation=GroupSort(2),
            use_bias=True,
            kernel_initializer="orthogonal",
        ),
        SpectralDense(
            10, activation=None, use_bias=False, kernel_initializer="orthogonal"
        ),
    ],
    # similary model has a parameter to set the lipschitz constant
    # to set automatically the constant of each layer
    k_coef_lip=1.0,
    name="hkr_model",
)
model.summary()

I0000 00:00:1743760424.072079   21651 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 20710 MB memory:  -> device: 0, name: NVIDIA A10G, pci bus id: 0000:00:1e.0, compute capability: 8.6


In [8]:
# HKR (Hinge-Krantorovich-Rubinstein) optimize robustness along with accuracy
model.compile(
    # decreasing alpha and increasing min_margin improve robustness (at the cost of accuracy)
    # note also in the case of lipschitz networks, more robustness require more parameters.
    loss=MulticlassHKR(alpha=50, min_margin=0.05),
    optimizer=Adam(1e-3),
    metrics=["accuracy", MulticlassKR()],
)

In [9]:
# fit the model
model.fit(
    x_train,
    y_train,
    batch_size=2048,
    epochs=100,
    validation_data=(x_test, y_test),
    shuffle=True,
)

Epoch 1/100


I0000 00:00:1743760434.730049   22345 service.cc:148] XLA service 0x55ebbcdf1870 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1743760434.730106   22345 service.cc:156]   StreamExecutor device (0): NVIDIA A10G, Compute Capability 8.6
2025-04-04 11:53:54.807238: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
I0000 00:00:1743760435.098507   22345 cuda_dnn.cc:529] Loaded cuDNN version 90300
2025-04-04 11:53:55.290336: W external/local_xla/xla/service/gpu/nvptx_compiler.cc:930] The NVIDIA driver's CUDA version is 12.4 which is older than the PTX compiler version 12.5.82. Because the driver is older than the PTX compiler version, XLA is disabling parallel compilation, which may slow down compilation. You should update your NVIDIA driver or use the NVIDIA-provided CUDA forward compatibility packages.




[1m17/30[0m [32m━━━━━━━━━━━[0m[37m━━━━━━━━━[0m [1m0s[0m 10ms/step - MulticlassKR: 0.0505 - accuracy: 0.3450 - loss: 2.3751

I0000 00:00:1743760444.348308   22345 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 173ms/step - MulticlassKR: 0.0759 - accuracy: 0.4678 - loss: 1.7830 - val_MulticlassKR: 0.2074 - val_accuracy: 0.8879 - val_loss: 0.1568
Epoch 2/100
[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 13ms/step - MulticlassKR: 0.2320 - accuracy: 0.8972 - loss: 0.1052 - val_MulticlassKR: 0.3192 - val_accuracy: 0.9296 - val_loss: -0.0712
Epoch 3/100
[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 12ms/step - MulticlassKR: 0.3485 - accuracy: 0.9305 - loss: -0.0981 - val_MulticlassKR: 0.4753 - val_accuracy: 0.9446 - val_loss: -0.2501
Epoch 4/100
[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 12ms/step - MulticlassKR: 0.5230 - accuracy: 0.9433 - loss: -0.2786 - val_MulticlassKR: 0.7108 - val_accuracy: 0.9473 - val_loss: -0.4603
Epoch 5/100
[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 12ms/step - MulticlassKR: 0.7692 - accuracy: 0.9466 - loss: -0.5050 - val_Multicl

<keras.src.callbacks.history.History at 0x7f9028cb2440>

In [13]:
vanilla_model.evaluate(x_test, y_test)

[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - MulticlassKR: 2.4034 - accuracy: 0.9600 - loss: -1.9218


[-2.073857307434082, 0.9672999978065491, 2.5064444541931152]

In [10]:

# once training is finished you can convert
# SpectralDense layers into Dense layers and SpectralConv2D into Conv2D
# which optimize performance for inference
vanilla_model = model.vanilla_export()

In [11]:
model.save('/home/aws_install/robustess_project/lip_models/demo0_MNIST_channelfirst_False_disj_Neurons.keras')
vanilla_model.save("/home/aws_install/robustess_project/lip_models/demo0_vanilla_MNIST_channelfirst_False_disj_Neurons.keras")