# Convert tf.keras model to ONNX

This tutorial shows 
- how to convert tf.keras model to ONNX from the saved model file or the source code directly. 
- comparison of the execution time of the inference on CPU between tf.keras model and ONNX converted model.

## Install ONNX dependencies
- `tf2onnx` provides a tool to convert TensorFlow model to ONNX
- `onnxruntime` is used to run inference on a saved ONNX model.

In [None]:
!pip install -Uqq tf2onnx
!pip install -Uqq onnxruntime

### Imports

In [1]:
import tf2onnx
import pandas as pd
import tensorflow as tf
import numpy as np

### Get a sample model 

In [2]:
model = tf.keras.applications.ResNet50(include_top=True, input_shape=(224, 224, 3))

Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/resnet/resnet50_weights_tf_dim_ordering_tf_kernels.h5


## Convert to ONNX

In [3]:
num_layers = len(model.layers)
print(f'first layer name: {model.layers[0].name}')
print(f'last layer name: {model.layers[num_layers-1].name}')

first layer name: input_1
last layer name: predictions


### Conversion

`opset` in `tf2onnx.convert.from_keras` is the ONNX Op version. You can find the full list which TF Ops are convertible to ONNX Ops [[here](https://github.com/onnx/tensorflow-onnx/blob/master/support_status.md)].

In [4]:
import onnx

input_signature = [tf.TensorSpec([None, 224, 224, 3], tf.float32, name='input_1')]
onnx_model, _ = tf2onnx.convert.from_keras(model, input_signature, opset=15)
onnx.save(onnx_model, "my_model.onnx")

# model.save('my_model')
# !python -m tf2onnx.convert --saved-model my_model --output my_model.onnx

Instructions for updating:
Use `tf.compat.v1.graph_util.extract_sub_graph`


## Test TF vs ONNX model with dummy data

### Generate dummy data 

In [5]:
dummy_inputs = tf.random.normal((32, 224, 224, 3))

### Test original TF model with dummy data

In [7]:
%%timeit
model(dummy_inputs)

1 loop, best of 5: 5.43 s per loop


In [11]:
tf_preds = model(dummy_inputs)
print(tf_preds)
print(tf.argmax(tf_preds, axis=1))

tf.Tensor(
[[7.7266595e-06 1.2607347e-04 3.5493996e-04 ... 3.9009101e-06
  6.4691136e-05 2.9123765e-03]
 [1.0709026e-05 1.3441169e-04 3.6555584e-04 ... 5.0727599e-06
  6.6014669e-05 3.2316628e-03]
 [8.6340870e-06 1.3047195e-04 3.3897307e-04 ... 4.2910137e-06
  6.6816639e-05 3.1956623e-03]
 ...
 [1.0132569e-05 1.4086883e-04 3.5038366e-04 ... 5.2608348e-06
  6.4157386e-05 3.2055001e-03]
 [6.7479391e-06 1.2649213e-04 2.8330638e-04 ... 4.4393978e-06
  6.4049018e-05 3.0844919e-03]
 [8.6630262e-06 1.2813542e-04 2.9517489e-04 ... 5.8516707e-06
  7.2770934e-05 3.3875057e-03]], shape=(32, 1000), dtype=float32)
tf.Tensor(
[783 783 783 783 783 783 783 783 783 783 783 783 783 783 783 783 783 783
 783 783 783 783 783 783 783 783 783 783 783 783 783 783], shape=(32,), dtype=int64)


### Test converted ONNX model with dummy data

If you want to inference with GPU, then you can do so by setting `providers=["CUDAExecutionProvider"]` in `ort.InferenceSession`.

The first parameter in `sess.run` is set to `None`, and that means all the outputs of the model will be retrieved. 

In [12]:
import onnxruntime as ort
import numpy as np

sess = ort.InferenceSession("my_model.onnx") # providers=["CUDAExecutionProvider"])
np_dummy_inputs = dummy_inputs.numpy()

In [13]:
%%timeit 
sess.run(None, {"input_1": np_dummy_inputs})

1 loop, best of 5: 3.97 s per loop


In [14]:
ort_preds = sess.run(None, {"input_1": np_dummy_inputs})
print(ort_preds)
print(np.argmax(ort_preds[0], axis=1))

[array([[7.7266932e-06, 1.2607401e-04, 3.5494129e-04, ..., 3.9009115e-06,
        6.4691383e-05, 2.9123744e-03],
       [1.0709014e-05, 1.3441198e-04, 3.6555488e-04, ..., 5.0727685e-06,
        6.6014931e-05, 3.2316626e-03],
       [8.6341424e-06, 1.3047228e-04, 3.3897266e-04, ..., 4.2910538e-06,
        6.6816901e-05, 3.1956607e-03],
       ...,
       [1.0132635e-05, 1.4086922e-04, 3.5038497e-04, ..., 5.2608593e-06,
        6.4157597e-05, 3.2054954e-03],
       [6.7478845e-06, 1.2649220e-04, 2.8330489e-04, ..., 4.4393792e-06,
        6.4049040e-05, 3.0844780e-03],
       [8.6630653e-06, 1.2813594e-04, 2.9517466e-04, ..., 5.8516916e-06,
        7.2771159e-05, 3.3875043e-03]], dtype=float32)]
[783 783 783 783 783 783 783 783 783 783 783 783 783 783 783 783 783 783
 783 783 783 783 783 783 783 783 783 783 783 783 783 783]


## Conclusion

We did a simple experiments with dummy dataset of 32 batch size. The default behaviour of `timeit` is to measure the average of the cell execution time with 7 times of repeat ([`timeit`'s default behaviour](https://ipython.readthedocs.io/en/stable/interactive/magics.html#magic-timeit))


The TF implementation of the ResNet50 took about 5.43s while the ONNX converted model took about 3.97s on average for the the inference job. So it is clear ONNX converted model is much faster on CPU.