# Convert tf.keras model to ONNX

This tutorial shows 
- how to convert tf.keras model to ONNX from the saved model file or the source code directly. 
- comparison of the execution time of the inference on CPU between tf.keras model and ONNX converted model.

## Install ONNX dependencies
- `tf2onnx` provides a tool to convert TensorFlow model to ONNX
- `onnxruntime` is used to run inference on a saved ONNX model.

In [None]:
!pip install -Uqq tf2onnx
!pip install -Uqq onnxruntime

### Imports

In [3]:
import tf2onnx
import pandas as pd
import tensorflow as tf
import numpy as np

### Get a sample model 

In [10]:
core = tf.keras.applications.ResNet50(include_top=True, input_shape=(224, 224, 3))

inputs = tf.keras.layers.Input(shape=(224, 224, 3), name="image_input")
preprocess = tf.keras.applications.resnet50.preprocess_input(inputs)
outputs = core(preprocess, training=False)
model = tf.keras.Model(inputs=[inputs], outputs=[outputs])

## Convert to ONNX

In [11]:
num_layers = len(model.layers)
print(f'first layer name: {model.layers[0].name}')
print(f'last layer name: {model.layers[num_layers-1].name}')

first layer name: image_input
last layer name: resnet50


### Conversion

`opset` in `tf2onnx.convert.from_keras` is the ONNX Op version. You can find the full list which TF Ops are convertible to ONNX Ops [[here](https://github.com/onnx/tensorflow-onnx/blob/master/support_status.md)].

there are two ways to convert TensorFlow model to ONNX
- `tf2onnx.convert.from_keras` to convert programatically
- `tf2onnx.convert` CLI to convert a saved TensorFlow model

In [12]:
import onnx

input_signature = [tf.TensorSpec([None, 224, 224, 3], tf.float32, name='image_input')]
onnx_model, _ = tf2onnx.convert.from_keras(model, input_signature, opset=15)
onnx.save(onnx_model, "my_model.onnx")

# model.save('my_model')
# !python -m tf2onnx.convert --saved-model my_model --output my_model.onnx

## Test TF vs ONNX model with dummy data

### Generate dummy data 

In [13]:
dummy_inputs = tf.random.normal((32, 224, 224, 3))

### Test original TF model with dummy data

In [14]:
%%timeit
model.predict(dummy_inputs)

1 loop, best of 5: 4.32 s per loop


In [15]:
tf_preds = model.predict(dummy_inputs)
print(tf_preds)
print(tf.argmax(tf_preds, axis=1))

[[3.15686048e-05 3.87393957e-04 1.79542520e-04 ... 1.30086582e-05
  9.79110773e-05 3.81577667e-03]
 [3.48336362e-05 4.16526280e-04 1.93989443e-04 ... 1.55881808e-05
  1.13529946e-04 4.16510738e-03]
 [3.33409043e-05 4.33546113e-04 1.97361689e-04 ... 1.53988676e-05
  1.12037051e-04 4.24840674e-03]
 ...
 [2.89980180e-05 3.86187923e-04 1.73962238e-04 ... 1.27737330e-05
  1.00923535e-04 4.02137777e-03]
 [3.22057713e-05 3.88624350e-04 1.79953189e-04 ... 1.43914149e-05
  1.02170088e-04 3.93211702e-03]
 [3.32495474e-05 4.18914686e-04 1.89737257e-04 ... 1.49391553e-05
  1.15772651e-04 4.33536153e-03]]
tf.Tensor(
[664 664 664 664 664 664 664 664 664 664 664 664 664 664 664 664 664 664
 851 664 664 664 664 664 664 851 664 664 664 664 664 664], shape=(32,), dtype=int64)


### Test converted ONNX model with dummy data

If you want to inference with GPU, then you can do so by setting `providers=["CUDAExecutionProvider"]` in `ort.InferenceSession`.

The first parameter in `sess.run` is set to `None`, and that means all the outputs of the model will be retrieved. 

In [16]:
import onnxruntime as ort
import numpy as np

sess = ort.InferenceSession("my_model.onnx") # providers=["CUDAExecutionProvider"])
np_dummy_inputs = dummy_inputs.numpy()

In [18]:
%%timeit 
sess.run(None, {"image_input": np_dummy_inputs})

1 loop, best of 5: 3.64 s per loop


In [20]:
ort_preds = sess.run(None, {"image_input": np_dummy_inputs})
print(ort_preds)
print(np.argmax(ort_preds[0], axis=1))

[array([[3.1568543e-05, 3.8739358e-04, 1.7954242e-04, ..., 1.3008658e-05,
        9.7910888e-05, 3.8157741e-03],
       [3.4834065e-05, 4.1653065e-04, 1.9399117e-04, ..., 1.5588314e-05,
        1.1353097e-04, 4.1651078e-03],
       [3.3341144e-05, 4.3354733e-04, 1.9736330e-04, ..., 1.5398955e-05,
        1.1203749e-04, 4.2484212e-03],
       ...,
       [2.8998174e-05, 3.8618816e-04, 1.7396275e-04, ..., 1.2773765e-05,
        1.0092379e-04, 4.0213722e-03],
       [3.2205990e-05, 3.8862476e-04, 1.7995379e-04, ..., 1.4391498e-05,
        1.0217058e-04, 3.9321054e-03],
       [3.3249766e-05, 4.1891521e-04, 1.8973851e-04, ..., 1.4939254e-05,
        1.1577292e-04, 4.3353694e-03]], dtype=float32)]
[664 664 664 664 664 664 664 664 664 664 664 664 664 664 664 664 664 664
 851 664 664 664 664 664 664 851 664 664 664 664 664 664]


## Conclusion

We did a simple experiments with dummy dataset of 32 batch size. The default behaviour of `timeit` is to measure the average of the cell execution time with 7 times of repeat ([`timeit`'s default behaviour](https://ipython.readthedocs.io/en/stable/interactive/magics.html#magic-timeit))


The TF implementation of the ResNet50 took about 4.32s while the ONNX converted model took about 3.64s on average for the the inference job. So it is clear ONNX converted model is much faster on CPU.