# Convert `tf.keras` model to ONNX

This tutorial shows:
- how to convert tf.keras model to ONNX from the saved model file or the source code directly. 
- comparison of the execution time of the inference on CPU between tf.keras model and ONNX converted model.

## Install ONNX dependencies
- `tf2onnx` provides a tool to convert TensorFlow model to ONNX
- `onnxruntime` is used to run inference on a saved ONNX model.

In [1]:
!pip install -Uqq tf2onnx
!pip install -Uqq onnxruntime

[K     |████████████████████████████████| 435 kB 13.5 MB/s 
[K     |████████████████████████████████| 12.8 MB 62.5 MB/s 
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow 2.8.0 requires tf-estimator-nightly==2.8.0.dev2021122109, which is not installed.[0m
[K     |████████████████████████████████| 4.9 MB 12.6 MB/s 
[?25h

### Imports

In [2]:
import tf2onnx
import pandas as pd
import tensorflow as tf
import numpy as np

### Get a sample model 

In [3]:
core = tf.keras.applications.ResNet50(include_top=True, input_shape=(224, 224, 3))

inputs = tf.keras.layers.Input(shape=(224, 224, 3), name="image_input")
preprocess = tf.keras.applications.resnet50.preprocess_input(inputs)
outputs = core(preprocess, training=False)
model = tf.keras.Model(inputs=[inputs], outputs=[outputs])

Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/resnet/resnet50_weights_tf_dim_ordering_tf_kernels.h5


Note that we are including the preprocessing layer in the `model` object. This will allow us to load an image from disk and run the model directly without requiring any
model-specific preprocessing. This reduces training/serving skew. 

## Convert to ONNX

In [4]:
num_layers = len(model.layers)
print(f'first layer name: {model.layers[0].name}')
print(f'last layer name: {model.layers[num_layers-1].name}')

first layer name: image_input
last layer name: resnet50


### Conversion

`opset` in `tf2onnx.convert.from_keras` is the ONNX Op version. You can find the full list which TensorFlow (TF) Ops are convertible to ONNX Ops [here](https://github.com/onnx/tensorflow-onnx/blob/master/support_status.md).

There are two ways to convert TensorFlow model to ONNX:
- `tf2onnx.convert.from_keras` to convert programatically
- `tf2onnx.convert` CLI to convert a saved TensorFlow model

In [5]:
import onnx

input_signature = [tf.TensorSpec([None, 224, 224, 3], tf.float32, name='image_input')]
onnx_model, _ = tf2onnx.convert.from_keras(model, input_signature, opset=15)
onnx.save(onnx_model, "resnet50_w_preprocessing.onnx")

# model.save('my_model')
# !python -m tf2onnx.convert --saved-model my_model --output my_model.onnx

Instructions for updating:
Use `tf.compat.v1.graph_util.extract_sub_graph`


## Test TF vs ONNX model with dummy data

### Generate dummy data 

In [6]:
dummy_inputs = tf.random.normal((32, 224, 224, 3))

### Test original TF model with dummy data

In [7]:
%%timeit
model.predict(dummy_inputs)

1 loop, best of 5: 3.41 s per loop


In [8]:
tf_preds = model.predict(dummy_inputs)

### Test converted ONNX model with dummy data

If you want to inference with GPU, then you can do so by setting `providers=["CUDAExecutionProvider"]` in `ort.InferenceSession`.

The first parameter in `sess.run` is set to `None`, and that means all the outputs of the model will be retrieved. 

In [10]:
import onnxruntime as ort
import numpy as np

sess = ort.InferenceSession("resnet50_w_preprocessing.onnx") # providers=["CUDAExecutionProvider"])
np_dummy_inputs = dummy_inputs.numpy()

In [11]:
%%timeit 
sess.run(None, {"image_input": np_dummy_inputs})

1 loop, best of 5: 3.06 s per loop


In [12]:
ort_preds = sess.run(None, {"image_input": np_dummy_inputs})

## Check if the TF and ONNX outputs match

In [16]:
np.testing.assert_allclose(tf_preds, ort_preds[0], atol=1e-4)

## Conclusion

We did a simple experiments with dummy dataset of 32 batch size. The default behaviour of `timeit` is to measure the average of the cell execution time with 7 times of repeat ([`timeit`'s default behaviour](https://ipython.readthedocs.io/en/stable/interactive/magics.html#magic-timeit)).


The ONNX model will likely always have a better inference latency than the TF model if you are using a CPU server for inference.