Copyright (c) Microsoft Corporation. All rights reserved.  
Licensed under the MIT License.

# ONNX Runtime: Tutorial for Nuphar execution provider
**Accelerating model inference via compiler, using Docker Images for ONNX Runtime with Nuphar**

This example shows how to accelerate model inference using Nuphar, an execution provider that leverages just-in-time compilation to generate optimized executables.

For more background about Nuphar, please check [Nuphar-ExecutionProvider.md](https://github.com/microsoft/onnxruntime/blob/master/docs/execution_providers/Nuphar-ExecutionProvider.md) and its [build instructions](https://github.com/microsoft/onnxruntime/blob/master/BUILD.md#nuphar).

#### Tutorial Roadmap:
0. Prerequistes
1. Create and run inference on a simple ONNX model, and understand how ***compilation*** works in Nuphar.
2. Create and run inference on a model using ***LSTM***, run symbolic shape inference, edit LSTM ops to Scan, and check Nuphar speedup.
3. ***Quantize*** the LSTM model and check speedup in Nuphar (CPU with AVX2 support is required).
4. Working on a real model: ***Bidirectional Attention Flow ([BiDAF](https://arxiv.org/pdf/1611.01603))*** from onnx model zoo.
5. ***Ahead-Of-Time (AOT) compilation*** to save just-in-time compilation cost on model load.


## 0. Prerequistes
Please make sure you have installed following Python packages. Besides, C++ compiler/linker is required for ahead-of-time compilation. Please make sure you have g++ if running on Linux, or Visual Studio 2017 on Windows.
 

In [1]:
import cpufeature
import numpy as np
import onnx
from onnx import helper, numpy_helper
import os
from timeit import default_timer as timer
import shutil
import subprocess
import sys
import tarfile
import urllib.request
def is_windows():
  return sys.platform.startswith('win')
if is_windows():
  assert shutil.which('cl.exe'), 'Please make sure MSVC compiler and liner are in PATH.'
else:
  assert shutil.which('g++'), 'Please make sure g++ is installed.'

And Nuphar package in onnxruntime is required too. Please make sure you are using Nuphar enabled build.

In [2]:
import onnxruntime
from onnxruntime.nuphar.model_editor import convert_to_scan_model
from onnxruntime.nuphar.model_quantizer import convert_matmul_model
from onnxruntime.nuphar.rnn_benchmark import generate_model
from onnxruntime.nuphar.symbolic_shape_infer import SymbolicShapeInference

## 1. Create and run inference on a simple ONNX model
Let's start with a simple model: Y = ((X + X) * X + X) * X + X

In [3]:
model = onnx.ModelProto()
opset = model.opset_import.add()
opset.domain == 'onnx'
opset.version = 7 # ONNX opset 7 is required for LSTM op later

graph = model.graph
X = 'input'
Y = 'output'

# declare graph input/ouput with shape [seq, batch, 1024]
dim = 1024
model.graph.input.add().CopyFrom(helper.make_tensor_value_info(X, onnx.TensorProto.FLOAT, ['seq', 'batch', dim]))
model.graph.output.add().CopyFrom(helper.make_tensor_value_info(Y, onnx.TensorProto.FLOAT, ['seq', 'batch', dim]))

# create nodes: Y = ((X + X) * X + X) * X + X
num_nodes = 5
for i in range(num_nodes):
  n = helper.make_node('Mul' if i % 2 else 'Add',
                       [X, X if i == 0 else 'out_'+str(i-1)],
                       ['out_'+str(i) if i < num_nodes - 1 else Y],
                       'node'+str(i))
  model.graph.node.add().CopyFrom(n)

# save the model
simple_model_name = 'simple.onnx'
onnx.save(model, simple_model_name)

We will use nuphar execution provider to run the inference for the model that we created above, and use settings string to check the generated code.

Because of the redirection of output, we dump the lowered code from a subprocess to a log file:

In [4]:
code_to_run = '''
import onnxruntime
s = 'codegen_dump_lower:verbose'
onnxruntime.capi._pybind_state.set_nuphar_settings(s)
sess = onnxruntime.InferenceSession('simple.onnx')
'''

log_file = 'simple_lower.log' 
with open(log_file, "w") as f:
  subprocess.run([sys.executable, '-c', code_to_run], stdout=f, stderr=f)

The lowered log is similar to C source code, but the whole file is lengthy to show here. Let's just check the last few lines that are most important:

In [5]:
with open(log_file) as f:
    log_lines = f.readlines()

log_lines[-10:]

['produce node4 {\n',
 '  for (ax0, 0, seq) {\n',
 '    for (ax1, 0, batch) {\n',
 '      for (ax2.outer, 0, 64) {\n',
 '        node4[ramp((((((ax0*batch) + ax1)*64) + ax2.outer)*16), 1, 16)] = (input[ramp((((((ax0*batch) + ax1)*64) + ax2.outer)*16), 1, 16)] + (input[ramp((((((ax0*batch) + ax1)*64) + ax2.outer)*16), 1, 16)]*(input[ramp((((((ax0*batch) + ax1)*64) + ax2.outer)*16), 1, 16)] + (input[ramp((((((ax0*batch) + ax1)*64) + ax2.outer)*16), 1, 16)]*(input[ramp((((((ax0*batch) + ax1)*64) + ax2.outer)*16), 1, 16)] + input[ramp((((((ax0*batch) + ax1)*64) + ax2.outer)*16), 1, 16)])))))\n',
 '      }\n',
 '    }\n',
 '  }\n',
 '}\n',
 '\n']

The compiled code showed that the nodes of Add/Mul were fused into a single function, and vectorization was applied in the loop. The fusion was automatically done by the compiler in the Nuphar execution provider, and did not require any manual model editing.

Next, let's run inference on the model and compare the accuracy and performance with numpy:

In [6]:
seq = 128
batch = 16
input_data = np.random.rand(seq, batch, dim).astype(np.float32)
sess = onnxruntime.InferenceSession(simple_model_name)
feed = {X:input_data}
output = sess.run([], feed)
np_output = ((((input_data + input_data) * input_data) + input_data) * input_data) + input_data
assert np.allclose(output[0], np_output)

repeats = 100
start_ort = timer()
for i in range(repeats):
    output = sess.run([], feed)
end_ort = timer()
start_np = timer()
for i in range(repeats):
    np_output = ((((input_data + input_data) * input_data) + input_data) * input_data) + input_data
end_np = timer()
'onnxruntime: {0:.3f} seconds, numpy: {1:.3f} seconds'.format(end_ort - start_ort, end_np - start_np)

'onnxruntime: 0.315 seconds, numpy: 0.728 seconds'

## 2. Create and run inference on a model using LSTM
Now, let's take one step further to work on a 4-layer LSTM model, created from onnxruntime.nuphar.rnn_benchmark module.

In [7]:
lstm_model = 'LSTMx4.onnx'
input_dim = 256
hidden_dim = 1024
generate_model('lstm', input_dim, hidden_dim, bidirectional=False, layers=4, model_name=lstm_model)

**IMPORTANT**: Nuphar generates code before knowing shapes of input data, unlike other execution providers that do runtime shape inference. Thus, shape inference information is critical for compiler optimizations in Nuphar. To do that, we run symbolic shape inference on the model. Symbolic shape inference is based on the ONNX shape inference, and enhanced by sympy to better handle Shape/ConstantOfShape/etc. ops using symbolic computation.

In [8]:
SymbolicShapeInference.infer_shapes(input_model=lstm_model, output_model=lstm_model)

Now, let's check baseline performance on the generated model, using CPU execution provider.

In [9]:
sess_baseline = onnxruntime.InferenceSession(lstm_model)
sess_baseline.set_providers(['CPUExecutionProvider']) # default provider in this container is Nuphar, this overrides to CPU EP
seq = 128
input_data = np.random.rand(seq, 1, input_dim).astype(np.float32)
feed = {sess_baseline.get_inputs()[0].name:input_data}
output = sess_baseline.run([], feed)

To run RNN models in Nuphar execution provider efficiently, LSTM/GRU/RNN ops need to be converted to Scan ops. This is because Scan is more flexible, and supports quantized RNNs.

In [10]:
scan_model = 'Scan_LSTMx4.onnx'
convert_to_scan_model(lstm_model, scan_model)

After conversion, let's compare performance and accuracy with baseline:

In [11]:
sess_nuphar = onnxruntime.InferenceSession(scan_model)
output_nuphar = sess_nuphar.run([], feed)
assert np.allclose(output[0], output_nuphar[0])

repeats = 10
start_baseline = timer()
for i in range(repeats):
    output = sess_baseline.run([], feed)
end_baseline = timer()

start_nuphar = timer()
for i in range(repeats):
    output = sess_nuphar.run([], feed)
end_nuphar = timer()

'nuphar: {0:.3f} seconds, baseline: {1:.3f} seconds'.format(end_nuphar - start_nuphar, end_baseline - start_baseline)

'nuphar: 2.899 seconds, baseline: 2.911 seconds'

## 3. Quantize the LSTM model
Let's get more speed-ups from Nuphar by quantizing the floating point GEMM/GEMV in LSTM model to int8 GEMM/GEMV.

**NOTE:** For inference speed of quantizated model, a CPU with AVX2 instructions is preferred.

In [12]:
cpufeature.CPUFeature['AVX2'] or 'No AVX2, quantization model might be slow'

True

We can use onnxruntime.nuphar.model_quantizer to quantize floating point GEMM/GEMVs. Assuming GEMM/GEMV takes form of input * weights, weights are statically quantized per-column, and inputs are dynamically quantized per-row.

In [13]:
quantized_model = 'Scan_LSTMx4_int8.onnx'
convert_matmul_model(scan_model, quantized_model)

Now run the quantized model, and check accuracy. Please note that quantization may cause accuracy loss, so we relax the comparison threshold a bit.

In [14]:
sess_quantized = onnxruntime.InferenceSession(quantized_model)
output_quantized = sess_quantized.run([], feed)
assert np.allclose(output[0], output_quantized[0], rtol=1e-3, atol=1e-3)

Now check quantized model performance:

In [15]:
start_quantized = timer()
for i in range(repeats):
    output = sess_quantized.run([], feed)
end_quantized = timer()

'quantized: {0:.3f} seconds, non-quantized: {1:.3f} seconds'.format(end_quantized - start_quantized, end_nuphar - start_nuphar)

'quantized: 0.768 seconds, non-quantized: 2.899 seconds'

## 4. Working on a real model: Bidirectional Attention Flow (BiDAF)
BiDAF is a machine comprehension model that uses LSTMs. The inputs to this model are paragraphs of contexts and queries, and the outputs are start/end indices of words in the contexts that answers the queries.

First let's download the model:

In [16]:
# download BiDAF model
cwd = os.getcwd()
bidaf_url = 'https://onnxzoo.blob.core.windows.net/models/opset_9/bidaf/bidaf.tar.gz'
bidaf_local = os.path.join(cwd, 'bidaf.tar.gz')
if not os.path.exists(bidaf_local):
  urllib.request.urlretrieve(bidaf_url, bidaf_local)
with tarfile.open(bidaf_local, 'r') as f:
  f.extractall(cwd)

Now let's check the performance of the CPU provider:

In [17]:
bidaf = os.path.join(cwd, 'bidaf', 'bidaf.onnx')
sess_baseline = onnxruntime.InferenceSession(bidaf)
sess_baseline.set_providers(['CPUExecutionProvider'])
# load test data
test_data_dir = os.path.join(cwd, 'bidaf', 'test_data_set_3')
tps = [onnx.load_tensor(os.path.join(test_data_dir, 'input_{}.pb'.format(i))) for i in range(len(sess_baseline.get_inputs()))]
feed = {tp.name:numpy_helper.to_array(tp) for tp in tps}
output_baseline = sess_baseline.run([], feed)

The context in this test data:

In [18]:
' '.join(list(feed['context_word'].reshape(-1)))

"with 4:51 left in regulation , carolina got the ball on their own 24 - yard line with a chance to mount a game - winning drive , and soon faced 3rd - and - 9 . on the next play , miller stripped the ball away from newton , and after several players dove for it , it took a long bounce backwards and was recovered by ward , who returned it five yards to the panthers 4 - yard line . although several players dove into the pile to attempt to recover it , newton did not and his lack of aggression later earned him heavy criticism . meanwhile , denver  ' s offense was kept out of the end zone for three plays , but a holding penalty on cornerback josh norman gave the broncos a new set of downs . then anderson scored on a 2 - yard touchdown run and manning completed a pass to bennie fowler for a 2 - point conversion , giving denver a 24 – 10 lead with 3:08 left and essentially putting the game away . carolina had two more drives , but failed to get a first down on each one ."

The query:

In [19]:
' '.join(list(feed['query_word'].reshape(-1)))

'who recovered the strip ball ?'

And the answer:

In [20]:
' '.join(list(feed['context_word'][output_baseline[0][0]:output_baseline[1][0]+1].reshape(-1)))

'ward'

Now put all steps together:

In [21]:
# editing
bidaf_converted = 'bidaf_mod.onnx'
SymbolicShapeInference.infer_shapes(bidaf, bidaf_converted)
convert_to_scan_model(bidaf_converted, bidaf_converted)
# When quantizing, there's an only_for_scan option to quantize only the GEMV inside Scan ops.
# This is useful when the input dims of LSTM being much bigger than hidden dims.
# BiDAF has several LSTMs with input dim being 800/1400/etc, while hidden dim is 100.
# So unlike the LSTMx4 model above, we use only_for_scan here
convert_matmul_model(bidaf_converted, bidaf_converted, only_for_scan=True)

# inference and verify accuracy
sess = onnxruntime.InferenceSession(bidaf_converted)
output = sess.run([], feed)
assert all([np.allclose(o, ob) for o, ob in zip(output, output_baseline)])

Check performance after all these steps:

In [22]:
start_baseline = timer()
for i in range(repeats):
    output = sess_baseline.run([], feed)
end_baseline = timer()

start_nuphar = timer()
for i in range(repeats):
    output = sess.run([], feed)
end_nuphar = timer()

'nuphar: {0:.3f} seconds, baseline: {1:.3f} seconds'.format(end_nuphar - start_nuphar, end_baseline - start_baseline)

'nuphar: 0.128 seconds, baseline: 0.177 seconds'

The benefit of quantization in BiDAF is not as great as in the LSTM sample above, because BiDAF has relatively small hidden dimensions, which limited the gain from optimization inside Scan ops. However, this model still benefits from fusion/vectorization/etc.

# 5. Ahead-Of-Time (AOT) compilation
Nuphar runs Just-in-time (JIT) compilation when loading models. The compilation may lead to slow cold start. We can use create_shared script to build dll from JIT code and accelerate model loading.

In [23]:
start_jit = timer()
sess = onnxruntime.InferenceSession(bidaf_converted)
end_jit = timer()
'JIT took {0:.3f} seconds'.format(end_jit - start_jit)

'JIT took 3.163 seconds'

In [24]:
# create a folder for JIT cache
cache_dir = os.path.join(cwd, 'bidaf_cache')
# remove any stale cache files
if os.path.exists(cache_dir):
  shutil.rmtree(cache_dir)
os.makedirs(cache_dir, exist_ok=True)
# use settings to enable JIT cache
settings = 'nuphar_cache_path:{}'.format(cache_dir)
onnxruntime.capi._pybind_state.set_nuphar_settings(settings)
sess = onnxruntime.InferenceSession(bidaf_converted)

Now object files of JIT code is stored in cache_dir, let's link them into dll:

In [25]:
cache_versioned_dir = os.path.join(cache_dir, os.listdir(cache_dir)[0])
# use onnxruntime.nuphar.create_shared module to create dll
onnxruntime_dir = os.path.split(os.path.abspath(onnxruntime.__file__))[0]
subprocess.run([sys.executable, '-m', 'onnxruntime.nuphar.create_shared', '--input_dir', cache_versioned_dir], check=True)
os.listdir(cache_versioned_dir)

['jit.so']

Check the model loading speed-up with AOT dll:

In [26]:
start_aot = timer()
# NOTE: Nuphar settings string is not sticky. It needs to be reset before creating InferenceSession
settings = 'nuphar_cache_path:{}'.format(cache_dir)
onnxruntime.capi._pybind_state.set_nuphar_settings(settings)
sess = onnxruntime.InferenceSession(bidaf_converted)
end_aot = timer()
'AOT took {0:.3f} seconds'.format(end_aot - start_aot)

'AOT took 0.464 seconds'