Copyright (c) Microsoft Corporation. All rights reserved.  
Licensed under the MIT License.

# ONNX Runtime: Tutorial for Nuphar execution provider
**Accelerating model inference via compiler, using Docker Images for ONNX Runtime with Nuphar**

This example shows how to accelerate model inference using Nuphar, an execution provider that leverages just-in-time compilation to generate optimized executables.

For more background about Nuphar, please check [Nuphar-ExecutionProvider.md](https://github.com/microsoft/onnxruntime/blob/master/docs/execution_providers/Nuphar-ExecutionProvider.md) and its [build instructions](https://github.com/microsoft/onnxruntime/blob/master/BUILD.md#nuphar).

#### Tutorial Roadmap:
1. Prerequistes
2. Create and run inference on a simple ONNX model, and understand how ***compilation*** works in Nuphar.
3. Create and run inference on a model using ***LSTM***, run symbolic shape inference, edit LSTM ops to Scan, and check Nuphar speedup.
4. ***Quantize*** the LSTM model and check speedup in Nuphar (CPU with AVX2 support is required).
5. Working on real models from onnx model zoo: ***BERT squad***, ***GPT-2*** and ***Bidirectional Attention Flow ([BiDAF](https://arxiv.org/pdf/1611.01603))***.
6. ***Ahead-Of-Time (AOT) compilation*** to save just-in-time compilation cost on model load.
7. Performance tuning for single thread inference.


## 1. Prerequistes
Please make sure you have installed following Python packages. Besides, C++ compiler/linker is required for ahead-of-time compilation. Please make sure you have g++ if running on Linux, or Visual Studio 2017 on Windows.

For simplicity, you may use [Nuphar docker image](https://github.com/microsoft/onnxruntime/blob/master/dockerfiles/README.md) from Microsoft Container Registry.


In [1]:
import cpufeature
import numpy as np
import onnx
from onnx import helper, numpy_helper
import os
from timeit import default_timer as timer
import shutil
import subprocess
import sys
import tarfile
import urllib.request

def is_windows():
  return sys.platform.startswith('win')

if is_windows():
  assert shutil.which('cl.exe'), 'Please make sure MSVC compiler and liner are in PATH.'
else:
  assert shutil.which('g++'), 'Please make sure g++ is installed.'

def print_speedup(name, delta_baseline, delta):
    print("{} speed-up {:.2f}%".format(name, 100*(delta_baseline/delta - 1)))
    print("    Baseline: {:.3f} s, Current: {:.3f} s".format(delta_baseline, delta))

And Nuphar package in onnxruntime is required too. Please make sure you are using Nuphar enabled build.

In [2]:
import onnxruntime
from onnxruntime.nuphar.model_editor import convert_to_scan_model
from onnxruntime.nuphar.model_quantizer import convert_matmul_model
from onnxruntime.nuphar.rnn_benchmark import generate_model
from onnxruntime.nuphar.symbolic_shape_infer import SymbolicShapeInference

## 2. Create and run inference on a simple ONNX model
Let's start with a simple model: Y = ((X + X) * X + X) * X + X

In [3]:
model = onnx.ModelProto()
opset = model.opset_import.add()
opset.domain == 'onnx'
opset.version = 7 # ONNX opset 7 is required for LSTM op later

graph = model.graph
X = 'input'
Y = 'output'

# declare graph input/ouput with shape [seq, batch, 1024]
dim = 1024
model.graph.input.add().CopyFrom(helper.make_tensor_value_info(X, onnx.TensorProto.FLOAT, ['seq', 'batch', dim]))
model.graph.output.add().CopyFrom(helper.make_tensor_value_info(Y, onnx.TensorProto.FLOAT, ['seq', 'batch', dim]))

# create nodes: Y = ((X + X) * X + X) * X + X
num_nodes = 5
for i in range(num_nodes):
  n = helper.make_node('Mul' if i % 2 else 'Add',
                       [X, X if i == 0 else 'out_'+str(i-1)],
                       ['out_'+str(i) if i < num_nodes - 1 else Y],
                       'node'+str(i))
  model.graph.node.add().CopyFrom(n)

# save the model
simple_model_name = 'simple.onnx'
onnx.save(model, simple_model_name)

We will use nuphar execution provider to run the inference for the model that we created above, and use settings string to check the generated code.

Because of the redirection of output, we dump the lowered code from a subprocess to a log file:

In [4]:
code_to_run = '''
import onnxruntime
s = 'codegen_dump_lower:verbose'
onnxruntime.capi._pybind_state.set_nuphar_settings(s)
sess = onnxruntime.InferenceSession('simple.onnx')
'''

log_file = 'simple_lower.log' 
with open(log_file, "w") as f:
  subprocess.run([sys.executable, '-c', code_to_run], stdout=f, stderr=f)

The lowered log is similar to C source code, but the whole file is lengthy to show here. Let's just check the last few lines that are most important:

In [5]:
with open(log_file) as f:
    log_lines = f.readlines()

log_lines[-10:]

['    for (ax2.outer, 0, 64) {\n',
 '      if ((0 <= (ax0.ax1.fused/batch))) {\n',
 '        if (((ax0.ax1.fused/batch) < seq)) {\n',
 '          node4[ramp((((ax0.ax1.fused*64) + ax2.outer)*16), 1, 16)] = (input[ramp((((ax0.ax1.fused*64) + ax2.outer)*16), 1, 16)] + (input[ramp((((ax0.ax1.fused*64) + ax2.outer)*16), 1, 16)]*(input[ramp((((ax0.ax1.fused*64) + ax2.outer)*16), 1, 16)] + (input[ramp((((ax0.ax1.fused*64) + ax2.outer)*16), 1, 16)]*(input[ramp((((ax0.ax1.fused*64) + ax2.outer)*16), 1, 16)] + input[ramp((((ax0.ax1.fused*64) + ax2.outer)*16), 1, 16)])))))\n',
 '        }\n',
 '      }\n',
 '    }\n',
 '  }\n',
 '}\n',
 '\n']

The compiled code showed that the nodes of Add/Mul were fused into a single function, and vectorization was applied in the loop. The fusion was automatically done by the compiler in the Nuphar execution provider, and did not require any manual model editing.

Next, let's run inference on the model and compare the accuracy and performance with numpy:

In [6]:
seq = 128
batch = 16
input_data = np.random.rand(seq, batch, dim).astype(np.float32)
sess = onnxruntime.InferenceSession(simple_model_name)
feed = {X:input_data}
output = sess.run([], feed)
np_output = ((((input_data + input_data) * input_data) + input_data) * input_data) + input_data
assert np.allclose(output[0], np_output)

repeats = 100
start_ort = timer()
for i in range(repeats):
    output = sess.run([], feed)
end_ort = timer()
start_np = timer()
for i in range(repeats):
    np_output = ((((input_data + input_data) * input_data) + input_data) * input_data) + input_data
end_np = timer()
print_speedup('Fusion', end_np - start_np, end_ort - start_ort)

Fusion speed-up 436.78%
    Baseline: 0.725 s, Current: 0.135 s


## 3. Create and run inference on a model using LSTM
Now, let's take one step further to work on a 4-layer LSTM model, created from onnxruntime.nuphar.rnn_benchmark module.

In [7]:
lstm_model = 'LSTMx4.onnx'
input_dim = 256
hidden_dim = 1024
generate_model('lstm', input_dim, hidden_dim, bidirectional=False, layers=4, model_name=lstm_model)

**IMPORTANT**: Nuphar generates code before knowing shapes of input data, unlike other execution providers that do runtime shape inference. Thus, shape inference information is critical for compiler optimizations in Nuphar. To do that, we run symbolic shape inference on the model. Symbolic shape inference is based on the ONNX shape inference, and enhanced by sympy to better handle Shape/ConstantOfShape/etc. ops using symbolic computation.

In [8]:
SymbolicShapeInference.infer_shapes(input_model=lstm_model, output_model=lstm_model)

Now, let's check baseline performance on the generated model, using CPU execution provider.

In [9]:
sess_baseline = onnxruntime.InferenceSession(lstm_model)
sess_baseline.set_providers(['CPUExecutionProvider']) # default provider in this container is Nuphar, this overrides to CPU EP
seq = 128
input_data = np.random.rand(seq, 1, input_dim).astype(np.float32)
feed = {sess_baseline.get_inputs()[0].name:input_data}
output = sess_baseline.run([], feed)

To run RNN models in Nuphar execution provider efficiently, LSTM/GRU/RNN ops need to be converted to Scan ops. This is because Scan is more flexible, and supports quantized RNNs.

In [10]:
scan_model = 'Scan_LSTMx4.onnx'
convert_to_scan_model(lstm_model, scan_model)

After conversion, let's compare performance and accuracy with baseline:

In [11]:
sess_nuphar = onnxruntime.InferenceSession(scan_model)
output_nuphar = sess_nuphar.run([], feed)
assert np.allclose(output[0], output_nuphar[0])

repeats = 10
start_baseline = timer()
for i in range(repeats):
    output = sess_baseline.run([], feed)
end_baseline = timer()

start_nuphar = timer()
for i in range(repeats):
    output = sess_nuphar.run([], feed)
end_nuphar = timer()

print_speedup('Nuphar Scan', end_baseline - start_baseline, end_nuphar - start_nuphar)

Nuphar Scan speed-up 3.37%
    Baseline: 3.099 s, Current: 2.998 s


## 4. Quantize the LSTM model
Let's get more speed-ups from Nuphar by quantizing the floating point GEMM/GEMV in LSTM model to int8 GEMM/GEMV.

**NOTE:** For inference speed of quantizated model, a CPU with AVX2 instructions is preferred.

In [12]:
cpufeature.CPUFeature['AVX2'] or 'No AVX2, quantization model might be slow'

True

We can use onnxruntime.nuphar.model_quantizer to quantize floating point GEMM/GEMVs. Assuming GEMM/GEMV takes form of input * weights, weights are statically quantized per-column, and inputs are dynamically quantized per-row.

In [13]:
quantized_model = 'Scan_LSTMx4_int8.onnx'
convert_matmul_model(scan_model, quantized_model)

Now run the quantized model, and check accuracy. Please note that quantization may cause accuracy loss, so we relax the comparison threshold a bit.

In [14]:
sess_quantized = onnxruntime.InferenceSession(quantized_model)
output_quantized = sess_quantized.run([], feed)
assert np.allclose(output[0], output_quantized[0], rtol=1e-3, atol=1e-3)

Now check quantized model performance:

In [15]:
start_quantized = timer()
for i in range(repeats):
    output = sess_quantized.run([], feed)
end_quantized = timer()

print_speedup('Quantization', end_nuphar - start_nuphar, end_quantized - start_quantized)

Quantization speed-up 299.66%
    Baseline: 2.998 s, Current: 0.750 s


## 5. Working on real models

### BERT Squad

BERT (Bidirectional Encoder Representations from Transformers) applies Transformers to language modelling. With Nuphar, we may fuse and compile the model to accelerate inference on CPU.

#### Download model and test data

In [16]:
# download BERT squad model
cwd = os.getcwd()
model_url = 'https://onnxzoo.blob.core.windows.net/models/opset_10/bert_squad/download_sample_10.tar.gz'
model_local = os.path.join(cwd, 'download_sample_10.tar.gz')
if not os.path.exists(model_local):
  urllib.request.urlretrieve(model_url, model_local)
with tarfile.open(model_local, 'r') as f:
  f.extractall(cwd)

#### Run symbolic shape inference
Note that this model has computations like `min(100000, seq_len)` which could be simplified to `seq_len` if we know `seq_len` is not going to be too big. We can do this by setting int_max. Besides, auto_merge is used to make sure the all nodes in the entire model could have shape inferenced by merging symbolic dims when broadcasting.

In [17]:
model_dir = os.path.join(cwd, 'download_sample_10')
model = os.path.join(model_dir, 'bertsquad10.onnx')
model_with_shape_inference = os.path.join(model_dir, 'bertsquad10_shaped.onnx')

# run symbolic shape inference
SymbolicShapeInference.infer_shapes(model, model_with_shape_inference, auto_merge=True, int_max=100000)

#### Run inference on original model, using CPU execution provider, with maximum optimization

In [18]:
sess_options = onnxruntime.SessionOptions()
sess_options.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_ENABLE_ALL
sess_baseline = onnxruntime.InferenceSession(model, sess_options)
sess_baseline.set_providers(['CPUExecutionProvider'])

# load test data
test_data_dir = os.path.join(model_dir, 'test_data_set_1')
tps = [onnx.load_tensor(os.path.join(test_data_dir, 'input_{}.pb'.format(i))) for i in range(len(sess_baseline.get_inputs()))]
feed = {tp.name:numpy_helper.to_array(tp) for tp in tps}
output_baseline = sess_baseline.run([], feed)

repeats = 20
start_baseline = timer()
for i in range(repeats):
    output = sess_baseline.run([], feed)
end_baseline = timer()

#### Run inference on the model with symbolic shape inference, using Nuphar execution provider
First let's check accuracy:

In [19]:
sess = onnxruntime.InferenceSession(model_with_shape_inference)
output = sess.run([], feed)
assert all([np.allclose(o, ob, atol=1e-4) for o, ob in zip(output, output_baseline)])

Then check speed:

In [20]:
start_nuphar = timer()
for i in range(repeats):
    output = sess.run([], feed)
end_nuphar = timer()

print_speedup('Nuphar BERT squad', end_baseline - start_baseline, end_nuphar - start_nuphar)

Nuphar BERT squad speed-up 60.13%
    Baseline: 4.928 s, Current: 3.077 s


### GPT-2 with fixed batch size
GPT-2 is a language model using Generative Pre-Trained Transformer for text generation. With Nuphar, we may fuse and compile the model to accelerate inference on CPU.

#### Download model and test data

In [21]:
# download GPT-2 model
cwd = os.getcwd()
model_url = 'https://onnxzoo.blob.core.windows.net/models/opset_10/GPT2/GPT-2.tar.gz'
model_local = os.path.join(cwd, 'GPT-2.tar.gz')
if not os.path.exists(model_local):
  urllib.request.urlretrieve(model_url, model_local)
with tarfile.open(model_local, 'r') as f:
  f.extractall(cwd)

#### Change batch dimension to fixed value, and run symbolic shape inference
The GPT-2 model from model zoo has a symbolic batch dimension. By replacing it with a fixed value, compiler would be able to generate better code.

In [22]:
model_dir = os.path.join(cwd, 'GPT2')
model = os.path.join(model_dir, 'model.onnx')

# edit batch dimension from symbolic to int value for better codegen
mp = onnx.load(model)
mp.graph.input[0].type.tensor_type.shape.dim[0].dim_value = 1
onnx.save(mp, model)

model_with_shape_inference = os.path.join(model_dir, 'model_shaped.onnx')

# run symbolic shape inference
SymbolicShapeInference.infer_shapes(model, model_with_shape_inference, auto_merge=True)

#### Run inference and compare accuracy/performance to CPU provider

In [23]:
sess_options = onnxruntime.SessionOptions()
sess_options.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_ENABLE_ALL
sess_baseline = onnxruntime.InferenceSession(model, sess_options)
sess_baseline.set_providers(['CPUExecutionProvider'])

# load test data, note the tensor proto name in data does not match model, so override it in feed
input_name = [i.name for i in sess_baseline.get_inputs()][0] # This model only has one input
test_data_dir = os.path.join(model_dir, 'test_data_set_0')
tp = onnx.load_tensor(os.path.join(test_data_dir, 'input_0.pb'))
feed = {input_name:numpy_helper.to_array(tp).reshape(1,-1)} # the test data missed batch dimension
output_baseline = sess_baseline.run([], feed)

repeats = 100
start_baseline = timer()
for i in range(repeats):
    output = sess_baseline.run([], feed)
end_baseline = timer()

sess = onnxruntime.InferenceSession(model_with_shape_inference)
output = sess.run([], feed)
assert all([np.allclose(o, ob, atol=1e-4) for o, ob in zip(output, output_baseline)])

start_nuphar = timer()
for i in range(repeats):
    output = sess.run([], feed)
end_nuphar = timer()

print_speedup('Nuphar GPT-2', end_baseline - start_baseline, end_nuphar - start_nuphar)

Nuphar GPT-2 speed-up 13.48%
    Baseline: 2.535 s, Current: 2.234 s


### BiDAF with quantization

BiDAF is a machine comprehension model that uses LSTMs. The inputs to this model are paragraphs of contexts and queries, and the outputs are start/end indices of words in the contexts that answers the queries.

First let's download the model:

In [24]:
# download BiDAF model
cwd = os.getcwd()
bidaf_url = 'https://onnxzoo.blob.core.windows.net/models/opset_9/bidaf/bidaf.tar.gz'
bidaf_local = os.path.join(cwd, 'bidaf.tar.gz')
if not os.path.exists(bidaf_local):
  urllib.request.urlretrieve(bidaf_url, bidaf_local)
with tarfile.open(bidaf_local, 'r') as f:
  f.extractall(cwd)

Now let's check the performance of the CPU provider:

In [25]:
bidaf = os.path.join(cwd, 'bidaf', 'bidaf.onnx')
sess_baseline = onnxruntime.InferenceSession(bidaf)
sess_baseline.set_providers(['CPUExecutionProvider'])
# load test data
test_data_dir = os.path.join(cwd, 'bidaf', 'test_data_set_3')
tps = [onnx.load_tensor(os.path.join(test_data_dir, 'input_{}.pb'.format(i))) for i in range(len(sess_baseline.get_inputs()))]
feed = {tp.name:numpy_helper.to_array(tp) for tp in tps}
output_baseline = sess_baseline.run([], feed)

The context in this test data:

In [26]:
' '.join(list(feed['context_word'].reshape(-1)))

"with 4:51 left in regulation , carolina got the ball on their own 24 - yard line with a chance to mount a game - winning drive , and soon faced 3rd - and - 9 . on the next play , miller stripped the ball away from newton , and after several players dove for it , it took a long bounce backwards and was recovered by ward , who returned it five yards to the panthers 4 - yard line . although several players dove into the pile to attempt to recover it , newton did not and his lack of aggression later earned him heavy criticism . meanwhile , denver  ' s offense was kept out of the end zone for three plays , but a holding penalty on cornerback josh norman gave the broncos a new set of downs . then anderson scored on a 2 - yard touchdown run and manning completed a pass to bennie fowler for a 2 - point conversion , giving denver a 24 – 10 lead with 3:08 left and essentially putting the game away . carolina had two more drives , but failed to get a first down on each one ."

The query:

In [27]:
' '.join(list(feed['query_word'].reshape(-1)))

'who recovered the strip ball ?'

And the answer:

In [28]:
' '.join(list(feed['context_word'][output_baseline[0][0]:output_baseline[1][0]+1].reshape(-1)))

'ward'

Now put all steps together:

In [29]:
# editing
bidaf_converted = 'bidaf_mod.onnx'
SymbolicShapeInference.infer_shapes(bidaf, bidaf_converted)
convert_to_scan_model(bidaf_converted, bidaf_converted)
# When quantizing, there's an only_for_scan option to quantize only the GEMV inside Scan ops.
# This is useful when the input dims of LSTM being much bigger than hidden dims.
# BiDAF has several LSTMs with input dim being 800/1400/etc, while hidden dim is 100.
# So unlike the LSTMx4 model above, we use only_for_scan here
convert_matmul_model(bidaf_converted, bidaf_converted, only_for_scan=True)

# inference and verify accuracy
sess = onnxruntime.InferenceSession(bidaf_converted)
output = sess.run([], feed)
assert all([np.allclose(o, ob) for o, ob in zip(output, output_baseline)])

Check performance after all these steps:

In [30]:
start_baseline = timer()
for i in range(repeats):
    output = sess_baseline.run([], feed)
end_baseline = timer()

start_nuphar = timer()
for i in range(repeats):
    output = sess.run([], feed)
end_nuphar = timer()

print_speedup('Nuphar quantized BiDAF', end_baseline - start_baseline, end_nuphar - start_nuphar)

Nuphar quantized BiDAF speed-up 47.77%
    Baseline: 1.564 s, Current: 1.058 s


The benefit of quantization in BiDAF is not as great as in the LSTM sample above, because BiDAF has relatively small hidden dimensions, which limited the gain from optimization inside Scan ops. However, this model still benefits from fusion/vectorization/etc.

## 6. Ahead-Of-Time (AOT) compilation
Nuphar runs Just-in-time (JIT) compilation when loading models. The compilation may lead to slow cold start. We can use create_shared script to build dll from JIT code and accelerate model loading.

In [31]:
start_jit = timer()
sess = onnxruntime.InferenceSession(bidaf_converted)
end_jit = timer()
'JIT took {:.3f} seconds'.format(end_jit - start_jit)

'JIT took 4.721 seconds'

In [32]:
# create a folder for JIT cache
cache_dir = os.path.join(cwd, 'bidaf_cache')
# remove any stale cache files
if os.path.exists(cache_dir):
  shutil.rmtree(cache_dir)
os.makedirs(cache_dir, exist_ok=True)
# use settings to enable JIT cache
settings = 'nuphar_cache_path:{}'.format(cache_dir)
onnxruntime.capi._pybind_state.set_nuphar_settings(settings)
sess = onnxruntime.InferenceSession(bidaf_converted)

Now object files of JIT code is stored in cache_dir, let's link them into dll:

In [33]:
cache_versioned_dir = os.path.join(cache_dir, os.listdir(cache_dir)[0])
# use onnxruntime.nuphar.create_shared module to create dll
onnxruntime_dir = os.path.split(os.path.abspath(onnxruntime.__file__))[0]
subprocess.run([sys.executable, '-m', 'onnxruntime.nuphar.create_shared', '--input_dir', cache_versioned_dir], check=True)
os.listdir(cache_versioned_dir)

['jit.so']

Check the model loading speed-up with AOT dll:

In [34]:
start_aot = timer()
# NOTE: Nuphar settings string is not sticky. It needs to be reset before creating InferenceSession
settings = 'nuphar_cache_path:{}'.format(cache_dir)
onnxruntime.capi._pybind_state.set_nuphar_settings(settings)
sess = onnxruntime.InferenceSession(bidaf_converted)
end_aot = timer()
print_speedup('AOT', end_jit - start_jit, end_aot - start_aot)

AOT speed-up 764.62%
    Baseline: 4.721 s, Current: 0.546 s


## 7. Performance tuning for single thread inference.
By default, Nuphar enables parallel schedule for lower inference latency with multiple threads, when building with MKLML or OpenMP. For some models, user may want to run single-thread inference for better throughput with multiple concurrent inference threads, and turning off parallel schedule may make inference a bit faster in single thread.

In [35]:
# set OMP_NUM_THREADS to 1 for single thread inference
# this would mak
os.environ['OMP_NUM_THREADS'] = '1'

sess = onnxruntime.InferenceSession(bidaf_converted)
start_baseline = timer()
for i in range(repeats):
    output_baseline = sess_baseline.run([], feed)
end_baseline = timer()

# use NUPHAR_PARALLEL_MIN_WORKLOADS=0 to turn off parallel schedule, using settings string
# it can be set from environment variable too: os.environ['NUPHAR_PARALLEL_MIN_WORKLOADS'] = '0'
settings = 'nuphar_parallel_min_workloads:0'
onnxruntime.capi._pybind_state.set_nuphar_settings(settings)
sess = onnxruntime.InferenceSession(bidaf_converted)

start = timer()
for i in range(repeats):
    output = sess_baseline.run([], feed)
end = timer()
print_speedup('Single thread perf w/o parallel schedule', end_baseline - start_baseline, end - start)

Single thread perf w/o parallel schedule speed-up 1.05%
    Baseline: 1.542 s, Current: 1.526 s
