# Infery Asynchronous Predict Example

In this notebook we will demonstrate the usage of infery's TRT inferencer's `predict_async` method. We will showcase the performance boost you may gain by utilizing it to perform preprocessing, postprocessing and data transfer in parallel with inference on the GPU. Begin by importing infery and some basic libs for this notebook.

In [1]:
import time
import infery
import numpy as np
from typing import List, Dict, Union, Callable

__init__ -INFO- Infery was successfully imported with 8 CPUS and 1 GPUS.


## Import and model loading

Now on to loading our TRT engine or pickle to the GPU. Notice the `concurrency` argument. This will determine how many async predictions we can launch at a time before our outputs may begin being overrun by following predictions. This point will become more clear later in this notebook.

In [2]:
# Prep some notebook globals
WARMUP_ITERATIONS = 200
BENCHMARK_ITERATIONS = 1000
CONCURRENCY = 10
ENGINE_PATH = 'ENGINE_PATH'

# Load model
model = infery.load(ENGINE_PATH, concurrency=CONCURRENCY)

Detecting framework...
Detected framework for /home/naveassaf/Desktop/yolox_s.engine: FrameworkType.TENSORRT
infery_manager -INFO- Loading model /home/naveassaf/Desktop/yolox_s.engine to the GPU




infery_manager -INFO- Successfully loaded /home/naveassaf/Desktop/yolox_s.engine to the GPU.


## Basic Asynchronous Predict Usage

`predict_async` behaves similarly to the normal infery `predict` except that it returns an `AsyncExecutionHandle` object which you may either `completed` - to check if the execution has completed - or `get` to wait until the execution completes and get the result. For example:

<br /><br />
**Basic Usage**

In [3]:
# Use infery's example_inputs to get a random np.ndarray input tensor
test_input = model.example_inputs

# Perform an asynchronous predict, demonstrating the get() and completed() functionality
execution_handle = model.predict_async(test_input)
print(f'BEFORE EXECUTION --- COMPLETED: {execution_handle.completed()} --- TIME: {round(time.time_ns()/1e6)} [ms]')
asyc_predict_result = execution_handle.get()
print(f'AFTER EXECUTION  --- COMPLETED: {execution_handle.completed()}  --- TIME: {round(time.time_ns()/1e6)} [ms]')

# Ensure we received the same result as a normal predict. The index [0] here accesses the first output of the model.
print(f'NORMAL == ASYNC: {(model.predict(test_input)[0] == asyc_predict_result[0]).all()}')

BEFORE EXECUTION --- COMPLETED: False --- TIME: 1667405990426 [ms]
AFTER EXECUTION  --- COMPLETED: True  --- TIME: 1667405990449 [ms]
NORMAL == ASYNC: True


## Utilizing Asynchronous Predict Within Your Application

So far we have seen the boost achievable by utilizing `predict_async` to hide the data transfer latency of the model. In the cell below we will demonstrate how to hide preprocessing and postprocessing latencies by parallelizing inference on the GPU, pre/postprocessing on the CPU and data transfer between them. We begin by declaring our pre/postprocessing functions. Here these will just spin for `SLEEP_TIME` to simulate work on the CPU (more accurate than just sleeping).


<br /><br />
**Define Mock Pre/Postprocessing Callbacks**

In [4]:
SLEEP_TIME = 0.01

def preprocess(x=None, **kwargs) -> Union[List[np.ndarray], Dict]:
    # Mimic data fetching, augmentation, ...
    current_time = time.time()
    while time.time() < current_time + SLEEP_TIME: pass

    return x

def postprocess(x: List[np.ndarray], **kwargs):
    # Render boxes, store results, ...
    current_time = time.time()
    while time.time() < current_time + SLEEP_TIME: pass

    # Here we choose to return an output. This is not mandatory for postprocessing
    return x

We will now iterate over a provided input in a sliding window, each time postprocessing `num_binding` inference results backwards, preprocessing the "current" input and sending it to the GPU by using `predict_async`. We first define a `predict_multi` function which receives multiple inputs and a `preprocessing_callback` and `postprocessing_callback` to perform on them.

<br /><br />
**Predict_Multi - Example Use of Asynchronous Predict To Speed Up Inference Over Multiple Inputs**


In [5]:
def predict_multi(multi_x: List, *, preprocessing_callback: Callable, postprocessing_callback: Callable) -> List[object]:
    execution_handles = []
    postprocessed_results = []
    filled = False

    for x in multi_x:
        # Get inference outputs `num_bindings` back if there are enough enqueued
        if len(execution_handles) >= CONCURRENCY or filled:
            inference_result = execution_handles.pop(0).get()
            postprocessed_results.append(postprocessing_callback(inference_result))
            filled = True

        # Preprocess and enqueue the current input
        x = preprocessing_callback(x)
        execution_handles.append(model.predict_async(x))

    return postprocessed_results

Now lets check the performance difference.

<br /><br />
**Benchmarking Predict_Multi**

In [6]:
# Prep benchmarking resources and warm up the GPU
test_input_list = [model.example_inputs] * BENCHMARK_ITERATIONS
[model.predict(test_input) for _ in range(WARMUP_ITERATIONS)]

# Benchmark predict_async's throughput.
start = time.perf_counter()
results = predict_multi(test_input_list, preprocessing_callback=preprocess, postprocessing_callback=postprocess)
print(f'PREDICT MULTI TOOK: {round(time.perf_counter() - start, 7) * 1000} [ms]')

# Roughly benchmark normal predict throughput
start = time.perf_counter()
results = []
for x in test_input_list:
        x = preprocess(x)
        model.predict(x)
        results.append(postprocess(x))
print(f'PREDICT NORMAL TOOK: {round(time.perf_counter() - start, 7) * 1000} [ms]')

PREDICT MULTI TOOK: 32493.3111 [ms]
PREDICT NORMAL TOOK: 50207.6232 [ms]


`predict_async` may also be used within a generator. If the input source is provided as an iterator, a blocking functionality may be utilized as a sort of async run loop that blocks either on the working GPU or a blocking read from the network for example:

<br /><br />
**Predict_Iter - Example Use of Asynchronous Predict To Speed Up Inference Over an Iterator**

In [7]:
def predict_iter(x_iterator, preprocessing_callback: Callable, postprocessing_callback: Callable):
    execution_handles = []

    for current_x in x_iterator():
        # Get inference outputs `num_bindings` back if there are enough enqueued
        if len(execution_handles) >= CONCURRENCY:
            inference_result = execution_handles.pop(0).get()
            postprocessing_callback(inference_result)

        # Preprocess and enqueue the current input
        current_x = preprocessing_callback(current_x)
        execution_handles.append(model.predict_async(current_x))

def data_generator():
    # Transient problem - the `predict_iter` takes NUM_BINDINGS iterations to fill up, thus we perform an extra few iterations to get all results.
    test_input_list = [model.example_inputs] * (BENCHMARK_ITERATIONS + CONCURRENCY)

    for repetition in range(BENCHMARK_ITERATIONS + CONCURRENCY):
        yield  test_input_list[repetition]

Now, lets run our generator. In many applications this call could iterate over the entire dataset or never return. Notice the result of the `postprocessing_callback` will not be used here - whatever necessary handling should be performed in the `postprocessing_callback`.

<br /><br />
**Benchmarking Predict_Iter**

In [8]:
# Prep benchmarking resources and warm up the GPU
test_input_list = [model.example_inputs] * BENCHMARK_ITERATIONS
[model.predict(test_input) for _ in range(WARMUP_ITERATIONS)]

# Benchmark predict_async's throughput.
start = time.perf_counter()
results = predict_iter(data_generator, preprocessing_callback=preprocess, postprocessing_callback=postprocess)
print(f'PREDICT ITER TOOK: {round(time.perf_counter() - start, 7) * 1000} [ms]')

# Roughly benchmark normal predict throughput
start = time.perf_counter()
for x in test_input_list:
        x = preprocess(x)
        model.predict(x)
        postprocess(x)
print(f'PREDICT NORMAL TOOK: {round(time.perf_counter() - start, 7) * 1000} [ms]')

PREDICT ITER TOOK: 32571.0046 [ms]
PREDICT NORMAL TOOK: 50935.396 [ms]
