# Relax BYOC Tutorial

  Bring-Your-Own-Codegen (BYOC) is the interface that TVM offers to enable integration of external libraries like TensorRT, Cutlass, DNNL, etc.  This doc aims to provide high-level idea about how to use BYOC in Relax, how to integrate new library, and what is the difference with Relay BYOC.

## User-level Guide

### Setup

  Build TVM with your BYOC in `config.cmake`. 
  For example, if you want to use TensorRT:

```python
set(USE_TENSORRT_CODEGEN ON)
set(USE_TENSORRT_RUNTIME ON)
```

### Basic workflow

  Relax BYOC offloads the computation at Relax function level. If you have a part of graph you want to leverage BYOC, e2e workflow example is as follows:

(1) Partition a graph into a set of relax functions and annotate the relax function that you want to offload with a target codegen and its global symbol. This example assumes TensorRT. Partitioning can be done either manually or by another pass (e.g., pattern matching. see Offload with pattern matching for detail). Note that target codegen should be able to handle every operator in the annotated function. It will throws a compilation error otherwise.

In [15]:
import tvm
from tvm import relax
from tvm.script.parser import relax as R, tir as T

# This example wants to offload `byoc_func` to TensorRT
@tvm.script.ir_module
class InputModule:
    @R.function
    def byoc_func(
        x: R.Tensor((16, 16), "float32"), y: R.Tensor((16, 16), "float32")
    ) -> R.Tensor((16, 16), "float32"):
        # Annotate a function you want to offload
        R.func_attr({"Codegen": "tensorrt", "global_symbol": "byoc_func"})
        z1 = R.multiply(x, y)
        z2 = R.add(z1, z1)
        z3 = R.add(z1, z2)
        return z3

    @R.function
    def tvm_func(
        x: R.Tensor((16, 16), "float32"), w: R.Tensor((16, 16), "float32")
    ) -> R.Tensor((16, 16), "float32"):
        gv0 = R.multiply(x, w)
        gv1 = R.add(x, gv0)
        return gv1

    @R.function
    def main(
        x: R.Tensor((16, 16), "float32"), y: R.Tensor((16, 16), "float32")
    ) -> R.Tensor((16, 16), "float32"):
        lv0 = byoc_func(x, y)
        lv1 = tvm_func(x, lv0)
        return lv1

mod = InputModule

(2) Perform `RunCodegen` pass to produce the external runtime module for the annotated relax functions. Internally, it will invoke codegen based on the annotation and attach the generated BYOC runtime module in the IRModule attribute so that executor can link it together. Then, it will also convert users of annotated functions from relax function calls to packed function calls with the attached global symbol in order to call into BYOC runtime. Once BYOC functions are consumed, they are no longer necessary. Thus, `RunCodegen` removes them by using the `RemoveUnusedFunctions` pass at the end.

In [16]:
new_mod = relax.transform.RunCodegen()(mod)
print(new_mod)

@R.function
def main(x: R.Tensor((16, 16), dtype="float32"), y: R.Tensor((16, 16), dtype="float32")) -> R.Tensor((16, 16), dtype="float32"):
    # block 0
    lv0 = R.call_tir("byoc_func", (x, y), (16, 16), dtype="float32")
    lv1: R.Tensor((16, 16), dtype="float32") = tvm_func(x, lv0)
    return lv1
    

@R.function
def main(x1: R.Tensor((16, 16), dtype="float32"), w: R.Tensor((16, 16), dtype="float32")) -> R.Tensor((16, 16), dtype="float32"):
    # block 0
    gv0: R.Tensor((16, 16), dtype="float32") = R.multiply(x1, w)
    gv1: R.Tensor((16, 16), dtype="float32") = R.add(x1, gv0)
    return gv1
    



As you can see, the function annotated with TensorRT is consumed  and its users are converted from relax function calls to the packed function calls with the given global symbol. 

In [17]:
print(new_mod.attrs)

{"external_mods": [runtime.Module(0x35149e8)]}


Please note that the runtime module generated by TensorRT codegen is attached in the IRModule attribute.

(3) Apply the rest of pass sequence (e.g., lowering, MetaSchedule tuning) and run the final IRModule with the selected executor. 

In [20]:
from tvm.relax.testing import transform
import tempfile
from tvm.relax.transform.tuning_api import Trace
import numpy as np

# Target gpu
target_str = "nvidia/geforce-rtx-3070"
target = tvm.target.Target(target_str)
dev = tvm.cuda()

with tempfile.TemporaryDirectory() as work_dir:
    with target, tvm.transform.PassContext(trace=Trace(mod), opt_level=3):
        # Apply the rest of pass seq
        rest_seq = tvm.transform.Sequential([ 
            transform.LowerWithRelayOpStrategyPass(target),
            relax.transform.MetaScheduleTuneIRMod(params={}, work_dir=work_dir, max_trials_global=8),
            relax.transform.MetaScheduleApplyDatabase(work_dir),
        ])
        final_mod = rest_seq(new_mod)

# Build and run `final_mod` with the chosen executor
ex = relax.vm.build(final_mod, target, params={})
vm = relax.VirtualMachine(ex, dev)

np0 = np.random.rand(16, 16).astype(np.float32)
np1 = np.random.rand(16, 16).astype(np.float32)
data0 = tvm.nd.array(np0, dev)
data1 = tvm.nd.array(np1, dev)
inputs = [data0, data1]
out = vm["main"](*inputs)

2023-02-09 16:22:23 [INFO] [task_scheduler.cc:260] Task #1 has finished. Remaining task(s): 0


Unnamed: 0,Name,FLOP,Weight,Speed (GFLOPS),Latency (us),Weighted Latency (us),Trials,Done
0,multiply,256,1,0.1631,1.5697,1.5697,4,Y
1,add,256,1,0.1632,1.5683,1.5683,4,Y


2023-02-09 16:22:23 [DEBUG] [task_scheduler.cc:318] 
 ID |     Name | FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------
  0 | multiply |  256 |      1 |         0.1631 |       1.5697 |                1.5697 |      4 |    Y 
  1 |      add |  256 |      1 |         0.1632 |       1.5683 |                1.5683 |      4 |    Y 
-------------------------------------------------------------------------------------------------------
Total trials: 8
Total latency (us): 3.13802



## Offload with pattern matching
  With the same principle in mind, Relax offers two convenient passes that you can easily group your operators and annotate your target functions.