# 在 Relay 中使用外部库

**原作者**: [Masahiro Masuda](https://github.com/masahi), [Truman Tian](https://github.com/SiNZeRo)

这是简短的教程，介绍关于如何使用在 Relay 中使用外部库，如 cuDNN，或 cuBLAS。

Relay 在内部使用 TVM 来生成目标特定的代码。例如，使用 cuda 后端，TVM 为用户提供的网络中的所有层生成 cuda kernel。但有时，将不同供应商开发的外部库合并到 Relay 中也是有帮助的。幸运的是，TVM 有一种透明地调用这些库的机制。对于 Relay 用户，需要做的只是适当地设置目标字符串。

使用来自 Relay 的外部库之前， TVM 需要构建您想要使用的库。例如，要使用 cuDNN，在 `cmake/config.cmake` 中启用 `USE_CUDNN` 选项，必要时需要指定 cuDNN include 和库目录。

首先，导入 Relay 和 TVM。

In [1]:
import tvm
from tvm import te
import numpy as np
from tvm.contrib import graph_executor as runtime
from tvm import relay
from tvm.relay import testing
import tvm.testing

## 创建简单网络

创建非常简单的网络进行演示。它由卷积、batch normalization 和 ReLU 激活组成。

In [2]:
out_channels = 16
batch_size = 1

data = relay.var("data", relay.TensorType((batch_size, 3, 224, 224), "float32"))
weight = relay.var("weight")
bn_gamma = relay.var("bn_gamma")
bn_beta = relay.var("bn_beta")
bn_mmean = relay.var("bn_mean")
bn_mvar = relay.var("bn_var")

simple_net = relay.nn.conv2d(
    data=data, weight=weight, kernel_size=(3, 3), channels=out_channels, padding=(1, 1)
)
simple_net = relay.nn.batch_norm(simple_net, bn_gamma, bn_beta, bn_mmean, bn_mvar)[0]
simple_net = relay.nn.relu(simple_net)
simple_net = relay.Function(relay.analysis.free_vars(simple_net), simple_net)

data_shape = (batch_size, 3, 224, 224)
net, params = testing.create_workload(simple_net)

## 使用 cuda 后端构建和运行

像往常一样，用 cuda 后端构建和运行这个网络。通过将日志级别设置为 `DEBUG`，将 Relay graph 编译的结果转储为伪代码。

In [3]:
import logging

logging.basicConfig(level=logging.DEBUG)  # to dump TVM IR after fusion

target = "cuda"
lib = relay.build_module.build(net, target, params=params)

dev = tvm.device(target, 0)
data = np.random.uniform(-1, 1, size=data_shape).astype("float32")
module = runtime.GraphModule(lib["default"](dev))
module.set_input("data", data)
module.run()
out_shape = (batch_size, out_channels, 224, 224)
out = module.get_output(0, tvm.nd.empty(out_shape))
out_cuda = out.numpy()

DEBUG:autotvm:Finish loading 825 records
INFO:te_compiler:Using injective.cpu for add based on highest priority (10)
INFO:te_compiler:Using injective.cpu for sqrt based on highest priority (10)
INFO:te_compiler:Using injective.cpu for divide based on highest priority (10)
INFO:te_compiler:Using injective.cpu for multiply based on highest priority (10)
INFO:te_compiler:Using injective.cpu for expand_dims based on highest priority (10)
INFO:te_compiler:Using injective.cpu for negative based on highest priority (10)
INFO:te_compiler:Using injective.cpu for multiply based on highest priority (10)
INFO:te_compiler:Using injective.cpu for add based on highest priority (10)
INFO:te_compiler:Using injective.cpu for expand_dims based on highest priority (10)
DEBUG:autotvm:Cannot find tuning records for:
    target=cuda -keys=cuda,gpu -arch=sm_75 -max_num_threads=1024 -thread_warp_size=32
    key=('conv2d_nchw.cuda', ('TENSOR', (1, 3, 224, 224), 'float32'), ('TENSOR', (16, 3, 3, 3), 'float32'), 

生成的伪代码应该如下所示。

```{tip}
注意 bias add、batch normalization 和 ReLU 激活是如何融合到卷积核中的。
```

TVM 从这个表示生成单一的融合 kernel。

In [4]:
print(lib.ir_mod["main"])

fn (%data: Tensor[(1, 3, 224, 224), float32] /* ty=Tensor[(1, 3, 224, 224), float32] */, %weight: Tensor[(16, 3, 3, 3), float32] /* ty=Tensor[(16, 3, 3, 3), float32] */, %bn_gamma: Tensor[(16), float32] /* ty=Tensor[(16), float32] */, %bn_beta: Tensor[(16), float32] /* ty=Tensor[(16), float32] */, %bn_mean: Tensor[(16), float32] /* ty=Tensor[(16), float32] */, %bn_var: Tensor[(16), float32] /* ty=Tensor[(16), float32] */) -> Tensor[(1, 16, 224, 224), float32] {
  %0 = nn.conv2d(%data, %weight, padding=[1, 1, 1, 1], channels=16, kernel_size=[3, 3]) /* ty=Tensor[(1, 16, 224, 224), float32] */;
  %1 = nn.batch_norm(%0, %bn_gamma, %bn_beta, %bn_mean, %bn_var) /* ty=(Tensor[(1, 16, 224, 224), float32], Tensor[(16), float32], Tensor[(16), float32]) */;
  %2 = %1.0 /* ty=Tensor[(1, 16, 224, 224), float32] */;
  nn.relu(%2) /* ty=Tensor[(1, 16, 224, 224), float32] */
} /* ty=fn (Tensor[(1, 3, 224, 224), float32], Tensor[(16, 3, 3, 3), float32], Tensor[(16), float32], Tensor[(16), float32], Ten

In [5]:
lib.function_metadata

{"tvmgen_default_fused_nn_conv2d_multiply_add_nn_relu": FunctionInfoNode(
workspace_sizes={cuda -keys=cuda,gpu -arch=sm_75 -max_num_threads=1024 -thread_warp_size=32: 768},
  io_sizes={cuda -keys=cuda,gpu -arch=sm_75 -max_num_threads=1024 -thread_warp_size=32: 3211264},
  constant_sizes={cuda -keys=cuda,gpu -arch=sm_75 -max_num_threads=1024 -thread_warp_size=32: 0},
  tir_primfuncs={cuda -keys=cuda,gpu -arch=sm_75 -max_num_threads=1024 -thread_warp_size=32: PrimFunc([placeholder, placeholder, placeholder, placeholder, T_relu]) attrs={"from_legacy_te_schedule": (bool)1, "global_symbol": "tvmgen_default_fused_nn_conv2d_multiply_add_nn_relu", "tir.noalias": (bool)1, "hash": "97c4f8c60220fadf"} {
  // attr [iter_var(blockIdx.z, , blockIdx.z)] thread_extent = 1
  allocate conv2d_nchw[float32 * 28], storage_scope = local
  allocate pad_temp.shared[float32 * 114], storage_scope = shared
  allocate placeholder.shared[float32 * 48], storage_scope = shared
  // attr [iter_var(blockIdx.y, , block

## 为卷积层使用 cuDNN

可以用 cuDNN 来代替 cuDNN 的卷积核。为此，需要做的就是将选项 `" -libs=cudnn"` 附加到目标字符串中。

In [6]:
net, params = testing.create_workload(simple_net)
target = "cuda -libs=cudnn"  # use cudnn for convolution
lib = relay.build_module.build(net, target, params=params)

dev = tvm.device(target, 0)
data = np.random.uniform(-1, 1, size=data_shape).astype("float32")
module = runtime.GraphModule(lib["default"](dev))
module.set_input("data", data)
module.run()
out_shape = (batch_size, out_channels, 224, 224)
out = module.get_output(0, tvm.nd.empty(out_shape))
out_cudnn = out.numpy()

DEBUG:autotvm:Finish loading 825 records
INFO:te_compiler:Using injective.cpu for add based on highest priority (10)
INFO:te_compiler:Using injective.cpu for sqrt based on highest priority (10)
INFO:te_compiler:Using injective.cpu for divide based on highest priority (10)
INFO:te_compiler:Using injective.cpu for multiply based on highest priority (10)
INFO:te_compiler:Using injective.cpu for expand_dims based on highest priority (10)
INFO:te_compiler:Using injective.cpu for negative based on highest priority (10)
INFO:te_compiler:Using injective.cpu for multiply based on highest priority (10)
INFO:te_compiler:Using injective.cpu for add based on highest priority (10)
INFO:te_compiler:Using injective.cpu for expand_dims based on highest priority (10)
[09:19:32] /media/pc/data/4tb/lxw/books/tvm/src/runtime/contrib/cudnn/conv_forward.cc:135: 	CUDNN Found 8 fwd algorithms, choosing CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_GEMM
[09:19:32] /media/pc/data/4tb/lxw/books/tvm/src/runtime/contrib/cudn

```{note}
如果你使用 cuDNN, Relay 不能融合后面的层的卷积。这是因为层融合发生在 TVM 内部表示 (IR) 级别。Relay 将外部库视为黑盒，因此没有办法将它们与 TVM IR 融合。
```

下面的伪代码显示，cuDNN 卷积 + bias add + batch norm + ReLU 分为两个计算阶段，一个用于 cuDNN 调用，另一个用于其余的运算。

In [19]:
lib.ir_mod

#[version = "0.0.5"]
def @main(%data: Tensor[(1, 3, 224, 224), float32] /* ty=Tensor[(1, 3, 224, 224), float32] */, %weight: Tensor[(16, 3, 3, 3), float32] /* ty=Tensor[(16, 3, 3, 3), float32] */, %bn_gamma: Tensor[(16), float32] /* ty=Tensor[(16), float32] */, %bn_beta: Tensor[(16), float32] /* ty=Tensor[(16), float32] */, %bn_mean: Tensor[(16), float32] /* ty=Tensor[(16), float32] */, %bn_var: Tensor[(16), float32] /* ty=Tensor[(16), float32] */) -> Tensor[(1, 16, 224, 224), float32] {
  %0 = nn.conv2d(%data, %weight, padding=[1, 1, 1, 1], channels=16, kernel_size=[3, 3]) /* ty=Tensor[(1, 16, 224, 224), float32] */;
  %1 = nn.batch_norm(%0, %bn_gamma, %bn_beta, %bn_mean, %bn_var) /* ty=(Tensor[(1, 16, 224, 224), float32], Tensor[(16), float32], Tensor[(16), float32]) */;
  %2 = %1.0 /* ty=Tensor[(1, 16, 224, 224), float32] */;
  nn.relu(%2) /* ty=Tensor[(1, 16, 224, 224), float32] */
}

## 验证结果

可以检查两次运行的结果是否匹配。

In [7]:
tvm.testing.assert_allclose(out_cuda, out_cudnn, rtol=1e-5)

## 结论

本教程涵盖了 cuDNN 与 Relay 的使用。TVM 也支持 cuBLAS。如果 cuBLAS 被启用，它将在全连接的层(`relay.dense`)内使用。要使用 cuBLAS，设置目标字符串为 `"cuda -libs=cublas"`。

也可以同时使用 cuDNN 和 cuBLAS：`"cuda -libs=cudnn,cublas"`。

对于 ROCm 后端，支持 MIOpen 和 rocBLAS。它们可以通过 target `"rocm -libs=miopen,rocblas"` 来启用。

能够使用外部库是很好的，但是需要记住一些注意事项。

- 首先，使用外部库可能会限制 TVM 和 Relay 的使用。
    
    例如，MIOpen 目前只支持 NCHW 布局和 fp32 数据类型，所以在 TVM 中不能使用其他布局或数据类型。

- 其次，更重要的是，外部库限制了 graph 编译过程中算子融合的可能性，如上所示。

    TVM 和 Relay 的目标是实现在各种硬件上的最佳性能，通过联合算子级和图优化。
    为了实现这一目标，应该继续为 TVM 和 Relay 开发更好的优化，同时在必要时使用外部库作为返回现有实现的好方法。