-
Notifications
You must be signed in to change notification settings - Fork 58
[DISCUSS] Multi-backend Dispatching in Relax #46
Comments
CC: @sunggg |
We adopted an approach similar to P1 in Meta and do have an approach similar to P2 in our roadmap, although we haven't dived into the detail plan of P2 due to no urgent requirements nor obvious performance gain. One challenge for the pattern in P1 is variants. For example, we have a case TVM GeLU can only run with FP32 due to the accuracy issue of
The above IR cannot match the pattern specified in CUTLASS that only considers Other than that, the overall design looks good to me. As I mentioned, we have a very similar design already implemented in Meta. |
@comaniac Thanks for the comment! For the AMP case, I think it is might be addressable by enhancing the pattern matcher, to make it support more general pattern like Glad to hear that many parties are adopting the same approach. I also just learnt that @sunggg have explored some similar approach before. I do think it is a good opportunity to have a converged global solution in relax with everyone's effort together. |
Hi, @ZihengJiang. I'm also glad to see that we are looking into the same direction. My collaborators and I developed an automated solution to handle such problems and observed some decent improvement compared to conventional approaches. It would be great if we can investigate this direction with relax together! |
Thanks @ZihengJiang! We proposed a minimum build pipeline design #49. This minimum build could enable flexible and customized compilation pipelines, and help explore ideas as you suggested in P2B. For example, users can write feedback for loops in python to explore search-based customized optimization, which requires measuring the execution cost of candidates during the exploration. The custom rewrite pass is outside of the minimum build, so users can write a python script as below: # mod: input relax program (computational graph)
def search_based_opt(mod):
while search_budget > 0 and not converge:
new_mod = relax.transform.SomeCustomRewritePass()(mod)
rt_mod = tvm.build(new_mod, target)
vm = relax.VirtualMachine(rt_mod, tvm.cpu())
execution_time = measure_perf(vm.run())
return best_candidate |
Thanks @ZihengJiang! The proposal is helpful for us, with it, it's possible to do graph heterogeneous planning and execution among cpu, gpu and npu. BTW some inputs of our NPU requirement, we need to do buffer fusion for very large subgraph as much as possible, which means for our NPU, we need more flexible for the whole graph compalition. The design here provide a fine grain heterogeneous solution, do we plan to have a coarser grain one? |
Hi @sunggg and @YuchenJin Thanks for the comments. I did not come up with a concrete proposal about the automatic searching part before, and here is the proposal for this part after I think more based on your comments. The general question can be expressed as: Given the constraints (the supported patterns by different backends considering priority), how can we find the optimal graph partition strategy to achieve the best performance for the whole graph. There are two design points worth considering in terms of the compilation infrastructure:
The pseudo code looks like the for-loop proposed by Yuchen, but one unclear decision is whether we should put |
Some thoughts: the |
I agree that overall it is useful to have a pattern language infrastructure that matches and enumerates the matched possibilities. The particular codegen signature (e.g. fcodegen) could use a bit more thinking, e.g. ideally we want to preserve the IRModule=>IRModule property to make it composable(as a rewrite of Expr=>Expr), but still have proper followup benchmark and feedback information available to guide such a decision. My guess is that we will end up want to enable a collection of different strategies. So focusing on the foundation infrastructure(pattern matching and enumeration, automation benchmarking of a subgraph decision) would be solid steps towards that which then can be used to quickly prototype strategies we talked abit |
Hey Ziheng, thanks for putting up this proposal!
Also, just curious, could you explain a bit about the motivation for doing this in Relax instead of in main? |
Hey @electriclilies . Thanks for the comment! For Q1, I would expect that the support pattern is limited, as we saw in the contrib module(https://github.com/apache/tvm/tree/main/src/runtime/contrib) and external codegen module (https://github.com/apache/tvm/tree/main/src/relay/backend/contrib) in current tvm. We can organize them in separate folders and reduce repetitive code by functions. For Q2, the need of tuning come up in some backends in real deployment, for example, in the cutlass BYOC, some parameters in the cutlass kernel are also tunable, and they usually have their own internal tuning function in the framework. We can also consider making our tuner part more general so that can be reused by other project easily. For Q3, yes, we definitely want to make this part automatically and smart. You can refer to the discussion above as @sunggg and @YuchenJin has raised. But our first goal is to build the underlying infra as a playground that we can experiment different search algorithm easily in the future. For Q4, yes I will reuse relay's pattern matcher. And would love to discuss with you more when we have some progress! I prefer to do this on the relax development branch since: 1. we can utilize new features in relax's design like "unify the abstraction for cross-layer optimizations" and "incremental compilation", and support for dynamic shape. 2. On the other hand, this part also can inspire us how to improve relax's design, which is easier to update than the upstream. |
* [Relax][Op] register gradients for some operators * add type annotations and some small fix * general case for transpose and matmul * format using black * sum and softmax * sum_grad now only uses 1 expand_dims * remove an unnecessary ICHECK in relax_script_printer
Motivation
To run a deep learning model, the mainstream approach in TVM is to do code generation with an auto-tuner, which make the performance optimization process easier for developers without domain knowledge. But, the auto-tuner approach also comes with some limitation:
For these reasons, an automatic dispatching mechanism for multi-backend (code generation, vendor library, kernel written manually) is required in TVM, assuming that those approaches need to be coexisted in the short term.
Existing Approaches
There are already some efforts to bridge the gap between auto-tuner and other approaches in TVM:
Those approaches demonstrated their feasibility on different applications, but lacks a global mechanism to dispatch between different backends automatically for different devices. For example, although BOLT allows users to specify which part to be computed by cutlass, it is still more friendly to let the framework decide whether to use code generation or cutlass from a user's perspective.
Design Goals
In summary, there are two design goals:
Proposal
P1: Pattern Specification for Target
The core API would be
register(backend, target, pattern, fcodegen)
.backend
specifies the name of backend, for example, "cudnn", "cutlass", "tensorrt", "autotir", etc. This name can be used as a key for priority configuration;target
specifies the targeting hardware, which can be "cuda -lib=cudnn", "arm-cpu -lib=ethos-n";pattern
specifies the dataflow pattern that can be converted to such backend, for kernel library like cudnn, the pattern can be only a single operator, for other stronger backend, the pattern can be a subgraph like conv2d + arbitrary_elementwise_operator. As we state in D0, the pattern needs to be able to represent a subgraph, also properties like specific shape/dtype.fcodegen
is a callback function, whose signature should befcodegen(target, pattern) -> packedfunc
. It means given the target and pattern, returning the optimal kernel with such backend. The callback design makes it possible to do kernel fusion and tuning in the fcodegen.Here are some examples (target and pattern would be complicated data structure, I use string here for simplification):
register("cudnn", "cuda -lib=cudnn", "conv2d", lambda target, op: return packed_func("cudnn.conv2d"))
register("cutlass", "cuda -lib=cutlass", "conv2d+add+relu", fusion_and_tune_in_culass)
P2: Automatic Evaluation and Dispatching
With P1, we can specify many graph patterns that can be converted to some specific backends. The next question is, how to choose the best backend for the final model.
P2A: priority configuration for different backends
First, we can allow users to specify the priority for different backends. For example:
The framework will use high priority backend (if it is enabled on the target hardware) to replace patterns in the model first, then try low priority backend. This is also useful when we want to lower some pattern to accelerator forcefully.
P2B: evaluation for same priority backends
If there are multiple backends which hold the same priority, and some pattern can be converted to them both, the framework will evaluate the generated packed function, then choose the better one according to their performance.
Benefits from Relax's New Design
The proposal can benefit from features brought by Relax's new design:
These two new features make it possible to lower some parts of the graph to a specific backend ( represented by a call_packed function), which we can still lower the other part with different backend like autotir.
Blocker
cc @Laurawly @xqdan @comaniac @junrushao1994 @tqchen
The text was updated successfully, but these errors were encountered: