-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Taichi's Ahead of Time (AOT) Module #3642
Comments
Nice summary! I really like the idea of the API to load the kernels from C++! I have a question related to the arguments of a TI Kernel. In the example of For the access to the Fields, would it help to make the root buffer readable from host? I know on mobile target, everything is in an Unified Memory, so performance are generally not impacted, but I'm not sure about Desktop target. GLuint x_ssbo;
glGenBuffers(1, &x_ssbo);
char buffer[N][M];
auto x_field = program.GetField("x");
x_field.CopyTo(x_ssbo);
// Or
x_field.CopyTo(buffer); This can be useful I guess when a computation is done on Vulkan/OpenGL, but results might be used on a different processor/backend (CPU?). We could still glMapBuffer the SSBO I guess but that would just simplify the code to the users of the library :). One thing I have in mind too is a configuration of the 'target' backend for the AOT modules. For example, we can generate the OpenGL/Vulkan module on our desktop machine which might have a high end GPU and recent driver with plenty of extension, but the execution of those AOT modules might be on a more limited machine that doesn't have those extensions. I don't know if we could provide a configuration of the extensions through Python, so when we initialize the Program Runtime, instead of checking the system, we could filter the one provided in this list too? def run_aot():
mod = ti.aot.Module(arch=ti.opengl, target_ext_config=config.txt)
mod.add_kernel(init)
mod.add_kernel(substep)
mod.save('/path/to/dir', 'opengl.tcb') # .tcb for "taichi binary" with config.txt:
|
Random thought: in the future, it may make sense to wrap kernels and fields into a class for more modularity. This may end up with a syntax similar to ... @ti.aot.module
class ParticleSystem:
def __init__(self, n_particles):
self.n_particles = n_particles
# Persistent states
with self.export_fields:
self.x = ti.Vector.field(2, float, n_particles)
self.v = ti.Vector.field(2, float, n_particles)
@ti.kernel(export=True)
def substep(dt: ti.f32):
for i in self.x:
self.x[i] += dt * self.v[i]
@ti.kernel(export=True)
def fetch_x(x_output: ti.types.Vector.ndarray(n=2, dtype=ti.f32, dim=2)):
for i in self.x:
x_output[i] = self.x[i]
@ti.kernel(export=True)
def compute_total_distance(dist_sum: ti.field.ndarray(dtype=ti.f32, dim=2)):
# Temporary field, destoried when kernel exits
distance = ti.field(...)
for i in self.x:
for j in range(self.n_particles):
distance[i, j] = (self.x[i] - self.x[j]).norm()
for i in self.x:
x_output[i] = 0.0
for j in range(n):
x_output[i] += distance[i, j]
|
Thanks for the reply!
Variables like
Yep. From a technical perspective, this should be quite straightforward :-) Updated the C++ API part.
Yeah the extension has introduced some trouble to us previously. Passing in a configuration list of the extensions the targeted platform can support sounds like a nice solution #:+1: |
Just realized that we need a way to be able to support more than one dumped modules. For example, if I have two Taichi scripts 'a.py' and 'b.py', each of them saving an AOT module, we need a way to support using the saved kernels from both scripts at the same time. |
Additional features we should consider supporting:
|
I think we can have a |
More update on the technical direction. What We Have So Far
Runtime APIExample usageI hope we can simplify the above usage && make it generic for other backend. Below is what I imagine to be a slightly better API: constexpr auto kArch = taichi::lang::Arch::vulkan;
// ... same as above
// This gives us the flexibility to plug in user's own VkDevice, if
// they already have one in their pipeline.
auto embedded_device =
std::make_unique<taichi::vulkan::VulkanDeviceCreator>(evd_params);
taichi::vulkan::VkRuntime::Params params;
params.host_result_buffer = result_buffer;
params.device = embedded_device->device();
auto vulkan_runtime =
std::make_unique<taichi::vulkan::VkRuntime>(std::move(params));
std::any mod_params = vulkan_runtime.get();
std::unique_ptr<taichi::aot::Module> vk_module =
taichi::aot::Module::load("/path/to/aot_module", kArch, mod_params);
if (!vk_module) {
printf("Cannot load Vulkan AOT module\n");
return -1;
}
// Retrieve kernels/fields/etc from AOT module so we can initialize our
// runtime
auto root_size = vk_module->get_root_size();
printf("root buffer size=%d\n", root_size);
vulkan_runtime->add_root_buffer(root_size);
auto substep_kernel = vk_module->get_kernel("substep");
if (!substep_kernel) {
printf("Cannot find 'substep' kernel\n");
return -1;
}
// Run `substep_kernel`
int n_particles = 8192;
std::vector<float> x{n_particles * 2};
for (int i = 0; i < 50; i++) {
substep_kernel->launch(&host_ctx);
}
vulkan_runtime->synchronize();
auto x_field = vk_module->get_field("x");
if (!x_field) {
printf("Cannot find 'x' field\n");
return -1;
}
// device to host copy, size is stored in `x_field` already
x_field.copy_to(/*dst=*/x.get()); In order to achieve this, we need a few new APIs and refactors.
|
class LaunchContextBuilder { | |
public: | |
LaunchContextBuilder(Kernel *kernel, RuntimeContext *ctx); | |
explicit LaunchContextBuilder(Kernel *kernel); | |
LaunchContextBuilder(LaunchContextBuilder &&) = default; | |
LaunchContextBuilder &operator=(LaunchContextBuilder &&) = default; | |
LaunchContextBuilder(const LaunchContextBuilder &) = delete; | |
LaunchContextBuilder &operator=(const LaunchContextBuilder &) = delete; | |
void set_arg_float(int arg_id, float64 d); | |
void set_arg_int(int arg_id, int64 d); | |
void set_extra_arg_int(int i, int j, int32 d); | |
void set_arg_external_array(int arg_id, | |
uintptr_t ptr, | |
uint64 size, | |
bool is_device_allocation); | |
void set_arg_external_array_with_shape(int arg_id, | |
uintptr_t ptr, | |
uint64 size, | |
const std::vector<int64> &shape); | |
void set_arg_ndarray(int arg_id, const Ndarray &arr); | |
// Sets the |arg_id|-th arg in the context to the bits stored in |d|. | |
// This ignores the underlying kernel's |arg_id|-th arg type. | |
void set_arg_raw(int arg_id, uint64 d); | |
RuntimeContext &get_context(); | |
private: | |
Kernel *kernel_; | |
std::unique_ptr<RuntimeContext> owned_ctx_; | |
// |ctx_| *almost* always points to |owned_ctx_|. However, it is possible | |
// that the caller passes a RuntimeContext pointer externally. In that case, | |
// |owned_ctx_| will be nullptr. | |
// Invariant: |ctx_| will never be nullptr. | |
RuntimeContext *ctx_; | |
}; |
Problems
- As seen from the sample usage, users still have to manually call synchronize(). Ideally, this information can be encoded inside
aot::Kernel::launch()
. - We lack a way to configure Kernel's grid/dim settings. This will be particularly important for kernels iterating over sparse fields.
- To support merging multiple modules (of the same arch), we might need to invent the concept of namespaces within a module. So a kernel or a field belongs to a specific namespace.
- Should get rid of the heavy boilerplate, like setting up CompileConfig, MemoryPool, etc.
- Unify the AOT data structure for different backends.
Compile-time API
Much like what we are doing now. One tweak is that we should really, really group the files into a single package file, instead of a folder.
Device Capabilities
There should be a way for us to control exactly which API extensions we want to enable when running the codegen. This is not a Vulkan specific problem, and applies to OpenGL, Apple Metal as well.
Build the Runtime Library
We can draw ideas from TVM around https://github.com/apache/tvm/blob/2e32f36fecaa3d5025705a98594a9f4a4f6d9f74/CMakeLists.txt#L401-L406.
There will be a libtaichi_runtime.so
, which include all the runtime stuff, including AOT. Then the current libtaichi_core.so
becomes (libtaichi_runtime.so
+ many codegens + CHI IR infra + pybind + ...).
@k-ye That looks awesome! Wow! I like a lot the different ideas and changes to the AOT API! :) |
We have some basic heuristics to decide which Taichi kernels need to synchronize. For example, if a kernel takes in an Ndarray, because we don't yet know whether it writes to the Ndarray, we will call synchronize. Another case is where the kernel returns a value. We can encode such info into the JSON as well. But I feel like manually controlling when to sync or not is good enough? |
When can we have a complete C++ API reference? |
Hi @KishinZW , wonder if you have checked out https://liong.work/taichi-aot-by-examples.github.io/? We have just released an initial version of C API, and haven't officially supported C++'s yet.
In general, Taichi's AOT API recommends that you exhcange data with the kernels via Ndarray:
Showing images is done via Taichi's GUI component. However, Taichi's C API only focuses on the core concepts like kernels and data containers. That said, the Taichi AOT examples do come with a demo framework that comes with a renderer. You can start from these materials: |
We'd like to share our ideas on how to implement the AOT feature in Taichi. AOT refers to the process of using Taichi as a GPU compute shader/kernel compiler: The AOT users can compile their Taichi kernels into an AOT module, package this module alongside their apps (likely without a Python environment, e.g. an Android app), and load/execute the compiled shaders from this module in the app.
Note that AOT is already an ongoing work, hence some of the tasks have already been implemented. For a quick peak of the Taichi AOT workflow, please check out this test.
Goals
API Proposal
Taichi provides a utility,
taichi.aot.Module
, for compiling the Taichi kernels and fields info into a module file. It provides these APIs:add_kernel(kernel_fn)
: Add a Taichi kernel to the AOT module.add_kernel_template(kernel_templ_fn)
: Add a Taichi kernel template to the AOT module.add_field(name, field)
: Add a Taichi field to the AOT module. However, we hope that Ndarray can serve as a more convenient dense data container in the AOT use cases.save(filepath, filename)
: Save this AOT module tofilepath/filename
.We will walk through the
Module
usage with the following example.mod
, targeted for the GL/ES shading language.init
andsubstep
. This step adds both kernels tomod
.x
andv
. Both are added tomod
, too./path/to/dir/opengl.tcb
.This completes the works required at the Tachi/Python side.
Assuming that we then want to deploy this to an Android app, and have added
opengl.tcb
to the app project, we imagine the following set of C++ APIs useful. Note that the language implementing the API is mostly irrelevant, and should be chosen according to the targeted platform suitability (e.g. ObjC/Swift for iOS, Java/Kotlin for Android). We choose C++ here just for the developer's familarity (Although at a very low level, C++ is suitable for both mobile platforms).C++ API
We can then use the above API in the following manner:
Taichi kernel template
So far we have only talked about the regular Taichi kernels. However, there is a special kind of kernel: A Taichi kernel with at least one
ti.template
parameter. E.g.The special part about this is that Taichi will instantiate a separate kernel body for different input arguments. Readers coming from the C++ background can relate this to the C++ function template: It is not until you invoke a function template with the actual type arguments filled, will the compiler instantiate a function definition for you. As a result, one cannot identify a compiled Taichi kernel just by its name. Instead, it is the combination of a string (the kernel template name) and the template args.
Module.add_kernel_template()
is for handling this situation.Then on the app side, we can retrieve and run these instantiated kernels with the code below.
Ndarray: making data containers more flexible
Currently, Taichi field is the official way for passing data between the kernel side and the host side. However, it comes with a few restrictions:
MTLBuffer
for Apple Metal, etc.)x_ssbo
holding the particles' position in my particle system, we have to run the Taichi kernels, then copy the data from the root buffer tox_ssbo
. Ideally, we can achieve zero-copy here by just bindingx_ssbo
to the GL shaders generated by Taichi.To overcome these disadvantages, we have been prototyping a new data container called Ndarray. Ndarray can be viewed as a more flexible and systematic implementation of Taichi's external array.
Say if we'd like to to pass a 2-D array of
vec2
into a Taichi kernel, here's how we can re-write the kernels using the Ndarray container:If our app already has an SSBO
x_ssbo
of the matching traits, we can pass it to the compiled kernel in this way:Implementaion Roadmap
Q & A
What Taichi features do you plan to support?
What Taichi features are currently out of the scope?
@ti.data_oriented
Other limitations?
The logic to invoke these kernels will still need to be re-written in the users' app (e.g. the
run_jit()
body in the above example). We may consider adding a compute graph in the future. Welcome discussion & proposal!How to locate a Taichi kernel?
For a regular kernel, the kernel name (a string) is enough as the identifier. For a kernel template, it is a combination of the kernel name and the instantiating template args.
How to support upgrading?
We can include a version into the AOT module.
The text was updated successfully, but these errors were encountered: