-
Notifications
You must be signed in to change notification settings - Fork 685
[aoti-et] Enable multimodal runner for Voxtral on CUDA #14980
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
08f0ce0
0f1659a
1808824
a7c55b1
be5d187
fb4940e
88873b7
f40b1fb
7ab6b25
afc2159
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -165,6 +165,14 @@ class ET_EXPERIMENTAL CudaBackend final | |
Span<EValue*> args) const override { | ||
AOTIDelegateHandle* handle = (AOTIDelegateHandle*)handle_; | ||
|
||
// Need to re-register all the symbols from the so_handle hosted by this | ||
// CudaBackend instance. The reason is that these symbols are | ||
// static/singleton across the whole process. When we share multiple methods | ||
// (meaning multiple so_handle) in the same process, we need to re-register | ||
// the symbols from the so_handle that is being used in this execution. | ||
ET_CHECK_OK_OR_RETURN_ERROR( | ||
register_shared_library_functions(handle->so_handle)); | ||
|
||
Comment on lines
+168
to
+175
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we're loading the model once and doing execute/inference multiple times, it will register multiple times, no? Can you do something like this?
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So the
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you store the AOTInductorModelContainerRunFunc inside AOTIDelegateHandle?
|
||
size_t n_inputs; | ||
AOTInductorModelContainerGetNumInputs(handle->container_handle, &n_inputs); | ||
|
||
|
@@ -223,7 +231,6 @@ class ET_EXPERIMENTAL CudaBackend final | |
"Failed to copy input %d from CPU to GPU", | ||
i); | ||
} | ||
ET_LOG(Info, "Inputs copied to GPU"); | ||
// Process output tensors: create GPU counterparts for ExecuTorch CPU | ||
// tensors | ||
for (int i = 0; i < n_outputs; i++) { | ||
|
@@ -253,7 +260,6 @@ class ET_EXPERIMENTAL CudaBackend final | |
|
||
gpu_outputs[i] = gpu_output_handle; | ||
} | ||
ET_LOG(Info, "Outputs created on GPU"); | ||
// Run AOTI container with GPU tensors | ||
AOTIRuntimeError error = AOTInductorModelContainerRun( | ||
handle->container_handle, | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make docblock something like this?
// CRITICAL: Multimodal models reuse tensors with different shapes across
// executions (e.g., variable-length audio). We MUST validate cached metadata
// matches current tensor state, or CUDA kernels will receive incorrect shapes
// leading to memory corruption and segfaults.