Add LLM support for cuda backend#17316
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17316
Note: Links to docs will display an error until the docs builds have been completed. ❌ 7 New Failures, 8 Cancelled Jobs, 6 Unrelated FailuresAs of commit 870cdc6 with merge base 1f4ad07 ( NEW FAILURES - The following jobs have failed:
CANCELLED JOBS - The following jobs were cancelled. Please retry:
FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
BROKEN TRUNK - The following jobs failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
7e6928e to
d649817
Compare
d649817 to
2560270
Compare
2560270 to
eed98eb
Compare
| "", | ||
| "Data files for the model. If multiple files are provided, they should be comma separated."); | ||
|
|
||
| DEFINE_string( |
There was a problem hiding this comment.
why do we need this? Can we just have data_paths above?
There was a problem hiding this comment.
oh. can be removed, the argument is slightly different from the other runners (data_path vs data_paths) but I guess it's not good to have both.
There was a problem hiding this comment.
Ahh I should remove data_path. All the APIs support data_paths now. Should be experimental so don't need to deprecate
| #include <cmath> | ||
| #include <cstring> | ||
| #include <fstream> | ||
| #include <string> |
There was a problem hiding this comment.
why do we need to update gemme3 runner to support qwen3 with pybinding?
There was a problem hiding this comment.
not needed. I can put this into another PR
Summary
This PR extends CUDA support for text-only LLM workflows and adds CI coverage for Qwen3-0.6B artifacts and pybind execution.
Why
We already validate CUDA multimodal paths, but text-generation CUDA coverage (especially Qwen3) was incomplete.
This change adds export/run support and CI wiring so CUDA text-generation artifacts are exercised in automated tests.
What changed
CUDA LLM runner/build support
llama-cudaandllama-cuda-debugMakefile targets.examples/models/llama/CMakePresets.json.examples/models/llama/CMakeLists.txtto link CUDA backend whenEXECUTORCH_BUILD_CUDA=ON.examples/models/llama/main.cpp:--data_pathconvenience flag (single PTD path).--prompt_filesupport for file-based prompts.Gemma3 runner usability
examples/models/gemma3/e2e_runner.cpp:--max_new_tokens.--stop_sequenceearly-stop behavior.Optimum exporter integration and CI pin
a9592258daacad7423fd5f39aaa59c6e36471520Qwen/Qwen3-0.6Bhandling in.ci/scripts/export_model_artifact.shfortext-generation.HuggingFace optimum CUDA test path
.ci/scripts/test_huggingface_optimum_model.py(test_text_generation):recipe=cudaexport (--device cuda --dtype bfloat16).--qlinear 4w--qlinear_packing_format tile_packed_to_4d--qembedding 8waoti_cuda_blob.ptd.TextLLMRunner.CUDA workflow updates
.github/workflows/cuda.yml:Qwen/Qwen3-0.6Bto CUDA export matrix.test-cuda-pybindmatrix to explicit artifact mapping.download-artifactto matrix-provided artifact name.Validation
Rely on new CI jobs.