Add LLM support for cuda backend by larryliu0820 · Pull Request #17316 · pytorch/executorch

larryliu0820 · 2026-02-09T20:12:42Z

Summary

This PR extends CUDA support for text-only LLM workflows and adds CI coverage for Qwen3-0.6B artifacts and pybind execution.

Why

We already validate CUDA multimodal paths, but text-generation CUDA coverage (especially Qwen3) was incomplete.
This change adds export/run support and CI wiring so CUDA text-generation artifacts are exercised in automated tests.

What changed

CUDA LLM runner/build support

Added llama-cuda and llama-cuda-debug Makefile targets.
Added CUDA presets/workflow presets in examples/models/llama/CMakePresets.json.
Updated examples/models/llama/CMakeLists.txt to link CUDA backend when EXECUTORCH_BUILD_CUDA=ON.
Updated examples/models/llama/main.cpp:
- Added --data_path convenience flag (single PTD path).
- Added --prompt_file support for file-based prompts.

Gemma3 runner usability

Updated examples/models/gemma3/e2e_runner.cpp:
- Added --max_new_tokens.
- Added --stop_sequence early-stop behavior.

Optimum exporter integration and CI pin

Bumped optimum-executorch CI pin to:
- a9592258daacad7423fd5f39aaa59c6e36471520
Added Qwen/Qwen3-0.6B handling in .ci/scripts/export_model_artifact.sh for text-generation.

HuggingFace optimum CUDA test path

Updated .ci/scripts/test_huggingface_optimum_model.py (test_text_generation):
- Supports recipe=cuda export (--device cuda --dtype bfloat16).
- Supports CUDA quantization for this path:
  - --qlinear 4w
  - --qlinear_packing_format tile_packed_to_4d
  - --qembedding 8w
- Validates presence of aoti_cuda_blob.ptd.
- Passes blob path into TextLLMRunner.

CUDA workflow updates

Updated .github/workflows/cuda.yml:
- Added Qwen/Qwen3-0.6B to CUDA export matrix.
- Updated test-cuda-pybind matrix to explicit artifact mapping.
- Added Qwen non-quantized and quantized-int4-tile-packed artifact runs in pybind test.
- Switched download-artifact to matrix-provided artifact name.

Validation

Rely on new CI jobs.

pytorch-bot · 2026-02-09T20:12:46Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17316

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 7 New Failures, 8 Cancelled Jobs, 6 Unrelated Failures

As of commit 870cdc6 with merge base 1f4ad07 ():

NEW FAILURES - The following jobs have failed:

cuda-perf / upload-benchmark-results (gh)
Process completed with exit code 255.
pull / test-multimodal-linux (gemma3-4b) / linux-job (gh)
RuntimeError: Command docker exec -t c6eda1a4cc6ad39291b471ea920792d653834f86dfedb51744c73bb943419573 /exec failed with exit code 139
pull / unittest-arm-backend-with-no-deps (test_pytest_ops_tosa) / linux-job (gh)
RuntimeError: Command docker exec -t 7713a58c672bb12d2d5a3f075144d9e5de346e18d4e377ebd86b53d8bbab249f /exec failed with exit code 1
Test Metal Backend / export-model-metal-artifact (mistralai, Voxtral-Mini-3B-2507, non-quantized) / macos-job (gh)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1
Test Metal Backend / export-model-metal-artifact (mistralai, Voxtral-Mini-3B-2507, quantized-int4-metal) / macos-job (gh)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1
Test Metal Backend / export-model-metal-artifact (openai, whisper-large-v3-turbo, non-quantized) / macos-job (gh)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1
Test Metal Backend / export-model-metal-artifact (openai, whisper-small, non-quantized) / macos-job (gh)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1

CANCELLED JOBS - The following jobs were cancelled. Please retry:

pull / test-lora-linux / linux-job (gh)
##[error]The operation was canceled.
pull / test-models-linux (ic4, portable, linux.4xlarge.memory) / linux-job (gh)
##[error]The operation was canceled.
pull / test-models-linux (llama3_2_vision_encoder, portable, linux.4xlarge.memory) / linux-job (gh)
pull / test-models-linux (phi_4_mini, portable, linux.4xlarge.memory) / linux-job (gh)
##[error]The operation was canceled.
trunk / test-demo-backend-delegation (buck2) / linux-job (gh)
trunk / test-demo-backend-delegation (cmake) / linux-job (gh)
trunk / test-models-linux-aarch64 (phi_4_mini, portable, linux.arm64.m7g.4xlarge) / linux-job (gh)
##[error]The operation was canceled.
trunk / test-qnn-model (fp32, mv2) / linux-job (gh)

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

cuda-perf / benchmark-cuda (nvidia/parakeet-tdt, non-quantized, nvidia_parakeet-tdt, 50) / linux-job (gh) (detected as infra flaky with no log or failing log classifier)
pull / test-models-linux (ic3, xnnpack-quantization-delegation, linux.2xlarge) / linux-job (gh) (matched linux rule in flaky-rules.json)
The process '/usr/bin/git' failed with exit code 128
trunk / test-models-windows (mobilebert, xnnpack-q8) / windows-job (gh) (detected as infra flaky with no log or failing log classifier)
trunk / test-models-windows (vit, xnnpack-q8) / windows-job (gh) (detected as infra flaky with no log or failing log classifier)

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

Test CUDA Windows Export and E2E / test-model-cuda-windows-e2e (mistralai, Voxtral-Mini-3B-2507, non-quantized) / windows-job (gh) (trunk failure)
Process completed with exit code 1.
Test CUDA Windows Export and E2E / test-model-cuda-windows-e2e (mistralai, Voxtral-Mini-3B-2507, quantized-int4-weight-only) / windows-job (gh) (trunk failure)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

lucylq · 2026-02-27T22:56:27Z

    "",
    "Data files for the model. If multiple files are provided, they should be comma separated.");

+DEFINE_string(


why do we need this? Can we just have data_paths above?

oh. can be removed, the argument is slightly different from the other runners (data_path vs data_paths) but I guess it's not good to have both.

Ahh I should remove data_path. All the APIs support data_paths now. Should be experimental so don't need to deprecate

Gasoonjia · 2026-02-27T23:01:00Z

 #include <cmath>
 #include <cstring>
 #include <fstream>
+#include <string>


why do we need to update gemme3 runner to support qwen3 with pybinding?

not needed. I can put this into another PR

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 9, 2026

Add LLM support for cuda backend

7a4086b

larryliu0820 force-pushed the llm_cuda branch from 7e6928e to d649817 Compare February 27, 2026 22:46

larryliu0820 added the release notes: desktop for desktop/laptop workstream label Feb 27, 2026

larryliu0820 marked this pull request as ready for review February 27, 2026 22:51

larryliu0820 requested review from kirklandsign and lucylq as code owners February 27, 2026 22:51

larryliu0820 force-pushed the llm_cuda branch from d649817 to 2560270 Compare February 27, 2026 22:52

Add text only llm CI jobs for cuda

eed98eb

larryliu0820 force-pushed the llm_cuda branch from 2560270 to eed98eb Compare February 27, 2026 22:55

lucylq reviewed Feb 27, 2026

View reviewed changes

Gasoonjia reviewed Feb 27, 2026

View reviewed changes

Remove gemma3 e2e runner changes from llm_cuda

870cdc6

lucylq approved these changes Feb 27, 2026

View reviewed changes

larryliu0820 had a problem deploying to upload-benchmark-results February 28, 2026 00:07 — with GitHub Actions Failure

larryliu0820 merged commit 6bb983b into main Feb 28, 2026
349 of 370 checks passed

larryliu0820 deleted the llm_cuda branch February 28, 2026 23:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add LLM support for cuda backend#17316

Add LLM support for cuda backend#17316
larryliu0820 merged 3 commits into
mainfrom
llm_cuda

larryliu0820 commented Feb 9, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Feb 9, 2026 •

edited

Loading

Uh oh!

lucylq Feb 27, 2026

Uh oh!

larryliu0820 Feb 27, 2026

Uh oh!

lucylq Feb 27, 2026 •

edited

Loading

Uh oh!

Gasoonjia Feb 27, 2026

Uh oh!

larryliu0820 Feb 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

larryliu0820 commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

What changed

CUDA LLM runner/build support

Gemma3 runner usability

Optimum exporter integration and CI pin

HuggingFace optimum CUDA test path

CUDA workflow updates

Validation

Uh oh!

pytorch-bot Bot commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17316

❌ 7 New Failures, 8 Cancelled Jobs, 6 Unrelated Failures

Uh oh!

lucylq Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

larryliu0820 Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

lucylq Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Gasoonjia Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

larryliu0820 Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

larryliu0820 commented Feb 9, 2026 •

edited

Loading

pytorch-bot Bot commented Feb 9, 2026 •

edited

Loading

lucylq Feb 27, 2026 •

edited

Loading