Skip to content

Add LLM support for cuda backend#17316

Merged
larryliu0820 merged 3 commits into
mainfrom
llm_cuda
Feb 28, 2026
Merged

Add LLM support for cuda backend#17316
larryliu0820 merged 3 commits into
mainfrom
llm_cuda

Conversation

@larryliu0820
Copy link
Copy Markdown
Contributor

@larryliu0820 larryliu0820 commented Feb 9, 2026

Summary

This PR extends CUDA support for text-only LLM workflows and adds CI coverage for Qwen3-0.6B artifacts and pybind execution.

Why

We already validate CUDA multimodal paths, but text-generation CUDA coverage (especially Qwen3) was incomplete.
This change adds export/run support and CI wiring so CUDA text-generation artifacts are exercised in automated tests.

What changed

CUDA LLM runner/build support

  • Added llama-cuda and llama-cuda-debug Makefile targets.
  • Added CUDA presets/workflow presets in examples/models/llama/CMakePresets.json.
  • Updated examples/models/llama/CMakeLists.txt to link CUDA backend when EXECUTORCH_BUILD_CUDA=ON.
  • Updated examples/models/llama/main.cpp:
    • Added --data_path convenience flag (single PTD path).
    • Added --prompt_file support for file-based prompts.

Gemma3 runner usability

  • Updated examples/models/gemma3/e2e_runner.cpp:
    • Added --max_new_tokens.
    • Added --stop_sequence early-stop behavior.

Optimum exporter integration and CI pin

  • Bumped optimum-executorch CI pin to:
    • a9592258daacad7423fd5f39aaa59c6e36471520
  • Added Qwen/Qwen3-0.6B handling in .ci/scripts/export_model_artifact.sh for text-generation.

HuggingFace optimum CUDA test path

  • Updated .ci/scripts/test_huggingface_optimum_model.py (test_text_generation):
    • Supports recipe=cuda export (--device cuda --dtype bfloat16).
    • Supports CUDA quantization for this path:
      • --qlinear 4w
      • --qlinear_packing_format tile_packed_to_4d
      • --qembedding 8w
    • Validates presence of aoti_cuda_blob.ptd.
    • Passes blob path into TextLLMRunner.

CUDA workflow updates

  • Updated .github/workflows/cuda.yml:
    • Added Qwen/Qwen3-0.6B to CUDA export matrix.
    • Updated test-cuda-pybind matrix to explicit artifact mapping.
    • Added Qwen non-quantized and quantized-int4-tile-packed artifact runs in pybind test.
    • Switched download-artifact to matrix-provided artifact name.

Validation

Rely on new CI jobs.

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Feb 9, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17316

Note: Links to docs will display an error until the docs builds have been completed.

❌ 7 New Failures, 8 Cancelled Jobs, 6 Unrelated Failures

As of commit 870cdc6 with merge base 1f4ad07 (image):

NEW FAILURES - The following jobs have failed:

CANCELLED JOBS - The following jobs were cancelled. Please retry:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 9, 2026
@larryliu0820 larryliu0820 added the release notes: desktop for desktop/laptop workstream label Feb 27, 2026
@larryliu0820 larryliu0820 marked this pull request as ready for review February 27, 2026 22:51
Comment thread examples/models/llama/main.cpp Outdated
"",
"Data files for the model. If multiple files are provided, they should be comma separated.");

DEFINE_string(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this? Can we just have data_paths above?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh. can be removed, the argument is slightly different from the other runners (data_path vs data_paths) but I guess it's not good to have both.

Copy link
Copy Markdown
Contributor

@lucylq lucylq Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh I should remove data_path. All the APIs support data_paths now. Should be experimental so don't need to deprecate

Comment thread examples/models/gemma3/e2e_runner.cpp Outdated
#include <cmath>
#include <cstring>
#include <fstream>
#include <string>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to update gemme3 runner to support qwen3 with pybinding?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed. I can put this into another PR

@larryliu0820 larryliu0820 had a problem deploying to upload-benchmark-results February 28, 2026 00:07 — with GitHub Actions Failure
@larryliu0820 larryliu0820 merged commit 6bb983b into main Feb 28, 2026
349 of 370 checks passed
@larryliu0820 larryliu0820 deleted the llm_cuda branch February 28, 2026 23:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. release notes: desktop for desktop/laptop workstream

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants