Skip to content

Add Qwen 3.5 MoE to cuda-perf CI and add prefill throughput tracking#18903

Merged
Gasoonjia merged 8 commits intomainfrom
cuda-perf-qwen35-moe
Apr 23, 2026
Merged

Add Qwen 3.5 MoE to cuda-perf CI and add prefill throughput tracking#18903
Gasoonjia merged 8 commits intomainfrom
cuda-perf-qwen35-moe

Conversation

@Gasoonjia
Copy link
Copy Markdown
Contributor

@Gasoonjia Gasoonjia commented Apr 15, 2026

  • Add PyTorchObserver stats output to qwen3_5_moe runner (enables cuda_benchmark.py parsing), --prompt_file flag, and GPU memory stats
  • Add prefill_throughput metric to cuda_benchmark.py (prefill tok/s alongside existing decode tok/s)
  • Add Qwen3.5-35B-A3B-HQQ-INT4 to cuda-perf.yml with >1000 token prompt and 512 output tokens, on linux.aws.a100
  • Creata a new ciflow tag for cuda-erf ci
  • Remove random model selection and schedule trigger; always run all models when triggered

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Apr 15, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18903

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

❌ 3 New Failures, 2 Cancelled Jobs, 4 Unrelated Failures

As of commit 9e72e8b with merge base 656850a (image):

NEW FAILURES - The following jobs have failed:

CANCELLED JOBS - The following jobs were cancelled. Please retry:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

BROKEN TRUNK - The following jobs failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 15, 2026
@github-actions
Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@Gasoonjia Gasoonjia temporarily deployed to upload-benchmark-results April 15, 2026 11:33 — with GitHub Actions Inactive
- Add PyTorchObserver stats output to qwen3_5_moe runner (enables
  cuda_benchmark.py parsing), --prompt_file flag, and GPU memory stats
- Add prefill_throughput metric to cuda_benchmark.py (prefill tok/s
  alongside existing decode tok/s)
- Add Qwen3.5-35B-A3B-HQQ-INT4 to cuda-perf.yml with >1000 token
  prompt and 512 output tokens, on linux.aws.a100
- Align cuda-perf.yml triggers with cuda.yml (push main/release,
  ciflow/cuda tags, PR on backends/cuda and backends/aoti paths)
- Remove random model selection and schedule trigger; always run all
  models when triggered
@Gasoonjia Gasoonjia force-pushed the cuda-perf-qwen35-moe branch from a23b5b1 to 49d0aa1 Compare April 15, 2026 18:46
@Gasoonjia Gasoonjia temporarily deployed to upload-benchmark-results April 15, 2026 21:00 — with GitHub Actions Inactive
@Gasoonjia Gasoonjia force-pushed the cuda-perf-qwen35-moe branch from e50e3fa to 36213f9 Compare April 15, 2026 22:15
@Gasoonjia Gasoonjia marked this pull request as ready for review April 15, 2026 22:15
@Gasoonjia Gasoonjia requested a review from lucylq as a code owner April 15, 2026 22:15
@Gasoonjia Gasoonjia requested a review from digantdesai April 15, 2026 22:17
@Gasoonjia Gasoonjia temporarily deployed to upload-benchmark-results April 16, 2026 00:50 — with GitHub Actions Inactive
Comment thread .github/workflows/cuda-perf.yml Outdated
TOKENIZER="model_artifacts/tokenizer.json"
# Generate a >1000 token prompt for benchmarking
python3 -c "
text = 'The quick brown fox jumps over the lazy dog. ' * 200
Copy link
Copy Markdown
Contributor

@digantdesai digantdesai Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just check in a file, instead of creating one on the fly, with different large prompts and use those everywhere, such repetitions can lead to weird response sometimes. For a pure performance benchmarking this is OK however since this is a CI in case of a suspect diff, we will come look at the output and it better not be garbage because of this prompt. :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have created a specific file for long prompt.

DEFINE_string(data_path, "", "Data file (.ptd) for CUDA backend.");
DEFINE_string(tokenizer_path, "", "HuggingFace tokenizer.json path.");
DEFINE_string(prompt, "Hello", "Prompt text.");
DEFINE_string(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

stats.gpu_peak_usage_mb =
(stats.gpu_total_bytes - gpu_free_bytes) / 1024.0 / 1024.0;

llm::print_report(stats);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@Gasoonjia Gasoonjia force-pushed the cuda-perf-qwen35-moe branch from df2a4bf to 69b5042 Compare April 22, 2026 07:54
@Gasoonjia Gasoonjia temporarily deployed to upload-benchmark-results April 22, 2026 12:33 — with GitHub Actions Inactive
@Gasoonjia Gasoonjia temporarily deployed to upload-benchmark-results April 22, 2026 22:15 — with GitHub Actions Inactive
@Gasoonjia Gasoonjia temporarily deployed to upload-benchmark-results April 23, 2026 00:18 — with GitHub Actions Inactive
@Gasoonjia Gasoonjia merged commit c48ea12 into main Apr 23, 2026
472 of 481 checks passed
@Gasoonjia Gasoonjia deleted the cuda-perf-qwen35-moe branch April 23, 2026 03:13
Gasoonjia added a commit that referenced this pull request Apr 23, 2026
…18903)

- Add PyTorchObserver stats output to qwen3_5_moe runner (enables
cuda_benchmark.py parsing), --prompt_file flag, and GPU memory stats
- Add prefill_throughput metric to cuda_benchmark.py (prefill tok/s
alongside existing decode tok/s)
- Add Qwen3.5-35B-A3B-HQQ-INT4 to cuda-perf.yml with >1000 token prompt
and 512 output tokens, on linux.aws.a100
- Creata a new ciflow tag for cuda-erf ci
- Remove random model selection and schedule trigger; always run all
models when triggered

---------

Co-authored-by: gasoonjia <gasoonjia@fb.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/cuda CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants