Add structured stats reporting and GPU memory tracking to Qwen3.5 MoE runner by pytorchbot · Pull Request #19228 · pytorch/executorch

pytorchbot · 2026-04-30T15:05:40Z

This PR was created by the merge bot to help merge the original PR into the main branch.
ghstack PR number: #19190 by @digantdesai
^ Please use this as the source of truth for the PR details, comments, and reviews
ghstack PR base: https://github.com/pytorch/executorch/tree/gh/digantdesai/53/base
ghstack PR head: https://github.com/pytorch/executorch/tree/gh/digantdesai/53/head
Merge bot PR base: https://github.com/pytorch/executorch/tree/gh/digantdesai/51/orig
Merge bot PR head: https://github.com/pytorch/executorch/tree/gh/digantdesai/53/orig

@diff-train-skip-merge

INT8 tensor core variants of the batched MoE GEMM kernels that dynamically quantize bf16 activations to INT8 per-row per-tile and dequantize INT4 weights directly to INT8 (skipping bf16 conversion). Uses tl.dot(int8, int8) → int32 accumulation with per-tile float32 rescale. 1.7× MoE speedup on A100 at M=1024 with 0.9998 cosine similarity vs bf16 baseline. Co-authored-by: Claude <noreplyanthropic.com> ghstack-source-id: 809c2cc Pull Request resolved: #19187

Add three new Triton kernels for dense W4A16 linear projections that replace tinygemm's tiled INT4 format with simple [N, K//2] packed weights (same format as MoE experts): - int4_matmul: fused dequant+tl.dot GEMM for medium M (prefill crossover) - int4_matvec: bandwidth-optimized vec-mat for M=1 decode - dequant_w4_to_bf16: weight dequant for large-M prefill via Inductor mm W4DequantLinear wraps these with dual decode/prefill dispatch: - Decode (M=1): int4_matvec (73 tok/s, ~35% slower than tinygemm) - Prefill (M>1): dequant+F.linear via Inductor (3400 tok/s at 3K tokens, +67% over tinygemm baseline) Single 18GB weight blob (no duplication). Decode perf regression is a known trade-off for uniform weight format — to be revisited with a CUDA C++ matvec kernel. Also adds INT8 dynamic-activation MoE tests and comprehensive correctness tests (48 tests, all passing at rtol=0.01). Co-authored-by: Claude <noreplyanthropic.com> ghstack-source-id: 89acc9b Pull Request resolved: #19188

… runner Runner now uses llm::Stats with proper timestamps for model load, prefill, decode, and GPU memory (via cudaMemGetInfo). Output matches stats.h print_report format: PyTorchObserver JSON line plus human-readable table. This commit was authored with the assistance of Claude Code. ghstack-source-id: 9227519 Pull Request resolved: #19190

pytorch-bot · 2026-04-30T15:05:45Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19228

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-04-30T17:36:11Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

…CUDA (#19265) ### Summary #19228 added structured GPU memory tracking to the qwen3_5_moe runner but did not wrap the new cudaMemGetInfo blocks in the existing EXECUTORCH_BUILD_CUDA guard that the rest of the file uses for CUDA-only APIs. The same main.cpp is built for the Metal target where the CUDA runtime headers are not available, so the new blocks failed to compile on macOS: error: use of undeclared identifier 'cudaMemGetInfo' if (cudaMemGetInfo(&free, &total) == cudaSuccess) { Wrap the three new scoped blocks in #ifdef EXECUTORCH_BUILD_CUDA, matching the existing guard pattern at lines 27, 68, 113, 168, and 184. The stats struct fields they would have populated (gpu_free_before_load_bytes, gpu_free_after_load_bytes, gpu_free_after_generate_bytes, gpu_peak_usage_mb) default to their sentinel values on non-CUDA builds, so the rest of the runner's stats reporting tolerates their absence. Authored with Claude Code. ### Test plan CI

digantdesai added 3 commits April 29, 2026 12:26

pytorchbot requested a review from lucylq as a code owner April 30, 2026 15:05

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 30, 2026

Base automatically changed from gh/digantdesai/51/orig to main April 30, 2026 17:34

Merge branch 'main' into gh/digantdesai/53/orig

d043e3f

Gasoonjia approved these changes Apr 30, 2026

View reviewed changes

Gasoonjia merged commit e84a418 into main Apr 30, 2026
194 of 201 checks passed

Gasoonjia deleted the gh/digantdesai/53/orig branch April 30, 2026 17:38

rascani mentioned this pull request May 1, 2026

qwen3_5_moe: guard new cudaMemGetInfo blocks behind EXECUTORCH_BUILD_CUDA #19265

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add structured stats reporting and GPU memory tracking to Qwen3.5 MoE runner#19228

Add structured stats reporting and GPU memory tracking to Qwen3.5 MoE runner#19228
Gasoonjia merged 4 commits intomainfrom
gh/digantdesai/53/orig

pytorchbot commented Apr 30, 2026

Uh oh!

pytorch-bot Bot commented Apr 30, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

pytorchbot commented Apr 30, 2026

Uh oh!

pytorch-bot Bot commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19228

Uh oh!

github-actions Bot commented Apr 30, 2026

This PR needs a release notes: label

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pytorch-bot Bot commented Apr 30, 2026 •

edited

Loading

This PR needs a `release notes:` label