Skip to content

Add structured stats reporting and GPU memory tracking to Qwen3.5 MoE runner#19228

Merged
Gasoonjia merged 4 commits intomainfrom
gh/digantdesai/53/orig
Apr 30, 2026
Merged

Add structured stats reporting and GPU memory tracking to Qwen3.5 MoE runner#19228
Gasoonjia merged 4 commits intomainfrom
gh/digantdesai/53/orig

Conversation

@pytorchbot
Copy link
Copy Markdown
Collaborator

This PR was created by the merge bot to help merge the original PR into the main branch.
ghstack PR number: #19190 by @digantdesai
^ Please use this as the source of truth for the PR details, comments, and reviews
ghstack PR base: https://github.com/pytorch/executorch/tree/gh/digantdesai/53/base
ghstack PR head: https://github.com/pytorch/executorch/tree/gh/digantdesai/53/head
Merge bot PR base: https://github.com/pytorch/executorch/tree/gh/digantdesai/51/orig
Merge bot PR head: https://github.com/pytorch/executorch/tree/gh/digantdesai/53/orig

@diff-train-skip-merge

INT8 tensor core variants of the batched MoE GEMM kernels that
dynamically quantize bf16 activations to INT8 per-row per-tile and
dequantize INT4 weights directly to INT8 (skipping bf16 conversion).
Uses tl.dot(int8, int8) → int32 accumulation with per-tile float32
rescale. 1.7× MoE speedup on A100 at M=1024 with 0.9998 cosine
similarity vs bf16 baseline.

Co-authored-by: Claude <noreplyanthropic.com>

ghstack-source-id: 809c2cc
Pull Request resolved: #19187
Add three new Triton kernels for dense W4A16 linear projections that
replace tinygemm's tiled INT4 format with simple [N, K//2] packed weights
(same format as MoE experts):

- int4_matmul: fused dequant+tl.dot GEMM for medium M (prefill crossover)
- int4_matvec: bandwidth-optimized vec-mat for M=1 decode
- dequant_w4_to_bf16: weight dequant for large-M prefill via Inductor mm

W4DequantLinear wraps these with dual decode/prefill dispatch:
- Decode (M=1): int4_matvec (73 tok/s, ~35% slower than tinygemm)
- Prefill (M>1): dequant+F.linear via Inductor (3400 tok/s at 3K tokens,
  +67% over tinygemm baseline)

Single 18GB weight blob (no duplication). Decode perf regression is a
known trade-off for uniform weight format — to be revisited with a
CUDA C++ matvec kernel.

Also adds INT8 dynamic-activation MoE tests and comprehensive correctness
tests (48 tests, all passing at rtol=0.01).

Co-authored-by: Claude <noreplyanthropic.com>

ghstack-source-id: 89acc9b
Pull Request resolved: #19188
… runner

Runner now uses llm::Stats with proper timestamps for model load, prefill,
decode, and GPU memory (via cudaMemGetInfo). Output matches stats.h
print_report format: PyTorchObserver JSON line plus human-readable table.

This commit was authored with the assistance of Claude Code.

ghstack-source-id: 9227519
Pull Request resolved: #19190
@pytorchbot pytorchbot requested a review from lucylq as a code owner April 30, 2026 15:05
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Apr 30, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19228

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 30, 2026
Base automatically changed from gh/digantdesai/51/orig to main April 30, 2026 17:34
@github-actions
Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@Gasoonjia Gasoonjia merged commit e84a418 into main Apr 30, 2026
194 of 201 checks passed
@Gasoonjia Gasoonjia deleted the gh/digantdesai/53/orig branch April 30, 2026 17:38
rascani added a commit that referenced this pull request May 1, 2026
…CUDA (#19265)

### Summary

#19228 added structured GPU memory tracking to the qwen3_5_moe runner
but did not wrap the new cudaMemGetInfo blocks in the existing
EXECUTORCH_BUILD_CUDA guard that the rest of the file uses for CUDA-only
APIs. The same main.cpp is built for the Metal target where the CUDA
runtime headers are not available, so the new blocks failed to compile
on macOS:

    error: use of undeclared identifier 'cudaMemGetInfo'
        if (cudaMemGetInfo(&free, &total) == cudaSuccess) {

Wrap the three new scoped blocks in #ifdef EXECUTORCH_BUILD_CUDA,
matching the existing guard pattern at lines 27, 68, 113, 168, and 184.
The stats struct fields they would have populated
(gpu_free_before_load_bytes, gpu_free_after_load_bytes,
gpu_free_after_generate_bytes, gpu_peak_usage_mb) default to their
sentinel values on non-CUDA builds, so the rest of the runner's stats
reporting tolerates their absence.

Authored with Claude Code.

### Test plan
CI
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants