Add W4A8 INT8 activation kernels for batched MoE prefill#19226
Add W4A8 INT8 activation kernels for batched MoE prefill#19226
Conversation
INT8 tensor core variants of the batched MoE GEMM kernels that dynamically quantize bf16 activations to INT8 per-row per-tile and dequantize INT4 weights directly to INT8 (skipping bf16 conversion). Uses tl.dot(int8, int8) → int32 accumulation with per-tile float32 rescale. 1.7× MoE speedup on A100 at M=1024 with 0.9998 cosine similarity vs bf16 baseline. Co-authored-by: Claude <noreplyanthropic.com> ghstack-source-id: 809c2cc Pull Request resolved: #19187
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19226
Note: Links to docs will display an error until the docs builds have been completed. ❌ 13 New Failures, 5 Pending, 4 Unrelated FailuresAs of commit cfd120d with merge base cb4e5ae ( NEW FAILURES - The following jobs have failed:
FLAKY - The following job failed but was likely due to flakiness present on trunk:
BROKEN TRUNK - The following jobs failed but was present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
This PR was created by the merge bot to help merge the original PR into the main branch.
ghstack PR number: #19187 by @digantdesai
^ Please use this as the source of truth for the PR details, comments, and reviews
ghstack PR base: https://github.com/pytorch/executorch/tree/gh/digantdesai/50/base
ghstack PR head: https://github.com/pytorch/executorch/tree/gh/digantdesai/50/head
Merge bot PR base: https://github.com/pytorch/executorch/tree/main
Merge bot PR head: https://github.com/pytorch/executorch/tree/gh/digantdesai/50/orig
@diff-train-skip-merge