[ROCM] Float8 deepseekv3_671b IntOverflow in triton kernels during training by alex-minooka · Pull Request #4016 · pytorch/ao

alex-minooka · 2026-03-06T18:39:33Z

Issue
When running training on deepseekv3_671b on 1 node reduced layers, triton kernels were producing IntOverflow exception after a few iterations.

`File "/opt/conda/lib/python3.11/site-packages/triton/runtime/jit.py", line 744, in run kernel.run(grid_0, grid_1, grid_2, stream, kernel.function, kernel.packed_metadata, launch_metadata, File "/opt/conda/lib/python3.11/site-packages/triton/backends/amd/driver.py", line 828, in call self.launch(self.launch_cooperative_grid, gridX, gridY, gridZ, stream, function, profile_scratch, *args) OverflowError: signed integer is greater than maximum`

Changes:
Two fixes applied to both _triton_fp8_per_group_rowwise_scales_kernel and _triton_fp8_per_group_colwise_scales_kernel:

Removed unused num_elements parameter — This was computed as hp_tensor.numel() and passed to the Triton kernels but never referenced in the kernel bodies. For DeepSeek V3's large MoE tensors (256 experts, dim=7168), numel() can exceed 2^31 - 1, and the AMD Triton driver packs int kernel args as signed 32-bit integers, directly causing the OverflowError: signed integer is greater than maximum at kernel launch.

Cast pointer arithmetic to tl.int64 — All stride multiplications and offset computations inside both kernels now use 64-bit integers (block_row_offs.to(tl.int64), stride_input_row.to(tl.int64), etc.). This prevents potential int32 overflow in pointer calculations like block_row_offs * stride_input_row for large activation tensors (e.g., ~1M routed tokens × stride 7168 ≈ 7B, which overflows int32).

pytorch-bot · 2026-03-06T18:39:38Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/4016

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit cad25ac with merge base 42bcdc4 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

danielvegamyhre · 2026-03-06T18:41:36Z

torchao/prototype/moe_training/kernels/jagged_float8_scales.py

        block_row_offs = block_row_id * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)

+        # cast to int64 to avoid overflow in pointer arithmetic for large tensors
+        block_row_offs_i64 = block_row_offs.to(tl.int64)


You can just update the type annotation in the kernel signature to int64 i think, can you try

deepseekv3_671b was seeing int overflows on full model due to moe sizes

f283a21

pytorch-bot bot added the topic: rocm label Mar 6, 2026

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 6, 2026

danielvegamyhre self-requested a review March 6, 2026 18:40

danielvegamyhre reviewed Mar 6, 2026

View reviewed changes

danielvegamyhre self-requested a review March 6, 2026 18:41

changing headers

cad25ac

danielvegamyhre merged commit 5045d76 into pytorch:main Mar 7, 2026
19 checks passed

brucechanglongxu mentioned this pull request Mar 7, 2026

Revert expanded MoE FP8 autotune configs that regress DeepSeek V3 shapes #4024

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ROCM] Float8 deepseekv3_671b IntOverflow in triton kernels during training #4016

[ROCM] Float8 deepseekv3_671b IntOverflow in triton kernels during training #4016
danielvegamyhre merged 2 commits intopytorch:mainfrom
alex-minooka:float8-deepseek-int-overflow

alex-minooka commented Mar 6, 2026

Uh oh!

pytorch-bot bot commented Mar 6, 2026 •

edited

Loading

Uh oh!

danielvegamyhre Mar 6, 2026

Uh oh!

alex-minooka Mar 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

alex-minooka commented Mar 6, 2026

Uh oh!

pytorch-bot bot commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/4016

✅ No Failures

Uh oh!

danielvegamyhre Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

alex-minooka Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pytorch-bot bot commented Mar 6, 2026 •

edited

Loading