-
Notifications
You must be signed in to change notification settings - Fork 298
Build mxfp4 kernel for sm120a #2285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2285
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (2 Unrelated Failures)As of commit ae5f55a with merge base eb86177 ( BROKEN TRUNK - The following jobs failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
The first thing that comes to mind is that example is doing NVfp4 where all our recipes are doing MXfp4, e.g. https://github.com/pytorch/ao/pull/2285/files#diff-e155558499c3b1fbab1b5d3b60f032bf1e636908a8ef50a1de33bff518107019R240-R241 needs to change as well. For inference we have MXFP8 and MXFP4 support I am planning to add an NVFP4 scaling recipe next, that being said I would imagine that MXFP4 is supported on 5090.. cc @syed-ahmed |
I noticed that as well
😭 |
Per cutlass docs, I believe MXFP4 is supported in 5090: https://github.com/NVIDIA/cutlass/blob/9d165a3b8ef446a7ff3db198413f82bcb83f46fe/media/docs/cpp/blackwell_functionality.md#blackwell-sm120-gemms However note the section that talks about the differences with sm100. So it's possible we need more changes to the kernel in torch ao. Also what CUDA version are you using? I'd assume you'd need a fairly recent CUDA version. I'll try to guide more next week. |
@syed-ahmed I'm using CUDA 12.9 The strange thing is that the cutlass example works, but the one in torchao doesn't. I carefully compared the two, and I don't spot any difference in the template arguments. |
How about the test? Are the inputs similar to the cutlass example? |
563fc7c
to
0f2f3af
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks really good, could you add a test to test/prototype/mx_formats/test_mx_mm.py
even if it wound't be exercised in ci, As well if you have any perf numbers that would be great
2da21d1
to
e8738bc
Compare
benchmarks/mx_formats/mm_bench.py
Outdated
plot_tflops_comparison(df, save_path) | ||
|
||
|
||
if __name__ == "__main__": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we unify this code with https://github.com/pytorch/ao/blob/main/benchmarks/float8/bench_matmul.py instead? I know the path says float8
but it would be good to have it all in one place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added mxfp4 to that script. You can review it. Thank you
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any changes you want to make? If not, I will merge 🙏
benchmarks/float8/bench_matmul.py
Outdated
A = torch.zeros(M, K, device=device, dtype=d1) | ||
B = torch.zeros(K, N, device=device, dtype=d2).t().contiguous().t() | ||
if use_fp4: | ||
A = torch.zeros(M, K // 2, device=device, dtype=torch.int8).view( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just now looking at this but the performance on Nvidia w/ zeros vs non zero filled data is very very different
I think in a follow up PR we should make the zero filled an option vs randn distributed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can even be in this PR, I'd vote for randn as default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Should there be an option to choose between randn and zeros? Or just use randn for everything
- For FP4, should I do randn in FP32/BF16 then convert to FP4, or just do
randint(0, 255, dtype=uint8).view(float4_e2m1fn_x2)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^ I think the option for zeros can come later
- I would do the first randn in Higher precision then cast
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mind taking a final look then merge? The failing tests seem unrelated.
sorry this breaks internal CI, we'' revert first to unblock diff train
|
This reverts commit e73a142.
Yea I guess we need to do something like this, if we don't use cmake to build each source file with different flags |
Revert "Build mxfp4 kernel for sm120a (pytorch#2285)" This reverts commit e73a142.
Update (2025/06/14)
Added benchmark script, courtesy to @drisspg. Perhaps you can run it for B200 as well
At 575W (default)
At 400W - this is just for me since I'm running this card at 400W
For reference (5090 is last column)
At stock power limit, the card can comfortably reach speed of light. The interesting bit is in MX-FP8 -> it looks like the kernel from
torch._scaled_mm()
(I guess it's from CuDNN/CuBLAS?) is using FP16 accumulate 🤔Update (2025/06/11)
I narrowed down the issue to template - if the kernel is inside a templated function, even if I don't use any template arguments, I will get the runtime error below (
cudaFuncSetAttribute() returned error: invalid resource handle
). It might be an issue with cutlass or my environment (nvcc version, compiler...).Hence, the solution is to create a separate source file for sm120a, without any templated functions. When we support nvfp4 in the future, we can either manually duplicate the code again, use macro, or have a python script to codegen the cutlass kernel creation.
Other details of this PR:
sm120a
extensionOther alternatives that I have considered for the torch library loading logic:
setuptools.Extension
's limitation, sm100a and sm120a kernels must stay in separate shared library files. This eliminates the option of doing runtime check in C++.mx_fp4_bf16_sm100a
andmx_fp4_bf16_sm120a
), and dispatch the correct op in PythonOriginal (2025/05/31)
Just making some quick changes here to see if I can build mxfp4 kernel on 5090 (sm120). Eventually this will be put under
torchao._C_cutlass_120a
?Setting
-DCUTLASS_DEBUG_TRACE_LEVEL=1
so I can see debug trace.To build (using
torch==2.8.0.dev20250530+cu128
)Running
pytest test/prototype/mx_formats/test_mx_mm.py -v
cudaFuncSetAttribute() returned error: invalid resource handle
means that the function is invalid? https://github.com/NVIDIA/cutlass/blob/ad7b2f5e84fcfa124cb02b91d5bd26d238c0459e/include/cutlass/gemm/device/gemm_universal_adapter.h#L338, which is quite strange...For reference, I can build and run the example from Cutlass here https://github.com/NVIDIA/cutlass/blob/v3.9.2/examples/79_blackwell_geforce_gemm/79a_blackwell_geforce_nvfp4_bf16_gemm.cu. The changes in this PR has been taken from this example. When building with
CUTLASS_DEBUG_TRACE_LEVEL=1
, there are also warnings insm90_gemm_tma_warpspecialized_cooperative.hpp
, so that is probably not the issue.@drisspg
cc @alexsamardzic in case you faced this error with Cutlass before