[nvfp4] Make per_tensor_scale optional for triton kernel path#4188
[nvfp4] Make per_tensor_scale optional for triton kernel path#4188jerryzh168 merged 15 commits intomainfrom
Conversation
Summary: MSLK now supports optional global scale in its triton quantize kernel (MSLK#233, commit c01f06c). This change relaxes the corresponding constraint in torchao so the triton kernel path can be used without a per_tensor_scale (single-level block-wise scaling only). Changes: - Remove `assert per_tensor_scale is not None` from `to_nvfp4` triton branch - Update `mslk_quantize_nvfp4` and its custom op to accept `Optional[torch.Tensor]`, passing `None` through to MSLK (which treats it as global_scale=1.0) - Relax `_addmm_nvfp4_dispatch` to allow mixed per_tensor_scale states between operands (treat None as 1.0) instead of asserting both-or-neither Test Plan: Requires SM100+ GPU with MSLK installed. ``` python -m pytest test/prototype/mx_formats/test_nvfp4_tensor.py::test_triton_nvfp4_quantize_equivalence -v python -m pytest test/prototype/mx_formats/test_nvfp4_tensor.py::test_nvfp4_matmul_optional_per_tensor_scale -v python -m pytest test/prototype/mx_formats/test_inference_workflow.py::test_inference_workflow_nvfp4 -k "use_triton_kernel_True-use_dynamic_per_tensor_scale_False" -v ``` [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/4188
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit a205345 with merge base 0c29e81 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Summary: MSLK now supports optional global scale in its triton quantize kernel (MSLK#233, commit c01f06c). This change relaxes the corresponding constraint in torchao so the triton kernel path can be used without a per_tensor_scale (single-level block-wise scaling only). Changes: - Remove `assert per_tensor_scale is not None` from `to_nvfp4` triton branch - Update `mslk_quantize_nvfp4` and its custom op to accept `Optional[torch.Tensor]`, passing `None` through to MSLK (which treats it as global_scale=1.0) - Relax `_addmm_nvfp4_dispatch` to allow mixed per_tensor_scale states between operands (treat None as 1.0) instead of asserting both-or-neither Test Plan: Requires SM100+ GPU with MSLK installed. ``` python -m pytest test/prototype/mx_formats/test_nvfp4_tensor.py::test_triton_nvfp4_quantize_equivalence -v python -m pytest test/prototype/mx_formats/test_nvfp4_tensor.py::test_nvfp4_matmul_optional_per_tensor_scale -v python -m pytest test/prototype/mx_formats/test_inference_workflow.py::test_inference_workflow_nvfp4 -k "use_triton_kernel_True-use_dynamic_per_tensor_scale_False" -v ``` ghstack-source-id: 53b759f Pull Request resolved: #4188
…ath" Summary: MSLK now supports optional global scale in its triton quantize kernel (MSLK#233, commit c01f06c). This change relaxes the corresponding constraint in torchao so the triton kernel path can be used without a per_tensor_scale (single-level block-wise scaling only). Changes: - Remove `assert per_tensor_scale is not None` from `to_nvfp4` triton branch - Update `mslk_quantize_nvfp4` and its custom op to accept `Optional[torch.Tensor]`, passing `None` through to MSLK (which treats it as global_scale=1.0) - Relax `_addmm_nvfp4_dispatch` to allow mixed per_tensor_scale states between operands (treat None as 1.0) instead of asserting both-or-neither Test Plan: Requires SM100+ GPU with MSLK installed. ``` python -m pytest test/prototype/mx_formats/test_nvfp4_tensor.py::test_triton_nvfp4_quantize_equivalence -v python -m pytest test/prototype/mx_formats/test_nvfp4_tensor.py::test_nvfp4_matmul_optional_per_tensor_scale -v python -m pytest test/prototype/mx_formats/test_inference_workflow.py::test_inference_workflow_nvfp4 -k "test_inference_workflow_nvfp4" -v ``` [ghstack-poisoned]
Summary: MSLK now supports optional global scale in its triton quantize kernel (MSLK#233, commit c01f06c). This change relaxes the corresponding constraint in torchao so the triton kernel path can be used without a per_tensor_scale (single-level block-wise scaling only). Changes: - Remove `assert per_tensor_scale is not None` from `to_nvfp4` triton branch - Update `mslk_quantize_nvfp4` and its custom op to accept `Optional[torch.Tensor]`, passing `None` through to MSLK (which treats it as global_scale=1.0) - Relax `_addmm_nvfp4_dispatch` to allow mixed per_tensor_scale states between operands (treat None as 1.0) instead of asserting both-or-neither Test Plan: Requires SM100+ GPU with MSLK installed. ``` python -m pytest test/prototype/mx_formats/test_nvfp4_tensor.py::test_triton_nvfp4_quantize_equivalence -v python -m pytest test/prototype/mx_formats/test_nvfp4_tensor.py::test_nvfp4_matmul_optional_per_tensor_scale -v python -m pytest test/prototype/mx_formats/test_inference_workflow.py::test_inference_workflow_nvfp4 -k "use_triton_kernel_True-use_dynamic_per_tensor_scale_False" -v ``` ghstack-source-id: daffb99 Pull Request resolved: #4188
| data_scaled = data_hp / block_scale_fp32.unsqueeze(-1) | ||
| # Multiply by reciprocal instead of dividing to match MSLK triton kernel | ||
| # numerics (global_scale=None treated as 1.0): x * (1.0 / fp8_scale) | ||
| data_scaled = data_hp * (1.0 / block_scale_fp32).unsqueeze(-1) |
There was a problem hiding this comment.
is there evidence that this is needed?
There was a problem hiding this comment.
yeah, it's because the kernel equivalence numerical test fails
…ath" Summary: MSLK now supports optional global scale in its triton quantize kernel (MSLK#233, commit c01f06c). This change relaxes the corresponding constraint in torchao so the triton kernel path can be used without a per_tensor_scale (single-level block-wise scaling only). Changes: - Remove `assert per_tensor_scale is not None` from `to_nvfp4` triton branch - Update `mslk_quantize_nvfp4` and its custom op to accept `Optional[torch.Tensor]`, passing `None` through to MSLK (which treats it as global_scale=1.0) - Relax `_addmm_nvfp4_dispatch` to allow mixed per_tensor_scale states between operands (treat None as 1.0) instead of asserting both-or-neither Test Plan: Requires SM100+ GPU with MSLK nightly installed. ``` python -m pytest test/prototype/mx_formats/test_nvfp4_tensor.py::test_triton_nvfp4_quantize_equivalence -v python -m pytest test/prototype/mx_formats/test_nvfp4_tensor.py::test_nvfp4_matmul_optional_per_tensor_scale -v python -m pytest test/prototype/mx_formats/test_inference_workflow.py::test_inference_workflow_nvfp4 -k "test_inference_workflow_nvfp4" -v ``` Performance: with global scale: ``` python benchmarks/float8/float8_inference_roofline.py --recipe_name nvfp4 --enable_fusion_modeling True --skip_printing_detailed_metrics True Parameter Value ---------------------- ------------------------ GPU NVIDIA GB200 torch version 2.12.0.dev20260316+cu128 torchao version 0.17.0+git95281b63b recipe_name nvfp4 do_benchmarks True shape_gen_name pow2 enable_fusion_modeling True op_name linear MKN None None None DHW None None None kernel_size stride 1 padding 0 bf16_gemm_time_sympy Max(2.0e-6, 1.13960113960114e-15*K*M*N, 2.71739130434783e-13*K*M + 2.71739130434783e-13*K*N + 2.71739130434783e-13*M*N) bf16_ovhd_time_sympy Max(2.0e-6, 5.43478260869565e-13*K*M) fp8_gemm_time_sympy Max(2.0e-6, 2.84900284900285e-16*K*M*N, 6.79347826086956e-14*K*M + 6.79347826086956e-14*K*N + 2.71739130434783e-13*M*N + 6.79347826086956e-14*floor(K*M/16 + K*N/16)) fp8_ovhd_time_sympy Max(2.0e-6, 6.11413043478261e-13*K*M + 1.35869565217391e-13*M*floor(K/16)) fwd_M fwd_K fwd_N r_fp8_gemm_and_ovhd_spdp b_fp8_e2e_spdp 0 1024 1024 1024 1.00 0.45 1 2048 2048 2048 2.39 0.66 2 4096 4096 4096 2.92 1.29 3 8192 8192 8192 3.34 1.74 4 16384 16384 16384 3.63 2.84 ``` without global scale: ``` python benchmarks/float8/float8_inference_roofline.py --recipe_name nvfp4_no_global_scale --enable_fusion_modeling True --skip_printing_detailed_metrics True Parameter Value ---------------------- ------------------------ GPU NVIDIA GB200 torch version 2.12.0.dev20260316+cu128 torchao version 0.17.0+gitabb103d3b recipe_name nvfp4_no_global_scale do_benchmarks True shape_gen_name pow2 enable_fusion_modeling True op_name linear MKN None None None DHW None None None kernel_size stride 1 padding 0 bf16_gemm_time_sympy Max(2.0e-6, 1.13960113960114e-15*K*M*N, 2.71739130434783e-13*K*M + 2.71739130434783e-13*K*N + 2.71739130434783e-13*M*N) bf16_ovhd_time_sympy Max(2.0e-6, 5.43478260869565e-13*K*M) fp8_gemm_time_sympy Max(2.0e-6, 2.84900284900285e-16*K*M*N, 6.79347826086956e-14*K*M + 6.79347826086956e-14*K*N + 2.71739130434783e-13*M*N + 6.79347826086956e-14*floor(K*M/16 + K*N/16)) fp8_ovhd_time_sympy Max(2.0e-6, 3.39673913043478e-13*K*M + 1.35869565217391e-13*M*floor(K/16)) fwd_M fwd_K fwd_N r_fp8_gemm_and_ovhd_spdp b_fp8_e2e_spdp 0 1024 1024 1024 1.00 0.73 1 2048 2048 2048 2.71 1.09 2 4096 4096 4096 3.44 2.22 3 8192 8192 8192 3.68 2.82 4 16384 16384 16384 3.83 3.65 ``` [ghstack-poisoned]
Summary: MSLK now supports optional global scale in its triton quantize kernel (MSLK#233, commit c01f06c). This change relaxes the corresponding constraint in torchao so the triton kernel path can be used without a per_tensor_scale (single-level block-wise scaling only). Changes: - Remove `assert per_tensor_scale is not None` from `to_nvfp4` triton branch - Update `mslk_quantize_nvfp4` and its custom op to accept `Optional[torch.Tensor]`, passing `None` through to MSLK (which treats it as global_scale=1.0) - Relax `_addmm_nvfp4_dispatch` to allow mixed per_tensor_scale states between operands (treat None as 1.0) instead of asserting both-or-neither Test Plan: Requires SM100+ GPU with MSLK installed. ``` python -m pytest test/prototype/mx_formats/test_nvfp4_tensor.py::test_triton_nvfp4_quantize_equivalence -v python -m pytest test/prototype/mx_formats/test_nvfp4_tensor.py::test_nvfp4_matmul_optional_per_tensor_scale -v python -m pytest test/prototype/mx_formats/test_inference_workflow.py::test_inference_workflow_nvfp4 -k "use_triton_kernel_True-use_dynamic_per_tensor_scale_False" -v ``` ghstack-source-id: 3d3489c Pull Request resolved: #4188
… triton kernel path" Summary: MSLK now supports optional global scale in its triton quantize kernel (MSLK#233, commit c01f06c). This change relaxes the corresponding constraint in torchao so the triton kernel path can be used without a per_tensor_scale (single-level block-wise scaling only). Changes: - Remove `assert per_tensor_scale is not None` from `to_nvfp4` triton branch - Update `mslk_quantize_nvfp4` and its custom op to accept `Optional[torch.Tensor]`, passing `None` through to MSLK (which treats it as global_scale=1.0) - Relax `_addmm_nvfp4_dispatch` to allow mixed per_tensor_scale states between operands (treat None as 1.0) instead of asserting both-or-neither Test Plan: Requires SM100+ GPU with MSLK nightly installed. ``` python -m pytest test/prototype/mx_formats/test_nvfp4_tensor.py::test_triton_nvfp4_quantize_equivalence -v python -m pytest test/prototype/mx_formats/test_nvfp4_tensor.py::test_nvfp4_matmul_optional_per_tensor_scale -v python -m pytest test/prototype/mx_formats/test_inference_workflow.py::test_inference_workflow_nvfp4 -k "test_inference_workflow_nvfp4" -v ``` Performance: with global scale: ``` python benchmarks/float8/float8_inference_roofline.py --recipe_name nvfp4 --enable_fusion_modeling True --skip_printing_detailed_metrics True Parameter Value ---------------------- ------------------------ GPU NVIDIA GB200 torch version 2.12.0.dev20260316+cu128 torchao version 0.17.0+git95281b63b recipe_name nvfp4 do_benchmarks True shape_gen_name pow2 enable_fusion_modeling True op_name linear MKN None None None DHW None None None kernel_size stride 1 padding 0 bf16_gemm_time_sympy Max(2.0e-6, 1.13960113960114e-15*K*M*N, 2.71739130434783e-13*K*M + 2.71739130434783e-13*K*N + 2.71739130434783e-13*M*N) bf16_ovhd_time_sympy Max(2.0e-6, 5.43478260869565e-13*K*M) fp8_gemm_time_sympy Max(2.0e-6, 2.84900284900285e-16*K*M*N, 6.79347826086956e-14*K*M + 6.79347826086956e-14*K*N + 2.71739130434783e-13*M*N + 6.79347826086956e-14*floor(K*M/16 + K*N/16)) fp8_ovhd_time_sympy Max(2.0e-6, 6.11413043478261e-13*K*M + 1.35869565217391e-13*M*floor(K/16)) fwd_M fwd_K fwd_N r_fp8_gemm_and_ovhd_spdp b_fp8_e2e_spdp 0 1024 1024 1024 1.00 0.45 1 2048 2048 2048 2.39 0.66 2 4096 4096 4096 2.92 1.29 3 8192 8192 8192 3.34 1.74 4 16384 16384 16384 3.63 2.84 ``` without global scale: ``` python benchmarks/float8/float8_inference_roofline.py --recipe_name nvfp4_no_global_scale --enable_fusion_modeling True --skip_printing_detailed_metrics True Parameter Value ---------------------- ------------------------ GPU NVIDIA GB200 torch version 2.12.0.dev20260316+cu128 torchao version 0.17.0+gitabb103d3b recipe_name nvfp4_no_global_scale do_benchmarks True shape_gen_name pow2 enable_fusion_modeling True op_name linear MKN None None None DHW None None None kernel_size stride 1 padding 0 bf16_gemm_time_sympy Max(2.0e-6, 1.13960113960114e-15*K*M*N, 2.71739130434783e-13*K*M + 2.71739130434783e-13*K*N + 2.71739130434783e-13*M*N) bf16_ovhd_time_sympy Max(2.0e-6, 5.43478260869565e-13*K*M) fp8_gemm_time_sympy Max(2.0e-6, 2.84900284900285e-16*K*M*N, 6.79347826086956e-14*K*M + 6.79347826086956e-14*K*N + 2.71739130434783e-13*M*N + 6.79347826086956e-14*floor(K*M/16 + K*N/16)) fp8_ovhd_time_sympy Max(2.0e-6, 3.39673913043478e-13*K*M + 1.35869565217391e-13*M*floor(K/16)) fwd_M fwd_K fwd_N r_fp8_gemm_and_ovhd_spdp b_fp8_e2e_spdp 0 1024 1024 1024 1.00 0.73 1 2048 2048 2048 2.71 1.09 2 4096 4096 4096 3.44 2.22 3 8192 8192 8192 3.68 2.82 4 16384 16384 16384 3.83 3.65 ``` [ghstack-poisoned]
…ath" Summary: MSLK now supports optional global scale in its triton quantize kernel (MSLK#233, commit c01f06c). This change relaxes the corresponding constraint in torchao so the triton kernel path can be used without a per_tensor_scale (single-level block-wise scaling only). Changes: - Remove `assert per_tensor_scale is not None` from `to_nvfp4` triton branch - Update `mslk_quantize_nvfp4` and its custom op to accept `Optional[torch.Tensor]`, passing `None` through to MSLK (which treats it as global_scale=1.0) - Relax `_addmm_nvfp4_dispatch` to allow mixed per_tensor_scale states between operands (treat None as 1.0) instead of asserting both-or-neither Test Plan: Requires SM100+ GPU with MSLK nightly installed. ``` python -m pytest test/prototype/mx_formats/test_nvfp4_tensor.py::test_triton_nvfp4_quantize_equivalence -v python -m pytest test/prototype/mx_formats/test_nvfp4_tensor.py::test_nvfp4_matmul_optional_per_tensor_scale -v python -m pytest test/prototype/mx_formats/test_inference_workflow.py::test_inference_workflow_nvfp4 -k "test_inference_workflow_nvfp4" -v ``` Performance: with global scale: ``` python benchmarks/float8/float8_inference_roofline.py --recipe_name nvfp4 --enable_fusion_modeling True --skip_printing_detailed_metrics True Parameter Value ---------------------- ------------------------ GPU NVIDIA GB200 torch version 2.12.0.dev20260316+cu128 torchao version 0.17.0+git95281b63b recipe_name nvfp4 do_benchmarks True shape_gen_name pow2 enable_fusion_modeling True op_name linear MKN None None None DHW None None None kernel_size stride 1 padding 0 bf16_gemm_time_sympy Max(2.0e-6, 1.13960113960114e-15*K*M*N, 2.71739130434783e-13*K*M + 2.71739130434783e-13*K*N + 2.71739130434783e-13*M*N) bf16_ovhd_time_sympy Max(2.0e-6, 5.43478260869565e-13*K*M) fp8_gemm_time_sympy Max(2.0e-6, 2.84900284900285e-16*K*M*N, 6.79347826086956e-14*K*M + 6.79347826086956e-14*K*N + 2.71739130434783e-13*M*N + 6.79347826086956e-14*floor(K*M/16 + K*N/16)) fp8_ovhd_time_sympy Max(2.0e-6, 6.11413043478261e-13*K*M + 1.35869565217391e-13*M*floor(K/16)) fwd_M fwd_K fwd_N r_fp8_gemm_and_ovhd_spdp b_fp8_e2e_spdp 0 1024 1024 1024 1.00 0.45 1 2048 2048 2048 2.39 0.66 2 4096 4096 4096 2.92 1.29 3 8192 8192 8192 3.34 1.74 4 16384 16384 16384 3.63 2.84 ``` without global scale: ``` python benchmarks/float8/float8_inference_roofline.py --recipe_name nvfp4_no_global_scale --enable_fusion_modeling True --skip_printing_detailed_metrics True Parameter Value ---------------------- ------------------------ GPU NVIDIA GB200 torch version 2.12.0.dev20260316+cu128 torchao version 0.17.0+gitabb103d3b recipe_name nvfp4_no_global_scale do_benchmarks True shape_gen_name pow2 enable_fusion_modeling True op_name linear MKN None None None DHW None None None kernel_size stride 1 padding 0 bf16_gemm_time_sympy Max(2.0e-6, 1.13960113960114e-15*K*M*N, 2.71739130434783e-13*K*M + 2.71739130434783e-13*K*N + 2.71739130434783e-13*M*N) bf16_ovhd_time_sympy Max(2.0e-6, 5.43478260869565e-13*K*M) fp8_gemm_time_sympy Max(2.0e-6, 2.84900284900285e-16*K*M*N, 6.79347826086956e-14*K*M + 6.79347826086956e-14*K*N + 2.71739130434783e-13*M*N + 6.79347826086956e-14*floor(K*M/16 + K*N/16)) fp8_ovhd_time_sympy Max(2.0e-6, 3.39673913043478e-13*K*M + 1.35869565217391e-13*M*floor(K/16)) fwd_M fwd_K fwd_N r_fp8_gemm_and_ovhd_spdp b_fp8_e2e_spdp 0 1024 1024 1024 1.00 0.73 1 2048 2048 2048 2.71 1.09 2 4096 4096 4096 3.44 2.22 3 8192 8192 8192 3.68 2.82 4 16384 16384 16384 3.83 3.65 ``` [ghstack-poisoned]
Summary: MSLK now supports optional global scale in its triton quantize kernel (MSLK#233, commit c01f06c). This change relaxes the corresponding constraint in torchao so the triton kernel path can be used without a per_tensor_scale (single-level block-wise scaling only). Changes: - Remove `assert per_tensor_scale is not None` from `to_nvfp4` triton branch - Update `mslk_quantize_nvfp4` and its custom op to accept `Optional[torch.Tensor]`, passing `None` through to MSLK (which treats it as global_scale=1.0) - Relax `_addmm_nvfp4_dispatch` to allow mixed per_tensor_scale states between operands (treat None as 1.0) instead of asserting both-or-neither Test Plan: Requires SM100+ GPU with MSLK installed. ``` python -m pytest test/prototype/mx_formats/test_nvfp4_tensor.py::test_triton_nvfp4_quantize_equivalence -v python -m pytest test/prototype/mx_formats/test_nvfp4_tensor.py::test_nvfp4_matmul_optional_per_tensor_scale -v python -m pytest test/prototype/mx_formats/test_inference_workflow.py::test_inference_workflow_nvfp4 -k "use_triton_kernel_True-use_dynamic_per_tensor_scale_False" -v ``` ghstack-source-id: 32c52ee Pull Request resolved: #4188
|
@claude review |
|
Claude finished @jerryzh168's task in 5m 11s —— View job Code Review: [nvfp4] Make per_tensor_scale optional for triton kernel path
Overall this is a clean, well-scoped change. The core logic is straightforward: remove the assertion that blocked the triton kernel path when
|
… triton kernel path" Summary: MSLK now supports optional global scale in its triton quantize kernel (MSLK#233, commit c01f06c). This change relaxes the corresponding constraint in torchao so the triton kernel path can be used without a per_tensor_scale (single-level block-wise scaling only). Changes: - Remove `assert per_tensor_scale is not None` from `to_nvfp4` triton branch - Update `mslk_quantize_nvfp4` and its custom op to accept `Optional[torch.Tensor]`, passing `None` through to MSLK (which treats it as global_scale=1.0) - Relax `_addmm_nvfp4_dispatch` to allow mixed per_tensor_scale states between operands (treat None as 1.0) instead of asserting both-or-neither Test Plan: Requires SM100+ GPU with MSLK nightly installed. ``` python -m pytest test/prototype/mx_formats/test_nvfp4_tensor.py::test_triton_nvfp4_quantize_equivalence -v python -m pytest test/prototype/mx_formats/test_nvfp4_tensor.py::test_nvfp4_matmul_optional_per_tensor_scale -v python -m pytest test/prototype/mx_formats/test_inference_workflow.py::test_inference_workflow_nvfp4 -k "test_inference_workflow_nvfp4" -v ``` Performance: with global scale: ``` python benchmarks/float8/float8_inference_roofline.py --recipe_name nvfp4 --enable_fusion_modeling True --skip_printing_detailed_metrics True Parameter Value ---------------------- ------------------------ GPU NVIDIA GB200 torch version 2.12.0.dev20260316+cu128 torchao version 0.17.0+git95281b63b recipe_name nvfp4 do_benchmarks True shape_gen_name pow2 enable_fusion_modeling True op_name linear MKN None None None DHW None None None kernel_size stride 1 padding 0 bf16_gemm_time_sympy Max(2.0e-6, 1.13960113960114e-15*K*M*N, 2.71739130434783e-13*K*M + 2.71739130434783e-13*K*N + 2.71739130434783e-13*M*N) bf16_ovhd_time_sympy Max(2.0e-6, 5.43478260869565e-13*K*M) fp8_gemm_time_sympy Max(2.0e-6, 2.84900284900285e-16*K*M*N, 6.79347826086956e-14*K*M + 6.79347826086956e-14*K*N + 2.71739130434783e-13*M*N + 6.79347826086956e-14*floor(K*M/16 + K*N/16)) fp8_ovhd_time_sympy Max(2.0e-6, 6.11413043478261e-13*K*M + 1.35869565217391e-13*M*floor(K/16)) fwd_M fwd_K fwd_N r_fp8_gemm_and_ovhd_spdp b_fp8_e2e_spdp 0 1024 1024 1024 1.00 0.45 1 2048 2048 2048 2.39 0.66 2 4096 4096 4096 2.92 1.29 3 8192 8192 8192 3.34 1.74 4 16384 16384 16384 3.63 2.84 ``` without global scale: ``` python benchmarks/float8/float8_inference_roofline.py --recipe_name nvfp4_no_global_scale --enable_fusion_modeling True --skip_printing_detailed_metrics True Parameter Value ---------------------- ------------------------ GPU NVIDIA GB200 torch version 2.12.0.dev20260316+cu128 torchao version 0.17.0+gitabb103d3b recipe_name nvfp4_no_global_scale do_benchmarks True shape_gen_name pow2 enable_fusion_modeling True op_name linear MKN None None None DHW None None None kernel_size stride 1 padding 0 bf16_gemm_time_sympy Max(2.0e-6, 1.13960113960114e-15*K*M*N, 2.71739130434783e-13*K*M + 2.71739130434783e-13*K*N + 2.71739130434783e-13*M*N) bf16_ovhd_time_sympy Max(2.0e-6, 5.43478260869565e-13*K*M) fp8_gemm_time_sympy Max(2.0e-6, 2.84900284900285e-16*K*M*N, 6.79347826086956e-14*K*M + 6.79347826086956e-14*K*N + 2.71739130434783e-13*M*N + 6.79347826086956e-14*floor(K*M/16 + K*N/16)) fp8_ovhd_time_sympy Max(2.0e-6, 3.39673913043478e-13*K*M + 1.35869565217391e-13*M*floor(K/16)) fwd_M fwd_K fwd_N r_fp8_gemm_and_ovhd_spdp b_fp8_e2e_spdp 0 1024 1024 1024 1.00 0.73 1 2048 2048 2048 2.71 1.09 2 4096 4096 4096 3.44 2.22 3 8192 8192 8192 3.68 2.82 4 16384 16384 16384 3.83 3.65 ``` [ghstack-poisoned]
…ath" Summary: MSLK now supports optional global scale in its triton quantize kernel (MSLK#233, commit c01f06c). This change relaxes the corresponding constraint in torchao so the triton kernel path can be used without a per_tensor_scale (single-level block-wise scaling only). Changes: - Remove `assert per_tensor_scale is not None` from `to_nvfp4` triton branch - Update `mslk_quantize_nvfp4` and its custom op to accept `Optional[torch.Tensor]`, passing `None` through to MSLK (which treats it as global_scale=1.0) - Relax `_addmm_nvfp4_dispatch` to allow mixed per_tensor_scale states between operands (treat None as 1.0) instead of asserting both-or-neither Test Plan: Requires SM100+ GPU with MSLK nightly installed. ``` python -m pytest test/prototype/mx_formats/test_nvfp4_tensor.py::test_triton_nvfp4_quantize_equivalence -v python -m pytest test/prototype/mx_formats/test_nvfp4_tensor.py::test_nvfp4_matmul_optional_per_tensor_scale -v python -m pytest test/prototype/mx_formats/test_inference_workflow.py::test_inference_workflow_nvfp4 -k "test_inference_workflow_nvfp4" -v ``` Performance: with global scale: ``` python benchmarks/float8/float8_inference_roofline.py --recipe_name nvfp4 --enable_fusion_modeling True --skip_printing_detailed_metrics True Parameter Value ---------------------- ------------------------ GPU NVIDIA GB200 torch version 2.12.0.dev20260316+cu128 torchao version 0.17.0+git95281b63b recipe_name nvfp4 do_benchmarks True shape_gen_name pow2 enable_fusion_modeling True op_name linear MKN None None None DHW None None None kernel_size stride 1 padding 0 bf16_gemm_time_sympy Max(2.0e-6, 1.13960113960114e-15*K*M*N, 2.71739130434783e-13*K*M + 2.71739130434783e-13*K*N + 2.71739130434783e-13*M*N) bf16_ovhd_time_sympy Max(2.0e-6, 5.43478260869565e-13*K*M) fp8_gemm_time_sympy Max(2.0e-6, 2.84900284900285e-16*K*M*N, 6.79347826086956e-14*K*M + 6.79347826086956e-14*K*N + 2.71739130434783e-13*M*N + 6.79347826086956e-14*floor(K*M/16 + K*N/16)) fp8_ovhd_time_sympy Max(2.0e-6, 6.11413043478261e-13*K*M + 1.35869565217391e-13*M*floor(K/16)) fwd_M fwd_K fwd_N r_fp8_gemm_and_ovhd_spdp b_fp8_e2e_spdp 0 1024 1024 1024 1.00 0.45 1 2048 2048 2048 2.39 0.66 2 4096 4096 4096 2.92 1.29 3 8192 8192 8192 3.34 1.74 4 16384 16384 16384 3.63 2.84 ``` without global scale: ``` python benchmarks/float8/float8_inference_roofline.py --recipe_name nvfp4_no_global_scale --enable_fusion_modeling True --skip_printing_detailed_metrics True Parameter Value ---------------------- ------------------------ GPU NVIDIA GB200 torch version 2.12.0.dev20260316+cu128 torchao version 0.17.0+gitabb103d3b recipe_name nvfp4_no_global_scale do_benchmarks True shape_gen_name pow2 enable_fusion_modeling True op_name linear MKN None None None DHW None None None kernel_size stride 1 padding 0 bf16_gemm_time_sympy Max(2.0e-6, 1.13960113960114e-15*K*M*N, 2.71739130434783e-13*K*M + 2.71739130434783e-13*K*N + 2.71739130434783e-13*M*N) bf16_ovhd_time_sympy Max(2.0e-6, 5.43478260869565e-13*K*M) fp8_gemm_time_sympy Max(2.0e-6, 2.84900284900285e-16*K*M*N, 6.79347826086956e-14*K*M + 6.79347826086956e-14*K*N + 2.71739130434783e-13*M*N + 6.79347826086956e-14*floor(K*M/16 + K*N/16)) fp8_ovhd_time_sympy Max(2.0e-6, 3.39673913043478e-13*K*M + 1.35869565217391e-13*M*floor(K/16)) fwd_M fwd_K fwd_N r_fp8_gemm_and_ovhd_spdp b_fp8_e2e_spdp 0 1024 1024 1024 1.00 0.73 1 2048 2048 2048 2.71 1.09 2 4096 4096 4096 3.44 2.22 3 8192 8192 8192 3.68 2.82 4 16384 16384 16384 3.83 3.65 ``` [ghstack-poisoned]
Summary: MSLK now supports optional global scale in its triton quantize kernel (MSLK#233, commit c01f06c). This change relaxes the corresponding constraint in torchao so the triton kernel path can be used without a per_tensor_scale (single-level block-wise scaling only). Changes: - Remove `assert per_tensor_scale is not None` from `to_nvfp4` triton branch - Update `mslk_quantize_nvfp4` and its custom op to accept `Optional[torch.Tensor]`, passing `None` through to MSLK (which treats it as global_scale=1.0) - Relax `_addmm_nvfp4_dispatch` to allow mixed per_tensor_scale states between operands (treat None as 1.0) instead of asserting both-or-neither Test Plan: Requires SM100+ GPU with MSLK installed. ``` python -m pytest test/prototype/mx_formats/test_nvfp4_tensor.py::test_triton_nvfp4_quantize_equivalence -v python -m pytest test/prototype/mx_formats/test_nvfp4_tensor.py::test_nvfp4_matmul_optional_per_tensor_scale -v python -m pytest test/prototype/mx_formats/test_inference_workflow.py::test_inference_workflow_nvfp4 -k "use_triton_kernel_True-use_dynamic_per_tensor_scale_False" -v ``` ghstack-source-id: 53081a1 Pull Request resolved: #4188
… triton kernel path" Summary: MSLK now supports optional global scale in its triton quantize kernel (MSLK#233, commit c01f06c). This change relaxes the corresponding constraint in torchao so the triton kernel path can be used without a per_tensor_scale (single-level block-wise scaling only). Changes: - Remove `assert per_tensor_scale is not None` from `to_nvfp4` triton branch - Update `mslk_quantize_nvfp4` and its custom op to accept `Optional[torch.Tensor]`, passing `None` through to MSLK (which treats it as global_scale=1.0) - Relax `_addmm_nvfp4_dispatch` to allow mixed per_tensor_scale states between operands (treat None as 1.0) instead of asserting both-or-neither Test Plan: Requires SM100+ GPU with MSLK nightly installed. ``` python -m pytest test/prototype/mx_formats/test_nvfp4_tensor.py::test_triton_nvfp4_quantize_equivalence -v python -m pytest test/prototype/mx_formats/test_nvfp4_tensor.py::test_nvfp4_matmul_optional_per_tensor_scale -v python -m pytest test/prototype/mx_formats/test_inference_workflow.py::test_inference_workflow_nvfp4 -k "test_inference_workflow_nvfp4" -v ``` Performance: with global scale: ``` python benchmarks/float8/float8_inference_roofline.py --recipe_name nvfp4 --enable_fusion_modeling True --skip_printing_detailed_metrics True Parameter Value ---------------------- ------------------------ GPU NVIDIA GB200 torch version 2.12.0.dev20260316+cu128 torchao version 0.17.0+git95281b63b recipe_name nvfp4 do_benchmarks True shape_gen_name pow2 enable_fusion_modeling True op_name linear MKN None None None DHW None None None kernel_size stride 1 padding 0 bf16_gemm_time_sympy Max(2.0e-6, 1.13960113960114e-15*K*M*N, 2.71739130434783e-13*K*M + 2.71739130434783e-13*K*N + 2.71739130434783e-13*M*N) bf16_ovhd_time_sympy Max(2.0e-6, 5.43478260869565e-13*K*M) fp8_gemm_time_sympy Max(2.0e-6, 2.84900284900285e-16*K*M*N, 6.79347826086956e-14*K*M + 6.79347826086956e-14*K*N + 2.71739130434783e-13*M*N + 6.79347826086956e-14*floor(K*M/16 + K*N/16)) fp8_ovhd_time_sympy Max(2.0e-6, 6.11413043478261e-13*K*M + 1.35869565217391e-13*M*floor(K/16)) fwd_M fwd_K fwd_N r_fp8_gemm_and_ovhd_spdp b_fp8_e2e_spdp 0 1024 1024 1024 1.00 0.45 1 2048 2048 2048 2.39 0.66 2 4096 4096 4096 2.92 1.29 3 8192 8192 8192 3.34 1.74 4 16384 16384 16384 3.63 2.84 ``` without global scale: ``` python benchmarks/float8/float8_inference_roofline.py --recipe_name nvfp4_no_global_scale --enable_fusion_modeling True --skip_printing_detailed_metrics True Parameter Value ---------------------- ------------------------ GPU NVIDIA GB200 torch version 2.12.0.dev20260316+cu128 torchao version 0.17.0+gitabb103d3b recipe_name nvfp4_no_global_scale do_benchmarks True shape_gen_name pow2 enable_fusion_modeling True op_name linear MKN None None None DHW None None None kernel_size stride 1 padding 0 bf16_gemm_time_sympy Max(2.0e-6, 1.13960113960114e-15*K*M*N, 2.71739130434783e-13*K*M + 2.71739130434783e-13*K*N + 2.71739130434783e-13*M*N) bf16_ovhd_time_sympy Max(2.0e-6, 5.43478260869565e-13*K*M) fp8_gemm_time_sympy Max(2.0e-6, 2.84900284900285e-16*K*M*N, 6.79347826086956e-14*K*M + 6.79347826086956e-14*K*N + 2.71739130434783e-13*M*N + 6.79347826086956e-14*floor(K*M/16 + K*N/16)) fp8_ovhd_time_sympy Max(2.0e-6, 3.39673913043478e-13*K*M + 1.35869565217391e-13*M*floor(K/16)) fwd_M fwd_K fwd_N r_fp8_gemm_and_ovhd_spdp b_fp8_e2e_spdp 0 1024 1024 1024 1.00 0.73 1 2048 2048 2048 2.71 1.09 2 4096 4096 4096 3.44 2.22 3 8192 8192 8192 3.68 2.82 4 16384 16384 16384 3.83 3.65 ``` [ghstack-poisoned]
…ath" Summary: MSLK now supports optional global scale in its triton quantize kernel (MSLK#233, commit c01f06c). This change relaxes the corresponding constraint in torchao so the triton kernel path can be used without a per_tensor_scale (single-level block-wise scaling only). Changes: - Remove `assert per_tensor_scale is not None` from `to_nvfp4` triton branch - Update `mslk_quantize_nvfp4` and its custom op to accept `Optional[torch.Tensor]`, passing `None` through to MSLK (which treats it as global_scale=1.0) - Relax `_addmm_nvfp4_dispatch` to allow mixed per_tensor_scale states between operands (treat None as 1.0) instead of asserting both-or-neither Test Plan: Requires SM100+ GPU with MSLK nightly installed. ``` python -m pytest test/prototype/mx_formats/test_nvfp4_tensor.py::test_triton_nvfp4_quantize_equivalence -v python -m pytest test/prototype/mx_formats/test_nvfp4_tensor.py::test_nvfp4_matmul_optional_per_tensor_scale -v python -m pytest test/prototype/mx_formats/test_inference_workflow.py::test_inference_workflow_nvfp4 -k "test_inference_workflow_nvfp4" -v ``` Performance: with global scale: ``` python benchmarks/float8/float8_inference_roofline.py --recipe_name nvfp4 --enable_fusion_modeling True --skip_printing_detailed_metrics True Parameter Value ---------------------- ------------------------ GPU NVIDIA GB200 torch version 2.12.0.dev20260316+cu128 torchao version 0.17.0+git95281b63b recipe_name nvfp4 do_benchmarks True shape_gen_name pow2 enable_fusion_modeling True op_name linear MKN None None None DHW None None None kernel_size stride 1 padding 0 bf16_gemm_time_sympy Max(2.0e-6, 1.13960113960114e-15*K*M*N, 2.71739130434783e-13*K*M + 2.71739130434783e-13*K*N + 2.71739130434783e-13*M*N) bf16_ovhd_time_sympy Max(2.0e-6, 5.43478260869565e-13*K*M) fp8_gemm_time_sympy Max(2.0e-6, 2.84900284900285e-16*K*M*N, 6.79347826086956e-14*K*M + 6.79347826086956e-14*K*N + 2.71739130434783e-13*M*N + 6.79347826086956e-14*floor(K*M/16 + K*N/16)) fp8_ovhd_time_sympy Max(2.0e-6, 6.11413043478261e-13*K*M + 1.35869565217391e-13*M*floor(K/16)) fwd_M fwd_K fwd_N r_fp8_gemm_and_ovhd_spdp b_fp8_e2e_spdp 0 1024 1024 1024 1.00 0.45 1 2048 2048 2048 2.39 0.66 2 4096 4096 4096 2.92 1.29 3 8192 8192 8192 3.34 1.74 4 16384 16384 16384 3.63 2.84 ``` without global scale: ``` python benchmarks/float8/float8_inference_roofline.py --recipe_name nvfp4_no_global_scale --enable_fusion_modeling True --skip_printing_detailed_metrics True Parameter Value ---------------------- ------------------------ GPU NVIDIA GB200 torch version 2.12.0.dev20260316+cu128 torchao version 0.17.0+gitabb103d3b recipe_name nvfp4_no_global_scale do_benchmarks True shape_gen_name pow2 enable_fusion_modeling True op_name linear MKN None None None DHW None None None kernel_size stride 1 padding 0 bf16_gemm_time_sympy Max(2.0e-6, 1.13960113960114e-15*K*M*N, 2.71739130434783e-13*K*M + 2.71739130434783e-13*K*N + 2.71739130434783e-13*M*N) bf16_ovhd_time_sympy Max(2.0e-6, 5.43478260869565e-13*K*M) fp8_gemm_time_sympy Max(2.0e-6, 2.84900284900285e-16*K*M*N, 6.79347826086956e-14*K*M + 6.79347826086956e-14*K*N + 2.71739130434783e-13*M*N + 6.79347826086956e-14*floor(K*M/16 + K*N/16)) fp8_ovhd_time_sympy Max(2.0e-6, 3.39673913043478e-13*K*M + 1.35869565217391e-13*M*floor(K/16)) fwd_M fwd_K fwd_N r_fp8_gemm_and_ovhd_spdp b_fp8_e2e_spdp 0 1024 1024 1024 1.00 0.73 1 2048 2048 2048 2.71 1.09 2 4096 4096 4096 3.44 2.22 3 8192 8192 8192 3.68 2.82 4 16384 16384 16384 3.83 3.65 ``` [ghstack-poisoned]
Summary: MSLK now supports optional global scale in its triton quantize kernel (MSLK#233, commit c01f06c). This change relaxes the corresponding constraint in torchao so the triton kernel path can be used without a per_tensor_scale (single-level block-wise scaling only). Changes: - Remove `assert per_tensor_scale is not None` from `to_nvfp4` triton branch - Update `mslk_quantize_nvfp4` and its custom op to accept `Optional[torch.Tensor]`, passing `None` through to MSLK (which treats it as global_scale=1.0) - Relax `_addmm_nvfp4_dispatch` to allow mixed per_tensor_scale states between operands (treat None as 1.0) instead of asserting both-or-neither Test Plan: Requires SM100+ GPU with MSLK installed. ``` python -m pytest test/prototype/mx_formats/test_nvfp4_tensor.py::test_triton_nvfp4_quantize_equivalence -v python -m pytest test/prototype/mx_formats/test_nvfp4_tensor.py::test_nvfp4_matmul_optional_per_tensor_scale -v python -m pytest test/prototype/mx_formats/test_inference_workflow.py::test_inference_workflow_nvfp4 -k "use_triton_kernel_True-use_dynamic_per_tensor_scale_False" -v ``` ghstack-source-id: 06fa736 Pull Request resolved: #4188
|
@claude review again |
|
Claude finished @jerryzh168's task in 3m 51s —— View job Code Review: [nvfp4] Make per_tensor_scale optional for triton kernel path
This revision addresses the previous feedback well — the version guard for MSLK was added, vkuzo's comment about removing the
|
| ) | ||
|
|
||
| if global_scale is None and not is_mslk_version_at_least("1.1.0"): | ||
| import mslk |
There was a problem hiding this comment.
mslk.__version__ that's used in L1224 I think
| import mslk | ||
|
|
||
| raise RuntimeError( | ||
| f"Optional global_scale support requires MSLK >= 1.1.0, " |
There was a problem hiding this comment.
is this version matching the MSLK release corresponding to PyTorch 2.11.0, or is this a later version?
There was a problem hiding this comment.
Yes, this is the matching version for torch 2.11.0, just released recently
There was a problem hiding this comment.
I think it would be less confusing to do it as follows
if global_scale is None:
assert is_mslk_version_at_least("1.1.0"), "unsupported"
...There was a problem hiding this comment.
makes sense, updated
| def is_mslk_version_at_least(min_version: str) -> bool: | ||
| if not _is_mslk_available(): | ||
| return False | ||
| import mslk |
There was a problem hiding this comment.
yeah _is_mslk_available() checks the availability of mslk library:
Lines 1270 to 1277 in 79159f2
(we should remove import there actually, I can put up a follow up PR)
… triton kernel path" Summary: MSLK now supports optional global scale in its triton quantize kernel (MSLK#233, commit c01f06c). This change relaxes the corresponding constraint in torchao so the triton kernel path can be used without a per_tensor_scale (single-level block-wise scaling only). Changes: - Remove `assert per_tensor_scale is not None` from `to_nvfp4` triton branch - Update `mslk_quantize_nvfp4` and its custom op to accept `Optional[torch.Tensor]`, passing `None` through to MSLK (which treats it as global_scale=1.0) - Relax `_addmm_nvfp4_dispatch` to allow mixed per_tensor_scale states between operands (treat None as 1.0) instead of asserting both-or-neither Test Plan: Requires SM100+ GPU with MSLK nightly installed. ``` python -m pytest test/prototype/mx_formats/test_nvfp4_tensor.py::test_triton_nvfp4_quantize_equivalence -v python -m pytest test/prototype/mx_formats/test_nvfp4_tensor.py::test_nvfp4_matmul_optional_per_tensor_scale -v python -m pytest test/prototype/mx_formats/test_inference_workflow.py::test_inference_workflow_nvfp4 -k "test_inference_workflow_nvfp4" -v ``` Performance: with global scale: ``` python benchmarks/float8/float8_inference_roofline.py --recipe_name nvfp4 --enable_fusion_modeling True --skip_printing_detailed_metrics True Parameter Value ---------------------- ------------------------ GPU NVIDIA GB200 torch version 2.12.0.dev20260316+cu128 torchao version 0.17.0+git95281b63b recipe_name nvfp4 do_benchmarks True shape_gen_name pow2 enable_fusion_modeling True op_name linear MKN None None None DHW None None None kernel_size stride 1 padding 0 bf16_gemm_time_sympy Max(2.0e-6, 1.13960113960114e-15*K*M*N, 2.71739130434783e-13*K*M + 2.71739130434783e-13*K*N + 2.71739130434783e-13*M*N) bf16_ovhd_time_sympy Max(2.0e-6, 5.43478260869565e-13*K*M) fp8_gemm_time_sympy Max(2.0e-6, 2.84900284900285e-16*K*M*N, 6.79347826086956e-14*K*M + 6.79347826086956e-14*K*N + 2.71739130434783e-13*M*N + 6.79347826086956e-14*floor(K*M/16 + K*N/16)) fp8_ovhd_time_sympy Max(2.0e-6, 6.11413043478261e-13*K*M + 1.35869565217391e-13*M*floor(K/16)) fwd_M fwd_K fwd_N r_fp8_gemm_and_ovhd_spdp b_fp8_e2e_spdp 0 1024 1024 1024 1.00 0.45 1 2048 2048 2048 2.39 0.66 2 4096 4096 4096 2.92 1.29 3 8192 8192 8192 3.34 1.74 4 16384 16384 16384 3.63 2.84 ``` without global scale: ``` python benchmarks/float8/float8_inference_roofline.py --recipe_name nvfp4_no_global_scale --enable_fusion_modeling True --skip_printing_detailed_metrics True Parameter Value ---------------------- ------------------------ GPU NVIDIA GB200 torch version 2.12.0.dev20260316+cu128 torchao version 0.17.0+gitabb103d3b recipe_name nvfp4_no_global_scale do_benchmarks True shape_gen_name pow2 enable_fusion_modeling True op_name linear MKN None None None DHW None None None kernel_size stride 1 padding 0 bf16_gemm_time_sympy Max(2.0e-6, 1.13960113960114e-15*K*M*N, 2.71739130434783e-13*K*M + 2.71739130434783e-13*K*N + 2.71739130434783e-13*M*N) bf16_ovhd_time_sympy Max(2.0e-6, 5.43478260869565e-13*K*M) fp8_gemm_time_sympy Max(2.0e-6, 2.84900284900285e-16*K*M*N, 6.79347826086956e-14*K*M + 6.79347826086956e-14*K*N + 2.71739130434783e-13*M*N + 6.79347826086956e-14*floor(K*M/16 + K*N/16)) fp8_ovhd_time_sympy Max(2.0e-6, 3.39673913043478e-13*K*M + 1.35869565217391e-13*M*floor(K/16)) fwd_M fwd_K fwd_N r_fp8_gemm_and_ovhd_spdp b_fp8_e2e_spdp 0 1024 1024 1024 1.00 0.73 1 2048 2048 2048 2.71 1.09 2 4096 4096 4096 3.44 2.22 3 8192 8192 8192 3.68 2.82 4 16384 16384 16384 3.83 3.65 ``` [ghstack-poisoned]
…ath" Summary: MSLK now supports optional global scale in its triton quantize kernel (MSLK#233, commit c01f06c). This change relaxes the corresponding constraint in torchao so the triton kernel path can be used without a per_tensor_scale (single-level block-wise scaling only). Changes: - Remove `assert per_tensor_scale is not None` from `to_nvfp4` triton branch - Update `mslk_quantize_nvfp4` and its custom op to accept `Optional[torch.Tensor]`, passing `None` through to MSLK (which treats it as global_scale=1.0) - Relax `_addmm_nvfp4_dispatch` to allow mixed per_tensor_scale states between operands (treat None as 1.0) instead of asserting both-or-neither Test Plan: Requires SM100+ GPU with MSLK nightly installed. ``` python -m pytest test/prototype/mx_formats/test_nvfp4_tensor.py::test_triton_nvfp4_quantize_equivalence -v python -m pytest test/prototype/mx_formats/test_nvfp4_tensor.py::test_nvfp4_matmul_optional_per_tensor_scale -v python -m pytest test/prototype/mx_formats/test_inference_workflow.py::test_inference_workflow_nvfp4 -k "test_inference_workflow_nvfp4" -v ``` Performance: with global scale: ``` python benchmarks/float8/float8_inference_roofline.py --recipe_name nvfp4 --enable_fusion_modeling True --skip_printing_detailed_metrics True Parameter Value ---------------------- ------------------------ GPU NVIDIA GB200 torch version 2.12.0.dev20260316+cu128 torchao version 0.17.0+git95281b63b recipe_name nvfp4 do_benchmarks True shape_gen_name pow2 enable_fusion_modeling True op_name linear MKN None None None DHW None None None kernel_size stride 1 padding 0 bf16_gemm_time_sympy Max(2.0e-6, 1.13960113960114e-15*K*M*N, 2.71739130434783e-13*K*M + 2.71739130434783e-13*K*N + 2.71739130434783e-13*M*N) bf16_ovhd_time_sympy Max(2.0e-6, 5.43478260869565e-13*K*M) fp8_gemm_time_sympy Max(2.0e-6, 2.84900284900285e-16*K*M*N, 6.79347826086956e-14*K*M + 6.79347826086956e-14*K*N + 2.71739130434783e-13*M*N + 6.79347826086956e-14*floor(K*M/16 + K*N/16)) fp8_ovhd_time_sympy Max(2.0e-6, 6.11413043478261e-13*K*M + 1.35869565217391e-13*M*floor(K/16)) fwd_M fwd_K fwd_N r_fp8_gemm_and_ovhd_spdp b_fp8_e2e_spdp 0 1024 1024 1024 1.00 0.45 1 2048 2048 2048 2.39 0.66 2 4096 4096 4096 2.92 1.29 3 8192 8192 8192 3.34 1.74 4 16384 16384 16384 3.63 2.84 ``` without global scale: ``` python benchmarks/float8/float8_inference_roofline.py --recipe_name nvfp4_no_global_scale --enable_fusion_modeling True --skip_printing_detailed_metrics True Parameter Value ---------------------- ------------------------ GPU NVIDIA GB200 torch version 2.12.0.dev20260316+cu128 torchao version 0.17.0+gitabb103d3b recipe_name nvfp4_no_global_scale do_benchmarks True shape_gen_name pow2 enable_fusion_modeling True op_name linear MKN None None None DHW None None None kernel_size stride 1 padding 0 bf16_gemm_time_sympy Max(2.0e-6, 1.13960113960114e-15*K*M*N, 2.71739130434783e-13*K*M + 2.71739130434783e-13*K*N + 2.71739130434783e-13*M*N) bf16_ovhd_time_sympy Max(2.0e-6, 5.43478260869565e-13*K*M) fp8_gemm_time_sympy Max(2.0e-6, 2.84900284900285e-16*K*M*N, 6.79347826086956e-14*K*M + 6.79347826086956e-14*K*N + 2.71739130434783e-13*M*N + 6.79347826086956e-14*floor(K*M/16 + K*N/16)) fp8_ovhd_time_sympy Max(2.0e-6, 3.39673913043478e-13*K*M + 1.35869565217391e-13*M*floor(K/16)) fwd_M fwd_K fwd_N r_fp8_gemm_and_ovhd_spdp b_fp8_e2e_spdp 0 1024 1024 1024 1.00 0.73 1 2048 2048 2048 2.71 1.09 2 4096 4096 4096 3.44 2.22 3 8192 8192 8192 3.68 2.82 4 16384 16384 16384 3.83 3.65 ``` [ghstack-poisoned]
Summary: MSLK now supports optional global scale in its triton quantize kernel (MSLK#233, commit c01f06c). This change relaxes the corresponding constraint in torchao so the triton kernel path can be used without a per_tensor_scale (single-level block-wise scaling only). Changes: - Remove `assert per_tensor_scale is not None` from `to_nvfp4` triton branch - Update `mslk_quantize_nvfp4` and its custom op to accept `Optional[torch.Tensor]`, passing `None` through to MSLK (which treats it as global_scale=1.0) - Relax `_addmm_nvfp4_dispatch` to allow mixed per_tensor_scale states between operands (treat None as 1.0) instead of asserting both-or-neither Test Plan: Requires SM100+ GPU with MSLK installed. ``` python -m pytest test/prototype/mx_formats/test_nvfp4_tensor.py::test_triton_nvfp4_quantize_equivalence -v python -m pytest test/prototype/mx_formats/test_nvfp4_tensor.py::test_nvfp4_matmul_optional_per_tensor_scale -v python -m pytest test/prototype/mx_formats/test_inference_workflow.py::test_inference_workflow_nvfp4 -k "use_triton_kernel_True-use_dynamic_per_tensor_scale_False" -v ``` ghstack-source-id: f9a4c1f Pull Request resolved: #4188
… triton kernel path" Summary: MSLK now supports optional global scale in its triton quantize kernel (MSLK#233, commit c01f06c). This change relaxes the corresponding constraint in torchao so the triton kernel path can be used without a per_tensor_scale (single-level block-wise scaling only). Changes: - Remove `assert per_tensor_scale is not None` from `to_nvfp4` triton branch - Update `mslk_quantize_nvfp4` and its custom op to accept `Optional[torch.Tensor]`, passing `None` through to MSLK (which treats it as global_scale=1.0) - Relax `_addmm_nvfp4_dispatch` to allow mixed per_tensor_scale states between operands (treat None as 1.0) instead of asserting both-or-neither Test Plan: Requires SM100+ GPU with MSLK nightly installed. ``` python -m pytest test/prototype/mx_formats/test_nvfp4_tensor.py::test_triton_nvfp4_quantize_equivalence -v python -m pytest test/prototype/mx_formats/test_nvfp4_tensor.py::test_nvfp4_matmul_optional_per_tensor_scale -v python -m pytest test/prototype/mx_formats/test_inference_workflow.py::test_inference_workflow_nvfp4 -k "test_inference_workflow_nvfp4" -v ``` Performance: with global scale: ``` python benchmarks/float8/float8_inference_roofline.py --recipe_name nvfp4 --enable_fusion_modeling True --skip_printing_detailed_metrics True Parameter Value ---------------------- ------------------------ GPU NVIDIA GB200 torch version 2.12.0.dev20260316+cu128 torchao version 0.17.0+git95281b63b recipe_name nvfp4 do_benchmarks True shape_gen_name pow2 enable_fusion_modeling True op_name linear MKN None None None DHW None None None kernel_size stride 1 padding 0 bf16_gemm_time_sympy Max(2.0e-6, 1.13960113960114e-15*K*M*N, 2.71739130434783e-13*K*M + 2.71739130434783e-13*K*N + 2.71739130434783e-13*M*N) bf16_ovhd_time_sympy Max(2.0e-6, 5.43478260869565e-13*K*M) fp8_gemm_time_sympy Max(2.0e-6, 2.84900284900285e-16*K*M*N, 6.79347826086956e-14*K*M + 6.79347826086956e-14*K*N + 2.71739130434783e-13*M*N + 6.79347826086956e-14*floor(K*M/16 + K*N/16)) fp8_ovhd_time_sympy Max(2.0e-6, 6.11413043478261e-13*K*M + 1.35869565217391e-13*M*floor(K/16)) fwd_M fwd_K fwd_N r_fp8_gemm_and_ovhd_spdp b_fp8_e2e_spdp 0 1024 1024 1024 1.00 0.45 1 2048 2048 2048 2.39 0.66 2 4096 4096 4096 2.92 1.29 3 8192 8192 8192 3.34 1.74 4 16384 16384 16384 3.63 2.84 ``` without global scale: ``` python benchmarks/float8/float8_inference_roofline.py --recipe_name nvfp4_no_global_scale --enable_fusion_modeling True --skip_printing_detailed_metrics True Parameter Value ---------------------- ------------------------ GPU NVIDIA GB200 torch version 2.12.0.dev20260316+cu128 torchao version 0.17.0+gitabb103d3b recipe_name nvfp4_no_global_scale do_benchmarks True shape_gen_name pow2 enable_fusion_modeling True op_name linear MKN None None None DHW None None None kernel_size stride 1 padding 0 bf16_gemm_time_sympy Max(2.0e-6, 1.13960113960114e-15*K*M*N, 2.71739130434783e-13*K*M + 2.71739130434783e-13*K*N + 2.71739130434783e-13*M*N) bf16_ovhd_time_sympy Max(2.0e-6, 5.43478260869565e-13*K*M) fp8_gemm_time_sympy Max(2.0e-6, 2.84900284900285e-16*K*M*N, 6.79347826086956e-14*K*M + 6.79347826086956e-14*K*N + 2.71739130434783e-13*M*N + 6.79347826086956e-14*floor(K*M/16 + K*N/16)) fp8_ovhd_time_sympy Max(2.0e-6, 3.39673913043478e-13*K*M + 1.35869565217391e-13*M*floor(K/16)) fwd_M fwd_K fwd_N r_fp8_gemm_and_ovhd_spdp b_fp8_e2e_spdp 0 1024 1024 1024 1.00 0.73 1 2048 2048 2048 2.71 1.09 2 4096 4096 4096 3.44 2.22 3 8192 8192 8192 3.68 2.82 4 16384 16384 16384 3.83 3.65 ``` [ghstack-poisoned]
…ath" Summary: MSLK now supports optional global scale in its triton quantize kernel (MSLK#233, commit c01f06c). This change relaxes the corresponding constraint in torchao so the triton kernel path can be used without a per_tensor_scale (single-level block-wise scaling only). Changes: - Remove `assert per_tensor_scale is not None` from `to_nvfp4` triton branch - Update `mslk_quantize_nvfp4` and its custom op to accept `Optional[torch.Tensor]`, passing `None` through to MSLK (which treats it as global_scale=1.0) - Relax `_addmm_nvfp4_dispatch` to allow mixed per_tensor_scale states between operands (treat None as 1.0) instead of asserting both-or-neither Test Plan: Requires SM100+ GPU with MSLK nightly installed. ``` python -m pytest test/prototype/mx_formats/test_nvfp4_tensor.py::test_triton_nvfp4_quantize_equivalence -v python -m pytest test/prototype/mx_formats/test_nvfp4_tensor.py::test_nvfp4_matmul_optional_per_tensor_scale -v python -m pytest test/prototype/mx_formats/test_inference_workflow.py::test_inference_workflow_nvfp4 -k "test_inference_workflow_nvfp4" -v ``` Performance: with global scale: ``` python benchmarks/float8/float8_inference_roofline.py --recipe_name nvfp4 --enable_fusion_modeling True --skip_printing_detailed_metrics True Parameter Value ---------------------- ------------------------ GPU NVIDIA GB200 torch version 2.12.0.dev20260316+cu128 torchao version 0.17.0+git95281b63b recipe_name nvfp4 do_benchmarks True shape_gen_name pow2 enable_fusion_modeling True op_name linear MKN None None None DHW None None None kernel_size stride 1 padding 0 bf16_gemm_time_sympy Max(2.0e-6, 1.13960113960114e-15*K*M*N, 2.71739130434783e-13*K*M + 2.71739130434783e-13*K*N + 2.71739130434783e-13*M*N) bf16_ovhd_time_sympy Max(2.0e-6, 5.43478260869565e-13*K*M) fp8_gemm_time_sympy Max(2.0e-6, 2.84900284900285e-16*K*M*N, 6.79347826086956e-14*K*M + 6.79347826086956e-14*K*N + 2.71739130434783e-13*M*N + 6.79347826086956e-14*floor(K*M/16 + K*N/16)) fp8_ovhd_time_sympy Max(2.0e-6, 6.11413043478261e-13*K*M + 1.35869565217391e-13*M*floor(K/16)) fwd_M fwd_K fwd_N r_fp8_gemm_and_ovhd_spdp b_fp8_e2e_spdp 0 1024 1024 1024 1.00 0.45 1 2048 2048 2048 2.39 0.66 2 4096 4096 4096 2.92 1.29 3 8192 8192 8192 3.34 1.74 4 16384 16384 16384 3.63 2.84 ``` without global scale: ``` python benchmarks/float8/float8_inference_roofline.py --recipe_name nvfp4_no_global_scale --enable_fusion_modeling True --skip_printing_detailed_metrics True Parameter Value ---------------------- ------------------------ GPU NVIDIA GB200 torch version 2.12.0.dev20260316+cu128 torchao version 0.17.0+gitabb103d3b recipe_name nvfp4_no_global_scale do_benchmarks True shape_gen_name pow2 enable_fusion_modeling True op_name linear MKN None None None DHW None None None kernel_size stride 1 padding 0 bf16_gemm_time_sympy Max(2.0e-6, 1.13960113960114e-15*K*M*N, 2.71739130434783e-13*K*M + 2.71739130434783e-13*K*N + 2.71739130434783e-13*M*N) bf16_ovhd_time_sympy Max(2.0e-6, 5.43478260869565e-13*K*M) fp8_gemm_time_sympy Max(2.0e-6, 2.84900284900285e-16*K*M*N, 6.79347826086956e-14*K*M + 6.79347826086956e-14*K*N + 2.71739130434783e-13*M*N + 6.79347826086956e-14*floor(K*M/16 + K*N/16)) fp8_ovhd_time_sympy Max(2.0e-6, 3.39673913043478e-13*K*M + 1.35869565217391e-13*M*floor(K/16)) fwd_M fwd_K fwd_N r_fp8_gemm_and_ovhd_spdp b_fp8_e2e_spdp 0 1024 1024 1024 1.00 0.73 1 2048 2048 2048 2.71 1.09 2 4096 4096 4096 3.44 2.22 3 8192 8192 8192 3.68 2.82 4 16384 16384 16384 3.83 3.65 ``` [ghstack-poisoned]
Summary: MSLK now supports optional global scale in its triton quantize kernel (MSLK#233, commit c01f06c). This change relaxes the corresponding constraint in torchao so the triton kernel path can be used without a per_tensor_scale (single-level block-wise scaling only). Changes: - Remove `assert per_tensor_scale is not None` from `to_nvfp4` triton branch - Update `mslk_quantize_nvfp4` and its custom op to accept `Optional[torch.Tensor]`, passing `None` through to MSLK (which treats it as global_scale=1.0) - Relax `_addmm_nvfp4_dispatch` to allow mixed per_tensor_scale states between operands (treat None as 1.0) instead of asserting both-or-neither Test Plan: Requires SM100+ GPU with MSLK installed. ``` python -m pytest test/prototype/mx_formats/test_nvfp4_tensor.py::test_triton_nvfp4_quantize_equivalence -v python -m pytest test/prototype/mx_formats/test_nvfp4_tensor.py::test_nvfp4_matmul_optional_per_tensor_scale -v python -m pytest test/prototype/mx_formats/test_inference_workflow.py::test_inference_workflow_nvfp4 -k "use_triton_kernel_True-use_dynamic_per_tensor_scale_False" -v ``` ghstack-source-id: 6f4f6b4 Pull Request resolved: #4188
… triton kernel path" Summary: MSLK now supports optional global scale in its triton quantize kernel (MSLK#233, commit c01f06c). This change relaxes the corresponding constraint in torchao so the triton kernel path can be used without a per_tensor_scale (single-level block-wise scaling only). Changes: - Remove `assert per_tensor_scale is not None` from `to_nvfp4` triton branch - Update `mslk_quantize_nvfp4` and its custom op to accept `Optional[torch.Tensor]`, passing `None` through to MSLK (which treats it as global_scale=1.0) - Relax `_addmm_nvfp4_dispatch` to allow mixed per_tensor_scale states between operands (treat None as 1.0) instead of asserting both-or-neither Test Plan: Requires SM100+ GPU with MSLK nightly installed. ``` python -m pytest test/prototype/mx_formats/test_nvfp4_tensor.py::test_triton_nvfp4_quantize_equivalence -v python -m pytest test/prototype/mx_formats/test_nvfp4_tensor.py::test_nvfp4_matmul_optional_per_tensor_scale -v python -m pytest test/prototype/mx_formats/test_inference_workflow.py::test_inference_workflow_nvfp4 -k "test_inference_workflow_nvfp4" -v ``` Performance: with global scale: ``` python benchmarks/float8/float8_inference_roofline.py --recipe_name nvfp4 --enable_fusion_modeling True --skip_printing_detailed_metrics True Parameter Value ---------------------- ------------------------ GPU NVIDIA GB200 torch version 2.12.0.dev20260316+cu128 torchao version 0.17.0+git95281b63b recipe_name nvfp4 do_benchmarks True shape_gen_name pow2 enable_fusion_modeling True op_name linear MKN None None None DHW None None None kernel_size stride 1 padding 0 bf16_gemm_time_sympy Max(2.0e-6, 1.13960113960114e-15*K*M*N, 2.71739130434783e-13*K*M + 2.71739130434783e-13*K*N + 2.71739130434783e-13*M*N) bf16_ovhd_time_sympy Max(2.0e-6, 5.43478260869565e-13*K*M) fp8_gemm_time_sympy Max(2.0e-6, 2.84900284900285e-16*K*M*N, 6.79347826086956e-14*K*M + 6.79347826086956e-14*K*N + 2.71739130434783e-13*M*N + 6.79347826086956e-14*floor(K*M/16 + K*N/16)) fp8_ovhd_time_sympy Max(2.0e-6, 6.11413043478261e-13*K*M + 1.35869565217391e-13*M*floor(K/16)) fwd_M fwd_K fwd_N r_fp8_gemm_and_ovhd_spdp b_fp8_e2e_spdp 0 1024 1024 1024 1.00 0.45 1 2048 2048 2048 2.39 0.66 2 4096 4096 4096 2.92 1.29 3 8192 8192 8192 3.34 1.74 4 16384 16384 16384 3.63 2.84 ``` without global scale: ``` python benchmarks/float8/float8_inference_roofline.py --recipe_name nvfp4_no_global_scale --enable_fusion_modeling True --skip_printing_detailed_metrics True Parameter Value ---------------------- ------------------------ GPU NVIDIA GB200 torch version 2.12.0.dev20260316+cu128 torchao version 0.17.0+gitabb103d3b recipe_name nvfp4_no_global_scale do_benchmarks True shape_gen_name pow2 enable_fusion_modeling True op_name linear MKN None None None DHW None None None kernel_size stride 1 padding 0 bf16_gemm_time_sympy Max(2.0e-6, 1.13960113960114e-15*K*M*N, 2.71739130434783e-13*K*M + 2.71739130434783e-13*K*N + 2.71739130434783e-13*M*N) bf16_ovhd_time_sympy Max(2.0e-6, 5.43478260869565e-13*K*M) fp8_gemm_time_sympy Max(2.0e-6, 2.84900284900285e-16*K*M*N, 6.79347826086956e-14*K*M + 6.79347826086956e-14*K*N + 2.71739130434783e-13*M*N + 6.79347826086956e-14*floor(K*M/16 + K*N/16)) fp8_ovhd_time_sympy Max(2.0e-6, 3.39673913043478e-13*K*M + 1.35869565217391e-13*M*floor(K/16)) fwd_M fwd_K fwd_N r_fp8_gemm_and_ovhd_spdp b_fp8_e2e_spdp 0 1024 1024 1024 1.00 0.73 1 2048 2048 2048 2.71 1.09 2 4096 4096 4096 3.44 2.22 3 8192 8192 8192 3.68 2.82 4 16384 16384 16384 3.83 3.65 ``` [ghstack-poisoned]
…ath" Summary: MSLK now supports optional global scale in its triton quantize kernel (MSLK#233, commit c01f06c). This change relaxes the corresponding constraint in torchao so the triton kernel path can be used without a per_tensor_scale (single-level block-wise scaling only). Changes: - Remove `assert per_tensor_scale is not None` from `to_nvfp4` triton branch - Update `mslk_quantize_nvfp4` and its custom op to accept `Optional[torch.Tensor]`, passing `None` through to MSLK (which treats it as global_scale=1.0) - Relax `_addmm_nvfp4_dispatch` to allow mixed per_tensor_scale states between operands (treat None as 1.0) instead of asserting both-or-neither Test Plan: Requires SM100+ GPU with MSLK nightly installed. ``` python -m pytest test/prototype/mx_formats/test_nvfp4_tensor.py::test_triton_nvfp4_quantize_equivalence -v python -m pytest test/prototype/mx_formats/test_nvfp4_tensor.py::test_nvfp4_matmul_optional_per_tensor_scale -v python -m pytest test/prototype/mx_formats/test_inference_workflow.py::test_inference_workflow_nvfp4 -k "test_inference_workflow_nvfp4" -v ``` Performance: with global scale: ``` python benchmarks/float8/float8_inference_roofline.py --recipe_name nvfp4 --enable_fusion_modeling True --skip_printing_detailed_metrics True Parameter Value ---------------------- ------------------------ GPU NVIDIA GB200 torch version 2.12.0.dev20260316+cu128 torchao version 0.17.0+git95281b63b recipe_name nvfp4 do_benchmarks True shape_gen_name pow2 enable_fusion_modeling True op_name linear MKN None None None DHW None None None kernel_size stride 1 padding 0 bf16_gemm_time_sympy Max(2.0e-6, 1.13960113960114e-15*K*M*N, 2.71739130434783e-13*K*M + 2.71739130434783e-13*K*N + 2.71739130434783e-13*M*N) bf16_ovhd_time_sympy Max(2.0e-6, 5.43478260869565e-13*K*M) fp8_gemm_time_sympy Max(2.0e-6, 2.84900284900285e-16*K*M*N, 6.79347826086956e-14*K*M + 6.79347826086956e-14*K*N + 2.71739130434783e-13*M*N + 6.79347826086956e-14*floor(K*M/16 + K*N/16)) fp8_ovhd_time_sympy Max(2.0e-6, 6.11413043478261e-13*K*M + 1.35869565217391e-13*M*floor(K/16)) fwd_M fwd_K fwd_N r_fp8_gemm_and_ovhd_spdp b_fp8_e2e_spdp 0 1024 1024 1024 1.00 0.45 1 2048 2048 2048 2.39 0.66 2 4096 4096 4096 2.92 1.29 3 8192 8192 8192 3.34 1.74 4 16384 16384 16384 3.63 2.84 ``` without global scale: ``` python benchmarks/float8/float8_inference_roofline.py --recipe_name nvfp4_no_global_scale --enable_fusion_modeling True --skip_printing_detailed_metrics True Parameter Value ---------------------- ------------------------ GPU NVIDIA GB200 torch version 2.12.0.dev20260316+cu128 torchao version 0.17.0+gitabb103d3b recipe_name nvfp4_no_global_scale do_benchmarks True shape_gen_name pow2 enable_fusion_modeling True op_name linear MKN None None None DHW None None None kernel_size stride 1 padding 0 bf16_gemm_time_sympy Max(2.0e-6, 1.13960113960114e-15*K*M*N, 2.71739130434783e-13*K*M + 2.71739130434783e-13*K*N + 2.71739130434783e-13*M*N) bf16_ovhd_time_sympy Max(2.0e-6, 5.43478260869565e-13*K*M) fp8_gemm_time_sympy Max(2.0e-6, 2.84900284900285e-16*K*M*N, 6.79347826086956e-14*K*M + 6.79347826086956e-14*K*N + 2.71739130434783e-13*M*N + 6.79347826086956e-14*floor(K*M/16 + K*N/16)) fp8_ovhd_time_sympy Max(2.0e-6, 3.39673913043478e-13*K*M + 1.35869565217391e-13*M*floor(K/16)) fwd_M fwd_K fwd_N r_fp8_gemm_and_ovhd_spdp b_fp8_e2e_spdp 0 1024 1024 1024 1.00 0.73 1 2048 2048 2048 2.71 1.09 2 4096 4096 4096 3.44 2.22 3 8192 8192 8192 3.68 2.82 4 16384 16384 16384 3.83 3.65 ``` [ghstack-poisoned]
Summary: MSLK now supports optional global scale in its triton quantize kernel (MSLK#233, commit c01f06c). This change relaxes the corresponding constraint in torchao so the triton kernel path can be used without a per_tensor_scale (single-level block-wise scaling only). Changes: - Remove `assert per_tensor_scale is not None` from `to_nvfp4` triton branch - Update `mslk_quantize_nvfp4` and its custom op to accept `Optional[torch.Tensor]`, passing `None` through to MSLK (which treats it as global_scale=1.0) - Relax `_addmm_nvfp4_dispatch` to allow mixed per_tensor_scale states between operands (treat None as 1.0) instead of asserting both-or-neither Test Plan: Requires SM100+ GPU with MSLK installed. ``` python -m pytest test/prototype/mx_formats/test_nvfp4_tensor.py::test_triton_nvfp4_quantize_equivalence -v python -m pytest test/prototype/mx_formats/test_nvfp4_tensor.py::test_nvfp4_matmul_optional_per_tensor_scale -v python -m pytest test/prototype/mx_formats/test_inference_workflow.py::test_inference_workflow_nvfp4 -k "use_triton_kernel_True-use_dynamic_per_tensor_scale_False" -v ``` ghstack-source-id: 547bfc9 Pull Request resolved: #4188
Stack from ghstack (oldest at bottom):
Summary:
MSLK now supports optional global scale in its triton quantize kernel
(MSLK#233, commit c01f06c). This change relaxes the corresponding
constraint in torchao so the triton kernel path can be used without
a per_tensor_scale (single-level block-wise scaling only).
Changes:
assert per_tensor_scale is not Nonefromto_nvfp4triton branchmslk_quantize_nvfp4and its custom op to acceptOptional[torch.Tensor],passing
Nonethrough to MSLK (which treats it as global_scale=1.0)_addmm_nvfp4_dispatchto allow mixed per_tensor_scale states betweenoperands (treat None as 1.0) instead of asserting both-or-neither
Test Plan:
Requires SM100+ GPU with MSLK nightly installed.
Performance:
with global scale:
without global scale: