Add support for flashinfer quantize kernel option for nvfp4 by jerryzh168 · Pull Request #3912 · pytorch/ao

jerryzh168 · 2026-02-17T23:28:48Z

Stack from ghstack (oldest at bottom):

Summary:
Added the flashinfer option for better performance on some of the workflow
we are interested in, also added numerical equivalence test between different
nvfp4_quantize_kernel_choice options

Test Plan:

pip install flashinfer-python

pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence

perf test: #4031

We'll test speedup a bit later

Reviewers:

Subscribers:

Tasks:

Tags:

Summary: Added the flashinfer option for better performance on some of the workflow we are interested in, also added numerical equivalence test between different quantize_kernel_preference options Test Plan: pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

pytorch-bot · 2026-02-17T23:28:52Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3912

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 Cancelled Job

As of commit 452f27a with merge base 15df843 ():

CANCELLED JOB - The following job was cancelled. Please retry:

Run 1xH100 Tests / test (H100, linux.aws.h100, --pre torch torchvision torchaudio mslk --index-url https://download.... / linux-job (gh)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Summary: Added the flashinfer option for better performance on some of the workflow we are interested in, also added numerical equivalence test between different quantize_kernel_preference options Test Plan: pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 7ec4b65 Pull Request resolved: #3912

Summary: Added the flashinfer option for better performance on some of the workflow we are interested in, also added numerical equivalence test between different quantize_kernel_preference options Test Plan: pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Summary: Added the flashinfer option for better performance on some of the workflow we are interested in, also added numerical equivalence test between different quantize_kernel_preference options Test Plan: pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 39bdea0 Pull Request resolved: #3912

Summary: Added the flashinfer option for better performance on some of the workflow we are interested in, also added numerical equivalence test between different quantize_kernel_preference options Test Plan: pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Summary: Added the flashinfer option for better performance on some of the workflow we are interested in, also added numerical equivalence test between different quantize_kernel_preference options Test Plan: pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 834531d Pull Request resolved: #3912

Summary: Added the flashinfer option for better performance on some of the workflow we are interested in, also added numerical equivalence test between different quantize_kernel_preference options Test Plan: pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Summary: Added the flashinfer option for better performance on some of the workflow we are interested in, also added numerical equivalence test between different quantize_kernel_preference options Test Plan: pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 1a9b7b1 Pull Request resolved: #3912

Summary: Added the flashinfer option for better performance on some of the workflow we are interested in, also added numerical equivalence test between different quantize_kernel_preference options Test Plan: pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Summary: Added the flashinfer option for better performance on some of the workflow we are interested in, also added numerical equivalence test between different quantize_kernel_preference options Test Plan: pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 0ea6062 Pull Request resolved: #3912

Summary: Added the flashinfer option for better performance on some of the workflow we are interested in, also added numerical equivalence test between different quantize_kernel_preference options Test Plan: pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Summary: Added the flashinfer option for better performance on some of the workflow we are interested in, also added numerical equivalence test between different quantize_kernel_preference options Test Plan: pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 5480f76 Pull Request resolved: #3912

Summary: Added the flashinfer option for better performance on some of the workflow we are interested in, also added numerical equivalence test between different quantize_kernel_preference options Test Plan: pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Summary: Added the flashinfer option for better performance on some of the workflow we are interested in, also added numerical equivalence test between different quantize_kernel_preference options Test Plan: pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 51072c2 Pull Request resolved: #3912

Summary: Added the flashinfer option for better performance on some of the workflow we are interested in, also added numerical equivalence test between different quantize_kernel_preference options Test Plan: pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Summary: Added the flashinfer option for better performance on some of the workflow we are interested in, also added numerical equivalence test between different quantize_kernel_preference options Test Plan: pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: cb5cda5 Pull Request resolved: #3912

Summary: Added the flashinfer option for better performance on some of the workflow we are interested in, also added numerical equivalence test between different nvfp4_quantize_kernel_choice options Test Plan: pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Summary: Added the flashinfer option for better performance on some of the workflow we are interested in, also added numerical equivalence test between different quantize_kernel_preference options Test Plan: pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 6d8af1c Pull Request resolved: #3912

updated

Summary: Added the flashinfer option for better performance on some of the workflow we are interested in, also added numerical equivalence test between different nvfp4_quantize_kernel_choice options Test Plan: pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence We'll test speedup a bit later Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Summary: Added the flashinfer option for better performance on some of the workflow we are interested in, also added numerical equivalence test between different quantize_kernel_preference options Test Plan: pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 2d70cb7 Pull Request resolved: #3912

vkuzo · 2026-02-19T17:34:37Z

 @torch.no_grad()
-def test_triton_nvfp4_quantize_equivalence(M, N, use_per_tensor_scale, dtype):
-    """Test that Triton and PyTorch NVFP4 quantization produce equivalent results."""
+def test_kernel_choice_numerical_equivalence(M, N, use_per_tensor_scale, dtype):


clarify this is for quantization from bf16 to nvfp4

will rename to test_quantize_to_nvfp4_kernel_numerical_equivalence

original test is testing both fp32 and bf16, please let me know if you feel we should remove fp32 test case as well

vkuzo · 2026-02-19T17:35:27Z

+        # For kernel choices that use the same quantization algorithm as TORCH
+        # (TRITON should be bitwise identical), verify internal data matches exactly
+        if kc == NVFP4QuantizeKernelChoice.TRITON:
+            torch.testing.assert_close(


add check to ensure bitwise match

vkuzo · 2026-02-19T17:38:31Z

Can we include performance benchmarking in this PR?

roofline script with sweep over shapes: https://github.com/pytorch/ao/blob/main/benchmarks/float8/float8_inference_roofline.py
nvfp4 individual cast bench:

ao/benchmarks/mx_formats/cast_bench.py

Line 96 in f8bbce5

def run(

Summary: Added the flashinfer option for better performance on some of the workflow we are interested in, also added numerical equivalence test between different nvfp4_quantize_kernel_choice options Test Plan: pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence We'll test speedup a bit later Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Summary: Added the flashinfer option for better performance on some of the workflow we are interested in, also added numerical equivalence test between different quantize_kernel_preference options Test Plan: pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: d0132c3 Pull Request resolved: #3912

Summary: Added the flashinfer option for better performance on some of the workflow we are interested in, also added numerical equivalence test between different nvfp4_quantize_kernel_choice options Test Plan: pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence We'll test speedup a bit later Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Summary: Added the flashinfer option for better performance on some of the workflow we are interested in, also added numerical equivalence test between different quantize_kernel_preference options Test Plan: pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: f281c57 Pull Request resolved: #3912

Summary: Added the flashinfer option for better performance on some of the workflow we are interested in, also added numerical equivalence test between different nvfp4_quantize_kernel_choice options Test Plan: pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence We'll test speedup a bit later Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Summary: Added the flashinfer option for better performance on some of the workflow we are interested in, also added numerical equivalence test between different quantize_kernel_preference options Test Plan: pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: fe20eba Pull Request resolved: #3912

Summary: Added the flashinfer option for better performance on some of the workflow we are interested in, also added numerical equivalence test between different nvfp4_quantize_kernel_choice options Test Plan: pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence We'll test speedup a bit later Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Summary: Added the flashinfer option for better performance on some of the workflow we are interested in, also added numerical equivalence test between different quantize_kernel_preference options Test Plan: pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: b1c5919 Pull Request resolved: #3912

Summary: Added the flashinfer option for better performance on some of the workflow we are interested in, also added numerical equivalence test between different nvfp4_quantize_kernel_choice options Test Plan: pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence We'll test speedup a bit later Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Summary: Added the flashinfer option for better performance on some of the workflow we are interested in, also added numerical equivalence test between different quantize_kernel_preference options Test Plan: pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: d86b009 Pull Request resolved: #3912

Summary: Added the flashinfer option for better performance on some of the workflow we are interested in, also added numerical equivalence test between different nvfp4_quantize_kernel_choice options Test Plan: pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence We'll test speedup a bit later Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Summary: Added the flashinfer option for better performance on some of the workflow we are interested in, also added numerical equivalence test between different quantize_kernel_preference options Test Plan: pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 0db4b3f Pull Request resolved: #3912

Summary: Added the flashinfer option for better performance on some of the workflow we are interested in, also added numerical equivalence test between different nvfp4_quantize_kernel_choice options Test Plan: pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence We'll test speedup a bit later Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Summary: Added the flashinfer option for better performance on some of the workflow we are interested in, also added numerical equivalence test between different quantize_kernel_preference options Test Plan: pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 5cd280a Pull Request resolved: #3912

Summary: Added the flashinfer option for better performance on some of the workflow we are interested in, also added numerical equivalence test between different nvfp4_quantize_kernel_choice options Test Plan: pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence We'll test speedup a bit later Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Summary: Added the flashinfer option for better performance on some of the workflow we are interested in, also added numerical equivalence test between different quantize_kernel_preference options Test Plan: pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 50ce441 Pull Request resolved: #3912

Summary: Added the flashinfer option for better performance on some of the workflow we are interested in, also added numerical equivalence test between different nvfp4_quantize_kernel_choice options Test Plan: pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence We'll test speedup a bit later Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Summary: Added the flashinfer option for better performance on some of the workflow we are interested in, also added numerical equivalence test between different quantize_kernel_preference options Test Plan: pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 08be951 Pull Request resolved: #3912

Summary: Added the flashinfer option for better performance on some of the workflow we are interested in, also added numerical equivalence test between different nvfp4_quantize_kernel_choice options Test Plan: pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence We'll test speedup a bit later Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Summary: Added the flashinfer option for better performance on some of the workflow we are interested in, also added numerical equivalence test between different quantize_kernel_preference options Test Plan: pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 4e6c022 Pull Request resolved: #3912

Summary: Added the flashinfer option for better performance on some of the workflow we are interested in, also added numerical equivalence test between different nvfp4_quantize_kernel_choice options Test Plan: pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence We'll test speedup a bit later Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Summary: Added the flashinfer option for better performance on some of the workflow we are interested in, also added numerical equivalence test between different quantize_kernel_preference options Test Plan: pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 3577f51 Pull Request resolved: #3912

jerryzh168 · 2026-04-15T19:40:28Z

seems mslk kernels can give us similar performance as flashinfer kernel, this is no longer needed

jerryzh168 mentioned this pull request Feb 17, 2026

Add flashinfer-python to CI #3910

Merged

jerryzh168 mentioned this pull request Feb 17, 2026

Refactor use_triton_kernel to use nvfp4_quantize_kernel_choice #3911

Closed

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 17, 2026

jerryzh168 added the module: not user facing Use this tag if you don't want this PR to show up in release notes label Feb 17, 2026

vkuzo previously requested changes Feb 18, 2026

View reviewed changes

Comment thread torchao/quantization/quantize_/common/kernel_preference.py Outdated

jerryzh168 requested a review from vkuzo February 18, 2026 22:37

vkuzo reviewed Feb 19, 2026

View reviewed changes

jerryzh168 marked this pull request as draft March 14, 2026 07:53

jerryzh168 closed this Apr 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for flashinfer quantize kernel option for nvfp4#3912

Add support for flashinfer quantize kernel option for nvfp4#3912
jerryzh168 wants to merge 21 commits intogh/jerryzh168/43/basefrom
gh/jerryzh168/43/head

jerryzh168 commented Feb 17, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Feb 17, 2026 •

edited

Loading

Uh oh!

Uh oh!

vkuzo Feb 19, 2026

Uh oh!

jerryzh168 Feb 19, 2026 •

edited

Loading

Uh oh!

vkuzo Feb 19, 2026

Uh oh!

vkuzo commented Feb 19, 2026

Uh oh!

jerryzh168 commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jerryzh168 commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3912

❌ 1 Cancelled Job

Uh oh!

Uh oh!

vkuzo Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

jerryzh168 Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vkuzo Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

vkuzo commented Feb 19, 2026

Uh oh!

jerryzh168 commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jerryzh168 commented Feb 17, 2026 •

edited

Loading

pytorch-bot Bot commented Feb 17, 2026 •

edited

Loading

jerryzh168 Feb 19, 2026 •

edited

Loading