Skip to content

Add support for flashinfer quantize kernel option for nvfp4#3912

Closed
jerryzh168 wants to merge 21 commits intogh/jerryzh168/43/basefrom
gh/jerryzh168/43/head
Closed

Add support for flashinfer quantize kernel option for nvfp4#3912
jerryzh168 wants to merge 21 commits intogh/jerryzh168/43/basefrom
gh/jerryzh168/43/head

Conversation

@jerryzh168
Copy link
Copy Markdown
Contributor

@jerryzh168 jerryzh168 commented Feb 17, 2026

Stack from ghstack (oldest at bottom):

Summary:
Added the flashinfer option for better performance on some of the workflow
we are interested in, also added numerical equivalence test between different
nvfp4_quantize_kernel_choice options

Test Plan:

pip install flashinfer-python

pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence

perf test: #4031

We'll test speedup a bit later

Reviewers:

Subscribers:

Tasks:

Tags:

Summary:
Added the flashinfer option for better performance on some of the workflow
we are interested in, also added numerical equivalence test between different
quantize_kernel_preference options

Test Plan:
pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Feb 17, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3912

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 Cancelled Job

As of commit 452f27a with merge base 15df843 (image):

CANCELLED JOB - The following job was cancelled. Please retry:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

jerryzh168 added a commit that referenced this pull request Feb 17, 2026
Summary:
Added the flashinfer option for better performance on some of the workflow
we are interested in, also added numerical equivalence test between different
quantize_kernel_preference options

Test Plan:
pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 7ec4b65
Pull Request resolved: #3912
@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 17, 2026
@jerryzh168 jerryzh168 added the module: not user facing Use this tag if you don't want this PR to show up in release notes label Feb 17, 2026
Summary:
Added the flashinfer option for better performance on some of the workflow
we are interested in, also added numerical equivalence test between different
quantize_kernel_preference options

Test Plan:
pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
jerryzh168 added a commit that referenced this pull request Feb 17, 2026
Summary:
Added the flashinfer option for better performance on some of the workflow
we are interested in, also added numerical equivalence test between different
quantize_kernel_preference options

Test Plan:
pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 39bdea0
Pull Request resolved: #3912
Summary:
Added the flashinfer option for better performance on some of the workflow
we are interested in, also added numerical equivalence test between different
quantize_kernel_preference options

Test Plan:
pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
jerryzh168 added a commit that referenced this pull request Feb 18, 2026
Summary:
Added the flashinfer option for better performance on some of the workflow
we are interested in, also added numerical equivalence test between different
quantize_kernel_preference options

Test Plan:
pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 834531d
Pull Request resolved: #3912
Summary:
Added the flashinfer option for better performance on some of the workflow
we are interested in, also added numerical equivalence test between different
quantize_kernel_preference options

Test Plan:
pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
jerryzh168 added a commit that referenced this pull request Feb 18, 2026
Summary:
Added the flashinfer option for better performance on some of the workflow
we are interested in, also added numerical equivalence test between different
quantize_kernel_preference options

Test Plan:
pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 1a9b7b1
Pull Request resolved: #3912
Summary:
Added the flashinfer option for better performance on some of the workflow
we are interested in, also added numerical equivalence test between different
quantize_kernel_preference options

Test Plan:
pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
jerryzh168 added a commit that referenced this pull request Feb 18, 2026
Summary:
Added the flashinfer option for better performance on some of the workflow
we are interested in, also added numerical equivalence test between different
quantize_kernel_preference options

Test Plan:
pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 0ea6062
Pull Request resolved: #3912
Summary:
Added the flashinfer option for better performance on some of the workflow
we are interested in, also added numerical equivalence test between different
quantize_kernel_preference options

Test Plan:
pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
jerryzh168 added a commit that referenced this pull request Feb 18, 2026
Summary:
Added the flashinfer option for better performance on some of the workflow
we are interested in, also added numerical equivalence test between different
quantize_kernel_preference options

Test Plan:
pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 5480f76
Pull Request resolved: #3912
vkuzo
vkuzo previously requested changes Feb 18, 2026
Comment thread torchao/quantization/quantize_/common/kernel_preference.py Outdated
Summary:
Added the flashinfer option for better performance on some of the workflow
we are interested in, also added numerical equivalence test between different
quantize_kernel_preference options

Test Plan:
pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
jerryzh168 added a commit that referenced this pull request Feb 18, 2026
Summary:
Added the flashinfer option for better performance on some of the workflow
we are interested in, also added numerical equivalence test between different
quantize_kernel_preference options

Test Plan:
pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 51072c2
Pull Request resolved: #3912
Summary:
Added the flashinfer option for better performance on some of the workflow
we are interested in, also added numerical equivalence test between different
quantize_kernel_preference options

Test Plan:
pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
jerryzh168 added a commit that referenced this pull request Feb 18, 2026
Summary:
Added the flashinfer option for better performance on some of the workflow
we are interested in, also added numerical equivalence test between different
quantize_kernel_preference options

Test Plan:
pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: cb5cda5
Pull Request resolved: #3912
Summary:
Added the flashinfer option for better performance on some of the workflow
we are interested in, also added numerical equivalence test between different
nvfp4_quantize_kernel_choice options

Test Plan:
pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
jerryzh168 added a commit that referenced this pull request Feb 18, 2026
Summary:
Added the flashinfer option for better performance on some of the workflow
we are interested in, also added numerical equivalence test between different
quantize_kernel_preference options

Test Plan:
pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 6d8af1c
Pull Request resolved: #3912
@jerryzh168 jerryzh168 requested a review from vkuzo February 18, 2026 22:37
Summary:
Added the flashinfer option for better performance on some of the workflow
we are interested in, also added numerical equivalence test between different
nvfp4_quantize_kernel_choice options

Test Plan:
pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence

We'll test speedup a bit later

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
jerryzh168 added a commit that referenced this pull request Feb 19, 2026
Summary:
Added the flashinfer option for better performance on some of the workflow
we are interested in, also added numerical equivalence test between different
quantize_kernel_preference options

Test Plan:
pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 2d70cb7
Pull Request resolved: #3912
@torch.no_grad()
def test_triton_nvfp4_quantize_equivalence(M, N, use_per_tensor_scale, dtype):
"""Test that Triton and PyTorch NVFP4 quantization produce equivalent results."""
def test_kernel_choice_numerical_equivalence(M, N, use_per_tensor_scale, dtype):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clarify this is for quantization from bf16 to nvfp4

Copy link
Copy Markdown
Contributor Author

@jerryzh168 jerryzh168 Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will rename to test_quantize_to_nvfp4_kernel_numerical_equivalence

original test is testing both fp32 and bf16, please let me know if you feel we should remove fp32 test case as well

# For kernel choices that use the same quantization algorithm as TORCH
# (TRITON should be bitwise identical), verify internal data matches exactly
if kc == NVFP4QuantizeKernelChoice.TRITON:
torch.testing.assert_close(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add check to ensure bitwise match

@vkuzo
Copy link
Copy Markdown
Contributor

vkuzo commented Feb 19, 2026

Can we include performance benchmarking in this PR?

  1. roofline script with sweep over shapes: https://github.com/pytorch/ao/blob/main/benchmarks/float8/float8_inference_roofline.py
  2. nvfp4 individual cast bench:

Summary:
Added the flashinfer option for better performance on some of the workflow
we are interested in, also added numerical equivalence test between different
nvfp4_quantize_kernel_choice options

Test Plan:
pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence

We'll test speedup a bit later

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
jerryzh168 added a commit that referenced this pull request Feb 19, 2026
Summary:
Added the flashinfer option for better performance on some of the workflow
we are interested in, also added numerical equivalence test between different
quantize_kernel_preference options

Test Plan:
pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: d0132c3
Pull Request resolved: #3912
Summary:
Added the flashinfer option for better performance on some of the workflow
we are interested in, also added numerical equivalence test between different
nvfp4_quantize_kernel_choice options

Test Plan:
pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence

We'll test speedup a bit later

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
jerryzh168 added a commit that referenced this pull request Feb 19, 2026
Summary:
Added the flashinfer option for better performance on some of the workflow
we are interested in, also added numerical equivalence test between different
quantize_kernel_preference options

Test Plan:
pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: f281c57
Pull Request resolved: #3912
Summary:
Added the flashinfer option for better performance on some of the workflow
we are interested in, also added numerical equivalence test between different
nvfp4_quantize_kernel_choice options

Test Plan:
pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence

We'll test speedup a bit later

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
jerryzh168 added a commit that referenced this pull request Feb 19, 2026
Summary:
Added the flashinfer option for better performance on some of the workflow
we are interested in, also added numerical equivalence test between different
quantize_kernel_preference options

Test Plan:
pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: fe20eba
Pull Request resolved: #3912
Summary:
Added the flashinfer option for better performance on some of the workflow
we are interested in, also added numerical equivalence test between different
nvfp4_quantize_kernel_choice options

Test Plan:
pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence

We'll test speedup a bit later

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
jerryzh168 added a commit that referenced this pull request Feb 20, 2026
Summary:
Added the flashinfer option for better performance on some of the workflow
we are interested in, also added numerical equivalence test between different
quantize_kernel_preference options

Test Plan:
pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: b1c5919
Pull Request resolved: #3912
Summary:
Added the flashinfer option for better performance on some of the workflow
we are interested in, also added numerical equivalence test between different
nvfp4_quantize_kernel_choice options

Test Plan:
pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence

We'll test speedup a bit later

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
jerryzh168 added a commit that referenced this pull request Feb 20, 2026
Summary:
Added the flashinfer option for better performance on some of the workflow
we are interested in, also added numerical equivalence test between different
quantize_kernel_preference options

Test Plan:
pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: d86b009
Pull Request resolved: #3912
Summary:
Added the flashinfer option for better performance on some of the workflow
we are interested in, also added numerical equivalence test between different
nvfp4_quantize_kernel_choice options

Test Plan:
pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence

We'll test speedup a bit later

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
jerryzh168 added a commit that referenced this pull request Mar 14, 2026
Summary:
Added the flashinfer option for better performance on some of the workflow
we are interested in, also added numerical equivalence test between different
quantize_kernel_preference options

Test Plan:
pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 0db4b3f
Pull Request resolved: #3912
Summary:
Added the flashinfer option for better performance on some of the workflow
we are interested in, also added numerical equivalence test between different
nvfp4_quantize_kernel_choice options

Test Plan:
pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence

We'll test speedup a bit later

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
jerryzh168 added a commit that referenced this pull request Mar 14, 2026
Summary:
Added the flashinfer option for better performance on some of the workflow
we are interested in, also added numerical equivalence test between different
quantize_kernel_preference options

Test Plan:
pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 5cd280a
Pull Request resolved: #3912
Summary:
Added the flashinfer option for better performance on some of the workflow
we are interested in, also added numerical equivalence test between different
nvfp4_quantize_kernel_choice options

Test Plan:
pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence

We'll test speedup a bit later

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
jerryzh168 added a commit that referenced this pull request Mar 14, 2026
Summary:
Added the flashinfer option for better performance on some of the workflow
we are interested in, also added numerical equivalence test between different
quantize_kernel_preference options

Test Plan:
pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 50ce441
Pull Request resolved: #3912
Summary:
Added the flashinfer option for better performance on some of the workflow
we are interested in, also added numerical equivalence test between different
nvfp4_quantize_kernel_choice options

Test Plan:
pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence

We'll test speedup a bit later

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
jerryzh168 added a commit that referenced this pull request Mar 14, 2026
Summary:
Added the flashinfer option for better performance on some of the workflow
we are interested in, also added numerical equivalence test between different
quantize_kernel_preference options

Test Plan:
pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 08be951
Pull Request resolved: #3912
@jerryzh168 jerryzh168 marked this pull request as draft March 14, 2026 07:53
Summary:
Added the flashinfer option for better performance on some of the workflow
we are interested in, also added numerical equivalence test between different
nvfp4_quantize_kernel_choice options

Test Plan:
pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence

We'll test speedup a bit later

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
jerryzh168 added a commit that referenced this pull request Mar 14, 2026
Summary:
Added the flashinfer option for better performance on some of the workflow
we are interested in, also added numerical equivalence test between different
quantize_kernel_preference options

Test Plan:
pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 4e6c022
Pull Request resolved: #3912
Summary:
Added the flashinfer option for better performance on some of the workflow
we are interested in, also added numerical equivalence test between different
nvfp4_quantize_kernel_choice options

Test Plan:
pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence

We'll test speedup a bit later

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
jerryzh168 added a commit that referenced this pull request Mar 17, 2026
Summary:
Added the flashinfer option for better performance on some of the workflow
we are interested in, also added numerical equivalence test between different
quantize_kernel_preference options

Test Plan:
pytest test/prototype/mx_formats/test_nvfp4_tensor.py -k test_kernel_preference_numerical_equivalence

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 3577f51
Pull Request resolved: #3912
@jerryzh168
Copy link
Copy Markdown
Contributor Author

seems mslk kernels can give us similar performance as flashinfer kernel, this is no longer needed

@jerryzh168 jerryzh168 closed this Apr 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. module: not user facing Use this tag if you don't want this PR to show up in release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants