[FP8] Fix Benchmarking for certain Priors #155722

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

oniononion36 wants to merge 1 commit into pytorch:main from oniononion36:export-D76092551

Contributor

oniononion36 commented Jun 11, 2025 •

edited by pytorch-bot bot

Loading

Summary: For priors like layer norm, the order of the weight quantization kernel might be different and therefore have a different suffix, so we use regular expression instead.

Test Plan:
Trying this on model id 737772166 with

buck2 run mode/opt  mode/inplace -c fbcode.platform010_cuda_version=12 -c fbcode.nvcc_arch=h100 caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --lower-backend=AOT_INDUCTOR   --model-snapshot-id=737772166_0 --trace-aot-inductor-module=True --disable-acc-tracer=False --batch-size=1024 --node_replacement_dict "{'(autotune)':{'(1000+,1000+)':'fp8_float_model_dynamic_quantization_rowwise'}"

will allow more linears to be correctly replaced with fp8.
An example of the gpu trace can be found in https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/hpc/new/models/feed/benchmark/libkineto_activities_773108_f58b57e208c04787acd3bcb01a3e8771.json.gz&bucket=gpu_traces.

Rollback Plan:

Differential Revision: D76092551

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov

pytorch-bot bot commented Jun 11, 2025 •

edited

Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/155722

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 08ad237 with merge base ffac0de ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot bot added ciflow/inductor module: inductor labels

Contributor

facebook-github-bot commented Jun 11, 2025

This pull request was exported from Phabricator. Differential Revision: D76092551

facebook-github-bot added the fb-exported label

oniononion36 added the topic: not user facing label

Contributor

facebook-github-bot commented Jun 11, 2025

This pull request was exported from Phabricator. Differential Revision: D76092551

oniononion36 force-pushed the export-D76092551 branch from 77d0e0d to b31b1ab Compare

June 11, 2025 18:34

Contributor

facebook-github-bot commented Jun 11, 2025

This pull request was exported from Phabricator. Differential Revision: D76092551

oniononion36 force-pushed the export-D76092551 branch from b31b1ab to 3b8ff60 Compare

June 11, 2025 18:44

Contributor

facebook-github-bot commented Jun 11, 2025

This pull request was exported from Phabricator. Differential Revision: D76092551

oniononion36 force-pushed the export-D76092551 branch from 3b8ff60 to b26ec80 Compare

June 11, 2025 19:34

Contributor

facebook-github-bot commented Jun 11, 2025

This pull request was exported from Phabricator. Differential Revision: D76092551

oniononion36 force-pushed the export-D76092551 branch from b26ec80 to 0ec1c44 Compare

June 11, 2025 19:42

Contributor

facebook-github-bot commented Jun 11, 2025

This pull request was exported from Phabricator. Differential Revision: D76092551

oniononion36 force-pushed the export-D76092551 branch from 0ec1c44 to f017a2f Compare

June 11, 2025 20:55

Contributor

facebook-github-bot commented Jun 11, 2025

This pull request was exported from Phabricator. Differential Revision: D76092551

oniononion36 force-pushed the export-D76092551 branch from f017a2f to 1778fec Compare

June 11, 2025 21:08

oniononion36 added a commit to oniononion36/pytorch that referenced this pull request


          [FP8] Fix Benchmarking for certain Priors (pytorch#155722)

1778fec

Summary:
Pull Request resolved: pytorch#155722

For priors like layer norm, the order of the weight quantization kernel might be different and therefore have a different suffix, so we use regular expression instead.

Test Plan:
Trying this on model id 737772166 with
```
buck2 run mode/opt  mode/inplace -c fbcode.platform010_cuda_version=12 -c fbcode.nvcc_arch=h100 caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --lower-backend=AOT_INDUCTOR   --model-snapshot-id=737772166_0 --trace-aot-inductor-module=True --disable-acc-tracer=False --batch-size=1024 --node_replacement_dict "{'(autotune)':{'(1000+,1000+)':'fp8_float_model_dynamic_quantization_rowwise'}"
```
will allow more linears to be correctly replaced with fp8.
An example of the gpu trace can be found in https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/hpc/new/models/feed/benchmark/libkineto_activities_773108_f58b57e208c04787acd3bcb01a3e8771.json.gz&bucket=gpu_traces.

Rollback Plan:

Differential Revision: D76092551

Contributor

facebook-github-bot commented Jun 12, 2025

This pull request was exported from Phabricator. Differential Revision: D76092551

oniononion36 force-pushed the export-D76092551 branch from 1778fec to 27642e3 Compare

June 12, 2025 07:29

pytorch-bot bot added the module: dynamo label

Contributor

facebook-github-bot commented Jun 12, 2025

This pull request was exported from Phabricator. Differential Revision: D76092551

oniononion36 force-pushed the export-D76092551 branch from 27642e3 to 82cfb48 Compare

June 12, 2025 07:37

oniononion36 force-pushed the export-D76092551 branch from b068f71 to f9d518f Compare

June 16, 2025 17:13

pytorch-bot bot pushed a commit that referenced this pull request


          [FP8] Fix Benchmarking for certain Priors (#155722)

f9d518f

Summary:
Pull Request resolved: #155722

For priors like layer norm, the order of the weight quantization kernel might be different and therefore have a different suffix, so we use regular expression instead.

Test Plan:
Trying this on model id 737772166 with
```
buck2 run mode/opt  mode/inplace -c fbcode.platform010_cuda_version=12 -c fbcode.nvcc_arch=h100 caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --lower-backend=AOT_INDUCTOR   --model-snapshot-id=737772166_0 --trace-aot-inductor-module=True --disable-acc-tracer=False --batch-size=1024 --node_replacement_dict "{'(autotune)':{'(1000+,1000+)':'fp8_float_model_dynamic_quantization_rowwise'}"
```
will allow more linears to be correctly replaced with fp8.
An example of the gpu trace can be found in https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/hpc/new/models/feed/benchmark/libkineto_activities_773108_f58b57e208c04787acd3bcb01a3e8771.json.gz&bucket=gpu_traces.

Rollback Plan:

Reviewed By: frank-wei

Differential Revision: D76092551

Contributor

facebook-github-bot commented Jun 16, 2025

This pull request was exported from Phabricator. Differential Revision: D76092551

oniononion36 force-pushed the export-D76092551 branch from f9d518f to 27b1eb8 Compare

June 16, 2025 17:26

Contributor

facebook-github-bot commented Jun 16, 2025

This pull request was exported from Phabricator. Differential Revision: D76092551

oniononion36 force-pushed the export-D76092551 branch from 27b1eb8 to 01af5c1 Compare

June 16, 2025 17:39

Contributor

facebook-github-bot commented Jun 16, 2025

This pull request was exported from Phabricator. Differential Revision: D76092551

oniononion36 force-pushed the export-D76092551 branch from 01af5c1 to 4a40cae Compare

June 16, 2025 17:47

Contributor

facebook-github-bot commented Jun 16, 2025

This pull request was exported from Phabricator. Differential Revision: D76092551

oniononion36 force-pushed the export-D76092551 branch from 4a40cae to 08fef07 Compare

June 16, 2025 17:57

oniononion36 added a commit to oniononion36/pytorch that referenced this pull request


          [FP8] Fix Benchmarking for certain Priors (pytorch#155722)

08fef07

Summary:
Pull Request resolved: pytorch#155722

For priors like layer norm, the order of the weight quantization kernel might be different and therefore have a different suffix, so we use regular expression instead.

Test Plan:
Trying this on model id 737772166 with
```
buck2 run mode/opt  mode/inplace -c fbcode.platform010_cuda_version=12 -c fbcode.nvcc_arch=h100 caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --lower-backend=AOT_INDUCTOR   --model-snapshot-id=737772166_0 --trace-aot-inductor-module=True --disable-acc-tracer=False --batch-size=1024 --node_replacement_dict "{'(autotune)':{'(1000+,1000+)':'fp8_float_model_dynamic_quantization_rowwise'}"
```
will allow more linears to be correctly replaced with fp8.
An example of the gpu trace can be found in https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/hpc/new/models/feed/benchmark/libkineto_activities_773108_f58b57e208c04787acd3bcb01a3e8771.json.gz&bucket=gpu_traces.

Rollback Plan:

Reviewed By: frank-wei

Differential Revision: D76092551

oniononion36 force-pushed the export-D76092551 branch from 08fef07 to 9ae78df Compare

June 23, 2025 18:56

Contributor

facebook-github-bot commented Jun 23, 2025

This pull request was exported from Phabricator. Differential Revision: D76092551

pytorch-bot bot pushed a commit that referenced this pull request


          [FP8] Fix Benchmarking for certain Priors (#155722)

9ae78df

Summary:

For priors like layer norm, the order of the weight quantization kernel might be different and therefore have a different suffix, so we use regular expression instead.

Test Plan:
Trying this on model id 737772166 with 
```
buck2 run mode/opt  mode/inplace -c fbcode.platform010_cuda_version=12 -c fbcode.nvcc_arch=h100 caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --lower-backend=AOT_INDUCTOR   --model-snapshot-id=737772166_0 --trace-aot-inductor-module=True --disable-acc-tracer=False --batch-size=1024 --node_replacement_dict "{'(autotune)':{'(1000+,1000+)':'fp8_float_model_dynamic_quantization_rowwise'}"
```
will allow more linears to be correctly replaced with fp8.
An example of the gpu trace can be found in https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/hpc/new/models/feed/benchmark/libkineto_activities_773108_f58b57e208c04787acd3bcb01a3e8771.json.gz&bucket=gpu_traces.

Rollback Plan:

Reviewed By: frank-wei

Differential Revision: D76092551

oniononion36 force-pushed the export-D76092551 branch from 9ae78df to b45fee9 Compare

June 24, 2025 16:01

pytorch-bot bot pushed a commit that referenced this pull request


          [FP8] Fix Benchmarking for certain Priors (#155722)

b45fee9

Summary:

For priors like layer norm, the order of the weight quantization kernel might be different and therefore have a different suffix, so we use regular expression instead.

Test Plan:
Trying this on model id 737772166 with 
```
buck2 run mode/opt  mode/inplace -c fbcode.platform010_cuda_version=12 -c fbcode.nvcc_arch=h100 caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --lower-backend=AOT_INDUCTOR   --model-snapshot-id=737772166_0 --trace-aot-inductor-module=True --disable-acc-tracer=False --batch-size=1024 --node_replacement_dict "{'(autotune)':{'(1000+,1000+)':'fp8_float_model_dynamic_quantization_rowwise'}"
```
will allow more linears to be correctly replaced with fp8.
An example of the gpu trace can be found in https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/hpc/new/models/feed/benchmark/libkineto_activities_773108_f58b57e208c04787acd3bcb01a3e8771.json.gz&bucket=gpu_traces.

Rollback Plan:

Reviewed By: frank-wei

Differential Revision: D76092551

Contributor

facebook-github-bot commented Jun 24, 2025

This pull request was exported from Phabricator. Differential Revision: D76092551


          [FP8] Fix Benchmarking for certain Priors (pytorch#155722)

08ad237

Summary:

For priors like layer norm, the order of the weight quantization kernel might be different and therefore have a different suffix, so we use regular expression instead.

Test Plan:
Trying this on model id 737772166 with 
```
buck2 run mode/opt  mode/inplace -c fbcode.platform010_cuda_version=12 -c fbcode.nvcc_arch=h100 caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --lower-backend=AOT_INDUCTOR   --model-snapshot-id=737772166_0 --trace-aot-inductor-module=True --disable-acc-tracer=False --batch-size=1024 --node_replacement_dict "{'(autotune)':{'(1000+,1000+)':'fp8_float_model_dynamic_quantization_rowwise'}"
```
will allow more linears to be correctly replaced with fp8.
An example of the gpu trace can be found in https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/hpc/new/models/feed/benchmark/libkineto_activities_773108_f58b57e208c04787acd3bcb01a3e8771.json.gz&bucket=gpu_traces.

Rollback Plan:

Reviewed By: frank-wei

Differential Revision: D76092551

oniononion36 force-pushed the export-D76092551 branch from b45fee9 to 08ad237 Compare

July 1, 2025 19:18

Contributor

facebook-github-bot commented Jul 1, 2025

This pull request was exported from Phabricator. Differential Revision: D76092551

Contributor

facebook-github-bot commented Jul 1, 2025

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

pytorchmergebot added the merging label

Collaborator

pytorchmergebot commented Jul 1, 2025

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot added the Merged label

pytorchmergebot closed this in

617e3f6

pytorchmergebot removed the merging label

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor ciflow/trunk fb-exported Merged module: dynamo module: inductor topic: not user facing