Skip to content

Conversation

lakshayg
Copy link
Collaborator

@lakshayg lakshayg commented Sep 27, 2025

This commit simplifies the precision lookup and setting logic
by reducing the number of branches and using a custom hash
function. Fixes #161822. The issue described in #163709 still
persists. This is meant as a short term fix.

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng @chauhang @amjames @Lucaskabela

Copy link

pytorch-bot bot commented Sep 27, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/164044

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 100120d with merge base 2a7c486 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@lakshayg
Copy link
Collaborator Author

@pytorchbot label "topic: not user facing"

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Sep 27, 2025
@lakshayg lakshayg force-pushed the fp_precision_settting_perf_fix branch 2 times, most recently from b315aff to 3bf567d Compare September 28, 2025 16:18
@eqy eqy requested a review from ngimel September 29, 2025 16:04
Copy link
Collaborator

@eqy eqy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! What's the estimated performance difference between the old float32Precision and new version? IIRC this was the main issue as all matmuls (even ones that were not affected by TF32 setting) would call it

@jerryzh168 jerryzh168 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Sep 29, 2025
@lakshayg
Copy link
Collaborator Author

lakshayg commented Sep 29, 2025

What's the estimated performance difference between the old float32Precision and new version

@eqy I used the benchmarking script you shared in #161822 and ran it 100 times for this PR and the base branch. Here's what the run time distribution looks like. It looks like the distribution has just shifted left by ~0.5 ticks.

It's roughly a 5% speedup but I think it's better to think of it as a constant 0.5 tick improvement since it's not dependent on the sizes of the matrices involved.

image

@pytorch-bot pytorch-bot bot added ciflow/inductor module: cpu CPU specific problem (e.g., perf, algorithm) module: dynamo labels Sep 30, 2025
@lakshayg lakshayg force-pushed the fp_precision_settting_perf_fix branch from e6ebdf3 to e658958 Compare September 30, 2025 19:17
@lakshayg lakshayg self-assigned this Oct 1, 2025
@lakshayg lakshayg moved this to In Progress in PyTorch + CUDA Oct 1, 2025
@eqy
Copy link
Collaborator

eqy commented Oct 1, 2025

@pytorchmergebot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 1, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

@cyyever
Copy link
Collaborator

cyyever commented Oct 2, 2025

@pytorchmergebot merge

@github-project-automation github-project-automation bot moved this from In Progress to Done in PyTorch + CUDA Oct 2, 2025
}
};

std::unordered_map<
Copy link
Contributor

@swolchok swolchok Oct 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be improved further by using nested arrays as in #164387

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I considered doing that but decided not to since the hastable approach is immune to the possibility of someone accidentally changing the enum order or forgetting to update the mapping when they add/remove backends.

I also think that this design needs to be revisited (#163709) and wanted to avoid potentially throw-away work.

It's a valid comment though. Feel free to take it up in a separate PR.

@yangw-dev
Copy link
Contributor

@pytorchbot revert -c ghfirst -m "broke internal build In file included from xplat/caffe2/aten/src/ATen/DeviceAccelerator.cpp:1:
xplat/caffe2/aten/src/ATen/Context.h:502:38: error: shift count >= width of type [-Werror,-Wshift-count-overflow]
502 | return std::hash<size_t>{}((k1 << 32) | k2);"

Copy link

pytorch-bot bot commented Oct 2, 2025

❌ 🤖 pytorchbot command failed:

Got EOF while in a quoted string```
Try `@pytorchbot --help` for more info.

@yangw-dev
Copy link
Contributor

yangw-dev commented Oct 2, 2025

@pytorchbot revert -c ghfirst -m "broke internal build In file included from xplat/caffe2/aten/src/ATen/DeviceAccelerator.cpp:1: xplat/caffe2/aten/src/ATen/Context.h:502:38: error: shift count >= width of type [-Werror,-Wshift-count-overflow] 502 | return std::hash<size_t>{}((k1 << 32) | k2);"
sdfdsf

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot added a commit that referenced this pull request Oct 2, 2025
This reverts commit 723ba21.

Reverted #164044 on behalf of https://github.com/yangw-dev due to broke internal build In file included from xplat/caffe2/aten/src/ATen/DeviceAccelerator.cpp:1: xplat/caffe2/aten/src/ATen/Context.h:502:38: error: shift count >= width of type [-Werror,-Wshift-count-overflow] 502 | return std::hash<size_t>{}((k1 << 32) | k2); ([comment](#164044 (comment)))
@pytorchmergebot
Copy link
Collaborator

@lakshayg your PR has been successfully reverted.

@pytorchmergebot pytorchmergebot added Reverted ci-no-td Do not run TD on this PR labels Oct 2, 2025
This commit simplifies the precision lookup and setting
logic by reducing the number of branches and using a
custom hash function.
The buggy implementation could return "bf16" for
"cuda" backend which is an unsupported combination.
@lakshayg lakshayg force-pushed the fp_precision_settting_perf_fix branch from e658958 to 100120d Compare October 2, 2025 21:43
@ngimel
Copy link
Collaborator

ngimel commented Oct 3, 2025

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

cyyever pushed a commit to cyyever/pytorch that referenced this pull request Oct 4, 2025
This commit simplifies the precision lookup and setting logic
by reducing the number of branches and using a custom hash
function. Fixes pytorch#161822. The issue described in pytorch#163709 still
persists. This is meant as a short term fix.

Pull Request resolved: pytorch#164044
Approved by: https://github.com/ngimel, https://github.com/eqy
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-no-td Do not run TD on this PR ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request Merged module: cpu CPU specific problem (e.g., perf, algorithm) module: dynamo open source Reverted topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

[CUDA][cuBLAS] #125888 introduces measurable CPU overhead for matmuls
9 participants