-
Notifications
You must be signed in to change notification settings - Fork 25.5k
Speed up FP precision lookup #164044
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up FP precision lookup #164044
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/164044
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 100120d with merge base 2a7c486 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
@pytorchbot label "topic: not user facing" |
b315aff
to
3bf567d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! What's the estimated performance difference between the old float32Precision
and new version? IIRC this was the main issue as all matmuls (even ones that were not affected by TF32 setting) would call it
@eqy I used the benchmarking script you shared in #161822 and ran it 100 times for this PR and the base branch. Here's what the run time distribution looks like. It looks like the distribution has just shifted left by ~0.5 ticks. It's roughly a 5% speedup but I think it's better to think of it as a constant 0.5 tick improvement since it's not dependent on the sizes of the matrices involved. ![]() |
e6ebdf3
to
e658958
Compare
@pytorchmergebot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command |
@pytorchmergebot merge |
aten/src/ATen/Context.h
Outdated
} | ||
}; | ||
|
||
std::unordered_map< |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be improved further by using nested arrays as in #164387
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I considered doing that but decided not to since the hastable approach is immune to the possibility of someone accidentally changing the enum order or forgetting to update the mapping when they add/remove backends.
I also think that this design needs to be revisited (#163709) and wanted to avoid potentially throw-away work.
It's a valid comment though. Feel free to take it up in a separate PR.
@pytorchbot revert -c ghfirst -m "broke internal build In file included from xplat/caffe2/aten/src/ATen/DeviceAccelerator.cpp:1: |
❌ 🤖 pytorchbot command failed:
|
@pytorchbot revert -c ghfirst -m "broke internal build In file included from xplat/caffe2/aten/src/ATen/DeviceAccelerator.cpp:1: xplat/caffe2/aten/src/ATen/Context.h:502:38: error: shift count >= width of type [-Werror,-Wshift-count-overflow] 502 | return std::hash<size_t>{}((k1 << 32) | k2);" |
@pytorchbot successfully started a revert job. Check the current status here. |
This reverts commit 723ba21. Reverted #164044 on behalf of https://github.com/yangw-dev due to broke internal build In file included from xplat/caffe2/aten/src/ATen/DeviceAccelerator.cpp:1: xplat/caffe2/aten/src/ATen/Context.h:502:38: error: shift count >= width of type [-Werror,-Wshift-count-overflow] 502 | return std::hash<size_t>{}((k1 << 32) | k2); ([comment](#164044 (comment)))
@lakshayg your PR has been successfully reverted. |
This commit simplifies the precision lookup and setting logic by reducing the number of branches and using a custom hash function.
The buggy implementation could return "bf16" for "cuda" backend which is an unsupported combination.
e658958
to
100120d
Compare
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
This commit simplifies the precision lookup and setting logic by reducing the number of branches and using a custom hash function. Fixes pytorch#161822. The issue described in pytorch#163709 still persists. This is meant as a short term fix. Pull Request resolved: pytorch#164044 Approved by: https://github.com/ngimel, https://github.com/eqy
This commit simplifies the precision lookup and setting logic
by reducing the number of branches and using a custom hash
function. Fixes #161822. The issue described in #163709 still
persists. This is meant as a short term fix.
cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng @chauhang @amjames @Lucaskabela