New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speedup bincount and histc on CUDA #97090
Speedup bincount and histc on CUDA #97090
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/97090
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 15c18f0: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
8c0ae4d
to
6b9b382
Compare
@pytorchbot merge |
Merge failedReason: This PR needs a label If not, please add the To add a label, you can comment to pytorchbot, for example For more information, see Details for Dev Infra teamRaised by workflow job |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
This is to speed up torch.bincount and torch.histc on CUDA.
Fixes #96626
After speedup, time cost in #96626 would be
Note that in "case 1 CUDA", the max op takes the most time, i.e.,
pytorch/aten/src/ATen/native/cuda/SummaryOps.cu
Lines 334 to 335 in 5ee5a16
Benchmark
Time is measured on i7-10700 + RTX 3080, Ubuntu 22.04 (in WSL). The baseline is PyTorch 2.0.0+cu117. My dev version of PyTorch is compiled with CUDA 11.8. Each case is measured 15 times to take the median.
torch.bincount
torch.histc (float32)
torch.histc (int64)
Banchmark code
Here is the benchmark code: