-
Notifications
You must be signed in to change notification settings - Fork 25.6k
migrate sort
from the TH to Aten (CUDA)
#50887
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
💊 CI failures summary and remediationsAs of commit e1e9ca9 (more details on the Dr. CI page and at hud.pytorch.org/pr/50887):
2 failures not recognized by patterns:
This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions to the (internal) Dr. CI Users group. |
Codecov Report
@@ Coverage Diff @@
## master #50887 +/- ##
=======================================
Coverage 77.42% 77.42%
=======================================
Files 1891 1891
Lines 187534 187546 +12
=======================================
+ Hits 145197 145209 +12
Misses 42337 42337 |
Hey, @zsef123 ! Thank you very much for your work! I just skimmed through and for sorting with |
@nikitaved Thanks to commennts! I read that PR but I think I missed some works. In #39744, the But the I tried to using And I also have an idea to mix Have I ever missed something about this? |
@nikitaved Benchmarks
Scripts
This commit
1.7.1
|
This means that TensorIterator iterates over each dimension but dimension But I think you are right, with such an implementation parallelization is mostly done over the elements
Could you show your benchmarks similar to what you did with the StridedRandomAccessor? Thank you, @zsef123! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nikitaved I missed aten/native/cuda/CompositeRandomAccessor.h
In that codes, Using Thrust but struct name is TupleInfoCPU
. also CompositeRandomAccessorCPU
too
Which side of the code would be better?
In the native/cuda or integrate in single files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is better to use aten/native/cuda/CompositeRandomAccessor.h
for sure. Yes, it was a mistake to leave CPU there, it should be CUDA instead.
Thanks to answer. I already push the commit, apply CompositeRandomAccessor to the thrust kernel.
I think I confused combine Comparator with Stride. And in your answer, did you mean more benchmark datas? (#50887 (comment))
|
I mean, could you please run your benchmark tests on on commits prior to applying StridedAccessor? |
I am not sure you can do that for non-contiguous tensors, unless you make a contiguous copy, which, I guess, you do, right? |
At first commit 14277f4,
|
@zsef123 , then I apologize for wasting your time. Let's keep the fastest version then! |
What is the status of this PR? I am rewriting the thrust path with cub in #54626 |
@zsef123 Are you still interested in working on this? |
Sure I still interested but I just wait other reviews |
@zasdfgbnm @VitalyFedyunin @ngimel |
@zsef123 #54626 will go in first as it is more limited, ditches thrust (that we need to do for other reasons) and significantly improves performance. I'll let you know when it's merged so you can rebase on that. |
… cub (#54626) Summary: The thrust path of `torch.sort` in THC is rewritten and replaced with cub in ATen. The original algorithm is followed, but since cub does not offer custom compare operator, I have to change it a bit to 2 sort + gather. Note: tensor larger than 2^31 elements is supported, but the dimension being sorted can not go beyond 2^31. Related: #50887 #24637 Benchmark: ```python import torch import itertools for i in range(1000): torch.arange(100000, device='cuda') def run50_sync(f): for _ in range(50): f() torch.cuda.synchronize() for i, j in itertools.product([512, 4096, 8192], repeat=2): print(i,j) t = torch.randn(i, j, device='cuda') torch.cuda.synchronize() %timeit run50_sync(lambda: torch.sort(t)) torch.cuda.synchronize() %timeit run50_sync(lambda: torch.sort(t, dim=0)) print() ``` Before ``` 512 512 3.91 ms ± 8.53 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 4.87 ms ± 5.06 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 512 4096 70.5 ms ± 29.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) 32.7 ms ± 14.2 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) 512 8192 142 ms ± 21.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) 64.4 ms ± 94.9 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) 4096 512 26.8 ms ± 1.68 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) 82.2 ms ± 13.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) 4096 4096 606 ms ± 178 µs per loop (mean ± std. dev. of 7 runs, 1 loop each) 722 ms ± 94.8 µs per loop (mean ± std. dev. of 7 runs, 1 loop each) 4096 8192 1.28 s ± 157 µs per loop (mean ± std. dev. of 7 runs, 1 loop each) 1.54 s ± 500 µs per loop (mean ± std. dev. of 7 runs, 1 loop each) 8192 512 53.5 ms ± 73.1 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) 168 ms ± 39.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) 8192 4096 1.28 s ± 236 µs per loop (mean ± std. dev. of 7 runs, 1 loop each) 1.54 s ± 272 µs per loop (mean ± std. dev. of 7 runs, 1 loop each) 8192 8192 2.69 s ± 741 µs per loop (mean ± std. dev. of 7 runs, 1 loop each) 3.28 s ± 549 µs per loop (mean ± std. dev. of 7 runs, 1 loop each) ``` After ``` 512 512 4.02 ms ± 28.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 5 ms ± 15.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 512 4096 40.7 ms ± 74.2 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) 33.9 ms ± 186 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) 512 8192 71.7 ms ± 636 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) 66.4 ms ± 163 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) 4096 512 27.6 ms ± 27.8 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) 46.6 ms ± 101 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) 4096 4096 262 ms ± 1.14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 321 ms ± 1.32 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 4096 8192 520 ms ± 5.47 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 661 ms ± 853 µs per loop (mean ± std. dev. of 7 runs, 1 loop each) 8192 512 54.6 ms ± 133 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) 83.2 ms ± 320 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) 8192 4096 521 ms ± 1.06 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 645 ms ± 1.47 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 8192 8192 1.04 s ± 2.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 1.34 s ± 541 µs per loop (mean ± std. dev. of 7 runs, 1 loop each) ``` Pull Request resolved: #54626 Reviewed By: VitalyFedyunin Differential Revision: D27396078 Pulled By: ngimel fbshipit-source-id: 4a23b9355e3542e49233b4b4328e43947ec17efd
Fixes #24637
Changes
Merging into just
sort
andsort_out
both CPU and CUDAReplace
THCudaLongTensor_fillSliceWithIndex
to_fill_indices
( and it move toSorting.cpp
)Replace
THCTensor_(copy)
inTHCTensor_(sort)
, In commit, handling insort_out
Changed TH functions to ATens
-- inlining functions like
nDimensionLegacyNoScalars
-- Removes THCNumerics
-- Replace TH Macros (
THArgCheck
, ... ) to ATen MacrosChanged macro in
sortViaThrust
to lambdaAdd Testing case about
sortViaThrust
-- Changed
assertIsOrdered
to handling dynamic sizes of tensorBenchmark
Script
This Commit
1.7.1
@VitalyFedyunin