-
Notifications
You must be signed in to change notification settings - Fork 21.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster index_select
for sparse COO tensors on CPU.
#72710
Conversation
CI Flow Status⚛️ CI FlowRuleset - Version:
|
🔗 Helpful links
❌ 1 New Failures, 1 Flaky FailuresAs of commit e1a0978 (more details on the Dr. CI page): Expand to see more
🕵️ 1 new failure recognized by patternsThe following CI failures do not appear to be due to upstream breakages
|
6d25840
to
90fceb0
Compare
…taved/coo_index_select
…taved/coo_index_select
…taved/coo_index_select
index_select
for sparse COO tensors.index_select
for sparse COO tensors.
@cpuhrsch , could you please have a look? I will also put up some benchmarking results, but it is much faster than the previous implementation. |
@nikitaved - Thanks for sending this! Looks great! Do you have some timings for some sample inputs to verify the perf gains not just analytically? |
Hey @nikitaved. |
@pytorchbot revert this, as it breaks internal builds by introducing unused capture:
|
@pytorchbot revert this as it breaks internal builds |
I can reproduce the failure by passing |
This reverts commit ce3857e. Reverted #72710 on behalf of https://github.com/malfet
Tentative fix for internal builds
@malfet has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Importing to check for internal build failures with suggested changes, will reland if passes |
@pytorchbot merge this (Initiating merge automatically since Phabricator Diff has merged) |
Summary: Fixes #72212. This PR improves the previous algorithm in complexity. It also utilizes the structure of the problem and parallelizes computations when possible. Benchmark results. <details> <summary>Testing script</summary> ```python import torch import math from IPython import get_ipython from itertools import product import pickle from torch.utils.benchmark import Timer, Compare torch.manual_seed(13) #torch.set_num_threads(1) ipython = get_ipython() index_sizes = (100, 1000, 10000) # specifies (n, nnz) problem_dims = ( # n > nnz (10000, 100), (100000, 1000), (1000000, 10000), # n < nnz (10, 100), (10, 1000), (10, 10000), (100, 1000), (100, 10000), (1000, 10000), (1000, 100000), (1000, 1000000), #(1000000, 1000000000), ) def f(t, d, index): s = torch_sparse.SparseTensor.from_torch_sparse_coo_tensor(t) ss = s.index_select(d, index) return ss.coo() name = "PR" results = [] for (n, nnz), m in product(problem_dims, index_sizes): for d in (0, 1): if nnz < n: shape = (n, n) else: shape = (n, nnz // n) if d == 0 else (nnz // n, n) nrows, ncols = shape rowidx = torch.randint(low=0, high=nrows, size=(nnz,)) colidx = torch.randint(low=0, high=ncols, size=(nnz,)) itemidx = torch.vstack((rowidx, colidx)) xvalues = torch.randn(nnz) index = torch.randint(low=0, high=n, size=(m,)) SparseX = torch.sparse_coo_tensor(itemidx, xvalues, size=shape).coalesce() smtp = "SparseX.index_select(d, index)" timer = Timer(smtp, globals=globals(), label="coo.index_select", description=f"{name}: coo.index_select", sub_label=f"n={n}, nnz={nnz}, index_len={m}, dim={d}", num_threads=torch.get_num_threads()) results.append(timer.blocked_autorange()) compare = Compare(results) compare.trim_significant_figures() compare.print() with open(f"{name}_index_select.pickle", 'wb') as f: pickle.dump(results, f) ``` </details> <details> <summary>Gather results</summary> ```python import pickle from torch.utils.benchmark import Timer, Compare files = [ "PR", "torch_sparse", "master" ] timers = [] for name in files: with open("{}_index_select.pickle".format(name), 'rb') as f: timers += pickle.load(f) compare = Compare(timers) compare.trim_significant_figures() compare.print() ``` </details> <details> <summary>PR/torch_sparse/master runtime comparison</summary> ``` [----------------------------------- coo.index_select ----------------------------------] | PR | torch_sparse | master 32 threads: ----------------------------------------------------------------------------- n=10000, nnz=100, index_len=100, dim=0 | 14 | 140 | 10 n=10000, nnz=100, index_len=100, dim=1 | 14 | 200 | 10 n=10000, nnz=100, index_len=1000, dim=0 | 30 | 180 | 38 n=10000, nnz=100, index_len=1000, dim=1 | 34 | 240 | 38 n=10000, nnz=100, index_len=10000, dim=0 | 278 | 460 | 330 n=10000, nnz=100, index_len=10000, dim=1 | 275 | 516 | 330 n=100000, nnz=1000, index_len=100, dim=0 | 16 | 290 | 31 n=100000, nnz=1000, index_len=100, dim=1 | 26 | 390 | 31 n=100000, nnz=1000, index_len=1000, dim=0 | 45 | 405 | 263 n=100000, nnz=1000, index_len=1000, dim=1 | 73 | 500 | 261 n=100000, nnz=1000, index_len=10000, dim=0 | 444 | 783 | 2570 n=100000, nnz=1000, index_len=10000, dim=1 | 470 | 890 | 2590 n=1000000, nnz=10000, index_len=100, dim=0 | 25 | 2400 | 270 n=1000000, nnz=10000, index_len=100, dim=1 | 270 | 4000 | 269 n=1000000, nnz=10000, index_len=1000, dim=0 | 74 | 2600 | 2620 n=1000000, nnz=10000, index_len=1000, dim=1 | 464 | 3600 | 2640 n=1000000, nnz=10000, index_len=10000, dim=0 | 635 | 3300 | 26400 n=1000000, nnz=10000, index_len=10000, dim=1 | 1000 | 3960 | 26400 n=10, nnz=100, index_len=100, dim=0 | 16 | 137 | 16 n=10, nnz=100, index_len=100, dim=1 | 16 | 220 | 16 n=10, nnz=100, index_len=1000, dim=0 | 63 | 238 | 81 n=10, nnz=100, index_len=1000, dim=1 | 60 | 698 | 78 n=10, nnz=100, index_len=10000, dim=0 | 480 | 940 | 862 n=10, nnz=100, index_len=10000, dim=1 | 330 | 4930 | 1070 n=10, nnz=1000, index_len=100, dim=0 | 60 | 200 | 73 n=10, nnz=1000, index_len=100, dim=1 | 56 | 683 | 70 n=10, nnz=1000, index_len=1000, dim=0 | 480 | 530 | 1050 n=10, nnz=1000, index_len=1000, dim=1 | 330 | 4550 | 1368 n=10, nnz=1000, index_len=10000, dim=0 | 3100 | 2900 | 9300 n=10, nnz=1000, index_len=10000, dim=1 | 3400 | 46000 | 9100 n=10, nnz=10000, index_len=100, dim=0 | 400 | 453 | 857 n=10, nnz=10000, index_len=100, dim=1 | 400 | 4070 | 1730 n=10, nnz=10000, index_len=1000, dim=0 | 2840 | 2600 | 13900 n=10, nnz=10000, index_len=1000, dim=1 | 3700 | 40600 | 16000 n=10, nnz=10000, index_len=10000, dim=0 | 83200 | 67400 | 160000 n=10, nnz=10000, index_len=10000, dim=1 | 68000 | 528000 | 190000 n=100, nnz=1000, index_len=100, dim=0 | 46 | 148 | 31 n=100, nnz=1000, index_len=100, dim=1 | 45 | 242 | 37 n=100, nnz=1000, index_len=1000, dim=0 | 68 | 248 | 240 n=100, nnz=1000, index_len=1000, dim=1 | 66 | 755 | 290 n=100, nnz=1000, index_len=10000, dim=0 | 370 | 802 | 2250 n=100, nnz=1000, index_len=10000, dim=1 | 372 | 5430 | 2770 n=100, nnz=10000, index_len=100, dim=0 | 82 | 210 | 224 n=100, nnz=10000, index_len=100, dim=1 | 74 | 986 | 270 n=100, nnz=10000, index_len=1000, dim=0 | 350 | 618 | 2600 n=100, nnz=10000, index_len=1000, dim=1 | 370 | 4660 | 4560 n=100, nnz=10000, index_len=10000, dim=0 | 3000 | 3400 | 41680 n=100, nnz=10000, index_len=10000, dim=1 | 5000 | 47500 | 30400 n=1000, nnz=10000, index_len=100, dim=0 | 71 | 160 | 185 n=1000, nnz=10000, index_len=100, dim=1 | 64 | 516 | 190 n=1000, nnz=10000, index_len=1000, dim=0 | 100 | 249 | 1740 n=1000, nnz=10000, index_len=1000, dim=1 | 98 | 1030 | 1770 n=1000, nnz=10000, index_len=10000, dim=0 | 600 | 808 | 18300 n=1000, nnz=10000, index_len=10000, dim=1 | 663 | 5300 | 18500 n=1000, nnz=100000, index_len=100, dim=0 | 160 | 258 | 1890 n=1000, nnz=100000, index_len=100, dim=1 | 200 | 3620 | 2050 n=1000, nnz=100000, index_len=1000, dim=0 | 500 | 580 | 18700 n=1000, nnz=100000, index_len=1000, dim=1 | 640 | 7550 | 30000 n=1000, nnz=100000, index_len=10000, dim=0 | 3400 | 3260 | 186000 n=1000, nnz=100000, index_len=10000, dim=1 | 3600 | 49600 | 194000 n=1000, nnz=1000000, index_len=100, dim=0 | 517 | 957 | 18700 n=1000, nnz=1000000, index_len=100, dim=1 | 680 | 39600 | 37600 n=1000, nnz=1000000, index_len=1000, dim=0 | 3600 | 4500 | 186000 n=1000, nnz=1000000, index_len=1000, dim=1 | 5800 | 76400 | 190000 n=1000, nnz=1000000, index_len=10000, dim=0 | 50000 | 67900 | 1800000 n=1000, nnz=1000000, index_len=10000, dim=1 | 45000 | 570000 | 1900000 Times are in microseconds (us). ``` </details> Pull Request resolved: #72710 Reviewed By: samdow Differential Revision: D36282349 Pulled By: malfet fbshipit-source-id: 3679ea4ebeeda4d200a441aef6d45b98303bc0c0
Summary: Fixes #72212. This PR improves the previous algorithm in complexity. It also utilizes the structure of the problem and parallelizes computations when possible. Benchmark results. <details> <summary>Testing script</summary> ```python import torch import math from IPython import get_ipython from itertools import product import pickle from torch.utils.benchmark import Timer, Compare torch.manual_seed(13) #torch.set_num_threads(1) ipython = get_ipython() index_sizes = (100, 1000, 10000) # specifies (n, nnz) problem_dims = ( # n > nnz (10000, 100), (100000, 1000), (1000000, 10000), # n < nnz (10, 100), (10, 1000), (10, 10000), (100, 1000), (100, 10000), (1000, 10000), (1000, 100000), (1000, 1000000), #(1000000, 1000000000), ) def f(t, d, index): s = torch_sparse.SparseTensor.from_torch_sparse_coo_tensor(t) ss = s.index_select(d, index) return ss.coo() name = "PR" results = [] for (n, nnz), m in product(problem_dims, index_sizes): for d in (0, 1): if nnz < n: shape = (n, n) else: shape = (n, nnz // n) if d == 0 else (nnz // n, n) nrows, ncols = shape rowidx = torch.randint(low=0, high=nrows, size=(nnz,)) colidx = torch.randint(low=0, high=ncols, size=(nnz,)) itemidx = torch.vstack((rowidx, colidx)) xvalues = torch.randn(nnz) index = torch.randint(low=0, high=n, size=(m,)) SparseX = torch.sparse_coo_tensor(itemidx, xvalues, size=shape).coalesce() smtp = "SparseX.index_select(d, index)" timer = Timer(smtp, globals=globals(), label="coo.index_select", description=f"{name}: coo.index_select", sub_label=f"n={n}, nnz={nnz}, index_len={m}, dim={d}", num_threads=torch.get_num_threads()) results.append(timer.blocked_autorange()) compare = Compare(results) compare.trim_significant_figures() compare.print() with open(f"{name}_index_select.pickle", 'wb') as f: pickle.dump(results, f) ``` </details> <details> <summary>Gather results</summary> ```python import pickle from torch.utils.benchmark import Timer, Compare files = [ "PR", "torch_sparse", "master" ] timers = [] for name in files: with open("{}_index_select.pickle".format(name), 'rb') as f: timers += pickle.load(f) compare = Compare(timers) compare.trim_significant_figures() compare.print() ``` </details> <details> <summary>PR/torch_sparse/master runtime comparison</summary> ``` [----------------------------------- coo.index_select ----------------------------------] | PR | torch_sparse | master 32 threads: ----------------------------------------------------------------------------- n=10000, nnz=100, index_len=100, dim=0 | 14 | 140 | 10 n=10000, nnz=100, index_len=100, dim=1 | 14 | 200 | 10 n=10000, nnz=100, index_len=1000, dim=0 | 30 | 180 | 38 n=10000, nnz=100, index_len=1000, dim=1 | 34 | 240 | 38 n=10000, nnz=100, index_len=10000, dim=0 | 278 | 460 | 330 n=10000, nnz=100, index_len=10000, dim=1 | 275 | 516 | 330 n=100000, nnz=1000, index_len=100, dim=0 | 16 | 290 | 31 n=100000, nnz=1000, index_len=100, dim=1 | 26 | 390 | 31 n=100000, nnz=1000, index_len=1000, dim=0 | 45 | 405 | 263 n=100000, nnz=1000, index_len=1000, dim=1 | 73 | 500 | 261 n=100000, nnz=1000, index_len=10000, dim=0 | 444 | 783 | 2570 n=100000, nnz=1000, index_len=10000, dim=1 | 470 | 890 | 2590 n=1000000, nnz=10000, index_len=100, dim=0 | 25 | 2400 | 270 n=1000000, nnz=10000, index_len=100, dim=1 | 270 | 4000 | 269 n=1000000, nnz=10000, index_len=1000, dim=0 | 74 | 2600 | 2620 n=1000000, nnz=10000, index_len=1000, dim=1 | 464 | 3600 | 2640 n=1000000, nnz=10000, index_len=10000, dim=0 | 635 | 3300 | 26400 n=1000000, nnz=10000, index_len=10000, dim=1 | 1000 | 3960 | 26400 n=10, nnz=100, index_len=100, dim=0 | 16 | 137 | 16 n=10, nnz=100, index_len=100, dim=1 | 16 | 220 | 16 n=10, nnz=100, index_len=1000, dim=0 | 63 | 238 | 81 n=10, nnz=100, index_len=1000, dim=1 | 60 | 698 | 78 n=10, nnz=100, index_len=10000, dim=0 | 480 | 940 | 862 n=10, nnz=100, index_len=10000, dim=1 | 330 | 4930 | 1070 n=10, nnz=1000, index_len=100, dim=0 | 60 | 200 | 73 n=10, nnz=1000, index_len=100, dim=1 | 56 | 683 | 70 n=10, nnz=1000, index_len=1000, dim=0 | 480 | 530 | 1050 n=10, nnz=1000, index_len=1000, dim=1 | 330 | 4550 | 1368 n=10, nnz=1000, index_len=10000, dim=0 | 3100 | 2900 | 9300 n=10, nnz=1000, index_len=10000, dim=1 | 3400 | 46000 | 9100 n=10, nnz=10000, index_len=100, dim=0 | 400 | 453 | 857 n=10, nnz=10000, index_len=100, dim=1 | 400 | 4070 | 1730 n=10, nnz=10000, index_len=1000, dim=0 | 2840 | 2600 | 13900 n=10, nnz=10000, index_len=1000, dim=1 | 3700 | 40600 | 16000 n=10, nnz=10000, index_len=10000, dim=0 | 83200 | 67400 | 160000 n=10, nnz=10000, index_len=10000, dim=1 | 68000 | 528000 | 190000 n=100, nnz=1000, index_len=100, dim=0 | 46 | 148 | 31 n=100, nnz=1000, index_len=100, dim=1 | 45 | 242 | 37 n=100, nnz=1000, index_len=1000, dim=0 | 68 | 248 | 240 n=100, nnz=1000, index_len=1000, dim=1 | 66 | 755 | 290 n=100, nnz=1000, index_len=10000, dim=0 | 370 | 802 | 2250 n=100, nnz=1000, index_len=10000, dim=1 | 372 | 5430 | 2770 n=100, nnz=10000, index_len=100, dim=0 | 82 | 210 | 224 n=100, nnz=10000, index_len=100, dim=1 | 74 | 986 | 270 n=100, nnz=10000, index_len=1000, dim=0 | 350 | 618 | 2600 n=100, nnz=10000, index_len=1000, dim=1 | 370 | 4660 | 4560 n=100, nnz=10000, index_len=10000, dim=0 | 3000 | 3400 | 41680 n=100, nnz=10000, index_len=10000, dim=1 | 5000 | 47500 | 30400 n=1000, nnz=10000, index_len=100, dim=0 | 71 | 160 | 185 n=1000, nnz=10000, index_len=100, dim=1 | 64 | 516 | 190 n=1000, nnz=10000, index_len=1000, dim=0 | 100 | 249 | 1740 n=1000, nnz=10000, index_len=1000, dim=1 | 98 | 1030 | 1770 n=1000, nnz=10000, index_len=10000, dim=0 | 600 | 808 | 18300 n=1000, nnz=10000, index_len=10000, dim=1 | 663 | 5300 | 18500 n=1000, nnz=100000, index_len=100, dim=0 | 160 | 258 | 1890 n=1000, nnz=100000, index_len=100, dim=1 | 200 | 3620 | 2050 n=1000, nnz=100000, index_len=1000, dim=0 | 500 | 580 | 18700 n=1000, nnz=100000, index_len=1000, dim=1 | 640 | 7550 | 30000 n=1000, nnz=100000, index_len=10000, dim=0 | 3400 | 3260 | 186000 n=1000, nnz=100000, index_len=10000, dim=1 | 3600 | 49600 | 194000 n=1000, nnz=1000000, index_len=100, dim=0 | 517 | 957 | 18700 n=1000, nnz=1000000, index_len=100, dim=1 | 680 | 39600 | 37600 n=1000, nnz=1000000, index_len=1000, dim=0 | 3600 | 4500 | 186000 n=1000, nnz=1000000, index_len=1000, dim=1 | 5800 | 76400 | 190000 n=1000, nnz=1000000, index_len=10000, dim=0 | 50000 | 67900 | 1800000 n=1000, nnz=1000000, index_len=10000, dim=1 | 45000 | 570000 | 1900000 Times are in microseconds (us). ``` </details> Pull Request resolved: #72710 Reviewed By: samdow Differential Revision: D36282349 Pulled By: malfet fbshipit-source-id: 3679ea4ebeeda4d200a441aef6d45b98303bc0c0
Brings a native CUDA implementation for `index_select`. Master silently converts CUDA tensors to CPU for CUDA support. Case `nnz >> size` could be optimized similar to how #72710 is doing that. Some benchmarks: <details> <summary>PR/torch_sparse/master</summary> ``` [------------------------------- cuda coo.index_select -------------------------------] | PR | torch_sparse | master 32 threads: --------------------------------------------------------------------------- n=10000, nnz=100, index_len=100, dim=0 | 96 | 327 | 70 n=10000, nnz=100, index_len=100, dim=1 | 120 | 505 | 74 n=10000, nnz=100, index_len=1000, dim=0 | 90 | 333 | 93 n=10000, nnz=100, index_len=1000, dim=1 | 120 | 499 | 98 n=10000, nnz=100, index_len=10000, dim=0 | 92 | 331 | 350 n=10000, nnz=100, index_len=10000, dim=1 | 100 | 506 | 352 n=100000, nnz=1000, index_len=100, dim=0 | 53 | 274 | 60 n=100000, nnz=1000, index_len=100, dim=1 | 90 | 368 | 71 n=100000, nnz=1000, index_len=1000, dim=0 | 93 | 332 | 100 n=100000, nnz=1000, index_len=1000, dim=1 | 130 | 501 | 140 n=100000, nnz=1000, index_len=10000, dim=0 | 100 | 341 | 522 n=100000, nnz=1000, index_len=10000, dim=1 | 130 | 530 | 549 n=1000000, nnz=10000, index_len=100, dim=0 | 90 | 429 | 110 n=1000000, nnz=10000, index_len=100, dim=1 | 296 | 810 | 355 n=1000000, nnz=10000, index_len=1000, dim=0 | 100 | 435 | 170 n=1000000, nnz=10000, index_len=1000, dim=1 | 309 | 830 | 548 n=1000000, nnz=10000, index_len=10000, dim=0 | 110 | 446 | 750 n=1000000, nnz=10000, index_len=10000, dim=1 | 310 | 830 | 1000 n=10, nnz=100, index_len=100, dim=0 | 90 | 333 | 74 n=10, nnz=100, index_len=100, dim=1 | 100 | 497 | 78 n=10, nnz=100, index_len=1000, dim=0 | 90 | 329 | 140 n=10, nnz=100, index_len=1000, dim=1 | 100 | 800 | 100 n=10, nnz=100, index_len=10000, dim=0 | 93 | 340 | 900 n=10, nnz=100, index_len=10000, dim=1 | 120 | 800 | 489 n=10, nnz=1000, index_len=100, dim=0 | 90 | 321 | 140 n=10, nnz=1000, index_len=100, dim=1 | 100 | 680 | 140 n=10, nnz=1000, index_len=1000, dim=0 | 110 | 349 | 670 n=10, nnz=1000, index_len=1000, dim=1 | 130 | 740 | 800 n=10, nnz=1000, index_len=10000, dim=0 | 302 | 503 | 4882 n=10, nnz=1000, index_len=10000, dim=1 | 325 | 2257 | 5262 n=10, nnz=10000, index_len=100, dim=0 | 229 | 349 | 810 n=10, nnz=10000, index_len=100, dim=1 | 433 | 870 | 700 n=10, nnz=10000, index_len=1000, dim=0 | 666 | 502 | 5581 n=10, nnz=10000, index_len=1000, dim=1 | 826 | 2379 | 4820 n=10, nnz=10000, index_len=10000, dim=0 | 2534 | 2700 | 80000 n=10, nnz=10000, index_len=10000, dim=1 | 2723 | 18540 | 80000 n=100, nnz=1000, index_len=100, dim=0 | 94 | 324 | 110 n=100, nnz=1000, index_len=100, dim=1 | 100 | 499 | 110 n=100, nnz=1000, index_len=1000, dim=0 | 96 | 337 | 150 n=100, nnz=1000, index_len=1000, dim=1 | 130 | 800 | 140 n=100, nnz=1000, index_len=10000, dim=0 | 100 | 346 | 900 n=100, nnz=1000, index_len=10000, dim=1 | 130 | 760 | 900 n=100, nnz=10000, index_len=100, dim=0 | 90 | 323 | 190 n=100, nnz=10000, index_len=100, dim=1 | 279 | 800 | 180 n=100, nnz=10000, index_len=1000, dim=0 | 110 | 339 | 781 n=100, nnz=10000, index_len=1000, dim=1 | 294 | 870 | 800 n=100, nnz=10000, index_len=10000, dim=0 | 315 | 505 | 6264 n=100, nnz=10000, index_len=10000, dim=1 | 497 | 2398 | 5404 n=1000, nnz=10000, index_len=100, dim=0 | 90 | 333 | 160 n=1000, nnz=10000, index_len=100, dim=1 | 279 | 635 | 150 n=1000, nnz=10000, index_len=1000, dim=0 | 100 | 328 | 215 n=1000, nnz=10000, index_len=1000, dim=1 | 287 | 810 | 207 n=1000, nnz=10000, index_len=10000, dim=0 | 100 | 339 | 900 n=1000, nnz=10000, index_len=10000, dim=1 | 291 | 880 | 1000 n=1000, nnz=100000, index_len=100, dim=0 | 92 | 358 | 435 n=1000, nnz=100000, index_len=100, dim=1 | 302 | 900 | 530 n=1000, nnz=100000, index_len=1000, dim=0 | 130 | 360 | 1000 n=1000, nnz=100000, index_len=1000, dim=1 | 329 | 930 | 1200 n=1000, nnz=100000, index_len=10000, dim=0 | 343 | 530 | 7000 n=1000, nnz=100000, index_len=10000, dim=1 | 545 | 2446 | 6100 n=1000, nnz=1000000, index_len=100, dim=0 | 355 | 394 | 2210 n=1000, nnz=1000000, index_len=100, dim=1 | 1660 | 2276 | 2674 n=1000, nnz=1000000, index_len=1000, dim=0 | 877 | 574 | 6700 n=1000, nnz=1000000, index_len=1000, dim=1 | 2449 | 3782 | 9000 n=1000, nnz=1000000, index_len=10000, dim=0 | 3112 | 2931 | 57000 n=1000, nnz=1000000, index_len=10000, dim=1 | 7340 | 20220 | 65700 Times are in microseconds (us). ``` </details> Pull Request resolved: #77551 Approved by: https://github.com/cpuhrsch
Summary: Brings a native CUDA implementation for `index_select`. Master silently converts CUDA tensors to CPU for CUDA support. Case `nnz >> size` could be optimized similar to how #72710 is doing that. Some benchmarks: <details> <summary>PR/torch_sparse/master</summary> ``` [------------------------------- cuda coo.index_select -------------------------------] | PR | torch_sparse | master 32 threads: --------------------------------------------------------------------------- n=10000, nnz=100, index_len=100, dim=0 | 96 | 327 | 70 n=10000, nnz=100, index_len=100, dim=1 | 120 | 505 | 74 n=10000, nnz=100, index_len=1000, dim=0 | 90 | 333 | 93 n=10000, nnz=100, index_len=1000, dim=1 | 120 | 499 | 98 n=10000, nnz=100, index_len=10000, dim=0 | 92 | 331 | 350 n=10000, nnz=100, index_len=10000, dim=1 | 100 | 506 | 352 n=100000, nnz=1000, index_len=100, dim=0 | 53 | 274 | 60 n=100000, nnz=1000, index_len=100, dim=1 | 90 | 368 | 71 n=100000, nnz=1000, index_len=1000, dim=0 | 93 | 332 | 100 n=100000, nnz=1000, index_len=1000, dim=1 | 130 | 501 | 140 n=100000, nnz=1000, index_len=10000, dim=0 | 100 | 341 | 522 n=100000, nnz=1000, index_len=10000, dim=1 | 130 | 530 | 549 n=1000000, nnz=10000, index_len=100, dim=0 | 90 | 429 | 110 n=1000000, nnz=10000, index_len=100, dim=1 | 296 | 810 | 355 n=1000000, nnz=10000, index_len=1000, dim=0 | 100 | 435 | 170 n=1000000, nnz=10000, index_len=1000, dim=1 | 309 | 830 | 548 n=1000000, nnz=10000, index_len=10000, dim=0 | 110 | 446 | 750 n=1000000, nnz=10000, index_len=10000, dim=1 | 310 | 830 | 1000 n=10, nnz=100, index_len=100, dim=0 | 90 | 333 | 74 n=10, nnz=100, index_len=100, dim=1 | 100 | 497 | 78 n=10, nnz=100, index_len=1000, dim=0 | 90 | 329 | 140 n=10, nnz=100, index_len=1000, dim=1 | 100 | 800 | 100 n=10, nnz=100, index_len=10000, dim=0 | 93 | 340 | 900 n=10, nnz=100, index_len=10000, dim=1 | 120 | 800 | 489 n=10, nnz=1000, index_len=100, dim=0 | 90 | 321 | 140 n=10, nnz=1000, index_len=100, dim=1 | 100 | 680 | 140 n=10, nnz=1000, index_len=1000, dim=0 | 110 | 349 | 670 n=10, nnz=1000, index_len=1000, dim=1 | 130 | 740 | 800 n=10, nnz=1000, index_len=10000, dim=0 | 302 | 503 | 4882 n=10, nnz=1000, index_len=10000, dim=1 | 325 | 2257 | 5262 n=10, nnz=10000, index_len=100, dim=0 | 229 | 349 | 810 n=10, nnz=10000, index_len=100, dim=1 | 433 | 870 | 700 n=10, nnz=10000, index_len=1000, dim=0 | 666 | 502 | 5581 n=10, nnz=10000, index_len=1000, dim=1 | 826 | 2379 | 4820 n=10, nnz=10000, index_len=10000, dim=0 | 2534 | 2700 | 80000 n=10, nnz=10000, index_len=10000, dim=1 | 2723 | 18540 | 80000 n=100, nnz=1000, index_len=100, dim=0 | 94 | 324 | 110 n=100, nnz=1000, index_len=100, dim=1 | 100 | 499 | 110 n=100, nnz=1000, index_len=1000, dim=0 | 96 | 337 | 150 n=100, nnz=1000, index_len=1000, dim=1 | 130 | 800 | 140 n=100, nnz=1000, index_len=10000, dim=0 | 100 | 346 | 900 n=100, nnz=1000, index_len=10000, dim=1 | 130 | 760 | 900 n=100, nnz=10000, index_len=100, dim=0 | 90 | 323 | 190 n=100, nnz=10000, index_len=100, dim=1 | 279 | 800 | 180 n=100, nnz=10000, index_len=1000, dim=0 | 110 | 339 | 781 n=100, nnz=10000, index_len=1000, dim=1 | 294 | 870 | 800 n=100, nnz=10000, index_len=10000, dim=0 | 315 | 505 | 6264 n=100, nnz=10000, index_len=10000, dim=1 | 497 | 2398 | 5404 n=1000, nnz=10000, index_len=100, dim=0 | 90 | 333 | 160 n=1000, nnz=10000, index_len=100, dim=1 | 279 | 635 | 150 n=1000, nnz=10000, index_len=1000, dim=0 | 100 | 328 | 215 n=1000, nnz=10000, index_len=1000, dim=1 | 287 | 810 | 207 n=1000, nnz=10000, index_len=10000, dim=0 | 100 | 339 | 900 n=1000, nnz=10000, index_len=10000, dim=1 | 291 | 880 | 1000 n=1000, nnz=100000, index_len=100, dim=0 | 92 | 358 | 435 n=1000, nnz=100000, index_len=100, dim=1 | 302 | 900 | 530 n=1000, nnz=100000, index_len=1000, dim=0 | 130 | 360 | 1000 n=1000, nnz=100000, index_len=1000, dim=1 | 329 | 930 | 1200 n=1000, nnz=100000, index_len=10000, dim=0 | 343 | 530 | 7000 n=1000, nnz=100000, index_len=10000, dim=1 | 545 | 2446 | 6100 n=1000, nnz=1000000, index_len=100, dim=0 | 355 | 394 | 2210 n=1000, nnz=1000000, index_len=100, dim=1 | 1660 | 2276 | 2674 n=1000, nnz=1000000, index_len=1000, dim=0 | 877 | 574 | 6700 n=1000, nnz=1000000, index_len=1000, dim=1 | 2449 | 3782 | 9000 n=1000, nnz=1000000, index_len=10000, dim=0 | 3112 | 2931 | 57000 n=1000, nnz=1000000, index_len=10000, dim=1 | 7340 | 20220 | 65700 Times are in microseconds (us). ``` </details> Pull Request resolved: #77551 Approved by: https://github.com/cpuhrsch Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/03cf01bdc03a631a1ab521e27b6523bca1a57f0d Reviewed By: b0noI Differential Revision: D36854233 Pulled By: b0noI fbshipit-source-id: 9c665baf72fbb5530b450af0d768d0761b1a5c73
Fixes #72212.
This PR improves the previous algorithm in complexity. It also utilizes the structure of the problem and parallelizes computations when possible.
Benchmark results.
Testing script
Gather results
PR/torch_sparse/master runtime comparison