Sparse softmax support (CUDA) #42307

aocsa · 2020-07-30T14:08:34Z

This PR implements softmax support for sparse tensors.

Resolves gh-23651 for CUDA.

Here are some benchmark (script is here) results for torch.sparse.softmax and torch.softmax, using CPU and GPU, values are float64 scalars, timing repeat is 1000:

size	density	sparse CUDA	sparse CPU
(32, 10000)	0.01	380.2	687.5
(32, 10000)	0.05	404.3	2357.9
(32, 10000)	0.1	405.9	3677.2
(512, 10000)	0.01	438.0	5443.4
(512, 10000)	0.05	888.1	24485.0
(512, 10000)	0.1	1921.3	45340.5

size	density	dense CUDA	dense CPU
(32, 10000)	0.01	23.6	1943.2
(32, 10000)	0.05	23.6	1954.0
(32, 10000)	0.1	23.5	1950.0
(512, 10000)	0.01	639.3	39797.9
(512, 10000)	0.05	640.3	39374.4
(512, 10000)	0.1	639.6	39192.3

Times are in microseconds (us).

Quick note: I updated the performance test again.

dr-ci · 2020-07-30T14:09:19Z

💊 CI failures summary and remediations

As of commit 6d3fb8e (more details on the Dr. CI page):

2/2 failures possibly* introduced in this PR
- 1/2 non-CircleCI failure(s)

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

pytorch_linux_xenial_py3_6_gcc5_4_ge_config_simple_test (1/1)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Sep 21 05:20:09 [E request_callback_no_python.cpp:618] Received error while processing request type 2: RuntimeError: Can not pickle torch.futures.Future

Sep 21 05:20:09 At: 
Sep 21 05:20:09   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(94): serialize 
Sep 21 05:20:09   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(146): serialize 
Sep 21 05:20:09  
Sep 21 05:20:09 [E request_callback_no_python.cpp:618] Received error while processing request type 2: RuntimeError: Can not pickle torch.futures.Future 
Sep 21 05:20:09  
Sep 21 05:20:09 At: 
Sep 21 05:20:09   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(94): serialize 
Sep 21 05:20:09   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(146): serialize 
Sep 21 05:20:09  
Sep 21 05:20:09 [E request_callback_no_python.cpp:618] Received error while processing request type 2: RuntimeError: Can not pickle torch.futures.Future 
Sep 21 05:20:09  
Sep 21 05:20:09 At: 
Sep 21 05:20:09   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(94): serialize 
Sep 21 05:20:09   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(146): serialize 
Sep 21 05:20:09  
Sep 21 05:20:10 ok (1.438s) 
Sep 21 05:20:11   test_return_future_remote (__main__.ProcessGroupRpcTestWithSpawn) ... ok (1.439s) 
Sep 21 05:20:13   test_return_local_rrefs (__main__.ProcessGroupRpcTestWithSpawn) ... ok (1.540s) 
Sep 21 05:20:14   test_rpc_profiling_remote_record_function (__main__.ProcessGroupRpcTestWithSpawn) ... ERROR:root:Caught exception:  
Sep 21 05:20:14 Traceback (most recent call last):

ci.pytorch.org: 1 failed

Failed: pr/caffe2-pytorch-linux-bionic-rocm3.7-py3.6-test

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 124 times.

test/test_sparse.py

mruberry · 2020-08-13T19:30:34Z

Hey @aocsa, please ping me on this when it's ready for review.

aocsa · 2020-08-13T20:24:01Z

Hey @aocsa, please ping me on this when it's ready for review.

@mruberry I updated this PR with the autograd support, I think it is ready to review. cc @pearu

test/test_sparse.py

pearu

Thanks, @aocsa for implementing the CUDA support for softmax functions!

In my review, I noticed that the code size can be reduced considerably by eliminating code replication in a number of places. For instance, get_pools and compute_pool_max can be merged. Also, functions softmax_sparse_cuda, log_softmax_sparse_cuda, softmax_backward_sparse_cuda, and log_softmax_backward_sparse_cuda can be deleted after updating the corresponding functions in aten/src/ATen/native/sparse/SoftMax.cpp .

In addition, I am curious about the implementation complexity of compute_pool_max for computing mx_rows when for CPU it requires only 6 lines of code.

aten/src/ATen/native/sparse/cuda/SoftMax.cu

aocsa · 2020-08-18T05:22:50Z

Thanks, @aocsa for implementing the CUDA support for softmax functions!

In my review, I noticed that the code size can be reduced considerably by eliminating code replication in a number of places. For instance, get_pools and compute_pool_max can be merged. Also, functions softmax_sparse_cuda, log_softmax_sparse_cuda, softmax_backward_sparse_cuda, and log_softmax_backward_sparse_cuda can be deleted after updating the corresponding functions in aten/src/ATen/native/sparse/SoftMax.cpp .

Thanks for the review @pearu, I reduced the code, refactoring it where it was required. I tried to use the device dispatcher to reduce common code between cpu and cuda code, however it seems that it only works with specific paths. Look at this TODO. So I just did a simple refactor using what you suggested.

In addition, I am curious about the implementation complexity of compute_pool_max for computing mx_rows when for CPU it requires only 6 lines of code.

The complexity of compute_pool_max function should be in fact faster as it computes the pools and maxes in parallel using thrust::reduce_by_key API. However, it pays with the cost of the pre-processing step, using thrust::sorting, and its corresponding temporal memory uses.

cc @mruberry

pearu

@aocsa thanks for addressing my comments! Great work!

I believe there is a bug regarding the computation of log_softmax_max_buffer and this does not appear to be used, IIRC.

aten/src/ATen/native/sparse/cuda/SoftMax.cu

pearu

@aocsa thanks, we are almost there... but the code computing the max values is unfortunately incorrect. See the given example and diagnostics below.

aten/src/ATen/native/sparse/cuda/SoftMax.cu

aten/src/ATen/native/native_functions.yaml

aten/src/ATen/native/sparse/SoftMax.cpp

aten/src/ATen/native/sparse/cuda/SoftMax.cu

ngimel · 2020-09-15T21:36:34Z

I am not sure that this can be optimized, in general.

That said, when grad_offsets == out_offsets, the loop over range(out_nnz) could be collapsed into a single dense softmax call.

Yes! That's what I meant by "It would make sense to have a special case where gradient sparsity pattern is the same as output sparsity pattern"
With the tests for that special case.

aocsa · 2020-09-15T21:43:23Z

I am not sure that this can be optimized, in general.

That said, when grad_offsets == out_offsets, the loop over range(out_nnz) could be collapsed into a single dense softmax call.

Yes! That's what I meant by "It would make sense to have a special case where gradient sparsity pattern is the same as output sparsity pattern"

All right, so this optimization is for both Softmax.cpp and Softmax.cu and new tests for this special case. @pearu please could you detail the case here as I am not so familiar with all cases of the backward algorithm, and thanks for your comments, I appreciate your support here. cc @ngimel

aocsa · 2020-09-16T20:47:23Z

Quick update: I pushed my changes in the PR with the support for the "case: gradient sparsity pattern is the same as output sparsity pattern" when grad_offsets == out_offsets. Moreover, I double-checked if current tests cover this case so there is no necessity to create new ones unless we need to break the tests into more modular tests. Let me know if you have any other request and thanks for the feedback @pearu, @ngimel.

mruberry · 2020-09-16T20:49:10Z

Quick update: I pushed my changes in the PR with the support for the "case: gradient sparsity pattern is the same as output sparsity pattern" when grad_offsets == out_offsets. Moreover, I double-checked if current tests cover this case so there is no necessity to create new ones unless we need to break the tests into more modular tests. Let me know if you have any other request and thanks for the feedback @pearu, @ngimel.

Thanks @aocsa! We'll take a look soon.

aten/src/ATen/native/sparse/SoftMax.cpp

mruberry

Thanks for all the updates, @aocsa. @ngimel and I had the chance to review and we're happy with how this looks.

In the future it'll be interesting to improve our test infrastructure so we don't need to write so many custom components (like the Python softmax implementation or the custom jacobian implementation). That should make adding new operators much easier.

facebook-github-bot

@mruberry has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

mruberry · 2020-09-19T02:23:43Z

Unfortunately this is triggering our arcane internal build system, which is complaining about an inability to find <ATen/native/sparse/utils/ParamUtils.h>. The issue is, I think, that it's in a new folder that I'd have to register internally in a few places. A more immediate fix, however, would be to refactor it into the current sparse folder instead of putting it under utils. Would that be OK for now, @aocsa? We can revisit create a sparse utils folder later if we have organizational issues.

aocsa · 2020-09-19T02:36:14Z

Unfortunately this is triggering our arcane internal build system, which is complaining about an inability to find <ATen/native/sparse/utils/ParamUtils.h>. The issue is, I think, that it's in a new folder that I'd have to register internally in a few places. A more immediate fix, however, would be to refactor it into the current sparse folder instead of putting it under utils. Would that be OK for now, @aocsa? We can revisit create a sparse utils folder later if we have organizational issues.

Sure, I updated the PR with this minor refactor. cc @mruberry

mruberry · 2020-09-19T02:40:07Z

Thank you!

facebook-github-bot

@mruberry has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

mruberry · 2020-09-19T21:49:44Z

Ah, sorry, forgot to mention you'll also have to move the function definitions (implementations) to a cpp file. Only the declarations go in the header.

minor fix Initial commit: cuda_sparse_softmax and cuda_sparse_log_softmax minor redo minor fix Closes pytorchgh-23651 update with huge performance improvement! autograd support minor fix to autograd support minor refactor update with minor refactor optimize LogSoftMax mx buffer minor bug solved fix max_values error updates based on change requests updates based on change requests minor update

minor fix

minor refactor

aocsa · 2020-09-21T03:17:53Z

Ah, sorry, forgot to mention you'll also have to move the function definitions (implementations) to a cpp file. Only the declarations go in the header.

Sure. It's done.

facebook-github-bot

@mruberry has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

pytorchbot added the open source label Jul 30, 2020

rgommers added the module: sparse Related to torch.sparse label Jul 30, 2020

rgommers requested a review from pearu July 30, 2020 14:32

aocsa force-pushed the aocsa/23651-cuda-softmax-sparse branch 2 times, most recently from 12ec528 to fa80f14 Compare July 30, 2020 14:46

aocsa self-assigned this Jul 30, 2020

mruberry added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jul 31, 2020

mruberry self-requested a review July 31, 2020 21:00

mruberry added the module: cuda Related to torch.cuda, and CUDA support in general label Jul 31, 2020

rgommers reviewed Aug 8, 2020

View reviewed changes

test/test_sparse.py Outdated Show resolved Hide resolved

aocsa force-pushed the aocsa/23651-cuda-softmax-sparse branch from 9171f8e to 973428c Compare August 13, 2020 20:23

aocsa force-pushed the aocsa/23651-cuda-softmax-sparse branch from 973428c to 1ccdc91 Compare August 13, 2020 20:31

mruberry reviewed Aug 14, 2020

View reviewed changes

test/test_sparse.py Show resolved Hide resolved

pearu requested changes Aug 14, 2020

View reviewed changes

pearu requested changes Aug 18, 2020

View reviewed changes

aten/src/ATen/native/sparse/cuda/SoftMax.cu Outdated Show resolved Hide resolved

aten/src/ATen/native/sparse/cuda/SoftMax.cu Outdated Show resolved Hide resolved

pearu requested changes Aug 18, 2020

View reviewed changes

aten/src/ATen/native/sparse/cuda/SoftMax.cu Outdated Show resolved Hide resolved

mruberry reviewed Aug 18, 2020

View reviewed changes

aten/src/ATen/native/native_functions.yaml Outdated Show resolved Hide resolved

mruberry reviewed Aug 19, 2020

View reviewed changes

aten/src/ATen/native/sparse/SoftMax.cpp Outdated Show resolved Hide resolved

mruberry reviewed Aug 19, 2020

View reviewed changes

aten/src/ATen/native/sparse/SoftMax.cpp Outdated Show resolved Hide resolved

mruberry reviewed Aug 19, 2020

View reviewed changes

aten/src/ATen/native/sparse/cuda/SoftMax.cu Outdated Show resolved Hide resolved

mruberry reviewed Aug 19, 2020

View reviewed changes

aten/src/ATen/native/sparse/cuda/SoftMax.cu Show resolved Hide resolved

mruberry reviewed Aug 19, 2020

View reviewed changes

aten/src/ATen/native/sparse/cuda/SoftMax.cu Show resolved Hide resolved

ngimel reviewed Aug 19, 2020

View reviewed changes

aten/src/ATen/native/sparse/cuda/SoftMax.cu Outdated Show resolved Hide resolved

ngimel reviewed Aug 19, 2020

View reviewed changes

aten/src/ATen/native/sparse/cuda/SoftMax.cu Outdated Show resolved Hide resolved

ngimel reviewed Aug 19, 2020

View reviewed changes

aten/src/ATen/native/sparse/cuda/SoftMax.cu Outdated Show resolved Hide resolved

aocsa force-pushed the aocsa/23651-cuda-softmax-sparse branch from 001e2fb to c4a4733 Compare August 20, 2020 17:53

aocsa force-pushed the aocsa/23651-cuda-softmax-sparse branch from f505a57 to ebb1f60 Compare September 16, 2020 20:35

mruberry reviewed Sep 17, 2020

View reviewed changes

aten/src/ATen/native/sparse/SoftMax.cpp Show resolved Hide resolved

mruberry self-requested a review September 18, 2020 00:17

mruberry approved these changes Sep 18, 2020

View reviewed changes

facebook-github-bot reviewed Sep 18, 2020

View reviewed changes

facebook-github-bot reviewed Sep 19, 2020

View reviewed changes

aocsa added 10 commits September 20, 2020 21:13

minor redo

68f683c

minor update: reduce code size and unifies the naming of variables.

bd4178b

Step1: Minor refactor and improvements

09df2c6

Step2: Minor refactor and improvements

605f91e

Step3: Minor refactor and improvements

2134d9f

minor updates to softmax backward functions

14936e6

enable again skipIfRocm decorator

a6bdb44

case: gradient sparsity pattern is the same as output sparsity pattern

a96dae7

minor fix

minor refactor ParamUtils

fae90a5

minor refactor

aocsa force-pushed the aocsa/23651-cuda-softmax-sparse branch from ebcd70c to fae90a5 Compare September 21, 2020 02:14

minor refactor

6d3fb8e

facebook-github-bot reviewed Sep 22, 2020

View reviewed changes

facebook-github-bot closed this in 29dc3c5 Sep 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sparse softmax support (CUDA) #42307

Sparse softmax support (CUDA) #42307

aocsa commented Jul 30, 2020 •

edited

Loading

dr-ci bot commented Jul 30, 2020 •

edited

Loading

mruberry commented Aug 13, 2020

aocsa commented Aug 13, 2020 •

edited

Loading

pearu left a comment

aocsa commented Aug 18, 2020

pearu left a comment

pearu left a comment

ngimel commented Sep 15, 2020 •

edited

Loading

aocsa commented Sep 15, 2020

aocsa commented Sep 16, 2020

mruberry commented Sep 16, 2020

mruberry left a comment

facebook-github-bot left a comment

mruberry commented Sep 19, 2020

aocsa commented Sep 19, 2020

mruberry commented Sep 19, 2020

facebook-github-bot left a comment

mruberry commented Sep 19, 2020

aocsa commented Sep 21, 2020

facebook-github-bot left a comment

Sparse softmax support (CUDA) #42307

Sparse softmax support (CUDA) #42307

Conversation

aocsa commented Jul 30, 2020 • edited Loading

dr-ci bot commented Jul 30, 2020 • edited Loading

💊 CI failures summary and remediations

🕵️ 1 new failure recognized by patterns

pytorch_linux_xenial_py3_6_gcc5_4_ge_config_simple_test (1/1)

ci.pytorch.org: 1 failed

mruberry commented Aug 13, 2020

aocsa commented Aug 13, 2020 • edited Loading

pearu left a comment

Choose a reason for hiding this comment

aocsa commented Aug 18, 2020

pearu left a comment

Choose a reason for hiding this comment

pearu left a comment

Choose a reason for hiding this comment

ngimel commented Sep 15, 2020 • edited Loading

aocsa commented Sep 15, 2020

aocsa commented Sep 16, 2020

mruberry commented Sep 16, 2020

mruberry left a comment

Choose a reason for hiding this comment

facebook-github-bot left a comment

Choose a reason for hiding this comment

mruberry commented Sep 19, 2020

aocsa commented Sep 19, 2020

mruberry commented Sep 19, 2020

facebook-github-bot left a comment

Choose a reason for hiding this comment

mruberry commented Sep 19, 2020

aocsa commented Sep 21, 2020

facebook-github-bot left a comment

Choose a reason for hiding this comment

aocsa commented Jul 30, 2020 •

edited

Loading

dr-ci bot commented Jul 30, 2020 •

edited

Loading

aocsa commented Aug 13, 2020 •

edited

Loading

ngimel commented Sep 15, 2020 •

edited

Loading