Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sparse softmax support (CUDA) #42307

Closed
wants to merge 11 commits into from

Conversation

aocsa
Copy link
Contributor

@aocsa aocsa commented Jul 30, 2020

This PR implements softmax support for sparse tensors.

Resolves gh-23651 for CUDA.

  • sparse softmax
    • CUDA C++ implementation
    • unittests
    • update softmax documentation
    • autograd support
  • sparse log_softmax
    • CUDA C++ implementation
    • unittests
    • update log_softmax documentation
    • autograd support

Here are some benchmark (script is here) results for torch.sparse.softmax and torch.softmax, using CPU and GPU, values are float64 scalars, timing repeat is 1000:

size density sparse CUDA sparse CPU
(32, 10000) 0.01 380.2 687.5
(32, 10000) 0.05 404.3 2357.9
(32, 10000) 0.1 405.9 3677.2
(512, 10000) 0.01 438.0 5443.4
(512, 10000) 0.05 888.1 24485.0
(512, 10000) 0.1 1921.3 45340.5
size density dense CUDA dense CPU
(32, 10000) 0.01 23.6 1943.2
(32, 10000) 0.05 23.6 1954.0
(32, 10000) 0.1 23.5 1950.0
(512, 10000) 0.01 639.3 39797.9
(512, 10000) 0.05 640.3 39374.4
(512, 10000) 0.1 639.6 39192.3

Times are in microseconds (us).

Quick note: I updated the performance test again.

@dr-ci
Copy link

dr-ci bot commented Jul 30, 2020

💊 CI failures summary and remediations

As of commit 6d3fb8e (more details on the Dr. CI page):


  • 2/2 failures possibly* introduced in this PR
    • 1/2 non-CircleCI failure(s)

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_linux_xenial_py3_6_gcc5_4_ge_config_simple_test (1/1)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Sep 21 05:20:09 [E request_callback_no_python.cpp:618] Received error while processing request type 2: RuntimeError: Can not pickle torch.futures.Future
Sep 21 05:20:09 At: 
Sep 21 05:20:09   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(94): serialize 
Sep 21 05:20:09   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(146): serialize 
Sep 21 05:20:09  
Sep 21 05:20:09 [E request_callback_no_python.cpp:618] Received error while processing request type 2: RuntimeError: Can not pickle torch.futures.Future 
Sep 21 05:20:09  
Sep 21 05:20:09 At: 
Sep 21 05:20:09   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(94): serialize 
Sep 21 05:20:09   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(146): serialize 
Sep 21 05:20:09  
Sep 21 05:20:09 [E request_callback_no_python.cpp:618] Received error while processing request type 2: RuntimeError: Can not pickle torch.futures.Future 
Sep 21 05:20:09  
Sep 21 05:20:09 At: 
Sep 21 05:20:09   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(94): serialize 
Sep 21 05:20:09   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(146): serialize 
Sep 21 05:20:09  
Sep 21 05:20:10 ok (1.438s) 
Sep 21 05:20:11   test_return_future_remote (__main__.ProcessGroupRpcTestWithSpawn) ... ok (1.439s) 
Sep 21 05:20:13   test_return_local_rrefs (__main__.ProcessGroupRpcTestWithSpawn) ... ok (1.540s) 
Sep 21 05:20:14   test_rpc_profiling_remote_record_function (__main__.ProcessGroupRpcTestWithSpawn) ... ERROR:root:Caught exception:  
Sep 21 05:20:14 Traceback (most recent call last): 

ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 124 times.

@rgommers rgommers added the module: sparse Related to torch.sparse label Jul 30, 2020
@rgommers rgommers requested a review from pearu July 30, 2020 14:32
@aocsa aocsa force-pushed the aocsa/23651-cuda-softmax-sparse branch 2 times, most recently from 12ec528 to fa80f14 Compare July 30, 2020 14:46
@aocsa aocsa self-assigned this Jul 30, 2020
@mruberry mruberry added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jul 31, 2020
@mruberry mruberry self-requested a review July 31, 2020 21:00
@mruberry mruberry added the module: cuda Related to torch.cuda, and CUDA support in general label Jul 31, 2020
test/test_sparse.py Outdated Show resolved Hide resolved
@mruberry
Copy link
Collaborator

Hey @aocsa, please ping me on this when it's ready for review.

@aocsa aocsa force-pushed the aocsa/23651-cuda-softmax-sparse branch from 9171f8e to 973428c Compare August 13, 2020 20:23
@aocsa
Copy link
Contributor Author

aocsa commented Aug 13, 2020

Hey @aocsa, please ping me on this when it's ready for review.

@mruberry I updated this PR with the autograd support, I think it is ready to review. cc @pearu

@aocsa aocsa force-pushed the aocsa/23651-cuda-softmax-sparse branch from 973428c to 1ccdc91 Compare August 13, 2020 20:31
Copy link
Collaborator

@pearu pearu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @aocsa for implementing the CUDA support for softmax functions!

In my review, I noticed that the code size can be reduced considerably by eliminating code replication in a number of places. For instance, get_pools and compute_pool_max can be merged. Also, functions softmax_sparse_cuda, log_softmax_sparse_cuda, softmax_backward_sparse_cuda, and log_softmax_backward_sparse_cuda can be deleted after updating the corresponding functions in aten/src/ATen/native/sparse/SoftMax.cpp .

In addition, I am curious about the implementation complexity of compute_pool_max for computing mx_rows when for CPU it requires only 6 lines of code.

aten/src/ATen/native/sparse/cuda/SoftMax.cu Outdated Show resolved Hide resolved
aten/src/ATen/native/sparse/cuda/SoftMax.cu Outdated Show resolved Hide resolved
aten/src/ATen/native/sparse/cuda/SoftMax.cu Outdated Show resolved Hide resolved
aten/src/ATen/native/sparse/cuda/SoftMax.cu Outdated Show resolved Hide resolved
aten/src/ATen/native/sparse/cuda/SoftMax.cu Outdated Show resolved Hide resolved
aten/src/ATen/native/sparse/cuda/SoftMax.cu Outdated Show resolved Hide resolved
aten/src/ATen/native/sparse/cuda/SoftMax.cu Outdated Show resolved Hide resolved
aten/src/ATen/native/sparse/cuda/SoftMax.cu Outdated Show resolved Hide resolved
aten/src/ATen/native/sparse/cuda/SoftMax.cu Outdated Show resolved Hide resolved
aten/src/ATen/native/sparse/cuda/SoftMax.cu Outdated Show resolved Hide resolved
@aocsa
Copy link
Contributor Author

aocsa commented Aug 18, 2020

Thanks, @aocsa for implementing the CUDA support for softmax functions!

In my review, I noticed that the code size can be reduced considerably by eliminating code replication in a number of places. For instance, get_pools and compute_pool_max can be merged. Also, functions softmax_sparse_cuda, log_softmax_sparse_cuda, softmax_backward_sparse_cuda, and log_softmax_backward_sparse_cuda can be deleted after updating the corresponding functions in aten/src/ATen/native/sparse/SoftMax.cpp .

Thanks for the review @pearu, I reduced the code, refactoring it where it was required. I tried to use the device dispatcher to reduce common code between cpu and cuda code, however it seems that it only works with specific paths. Look at this TODO. So I just did a simple refactor using what you suggested.

In addition, I am curious about the implementation complexity of compute_pool_max for computing mx_rows when for CPU it requires only 6 lines of code.

The complexity of compute_pool_max function should be in fact faster as it computes the pools and maxes in parallel using thrust::reduce_by_key API. However, it pays with the cost of the pre-processing step, using thrust::sorting, and its corresponding temporal memory uses.

cc @mruberry

Copy link
Collaborator

@pearu pearu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aocsa thanks for addressing my comments! Great work!

I believe there is a bug regarding the computation of log_softmax_max_buffer and this does not appear to be used, IIRC.

aten/src/ATen/native/sparse/cuda/SoftMax.cu Outdated Show resolved Hide resolved
aten/src/ATen/native/sparse/cuda/SoftMax.cu Outdated Show resolved Hide resolved
Copy link
Collaborator

@pearu pearu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aocsa thanks, we are almost there... but the code computing the max values is unfortunately incorrect. See the given example and diagnostics below.

aten/src/ATen/native/sparse/cuda/SoftMax.cu Outdated Show resolved Hide resolved
@aocsa aocsa force-pushed the aocsa/23651-cuda-softmax-sparse branch from 001e2fb to c4a4733 Compare August 20, 2020 17:53
@ngimel
Copy link
Collaborator

ngimel commented Sep 15, 2020

I am not sure that this can be optimized, in general.

That said, when grad_offsets == out_offsets, the loop over range(out_nnz) could be collapsed into a single dense softmax call.

Yes! That's what I meant by "It would make sense to have a special case where gradient sparsity pattern is the same as output sparsity pattern"
With the tests for that special case.

@aocsa
Copy link
Contributor Author

aocsa commented Sep 15, 2020

I am not sure that this can be optimized, in general.

That said, when grad_offsets == out_offsets, the loop over range(out_nnz) could be collapsed into a single dense softmax call.

Yes! That's what I meant by "It would make sense to have a special case where gradient sparsity pattern is the same as output sparsity pattern"

All right, so this optimization is for both Softmax.cpp and Softmax.cu and new tests for this special case. @pearu please could you detail the case here as I am not so familiar with all cases of the backward algorithm, and thanks for your comments, I appreciate your support here. cc @ngimel

@aocsa aocsa force-pushed the aocsa/23651-cuda-softmax-sparse branch from f505a57 to ebb1f60 Compare September 16, 2020 20:35
@aocsa
Copy link
Contributor Author

aocsa commented Sep 16, 2020

Quick update: I pushed my changes in the PR with the support for the "case: gradient sparsity pattern is the same as output sparsity pattern" when grad_offsets == out_offsets. Moreover, I double-checked if current tests cover this case so there is no necessity to create new ones unless we need to break the tests into more modular tests. Let me know if you have any other request and thanks for the feedback @pearu, @ngimel.

@mruberry
Copy link
Collaborator

Quick update: I pushed my changes in the PR with the support for the "case: gradient sparsity pattern is the same as output sparsity pattern" when grad_offsets == out_offsets. Moreover, I double-checked if current tests cover this case so there is no necessity to create new ones unless we need to break the tests into more modular tests. Let me know if you have any other request and thanks for the feedback @pearu, @ngimel.

Thanks @aocsa! We'll take a look soon.

@mruberry mruberry self-requested a review September 18, 2020 00:17
Copy link
Collaborator

@mruberry mruberry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the updates, @aocsa. @ngimel and I had the chance to review and we're happy with how this looks.

In the future it'll be interesting to improve our test infrastructure so we don't need to write so many custom components (like the Python softmax implementation or the custom jacobian implementation). That should make adding new operators much easier.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mruberry has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@mruberry
Copy link
Collaborator

Unfortunately this is triggering our arcane internal build system, which is complaining about an inability to find <ATen/native/sparse/utils/ParamUtils.h>. The issue is, I think, that it's in a new folder that I'd have to register internally in a few places. A more immediate fix, however, would be to refactor it into the current sparse folder instead of putting it under utils. Would that be OK for now, @aocsa? We can revisit create a sparse utils folder later if we have organizational issues.

@aocsa
Copy link
Contributor Author

aocsa commented Sep 19, 2020

Unfortunately this is triggering our arcane internal build system, which is complaining about an inability to find <ATen/native/sparse/utils/ParamUtils.h>. The issue is, I think, that it's in a new folder that I'd have to register internally in a few places. A more immediate fix, however, would be to refactor it into the current sparse folder instead of putting it under utils. Would that be OK for now, @aocsa? We can revisit create a sparse utils folder later if we have organizational issues.

Sure, I updated the PR with this minor refactor. cc @mruberry

@mruberry
Copy link
Collaborator

Thank you!

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mruberry has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@mruberry
Copy link
Collaborator

Ah, sorry, forgot to mention you'll also have to move the function definitions (implementations) to a cpp file. Only the declarations go in the header.

minor fix

Initial commit: cuda_sparse_softmax and cuda_sparse_log_softmax


minor redo

minor fix

Closes pytorchgh-23651

update with huge performance improvement!


autograd support

minor fix to autograd support


minor refactor


update with minor refactor 


optimize LogSoftMax mx buffer


minor bug solved

fix max_values error


updates based on change requests

updates based on change requests


minor update
@aocsa aocsa force-pushed the aocsa/23651-cuda-softmax-sparse branch from ebcd70c to fae90a5 Compare September 21, 2020 02:14
@aocsa
Copy link
Contributor Author

aocsa commented Sep 21, 2020

Ah, sorry, forgot to mention you'll also have to move the function definitions (implementations) to a cpp file. Only the declarations go in the header.

Sure. It's done.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mruberry has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: cuda Related to torch.cuda, and CUDA support in general module: sparse Related to torch.sparse open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add sparse softmax/log_softmax functionality (ignore zero entries)
8 participants