-
Notifications
You must be signed in to change notification settings - Fork 22.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sparse softmax support (CUDA) #42307
Conversation
💊 CI failures summary and remediationsAs of commit 6d3fb8e (more details on the Dr. CI page):
🕵️ 1 new failure recognized by patternsThe following CI failures do not appear to be due to upstream breakages: pytorch_linux_xenial_py3_6_gcc5_4_ge_config_simple_test (1/1)Step: "Run tests" (full log | diagnosis details | 🔁 rerun)
|
12ec528
to
fa80f14
Compare
Hey @aocsa, please ping me on this when it's ready for review. |
9171f8e
to
973428c
Compare
973428c
to
1ccdc91
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @aocsa for implementing the CUDA support for softmax functions!
In my review, I noticed that the code size can be reduced considerably by eliminating code replication in a number of places. For instance, get_pools
and compute_pool_max
can be merged. Also, functions softmax_sparse_cuda
, log_softmax_sparse_cuda
, softmax_backward_sparse_cuda
, and log_softmax_backward_sparse_cuda
can be deleted after updating the corresponding functions in aten/src/ATen/native/sparse/SoftMax.cpp .
In addition, I am curious about the implementation complexity of compute_pool_max
for computing mx_rows
when for CPU it requires only 6 lines of code.
Thanks for the review @pearu, I reduced the code, refactoring it where it was required. I tried to use the
The complexity of compute_pool_max function should be in fact faster as it computes the cc @mruberry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aocsa thanks for addressing my comments! Great work!
I believe there is a bug regarding the computation of log_softmax_max_buffer
and this does not appear to be used, IIRC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aocsa thanks, we are almost there... but the code computing the max values is unfortunately incorrect. See the given example and diagnostics below.
001e2fb
to
c4a4733
Compare
Yes! That's what I meant by "It would make sense to have a special case where gradient sparsity pattern is the same as output sparsity pattern" |
All right, so this optimization is for both Softmax.cpp and Softmax.cu and new tests for this special case. @pearu please could you detail the case here as I am not so familiar with all cases of the backward algorithm, and thanks for your comments, I appreciate your support here. cc @ngimel |
f505a57
to
ebb1f60
Compare
Quick update: I pushed my changes in the PR with the support for the "case: gradient sparsity pattern is the same as output sparsity pattern" when |
Thanks @aocsa! We'll take a look soon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the updates, @aocsa. @ngimel and I had the chance to review and we're happy with how this looks.
In the future it'll be interesting to improve our test infrastructure so we don't need to write so many custom components (like the Python softmax implementation or the custom jacobian implementation). That should make adding new operators much easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mruberry has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Unfortunately this is triggering our arcane internal build system, which is complaining about an inability to find <ATen/native/sparse/utils/ParamUtils.h>. The issue is, I think, that it's in a new folder that I'd have to register internally in a few places. A more immediate fix, however, would be to refactor it into the current sparse folder instead of putting it under utils. Would that be OK for now, @aocsa? We can revisit create a sparse utils folder later if we have organizational issues. |
Sure, I updated the PR with this minor refactor. cc @mruberry |
Thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mruberry has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Ah, sorry, forgot to mention you'll also have to move the function definitions (implementations) to a cpp file. Only the declarations go in the header. |
minor fix Initial commit: cuda_sparse_softmax and cuda_sparse_log_softmax minor redo minor fix Closes pytorchgh-23651 update with huge performance improvement! autograd support minor fix to autograd support minor refactor update with minor refactor optimize LogSoftMax mx buffer minor bug solved fix max_values error updates based on change requests updates based on change requests minor update
minor refactor
ebcd70c
to
fae90a5
Compare
Sure. It's done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mruberry has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
This PR implements softmax support for sparse tensors.
Resolves gh-23651 for CUDA.
Here are some benchmark (script is here) results for
torch.sparse.softmax and torch.softmax
, using CPU and GPU, values are float64 scalars, timing repeat is 1000:Times are in microseconds (us).
Quick note: I updated the performance test again.