-
Notifications
You must be signed in to change notification settings - Fork 21.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LU solve uses cuBLAS and cuSOLVER for matrices with dim > 1024 #61815
Conversation
Hi @rkansal47! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks! |
🔗 Helpful links
💊 CI failures summary and remediationsAs of commit 08f8d31 (more details on the Dr. CI page):
🕵️ 2 new failures recognized by patternsThe following CI failures do not appear to be due to upstream breakages: pytorch_linux_xenial_py3_6_gcc5_4_build (1/2)Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)
|
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks! |
@IvanYashchuk, @xwang233 can you please review this PR and ping me to land it? |
#ifdef USE_CUSOLVER | ||
if (batch_size == 1 && m > 512) { | ||
if ((batch_size == 1 && m > 512) || (batch_size <= 8 && (m > 1024 || b1 > 1024 || b2 > 1024))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this PR overall looks good. Since this boolean (m > 1024 || b1 > 1024 || b2 > 1024)
is reused below, we can assign it to a variable just below the auto b2 = ...
line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, @rkansal47! Thank you for fixing this long-standing problem!
Could you please add a test to test/test_linalg.py
that would fail on master, but pass with this PR. Add a link to the issue in the test description and use @onlyCUDA
and @skipCUDAIfNoMagma
decorators.
auto b1 = b.size(-2); | ||
auto b2 = b.size(-1); | ||
bool over_magma_dim_limit = m > 1024 || b1 > 1024 || b2 > 1024; // magma implementation of LU solve cannot handle tensors with dimensions > 1024 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MAGMA doesn't work only when b.size(-1) > 1024
, all other conditions are unnecessary.
#ifdef USE_CUSOLVER | ||
if (batch_size == 1 && m > 512) { | ||
if ((batch_size == 1 && m > 512) || (batch_size <= 8 && over_magma_dim_limit)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do the conditions batch_size <= 8
and batch_size > 8
come from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They were chosen based off the benchmarks shown here #59148 (comment) - cuBLAS is faster for batch size > 8 and cuSOLVER for <= 8. Should I mention this somewhere in the PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's fine as it is, thank you for including it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'd be good to leave a comment in the code pointing this out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The testing script only considers a single dtype, float64
, could you please confirm the same outcome for float32
and the complex dtypes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lezcano added.
@nikitaved thanks, I agree confirming the heuristics for different types is important, but is that not an issue to raise for the original PR which chose them (#59148 (comment))? The only contribution I'm proposing in this PR is directing lu_solve
to not use MAGMA for large b matrices. Please let me know if you disagree.
@IvanYashchuk thanks! Updated with your suggestions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @rkansal47, for this fix! I am sorry for the delay. Unfortunately, a merge conflict was accumulated, I hope you don't mind that I fixed it.
# this tests https://github.com/pytorch/pytorch/issues/36921 | ||
def test_lu_solve_large_matrices(self, device, dtype): | ||
def run_test(A_dims, b_dims): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the name is a bit confusing. run_tests
gets A_dims = (1, 1)
, b_dims = (1, 1, 1025)
. So, effectively, matrices are small!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair, how's test_lu_solve_large_b_second_dim
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The referenced issue actually considers large batched matrices. Why wouldn't we test all the cases? I.e., large batch small size, small batch large size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only remaining issue in the current master branch is with b tensors with a last dimension > 1024 due to the linked bug in MAGMA. IvanYashchuk asked for a test which fails on the master branch but passes after this PR. These other tests would pass in the master branch as well, but I can add them anyway if you'd like.
@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
Summary: This PR builds off of #59148 and modifies the `lu_solve` routine to avoid MAGMA for `b` or `lu_data` matrices with any dimension > 1024, since MAGMA has a bug when dealing with such matrices (https://bitbucket.org/icl/magma/issues/19/dgesv_batched-dgetrs_batched-fails-for). Fixes #36921 Fixes #61929 Pull Request resolved: #61815 Reviewed By: anjali411 Differential Revision: D30199618 Pulled By: ngimel fbshipit-source-id: 06870793f697e9c35aaaa8254b8a8b1a38bd3aa9
This PR builds off of #59148 and modifies the
lu_solve
routine to avoid MAGMA forb
orlu_data
matrices with any dimension > 1024, since MAGMA has a bug when dealing with such matrices (https://bitbucket.org/icl/magma/issues/19/dgesv_batched-dgetrs_batched-fails-for).Fixes #36921
Fixes #61929