Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LU solve uses cuBLAS and cuSOLVER for matrices with dim > 1024 #61815

Closed
wants to merge 6 commits into from

Conversation

rkansal47
Copy link
Contributor

@rkansal47 rkansal47 commented Jul 18, 2021

This PR builds off of #59148 and modifies the lu_solve routine to avoid MAGMA for b or lu_data matrices with any dimension > 1024, since MAGMA has a bug when dealing with such matrices (https://bitbucket.org/icl/magma/issues/19/dgesv_batched-dgetrs_batched-fails-for).
Fixes #36921
Fixes #61929

@facebook-github-bot
Copy link
Contributor

Hi @rkansal47!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks!

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Jul 18, 2021

🔗 Helpful links

💊 CI failures summary and remediations

As of commit 08f8d31 (more details on the Dr. CI page):


  • 3/3 failures possibly* introduced in this PR
    • 1/3 non-scanned failure(s)

🕵️ 2 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_linux_xenial_py3_6_gcc5_4_build (1/2)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.
CONFLICT (add/add): Merge conflict in .bazelrc
Auto-merging .bazelrc
CONFLICT (add/add): Merge conflict in .azure_pipelines/pytorch-tests-pipeline.yml
Auto-merging .azure_pipelines/pytorch-tests-pipeline.yml
CONFLICT (add/add): Merge conflict in .azure_pipelines/nightly-pytorch-tests-pipeline.yml
Auto-merging .azure_pipelines/nightly-pytorch-tests-pipeline.yml
CONFLICT (add/add): Merge conflict in .azure_pipelines/job_templates/wheel-wait-job-template.yml
Auto-merging .azure_pipelines/job_templates/wheel-wait-job-template.yml
CONFLICT (add/add): Merge conflict in .azure_pipelines/job_templates/pytorch-template-unix.yml
Auto-merging .azure_pipelines/job_templates/pytorch-template-unix.yml
Automatic merge failed; fix conflicts and then commit the result.


Exited with code exit status 1

See CircleCI build pytorch_xla_linux_bionic_py3_6_clang9_build (2/2)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.
CONFLICT (add/add): Merge conflict in .bazelrc
Auto-merging .bazelrc
CONFLICT (add/add): Merge conflict in .azure_pipelines/pytorch-tests-pipeline.yml
Auto-merging .azure_pipelines/pytorch-tests-pipeline.yml
CONFLICT (add/add): Merge conflict in .azure_pipelines/nightly-pytorch-tests-pipeline.yml
Auto-merging .azure_pipelines/nightly-pytorch-tests-pipeline.yml
CONFLICT (add/add): Merge conflict in .azure_pipelines/job_templates/wheel-wait-job-template.yml
Auto-merging .azure_pipelines/job_templates/wheel-wait-job-template.yml
CONFLICT (add/add): Merge conflict in .azure_pipelines/job_templates/pytorch-template-unix.yml
Auto-merging .azure_pipelines/job_templates/pytorch-template-unix.yml
Automatic merge failed; fix conflicts and then commit the result.


Exited with code exit status 1


ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@facebook-github-bot
Copy link
Contributor

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks!

@mrshenli mrshenli added module: cublas Problem related to cublas support module: cuda Related to torch.cuda, and CUDA support in general triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Jul 19, 2021
@mrshenli mrshenli requested a review from ngimel July 19, 2021 21:47
@ngimel
Copy link
Collaborator

ngimel commented Jul 20, 2021

@IvanYashchuk, @xwang233 can you please review this PR and ping me to land it?

#ifdef USE_CUSOLVER
if (batch_size == 1 && m > 512) {
if ((batch_size == 1 && m > 512) || (batch_size <= 8 && (m > 1024 || b1 > 1024 || b2 > 1024))) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR overall looks good. Since this boolean (m > 1024 || b1 > 1024 || b2 > 1024) is reused below, we can assign it to a variable just below the auto b2 = ... line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, updated.

Copy link
Collaborator

@IvanYashchuk IvanYashchuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, @rkansal47! Thank you for fixing this long-standing problem!
Could you please add a test to test/test_linalg.py that would fail on master, but pass with this PR. Add a link to the issue in the test description and use @onlyCUDA and @skipCUDAIfNoMagma decorators.

Comment on lines 2877 to 2879
auto b1 = b.size(-2);
auto b2 = b.size(-1);
bool over_magma_dim_limit = m > 1024 || b1 > 1024 || b2 > 1024; // magma implementation of LU solve cannot handle tensors with dimensions > 1024
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MAGMA doesn't work only when b.size(-1) > 1024, all other conditions are unnecessary.

#ifdef USE_CUSOLVER
if (batch_size == 1 && m > 512) {
if ((batch_size == 1 && m > 512) || (batch_size <= 8 && over_magma_dim_limit)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do the conditions batch_size <= 8 and batch_size > 8 come from?

Copy link
Contributor Author

@rkansal47 rkansal47 Jul 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They were chosen based off the benchmarks shown here #59148 (comment) - cuBLAS is faster for batch size > 8 and cuSOLVER for <= 8. Should I mention this somewhere in the PR?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine as it is, thank you for including it here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be good to leave a comment in the code pointing this out.

Copy link
Collaborator

@nikitaved nikitaved Jul 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The testing script only considers a single dtype, float64, could you please confirm the same outcome for float32 and the complex dtypes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lezcano added.

@nikitaved thanks, I agree confirming the heuristics for different types is important, but is that not an issue to raise for the original PR which chose them (#59148 (comment))? The only contribution I'm proposing in this PR is directing lu_solve to not use MAGMA for large b matrices. Please let me know if you disagree.

aten/src/ATen/native/cuda/BatchLinearAlgebra.cu Outdated Show resolved Hide resolved
@rkansal47
Copy link
Contributor Author

@IvanYashchuk thanks! Updated with your suggestions.

Copy link
Collaborator

@IvanYashchuk IvanYashchuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @rkansal47, for this fix! I am sorry for the delay. Unfortunately, a merge conflict was accumulated, I hope you don't mind that I fixed it.

Comment on lines +7491 to +7493
# this tests https://github.com/pytorch/pytorch/issues/36921
def test_lu_solve_large_matrices(self, device, dtype):
def run_test(A_dims, b_dims):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the name is a bit confusing. run_tests gets A_dims = (1, 1), b_dims = (1, 1, 1025). So, effectively, matrices are small!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair, how's test_lu_solve_large_b_second_dim?

Copy link
Collaborator

@nikitaved nikitaved Jul 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The referenced issue actually considers large batched matrices. Why wouldn't we test all the cases? I.e., large batch small size, small batch large size?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only remaining issue in the current master branch is with b tensors with a last dimension > 1024 due to the linked bug in MAGMA. IvanYashchuk asked for a test which fails on the master branch but passes after this PR. These other tests would pass in the master branch as well, but I can add them anyway if you'd like.

test/test_linalg.py Outdated Show resolved Hide resolved
@facebook-github-bot
Copy link
Contributor

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@ngimel merged this pull request in 6d21e36.

alanwaketan pushed a commit that referenced this pull request Aug 17, 2021
Summary:
This PR builds off of #59148 and modifies the `lu_solve` routine to avoid MAGMA for `b` or `lu_data` matrices with any dimension > 1024, since MAGMA has a bug when dealing with such matrices (https://bitbucket.org/icl/magma/issues/19/dgesv_batched-dgetrs_batched-fails-for).
Fixes #36921
Fixes #61929

Pull Request resolved: #61815

Reviewed By: anjali411

Differential Revision: D30199618

Pulled By: ngimel

fbshipit-source-id: 06870793f697e9c35aaaa8254b8a8b1a38bd3aa9
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed Merged module: cublas Problem related to cublas support module: cuda Related to torch.cuda, and CUDA support in general open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Torch.linalg.solve large matrix cuda error Error in lu_solve for batched large matrices on the GPU
9 participants