Skip to content

Conversation

ignasa007
Copy link
Contributor

@ignasa007 ignasa007 commented May 4, 2025

Fixes #101075

This PR fixes an issue with the computation of residuals in the LOBPCG algorithm.

Bug: Line 788 is supposed to compute the denominator in Equation 9 of Duersch et al., 2018, as also suggested in line 776, but it uses the raw eigenvalue-estimates instead of their absolute values.

Consequence: This made the algorithm's success sensitive to initialization of eigenvectors.

Tests:

  • I have tested @jtorde's script, and I did NOT run into any assertion errors for a few minutes (as opposed to the original implementation, which fails after a few seconds).
  • I have also tried @pearu's specific test case, which also executes successfully - the residuals remain positive, and the final output is the same as one returned by SciPy (with and without enforcing the use of LOBPCG).
  • I extracted the relevant test cases from test/test_autograd.py and test/test_linalg.py, and they ran successfully.

Let me know if further test cases or benchmarks are needed.

cc @jianyuh @nikitaved @mruberry @walterddr @xwang233 @lezcano

Copy link

pytorch-bot bot commented May 4, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/152789

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 0ca3bd9 with merge base 56879f6 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Copy link

linux-foundation-easycla bot commented May 4, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

Copy link

pytorch-bot bot commented May 4, 2025

Didn't find following labels among repository labels: release notes: bug fix

@ignasa007
Copy link
Contributor Author

@pytorchbot label "module: linear algebra"

@pytorch-bot pytorch-bot bot added the module: linear algebra Issues related to specialized linear algebra operations in PyTorch; includes matrix multiply matmul label May 4, 2025
Copy link
Collaborator

@lezcano lezcano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. Can you add a regression test?

@lezcano lezcano removed the module: linear algebra Issues related to specialized linear algebra operations in PyTorch; includes matrix multiply matmul label May 5, 2025
@pytorch-bot pytorch-bot bot added the release notes: linalg_frontend release notes category label May 5, 2025
@ignasa007
Copy link
Contributor Author

A few comments:

  • I think the bug was not being picked up because A and B were being sampled as SPD matrices, which restricts the eigenvalues to be positive. The problem arises when the eigenvalue estimates are negative which are somehow not being encountered.1
  • Moreover, A only needs to be symmetric in SPD generalized eigenvalue problems. By changing the tests to use symmetric A, the tests immediately pick on the error in the original implementation – 4 out of 5 tests fail, each with 100% mismatched elements.
  • After updating the implementation, 2 tests fail, but I think that's due to imprecise numerics:
python test_linalg.py TestLinalgCPU.test_lobpcg_ortho_cpu_float64
Mismatched elements: 2 / 3000 (0.1%)
Greatest absolute difference: 0.0005147071740729814 at index (1, 2, 46, 0) (up to 0.0005 allowed)
Greatest relative difference: 6.17151121231244e-05 at index (1, 2, 46, 0) (up to 0 allowed)

python test_linalg.py TestLinalgCUDA.test_lobpcg_ortho_cuda_float64
Mismatched elements: 1 / 108 (0.9%)
Greatest absolute difference: 0.0006575654327463099 at index (0, 1, 2, 0) (up to 0.0005 allowed)
Greatest relative difference: 3.5065780611903477e-05 at index (0, 1, 2, 0) (up to 0 allowed)

I have also added a check in test_linalg.py for residuals to stay non-negative over the iterations, which the previous implementation was failing to satisfy in 4 of the 5 tests.

[1] This is actually a weird one – using random_symmetric_pd_matrix for sampling A and B does not fail any tests. However, replacing both with make_symmetric_pd_matrices – as suggested – fails the following test:

python test_linalg.py TestLinalgCUDA.test_lobpcg_ortho_cuda_float64
Mismatched elements: 2 / 3000 (0.1%)
Greatest absolute difference: 0.0007664745520128396 at index (1, 2, 95, 0) (up to 0.0005 allowed)
Greatest relative difference: 0.39717887328301443 at index (1, 2, 25, 0) (up to 0 allowed)

Mainly, the relative error is non-negligible. The difference between the two is just that the former uses Gaussian sampling, while the latter uses Uniform sampling in [-9, 9] (as defined in make_tensor). I tried changing the scale of the Gaussian sampler in random_symmetric_pd_matrix to 9, but the test still passes.

I am not sure what is going on there – if it's an issue with one of the sampling schemes, or with the LOBPCG implementation. For now, I skipped on changing the sampling to make_symmetric_pd_matrices.

@ignasa007 ignasa007 marked this pull request as ready for review May 5, 2025 19:36
Copy link
Collaborator

@pearu pearu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a minor nit, otherwise LGTM!

Thanks, @ignasa007!

@pearu
Copy link
Collaborator

pearu commented May 6, 2025

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased fix-lobpcg-rerr onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout fix-lobpcg-rerr && git pull --rebase)

Copy link
Collaborator

@lezcano lezcano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to use random_symmetric_pd_matrix here. A gaussian matrix has rather well behaved eigenvalues, while a matrix with uniform values on [-9,9] may have very large eigenvalues really. I think that's the issue you are seeing.

@ignasa007
Copy link
Contributor Author

Mm, alright. Should I add a comment warning against changing the sampling scheme, and/or removing the random_symmetric_pd_matrix function altogether, as is suggested in the TODO?

@pearu
Copy link
Collaborator

pearu commented May 6, 2025

Mm, alright. Should I add a comment warning against changing the sampling scheme, and/or removing the random_symmetric_pd_matrix function altogether, as is suggested in the TODO?

Changing the sampling scheme is OT for this PR, so I suggest to use the original sampling function per @lezcano suggestion.

Re the TODO item: I believe it was created with the assumption that using make_symmetric_pd_matrices instead of random_symmetric_pd_matrix (as both have the same goals) will not cause accuracy problems as seen here. The problem here indicates that make_tensor or make_symmetric_pd_matrices ought to be generalized to support selecting random distribution, or similar, to enable removing the TODO item. I suggest leave it as a follow-up task.

@pearu
Copy link
Collaborator

pearu commented May 6, 2025

To make the PR ready to land, please address the lint failures.

@ignasa007
Copy link
Contributor Author

From what I understand, the 11 errors seem to come from test/test_linalg.py, and are to do with numerical imprecision – they pass with atol=2e-3, rtol=2e-3 for generalized eigenvalue problem with smallest eigenvalues.

Copy link
Collaborator

@pearu pearu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a suggestion to fix CI failures and I think we still miss (read: it is not obvious that we have) the regression tests that reproduce the original issue.

@ignasa007
Copy link
Contributor Author

The faulty SciPy version checkscipy.__version__ < '1.4.1' – is keeping the test_linalg.py script from testing against the SciPy implementation (eg., mine skips the test because '1.13.1' < '1.4.1'). A simple fix is from packaging.version import Version and checking Version(scipy.__version__) < Version('1.4.1'). I can push a commit for this, along with the regression test.

@pearu
Copy link
Collaborator

pearu commented May 7, 2025

The faulty SciPy version checkscipy.__version__ < '1.4.1' – is keeping the test_linalg.py script from testing against the SciPy implementation (eg., mine skips the test because '1.13.1' < '1.4.1'). A simple fix is from packaging.version import Version and checking Version(scipy.__version__) < Version('1.4.1'). I can push a commit for this, along with the regression test.

Use

version.parse(scipy.__version__) < version.parse("1.4.1")

as it is used elsewhere in pytorch/test/.

@ignasa007 ignasa007 force-pushed the fix-lobpcg-rerr branch 2 times, most recently from d4d3059 to 27e380c Compare May 7, 2025 10:44
@ignasa007
Copy link
Contributor Author

Use

version.parse(scipy.__version__) < version.parse("1.4.1")

as it is used elsewhere in pytorch/test/.

Done!

Copy link
Collaborator

@pearu pearu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks, @ignasa007!

@pearu pearu requested a review from lezcano May 7, 2025 19:28
@pearu
Copy link
Collaborator

pearu commented May 8, 2025

@lezcano @nikitaved if the PR looks good to you, could you trigger the merging process?

Copy link
Collaborator

@lezcano lezcano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you!

@lezcano
Copy link
Collaborator

lezcano commented May 8, 2025

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 8, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@ignasa007
Copy link
Contributor Author

Alright, thanks everyone for guiding me through my first contribution to open-source! :)

@nikitaved
Copy link
Collaborator

Congrats, @ignasa007 ! Hope you had fun! :)

@ignasa007
Copy link
Contributor Author

Sure did! Looking forward to more collaborations :')

@ignasa007 ignasa007 deleted the fix-lobpcg-rerr branch July 2, 2025 22:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged open source release notes: linalg_frontend release notes category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

torch.lobpcg producing different largest eigenvalue than scipy and np.linalg.eig
6 participants