-
-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: test failure in test_x0_equals_Mb
with bicgstab
#15533
Comments
Note: it's possible that this is an MKL bug. In that case. If people think that's the most realistic (and/or there's no way to workaround), then we can raise it with Intel some way, e.g. at https://github.com/conda-forge/intel_repack-feedstock/ |
I think this might actually be showing up in the doctests... I had a failed CI run with
which happens in the same |
Forgot that I can do that 😅 |
@h-vetinari, that sporadic doctest failure should be fixed by #16276 |
Ah, thanks a lot for tracking that down @WarrenWeckesser! Guess I was just fooled by seeing the failure affect the same exact function. |
No movement and not a blocker for 1.9.0, so bumping the milestone. |
I bumped the milestone for now, but I'd be interested in fixing this (because it's a persistent failure in conda-forge for MKL). The problem is that I just don't know that code at all, so I'd probably need some pointers. I did have a look now nevertheless, and the error code
|
This is also interesting from my side since I also had some issues with this on OpenBLAS side. Is it possible to have the CPU type exposed to see whether they are Tiger Lake? |
Not quite as detailed, but we have the following that numpy picks up:
In contrast, it passes when it's on an agent with the following:
I'm not sure how many types of agents there are on azure, but from the results it mostly looks like 2. I can also add some debug output to determine the processor type if that's desired/helpful. |
Bumping milestone, branching is imminent (May 29th), and linalg-backend specific test failures are nothing new of course. If a fix is merged before I branch, feel free to adjust back. |
This is going to be fixed together with #18488 so 1.12 is fine. I'll make a round the issues that PR closes later. |
About the error, this is a breakdown case where the algorithm hits a point that can't proceed further due to numerical problems. But because the tests are/were too tight it was sporadic. Now all tests pass. |
Amazing news, thanks a lot! |
Essentially unchanged for 1.13.0:
|
@h-vetinari Do you have any bandwidth or tools to play with the failing test and find anything that passes? The test numbers are really not important but the mechanism we are testing here. |
If you could provide me with a patch that adds a couple of relevant debug-statements, I'm happy to throw it at our infrastructure and post the results. Would that help? |
Yes, give me some time and I'll provide something concrete. |
Gentle ping on some print statements that would help debug this. :) |
@ilayn, I'll fire up the BLAS testing again after 1.13.1 - do you think I could get some of those debug prints that would allow debugging this? Doesn't have to be perfect or complete, we can iterate... |
@ilayn, this is basically the last widely occurring issue in conda-forge across all platforms and BLAS flavours. I'd be very happy if we could debug it together. 🙃 Are you still willing/able to provide some basic debug statements so we can start digging? |
Yes so sorry about this. I ghosted more than once and I truly apologize. I guess something in my head is resisting to look at this code again somehow. I'll do it today. |
Is there any possibility to inject this instead of the current code? I basically sprinkled some print values to check where it goes awry def bicgstab(A, b, x0=None, *, rtol=1e-5, atol=0., maxiter=None, M=None,
callback=None):
"""Use BIConjugate Gradient STABilized iteration to solve ``Ax = b``.
Parameters
----------
A : {sparse matrix, ndarray, LinearOperator}
The real or complex N-by-N matrix of the linear system.
Alternatively, ``A`` can be a linear operator which can
produce ``Ax`` and ``A^T x`` using, e.g.,
``scipy.sparse.linalg.LinearOperator``.
b : ndarray
Right hand side of the linear system. Has shape (N,) or (N,1).
x0 : ndarray
Starting guess for the solution.
rtol, atol : float, optional
Parameters for the convergence test. For convergence,
``norm(b - A @ x) <= max(rtol*norm(b), atol)`` should be satisfied.
The default is ``atol=0.`` and ``rtol=1e-5``.
maxiter : integer
Maximum number of iterations. Iteration will stop after maxiter
steps even if the specified tolerance has not been achieved.
M : {sparse matrix, ndarray, LinearOperator}
Preconditioner for A. The preconditioner should approximate the
inverse of A. Effective preconditioning dramatically improves the
rate of convergence, which implies that fewer iterations are needed
to reach a given error tolerance.
callback : function
User-supplied function to call after each iteration. It is called
as callback(xk), where xk is the current solution vector.
Returns
-------
x : ndarray
The converged solution.
info : integer
Provides convergence information:
0 : successful exit
>0 : convergence to tolerance not achieved, number of iterations
<0 : parameter breakdown
Examples
--------
>>> import numpy as np
>>> from scipy.sparse import csc_matrix
>>> from scipy.sparse.linalg import bicgstab
>>> R = np.array([[4, 2, 0, 1],
... [3, 0, 0, 2],
... [0, 1, 1, 1],
... [0, 2, 1, 0]])
>>> A = csc_matrix(R)
>>> b = np.array([-1, -0.5, -1, 2])
>>> x, exit_code = bicgstab(A, b, atol=1e-5)
>>> print(exit_code) # 0 indicates successful convergence
0
>>> np.allclose(A.dot(x), b)
True
"""
A, M, x, b, postprocess = make_system(A, M, x0, b)
bnrm2 = np.linalg.norm(b)
print(f"Starting bnrm2 : {bnrm2}")
atol, _ = _get_atol_rtol('bicgstab', bnrm2, atol, rtol)
print(f"Starting atol: {atol}")
if bnrm2 == 0:
return postprocess(b), 0
n = len(b)
dotprod = np.vdot if np.iscomplexobj(x) else np.dot
if maxiter is None:
maxiter = n*10
matvec = A.matvec
psolve = M.matvec
# These values make no sense but coming from original Fortran code
# sqrt might have been meant instead.
rhotol = np.finfo(x.dtype.char).eps**2
omegatol = rhotol
print(f"Starting paramtols rho, omega : {rhotol} / {omegatol}")
# Dummy values to initialize vars, silence linter warnings
rho_prev, omega, alpha, p, v = None, None, None, None, None
r = b - matvec(x) if x.any() else b.copy()
rtilde = r.copy()
print(f"Starting residual: {r}")
for iteration in range(maxiter):
print(f"=============== Iteration {iteration} ========")
if np.linalg.norm(r) < atol: # Are we done?
return postprocess(x), 0
rho = dotprod(rtilde, r)
print(f"Current rho: {rho}")
if np.abs(rho) < rhotol: # rho breakdown
return postprocess(x), -10
if iteration > 0:
print(f"Omega check : {np.abs(omega)} < {omegatol}")
if np.abs(omega) < omegatol: # omega breakdown
return postprocess(x), -11
beta = (rho / rho_prev) * (alpha / omega)
print(f"Current beta: {beta}")
p -= omega*v
p *= beta
p += r
else: # First spin
s = np.empty_like(r)
p = r.copy()
phat = psolve(p)
v = matvec(phat)
rv = dotprod(rtilde, v)
print(f"Current rv : {rv}")
if rv == 0:
return postprocess(x), -11
alpha = rho / rv
print(f"Current alpha : {alpha}")
r -= alpha*v
s[:] = r[:]
if np.linalg.norm(s) < atol:
x += alpha*phat
return postprocess(x), 0
shat = psolve(s)
t = matvec(shat)
omega = dotprod(t, s) / dotprod(t, t)
x += alpha*phat
x += omega*shat
r -= omega*t
rho_prev = rho
if callback:
callback(x)
else: # for loop exhausted
# Return incomplete progress
return postprocess(x), maxiter Somehow we have to inject |
Thank you! :) Here's the output from the failing test (on CI agents without
This then somehow leads to
|
OK on my machine which always gets a pass on this test, I got
In your case a small miracle happens and we hit If I may, can I ask also to add the following lines just above the print("p", p)
print("phat", phat)
print("v", v)
print("rtilde", rtilde) My suspicion is that one of these arrays become exactly 0.0 |
The numbers are very funky on this one anyways. It has passed already in sufficient times in sufficient machines and we did not hear any screams about the rewrites so far. So if need be we can add |
I restarted a new build with that.
This issue predates the rewrite, so it appears the rewrite was too good in that it kept bug-for-bug compatibility with the Fortran implementation, at least in this regard. ;-)
Can you clarify which file you're talking about? |
Ah it is the |
With updated logging statements:
|
Since This seems it boils down to a dot product difference between architectures because this is what I get
So that |
Looks like lots and lots of cancellation. Perhaps it really is perfectly orthogonal in this very well-conditioned example. But that makes me kind of wary to skip this test. Even if scipy/scipy/sparse/linalg/_isolve/iterative.py Lines 256 to 259 in e984c95
to do something reasonable if other quantities (rho?) are small. Especially since the rho-condition sounds completely unattainable for me, as we're taking eps ** 2 , which is tiiiiiiiny:scipy/scipy/sparse/linalg/_isolve/iterative.py Lines 223 to 226 in e984c95
My conclusion is the opposite. Such a well-behaved example should not fail to converge, especially for an algorithm that advertises itself as stabilized. |
Yes but this method is known to breakdown in many cases for example relatively large imaginary parts compared to real parts or problems where search direction and rtilde comes very close. Another type of system from the literature is which is quite close to our example. Because we don't use the randomized |
Closing this as fixed by #20973 |
In conda-forge, we ran into new test failure for scipy 1.8, which appears only with MKL (which is available in cf only for x86), and only if the processor supports AVX512 (which azure CI only has for linux & windows), see conda-forge/scipy-feedstock#199.
To not further delay the release of 1.8, we skipped that test for now, but it should IMO be fixed, especially as such processors are becoming more and more common (in fact, it's getting harder and harder to purposefully catch a non-AVX512 windows CI agent on azure)
The failure is in
test_x0_equals_Mb[bicgstab]
and looks as follows:Also, at the end of the test suite, some (seemingly delayed) log output appears that is perhaps relevant:
The text was updated successfully, but these errors were encountered: