Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: test failure in test_x0_equals_Mb with bicgstab #15533

Closed
h-vetinari opened this issue Feb 6, 2022 · 43 comments · Fixed by #18488
Closed

BUG: test failure in test_x0_equals_Mb with bicgstab #15533

h-vetinari opened this issue Feb 6, 2022 · 43 comments · Fixed by #18488
Labels
defect A clear bug or issue that prevents SciPy from being installed or used as expected scipy.sparse.linalg
Milestone

Comments

@h-vetinari
Copy link
Member

h-vetinari commented Feb 6, 2022

In conda-forge, we ran into new test failure for scipy 1.8, which appears only with MKL (which is available in cf only for x86), and only if the processor supports AVX512 (which azure CI only has for linux & windows), see conda-forge/scipy-feedstock#199.

To not further delay the release of 1.8, we skipped that test for now, but it should IMO be fixed, especially as such processors are becoming more and more common (in fact, it's getting harder and harder to purposefully catch a non-AVX512 windows CI agent on azure)

The failure is in test_x0_equals_Mb[bicgstab] and looks as follows:

=================================== FAILURES ===================================
_________________________ test_x0_equals_Mb[bicgstab] __________________________
[...]/lib/python3.8/site-packages/scipy/sparse/linalg/_isolve/tests/test_iterative.py:538: in test_x0_equals_Mb
    assert_equal(info, 0)
E   AssertionError: 
E   Items are not equal:
E    ACTUAL: -11
E    DESIRED: 0
        A          = <10x10 sparse matrix of type '<class 'numpy.complex64'>'
	with 19 stored elements in Compressed Sparse Row format>
        b          = array([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])
        case       = <nonsymposdef>
        info       = -11
        solver     = <function bicgstab at 0x7fc6545a4ee0>
        sup        = <numpy.testing._private.utils.suppress_warnings object at 0x7fc62555d040>
        tol        = 1e-08
        x          = array([0.        +0.j, 0.50000358+0.j, 1.25016941+0.j, 2.12653797+0.j,
       3.0674186 +0.j, 4.03716976+0.j, 5.0158146 +0.j, 6.00288608+0.j,
       7.        +0.j, 8.        +0.j])
        x0         = 'Mb'

Also, at the end of the test suite, some (seemingly delayed) log output appears that is perhaps relevant:


   Normal return from subroutine COBYLA

   NFVALS =   50   F = 2.485185E+01    MAXCV = 1.999965E-10
   X = 4.955358E+00   6.666553E-01

 NNLS quitting on iteration count.
@h-vetinari h-vetinari added the defect A clear bug or issue that prevents SciPy from being installed or used as expected label Feb 6, 2022
@h-vetinari
Copy link
Member Author

h-vetinari commented Feb 6, 2022

Note: it's possible that this is an MKL bug. In that case. If people think that's the most realistic (and/or there's no way to workaround), then we can raise it with Intel some way, e.g. at https://github.com/conda-forge/intel_repack-feedstock/

@tylerjereddy tylerjereddy added this to the 1.8.1 milestone Feb 6, 2022
@h-vetinari
Copy link
Member Author

I think this might actually be showing up in the doctests... I had a failed CI run with

scipy.sparse.linalg.bicgstab
----------------------------

File "build/testenv/lib/python3.9/site-packages/scipy/sparse/linalg/_isolve/iterative.py", line 62, in bicgstab
Failed example:
    np.allclose(A.dot(x), b)
Expected:
    True
Got:
    False

which happens in the same bicgstab method, and could be flaky based on whether the assigned agent supports AVX512 or not.

@h-vetinari
Copy link
Member Author

h-vetinari commented May 24, 2022

This should be remilestoned as 1.8.1 has been released...

Forgot that I can do that 😅
Done...

@WarrenWeckesser
Copy link
Member

I think this might actually be showing up in the doctests...

@h-vetinari, that sporadic doctest failure should be fixed by #16276

@h-vetinari
Copy link
Member Author

Ah, thanks a lot for tracking that down @WarrenWeckesser! Guess I was just fooled by seeing the failure affect the same exact function.

@rgommers rgommers modified the milestones: 1.9.0, 1.10.0 Jul 5, 2022
@rgommers
Copy link
Member

rgommers commented Jul 5, 2022

No movement and not a blocker for 1.9.0, so bumping the milestone.

@h-vetinari
Copy link
Member Author

I bumped the milestone for now, but I'd be interested in fixing this (because it's a persistent failure in conda-forge for MKL). The problem is that I just don't know that code at all, so I'd probably need some pointers.

I did have a look now nevertheless, and the error code -11 stands for:

*                BREAKDOWN: If parameters RHO or OMEGA become smaller
*                   than some tolerance, the program will terminate.
*                   Here we check against tolerance BREAKTOL.
*
*                  -10: RHO  .ls.  BREAKTOL: RHO and RTLD have become 
*                                       orthogonal.
*                  -11: OMEGA  .ls.  BREAKTOL: S and T have become 
*                                         orthogonal relative to T'*T.

@ilayn
Copy link
Member

ilayn commented Jan 27, 2023

This is also interesting from my side since I also had some issues with this on OpenBLAS side. Is it possible to have the CPU type exposed to see whether they are Tiger Lake?

@h-vetinari
Copy link
Member Author

Is it possible to have the CPU type exposed to see whether they are Tiger Lake?

Not quite as detailed, but we have the following that numpy picks up:

Supported SIMD extensions in this NumPy install:
    baseline = SSE,SSE2,SSE3
    found = SSSE3,SSE41,POPCNT,SSE42,AVX,F16C,FMA3,AVX2,AVX512F,AVX512CD,AVX512_SKX
    not found = AVX512_KNL,AVX512_KNM,AVX512_CLX,AVX512_CNL,AVX512_ICL

In contrast, it passes when it's on an agent with the following:

Supported SIMD extensions in this NumPy install:
    baseline = SSE,SSE2,SSE3
    found = SSSE3,SSE41,POPCNT,SSE42,AVX,F16C,FMA3,AVX2
    not found = AVX512F,AVX512CD,AVX512_KNL,AVX512_KNM,AVX512_SKX,AVX512_CLX,AVX512_CNL,AVX512_ICL

I'm not sure how many types of agents there are on azure, but from the results it mostly looks like 2. I can also add some debug output to determine the processor type if that's desired/helpful.

@tylerjereddy
Copy link
Contributor

Bumping milestone, branching is imminent (May 29th), and linalg-backend specific test failures are nothing new of course. If a fix is merged before I branch, feel free to adjust back.

@tylerjereddy tylerjereddy modified the milestones: 1.11.0, 1.12.0 May 29, 2023
@ilayn
Copy link
Member

ilayn commented May 29, 2023

This is going to be fixed together with #18488 so 1.12 is fine. I'll make a round the issues that PR closes later.

@ilayn
Copy link
Member

ilayn commented May 29, 2023

About the error, this is a breakdown case where the algorithm hits a point that can't proceed further due to numerical problems. But because the tests are/were too tight it was sporadic. Now all tests pass.

@h-vetinari
Copy link
Member Author

Amazing news, thanks a lot!

@h-vetinari
Copy link
Member Author

Essentially unchanged for 1.13.0:

1.11.0 1.11.4 1.12.0 1.13.0
linux + x64 + blis + avx2 ✔️ ✔️ ✔️ ✔️
linux + x64 + blis + avx512 ✔️ ✔️ ✔️ ✔️
linux + x64 + mkl + avx2 ✔️ ✔️
linux + x64 + mkl + avx512 ✔️ ✔️
linux + x64 + netlib + avx2 ✔️ ✔️ ✔️ ✔️
linux + x64 + netlib + avx512 ✔️ ✔️ ✔️ ✔️
linux + x64 + openblas + avx2 ✔️ ✔️ ✔️ ✔️
linux + x64 + openblas + avx512 ✔️ ✔️ ✔️
linux + aarch64 + netlib ✔️ ✔️ ✔️ ✔️
linux + aarch64 + openblas ✔️ ✔️
osx + x64 + accelerate ✔️ ✔️ ✔️/❌ ✔️
win + x64 + blis + avx2 ✔️ ✔️ ✔️
win + x64 + blis + avx512 ✔️ ✔️ ✔️
win + x64 + mkl + avx2 ✔️ ✔️
win + x64 + mkl + avx512 ✔️ ✔️ ✔️
win + x64 + openblas + avx2 ✔️ ✔️ ✔️ ✔️
win + x64 + openblas + avx512 ✔️ ✔️ ✔️ ✔️

@ilayn
Copy link
Member

ilayn commented Apr 14, 2024

@h-vetinari Do you have any bandwidth or tools to play with the failing test and find anything that passes? The test numbers are really not important but the mechanism we are testing here.

@h-vetinari
Copy link
Member Author

@h-vetinari Do you have any bandwidth or tools to play with the failing test and find anything that passes? The test numbers are really not important but the mechanism we are testing here.

If you could provide me with a patch that adds a couple of relevant debug-statements, I'm happy to throw it at our infrastructure and post the results. Would that help?

@ilayn
Copy link
Member

ilayn commented Apr 14, 2024

Yes, give me some time and I'll provide something concrete.

@h-vetinari
Copy link
Member Author

Gentle ping on some print statements that would help debug this. :)

@h-vetinari
Copy link
Member Author

@ilayn, I'll fire up the BLAS testing again after 1.13.1 - do you think I could get some of those debug prints that would allow debugging this? Doesn't have to be perfect or complete, we can iterate...

@h-vetinari h-vetinari modified the milestones: 1.13.1, 1.14.0, 1.15.0 May 23, 2024
@h-vetinari
Copy link
Member Author

@ilayn, this is basically the last widely occurring issue in conda-forge across all platforms and BLAS flavours. I'd be very happy if we could debug it together. 🙃 Are you still willing/able to provide some basic debug statements so we can start digging?

@ilayn
Copy link
Member

ilayn commented Jun 15, 2024

Yes so sorry about this. I ghosted more than once and I truly apologize. I guess something in my head is resisting to look at this code again somehow. I'll do it today.

@ilayn
Copy link
Member

ilayn commented Jun 15, 2024

Is there any possibility to inject this instead of the current code? I basically sprinkled some print values to check where it goes awry

def bicgstab(A, b, x0=None, *, rtol=1e-5, atol=0., maxiter=None, M=None,
             callback=None):
    """Use BIConjugate Gradient STABilized iteration to solve ``Ax = b``.

    Parameters
    ----------
    A : {sparse matrix, ndarray, LinearOperator}
        The real or complex N-by-N matrix of the linear system.
        Alternatively, ``A`` can be a linear operator which can
        produce ``Ax`` and ``A^T x`` using, e.g.,
        ``scipy.sparse.linalg.LinearOperator``.
    b : ndarray
        Right hand side of the linear system. Has shape (N,) or (N,1).
    x0 : ndarray
        Starting guess for the solution.
    rtol, atol : float, optional
        Parameters for the convergence test. For convergence,
        ``norm(b - A @ x) <= max(rtol*norm(b), atol)`` should be satisfied.
        The default is ``atol=0.`` and ``rtol=1e-5``.
    maxiter : integer
        Maximum number of iterations.  Iteration will stop after maxiter
        steps even if the specified tolerance has not been achieved.
    M : {sparse matrix, ndarray, LinearOperator}
        Preconditioner for A.  The preconditioner should approximate the
        inverse of A.  Effective preconditioning dramatically improves the
        rate of convergence, which implies that fewer iterations are needed
        to reach a given error tolerance.
    callback : function
        User-supplied function to call after each iteration.  It is called
        as callback(xk), where xk is the current solution vector.

    Returns
    -------
    x : ndarray
        The converged solution.
    info : integer
        Provides convergence information:
            0  : successful exit
            >0 : convergence to tolerance not achieved, number of iterations
            <0 : parameter breakdown

    Examples
    --------
    >>> import numpy as np
    >>> from scipy.sparse import csc_matrix
    >>> from scipy.sparse.linalg import bicgstab
    >>> R = np.array([[4, 2, 0, 1],
    ...               [3, 0, 0, 2],
    ...               [0, 1, 1, 1],
    ...               [0, 2, 1, 0]])
    >>> A = csc_matrix(R)
    >>> b = np.array([-1, -0.5, -1, 2])
    >>> x, exit_code = bicgstab(A, b, atol=1e-5)
    >>> print(exit_code)  # 0 indicates successful convergence
    0
    >>> np.allclose(A.dot(x), b)
    True

    """
    A, M, x, b, postprocess = make_system(A, M, x0, b)
    bnrm2 = np.linalg.norm(b)
    print(f"Starting bnrm2 : {bnrm2}")
    atol, _ = _get_atol_rtol('bicgstab', bnrm2, atol, rtol)
    print(f"Starting atol: {atol}")
    if bnrm2 == 0:
        return postprocess(b), 0

    n = len(b)

    dotprod = np.vdot if np.iscomplexobj(x) else np.dot

    if maxiter is None:
        maxiter = n*10

    matvec = A.matvec
    psolve = M.matvec

    # These values make no sense but coming from original Fortran code
    # sqrt might have been meant instead.
    rhotol = np.finfo(x.dtype.char).eps**2
    omegatol = rhotol
    print(f"Starting paramtols rho, omega : {rhotol} / {omegatol}")
    # Dummy values to initialize vars, silence linter warnings
    rho_prev, omega, alpha, p, v = None, None, None, None, None

    r = b - matvec(x) if x.any() else b.copy()
    rtilde = r.copy()
    print(f"Starting residual: {r}")
    for iteration in range(maxiter):
        print(f"=============== Iteration {iteration} ========")
        if np.linalg.norm(r) < atol:  # Are we done?
            return postprocess(x), 0

        rho = dotprod(rtilde, r)
        print(f"Current rho: {rho}")
        if np.abs(rho) < rhotol:  # rho breakdown
            return postprocess(x), -10

        if iteration > 0:
            print(f"Omega check : {np.abs(omega)} < {omegatol}")
            if np.abs(omega) < omegatol:  # omega breakdown
                return postprocess(x), -11

            beta = (rho / rho_prev) * (alpha / omega)
            print(f"Current beta: {beta}")
            p -= omega*v
            p *= beta
            p += r
        else:  # First spin
            s = np.empty_like(r)
            p = r.copy()

        phat = psolve(p)
        v = matvec(phat)
        rv = dotprod(rtilde, v)
        print(f"Current rv : {rv}")
        if rv == 0:
            return postprocess(x), -11
        alpha = rho / rv
        print(f"Current alpha : {alpha}")
        r -= alpha*v
        s[:] = r[:]

        if np.linalg.norm(s) < atol:
            x += alpha*phat
            return postprocess(x), 0

        shat = psolve(s)
        t = matvec(shat)
        omega = dotprod(t, s) / dotprod(t, t)
        x += alpha*phat
        x += omega*shat
        r -= omega*t
        rho_prev = rho

        if callback:
            callback(x)

    else:  # for loop exhausted
        # Return incomplete progress
        return postprocess(x), maxiter

Somehow we have to inject -- --capture=sys to the test execution to get the results though. Not sure if you have that capacity. Conda is a bit of a mystery to me. But if you have a local machine to replicate then it would be awesome.

@h-vetinari
Copy link
Member Author

Thank you! :)

Here's the output from the failing test (on CI agents without AVX512F/AVX512CD instructions, i.e. "only" AVX2) - identical on both linux & win:

----------------------------- Captured stdout call -----------------------------
Starting bnrm2 : 16.881943016134134
Starting atol: 1e-08
Starting paramtols rho, omega : 4.930380657631324e-32 / 4.930380657631324e-32
Starting residual: [ 0. -1. -1. -1. -1. -1. -1. -1. -1. -1.]
=============== Iteration 0 ========
Current rho: 9.0
Current rv : 10.0
Current alpha : 0.9
=============== Iteration 1 ========
Current rho: -0.0399449035812673
Omega check : 0.3994490358126722 < 4.930380657631324e-32
Current beta: -0.010000000000000023
Current rv : -0.03595041322314066
Current alpha : 1.1111111111111085
=============== Iteration 2 ========
Current rho: -2.7755575615628914e-17
Omega check : 0.3825675630233262 < 4.930380657631324e-32
Current beta: 2.0180792053039126e-15
Current rv : 0.0

This then somehow leads to info = -11, c.f.

__________________ test_x0_equals_Mb[nonsymposdef-bicgstab] ___________________
[gw1] win32 -- Python 3.12.3 %PREFIX%\python.exe
..\_test_env\Lib\site-packages\scipy\sparse\linalg\_isolve\tests\test_iterative.py:520: in test_x0_equals_Mb
    assert info == 0
E   assert -11 == 0
        A          = <10x10 sparse matrix of type '<class 'numpy.float64'>'
	with 19 stored elements in Compressed Sparse Row format>
        b          = array([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])
        case       = <nonsymposdef-bicgstab>
        info       = -11
        rtol       = 1e-08
        x          = array([0.        , 0.52361589, 1.32356787, 2.15281624, 3.        ,
       4.        , 5.        , 6.        , 7.        , 8.        ])
        x0         = 'Mb'

@ilayn
Copy link
Member

ilayn commented Jun 16, 2024

OK on my machine which always gets a pass on this test, I got

Starting bnrm2 : 16.881943016134134
Starting atol: 0.00016881943016134135
Starting paramtols rho, omega : 4.930380657631324e-32 / 4.930380657631324e-32
Starting residual: [ 0. -1. -1. -1. -1. -1. -1. -1. -1. -1.]
=============== Iteration 0 ========
Current rho: 9.0
Current rv : 10.0
Current alpha : 0.9
=============== Iteration 1 ========
Current rho: -0.0399449035812674
Omega check : 0.3994490358126724 < 4.930380657631324e-32
Current beta: -0.010000000000000042
Current rv : -0.035950413223140916
Current alpha : 1.1111111111111032
=============== Iteration 2 ========
Current rho: -2.023390733630768e-16
Omega check : 0.3825675630233261 < 4.930380657631324e-32
Current beta: 1.4711864816975902e-14
Current rv : -4.96177387315249e-16
Current alpha : 0.40779583781096335
=============== Iteration 3 ========
Current rho: 1.9545000710325743e-17
Omega check : 0.4502470191790134 < 4.930380657631324e-32
Current beta: -0.08748787666655233
Current rv : 5.1525628801392976e-17
Current alpha : 0.3793258066905406
=============== Iteration 4 ========
Current rho: -2.512269613955742e-18
Omega check : 0.5300027929977077 < 4.930380657631324e-32
Current beta: -0.09199511928154214
Current rv : -2.6020852139652106e-18
Current alpha : 0.9654832210999723
=============== Iteration 5 ========
Current rho: 0.002706428175909658
Omega check : 0.49693326579407965 < 4.930380657631324e-32
Current beta: -2093037073226866.5
Current rv : 5867170329227.434
Current alpha : 4.612833826261236e-16
=============== Iteration 6 ========
Current rho: 0.0016019143903419333
Omega check : 0.571462224373975 < 4.930380657631324e-32
Current beta: 4.77774623675576e-16
Current rv : -0.00038850060507100736
Current alpha : -4.123325342180991
=============== Iteration 7 ========
Current rho: 0.002070293928641972
Omega check : 0.4902358823313919 < 4.930380657631324e-32
Current beta: -10.870141892415262
Current rv : -0.0004441575982355629
Current alpha : -4.6611696768586475
=============== Iteration 8 ========
Current rho: 0.0023866725790334187
Omega check : 0.47011447528419964 < 4.930380657631324e-32
Current beta: -11.430155103786058
Current rv : 1.2440308056737618e-15
Current alpha : 1918499580676.2253

In your case a small miracle happens and we hit rv exactly 0.0 which is quite suspicious for a dot product to be. Then quits because we will divide things with rv. That's why it is giving -11.

If I may, can I ask also to add the following lines just above the print(f"Current rv : {rv}") line

print("p", p)
print("phat", phat)
print("v", v)
print("rtilde", rtilde)

My suspicion is that one of these arrays become exactly 0.0

@ilayn
Copy link
Member

ilayn commented Jun 16, 2024

The numbers are very funky on this one anyways. It has passed already in sufficient times in sufficient machines and we did not hear any screams about the rewrites so far.

So if need be we can add bicgstab to lists on lines 174 and 176 for this specific problem and have some peace of mind. If you want to remove the manual skip on your side.

@h-vetinari
Copy link
Member Author

If I may, can I ask also to add the following lines just above the print(f"Current rv : {rv}") line

I restarted a new build with that.

The numbers are very funky on this one anyways. It has passed already in sufficient times in sufficient machines and we did not hear any screams about the rewrites so far.

This issue predates the rewrite, so it appears the rewrite was too good in that it kept bug-for-bug compatibility with the Fortran implementation, at least in this regard. ;-)

So if need be we can add bicgstab to lists on lines 174 and 176 for this specific problem and have some peace of mind. If you want to remove the manual skip on your side.

Can you clarify which file you're talking about?

@ilayn
Copy link
Member

ilayn commented Jun 16, 2024

Can you clarify which file you're talking about?

Ah it is the test_iterative.py the test file itself has a way to turn on and off solvers for each case at the top. The lines 174-176 can have this solver to skip it for this particular case.

@h-vetinari
Copy link
Member Author

With updated logging statements:

Starting bnrm2 : 16.881943016134134
Starting atol: 1.6881943016134135e-07
Starting paramtols rho, omega : 4.930380657631324e-32 / 4.930380657631324e-32
Starting residual: [ 0. -1. -1. -1. -1. -1. -1. -1. -1. -1.]
=============== Iteration 0 ========
Current rho: 9.0
p [ 0. -1. -1. -1. -1. -1. -1. -1. -1. -1.]
phat [ 0. -1. -1. -1. -1. -1. -1. -1. -1. -1.]
v [ 0. -2. -1. -1. -1. -1. -1. -1. -1. -1.]
rtilde [ 0. -1. -1. -1. -1. -1. -1. -1. -1. -1.]
Current rv : 10.0
Current alpha : 0.9
=============== Iteration 1 ========
Current rho: -0.0399449035812673
Omega check : 0.3994490358126722 < 4.930380657631324e-32
Current beta: -0.010000000000000023
p [ 0.          0.16289256  0.30545455 -0.05404959 -0.05404959 -0.05404959
 -0.05404959 -0.05404959 -0.05404959 -0.05404959]
phat [ 0.          0.16289256  0.30545455 -0.05404959 -0.05404959 -0.05404959
 -0.05404959 -0.05404959 -0.05404959 -0.05404959]
v [ 0.          0.32578512  0.44801653 -0.41355372 -0.05404959 -0.05404959
 -0.05404959 -0.05404959 -0.05404959 -0.05404959]
rtilde [ 0. -1. -1. -1. -1. -1. -1. -1. -1. -1.]
Current rv : -0.03595041322314066
Current alpha : 1.1111111111111085
=============== Iteration 2 ========
Current rho: -2.7755575615628914e-17
Omega check : 0.3825675630233262 < 4.930380657631324e-32
Current beta: 2.0180792053039126e-15
p [ 0.00000000e+00 -4.72317791e-02 -1.23519843e-01  1.79353779e-02
  1.52816244e-01 -1.61601835e-16 -1.61601835e-16 -1.61601835e-16
 -1.61601835e-16 -1.61601835e-16]
phat [ 0.00000000e+00 -4.72317791e-02 -1.23519843e-01  1.79353779e-02
  1.52816244e-01 -1.61601835e-16 -1.61601835e-16 -1.61601835e-16
 -1.61601835e-16 -1.61601835e-16]
v [ 0.00000000e+00 -9.44635581e-02 -1.99807907e-01  1.59390599e-01
  2.87697110e-01 -1.52816244e-01 -1.61601835e-16 -1.61601835e-16
 -1.61601835e-16 -1.61601835e-16]
rtilde [ 0. -1. -1. -1. -1. -1. -1. -1. -1. -1.]
Current rv : 0.0

@ilayn
Copy link
Member

ilayn commented Jun 17, 2024

Since rv is the dot product of rtilde and v I can't understand how rv can get to 0.0. But maybe it is the truncated precision that numpy does not print.

This seems it boils down to a dot product difference between architectures because this is what I get

=============== Iteration 2 ========
Current rho: -2.023390733630768e-16
Omega check : 0.3825675630233261 < 4.930380657631324e-32
Current beta: 1.4711864816975902e-14
p [ 0.00000000e+00 -4.72317791e-02 -1.23519843e-01  1.79353779e-02
  1.52816244e-01 -7.78011861e-16 -7.78011861e-16 -7.78011861e-16
 -7.78011861e-16 -7.78011861e-16]
phat [ 0.00000000e+00 -4.72317791e-02 -1.23519843e-01  1.79353779e-02
  1.52816244e-01 -7.78011861e-16 -7.78011861e-16 -7.78011861e-16
 -7.78011861e-16 -7.78011861e-16]
v [ 0.00000000e+00 -9.44635581e-02 -1.99807907e-01  1.59390599e-01
  2.87697110e-01 -1.52816244e-01 -7.78011861e-16 -7.78011861e-16
 -7.78011861e-16 -7.78011861e-16]
rtilde [ 0. -1. -1. -1. -1. -1. -1. -1. -1. -1.]
Current rv : -4.96177387315249e-16
Current alpha : 0.40779583781096335

So that rv is pretty much zero anyways so it is an accident that things go through on other architectures. Since the array is as well-conditioned (a bidiagonal array with positive diagonal) as it can be, I'll open a PR to skip this solver in that particular test and hopefully we can leave this behind 😄

@h-vetinari
Copy link
Member Author

h-vetinari commented Jun 17, 2024

Since rv is the dot product of rtilde and v I can't understand how rv can get to 0.0

Looks like lots and lots of cancellation. Perhaps it really is perfectly orthogonal in this very well-conditioned example. But that makes me kind of wary to skip this test. Even if rv == 0 is incredibly lucky, it seems like a valid condition to hit, so we might want to change:

rv = dotprod(rtilde, v)
if rv == 0:
return postprocess(x), -11
alpha = rho / rv

to do something reasonable if other quantities (rho?) are small. Especially since the rho-condition sounds completely unattainable for me, as we're taking eps ** 2, which is tiiiiiiiny:
# These values make no sense but coming from original Fortran code
# sqrt might have been meant instead.
rhotol = np.finfo(x.dtype.char).eps**2
omegatol = rhotol

Since the array is as well-conditioned (a bidiagonal array with positive diagonal) as it can be, [...]

My conclusion is the opposite. Such a well-behaved example should not fail to converge, especially for an algorithm that advertises itself as stabilized.

@ilayn
Copy link
Member

ilayn commented Jun 17, 2024

Yes but this method is known to breakdown in many cases for example relatively large imaginary parts compared to real parts or problems where search direction and rtilde comes very close.

Another type of system from the literature is

image

which is quite close to our example.

Because we don't use the randomized rtilde to start but use actual r this is sometimes unavoidable. It is still better than bicg but as you can see the number of solvers in sparse.linalg it is not the end of the story.

@h-vetinari
Copy link
Member Author

Closing this as fixed by #20973

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect A clear bug or issue that prevents SciPy from being installed or used as expected scipy.sparse.linalg
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants