Failing tests for truncated QR routine in coverage build #956

ACSimon33 · 2023-12-01T14:33:56Z

Description

Our coverage build was broken after the upgrade to 3.12.0 which led me to the bug in the lapack_testing.py script (see #954). After fixing that bug it was revealed that there are some tests which only fail in the coverage build:

                        -->   LAPACK TESTING SUMMARY  <--
                Processing LAPACK Testing output found in the TESTING directory
SUMMARY                 nb test run     numerical error         other error  
================        ===========     =================       ================  
REAL                    1328283         36885   (2.777%)        0       (0.000%)
DOUBLE PRECISION        1329105         36885   (2.775%)        0       (0.000%)
COMPLEX                 788035          36885   (4.681%)        0       (0.000%)
COMPLEX16               1029705         1       (0.000%)        0       (0.000%)

--> ALL PRECISIONS      4475128         110656  (2.473%)        0       (0.000%)

These tests are all related to the truncated QR routines:

testing_results.txt: SQK:  36885 out of 241365 tests failed to pass the threshold
testing_results.txt: DQK:  36885 out of 241365 tests failed to pass the threshold
testing_results.txt: CQK:  36885 out of 241695 tests failed to pass the threshold

Test ratios:
    1: 2-norm(svd(A) - svd(R)) / ( max(M,N) * 2-norm(svd(R)) * EPS )
    2: 1-norm( A*P - Q*R ) / ( max(M,N) * 1-norm(A) * EPS )
    3: 1-norm( I - Q'*Q ) / ( M * EPS )
    4: Returns 1.0D+100, if abs(R(K+1,K+1)) > abs(R(K,K)), where K=1:KFACT-1
    5: 1-norm(Q**T * B - Q**T * B ) / ( M * EPS )
 Messages:
 DGEQP3RK M =    2, N =    2, NRHS =    1, KMAX =    2, ABSTOL = -1.0000    , RELTOL = -1.0000    , NB =   1, NX =   1, type  2, test  4, ratio = 0.15179E+73
 DGEQP3RK M =    2, N =    2, NRHS =    1, KMAX =    3, ABSTOL = -1.0000    , RELTOL = -1.0000    , NB =   1, NX =   1, type  2, test  4, ratio = 0.15179E+73
 DGEQP3RK M =    2, N =    2, NRHS =    1, KMAX =    2, ABSTOL = -1.0000    , RELTOL = -1.0000    , NB =   3, NX =   0, type  2, test  4, ratio = 0.15179E+73
 DGEQP3RK M =    2, N =    2, NRHS =    1, KMAX =    3, ABSTOL = -1.0000    , RELTOL = -1.0000    , NB =   3, NX =   0, type  2, test  4, ratio = 0.15179E+73
 DGEQP3RK M =    2, N =    2, NRHS =    1, KMAX =    2, ABSTOL = -1.0000    , RELTOL = -1.0000    , NB =   3, NX =   5, type  2, test  4, ratio = 0.15179E+73

It is always the 4th test which fails for all kinds of matrices. Weirdly, the COMPLEX16 routines don't have that issue and if I build without LAPACKE the COMPLEX tests are also fine. To reproduce this issue just build with -DCMAKE_BUILD_TYPE=coverage.

Hi @scr2016, I guess you know the most about these routines. Do you have any ideas about what might go wrong here?

The text was updated successfully, but these errors were encountered:

dklyuchinskiy · 2023-12-14T06:12:46Z

@ACSimon33 In my environment, I have reproduced these failures even not for coverage build.

Looks like root-cause is uninitialized variable RESULT( 4 ) inside test routine, for example in TESTING/LIN/dchkqp3rk.f. Initially, it contains trash, since it can be uninitialized due to false of condition

              IF( DTEMP.LT.ZERO ) THEN
                  RESULT( 4 ) = BIGNUM
              END IF

in normal case.

That's why final check for thresh

IF( RESULT( 4 ).GE.THRESH ) THEN

is always true, which lead to every test failure.

Somewhere above we should set

RESULT( 4 ) = ZERO

ACSimon33 · 2023-12-15T08:28:50Z

@dklyuchinskiy Nice catch! Should I create a MR or do you want to do that?

dklyuchinskiy · 2023-12-15T09:17:11Z

@dklyuchinskiy Nice catch! Should I create a MR or do you want to do that?

@ACSimon33 I will be glad, if you create MR and check fix with coverage build. I did not work with it before.

Also, I am confused with some other places inside test.

According to the documentation, condition 4 is

Returns 1.0D+100 if abs(R(K+1,K+1)) > abs(R(K,K)),  K=1:KFACT-1
The elements on the diagonal of R should be non-increasing.

But after that we check the condition

                        DTEMP = (( ABS( A( (J-1)*M+J ) ) -
     $                          ABS( A( (J)*M+J+1 ) ) ) /
     $                          ABS( A(1) ) )

Indexes point to sub-diagonal elements of A (or R). Is it equal to the documentation?

In the formula above we should use LDA instead M, I guess.

Please correct me, if I am wrong.

ACSimon33 · 2023-12-15T18:06:20Z

@dklyuchinskiy I think the indices are actually pointing to the diagonal since Fortran is 1-indexed. So, for example if M=10 and J=1 it will be (A(1) - A(12))/A(1), which is the first diagonal element minus the second one scaled by the first. So, the test itself is correct.

I agree that we should use LDA even if it doesn't make a difference for the test (LDA=max(1,M)) because the test is only executed if the matrix rank is greater than 2.

scr2016 · 2023-12-15T20:16:27Z

This bug is currently in the process of fixing. This is a test number 4 which currently does not affect the correctness of the routine code/results. The test should check (with some care) if ABS values of the diagonal elements are non-increasing.

@ACSimon33 could you please provide:

the information about your system environment;
If the failing tests report that you provided in your original bug report if not complete (i.e. truncated), please prove the full output.

Thank you.

ACSimon33 · 2023-12-15T20:59:39Z

Hi @scr2016,
please have a look at the PR which is linked in this issue. The problem was just an un initialized RESULT vector as far as I can tell. At least it fixed the issue on my side and all tests are passing now.

I can reproduce the old errors tomorrow if you think that it’s still necessary.

scr2016 · 2023-12-15T21:22:10Z

@ACSimon33. The complete test error output and the environment information would help to check the issue thoroughly. Thank you in advance.

…

On Fri, Dec 15, 2023 at 12:59 PM Simon Lukas Märtens < ***@***.***> wrote: Hi @scr2016 <https://github.com/scr2016>, please have a look at the PR which is linked in this issue. The problem was just an un initialized RESULT vector as far as I can tell. At least it fixed the issue on my side and all tests are passing now. I can reproduce the old errors tomorrow if you think that it’s still necessary. — Reply to this email directly, view it on GitHub <#956 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAHYAZEJREFTINLOIHDCUSDYJS24LAVCNFSM6AAAAABAC53LV6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJYGQ4DAMBRGY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

dklyuchinskiy · 2023-12-18T07:50:47Z

@dklyuchinskiy I think the indices are actually pointing to the diagonal since Fortran is 1-indexed. So, for example if M=10 and J=1 it will be (A(1) - A(12))/A(1), which is the first diagonal element minus the second one scaled by the first. So, the test itself is correct.

@ACSimon33 Yeap, thank you for explanation! You are right! My fault :)

ACSimon33 · 2023-12-18T12:29:57Z

@ACSimon33. The complete test error output and the environment information would help to check the issue thoroughly. Thank you in advance.

@scr2016 Here are the complete test results:
LAPACK_test_results.txt

I compiled with GCC 13.2 on CentOS Linux 7 (Core). The issues only appear in the coverage build for me:

mkdir build && cd build
cmake -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++ -DCMAKE_Fortran_COMPILER=gfortran -DCMAKE_BUILD_TYPE=coverage ..
make -j8
ctest -j8

ACSimon33 · 2023-12-18T12:51:29Z

@scr2016 I tried with some more GCC version (4.8.5, 5.5.0, 6.5.0, 7.5.0, 8.4.0, 9.3.0, 10.3.0, 12.2.0, 13.2.0). The issue only exists for GCC >= 7.5.0.

ACSimon33 added the Type: Bug label Dec 1, 2023

ACSimon33 mentioned this issue Dec 15, 2023

Fixed usage of uninitialized variables in TESTING #961

Merged

1 task

scr2016 self-assigned this Dec 15, 2023

langou closed this as completed in #961 Jun 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failing tests for truncated QR routine in coverage build #956

Failing tests for truncated QR routine in coverage build #956

ACSimon33 commented Dec 1, 2023

dklyuchinskiy commented Dec 14, 2023

ACSimon33 commented Dec 15, 2023 •

edited

Loading

dklyuchinskiy commented Dec 15, 2023 •

edited

Loading

ACSimon33 commented Dec 15, 2023 •

edited

Loading

scr2016 commented Dec 15, 2023

ACSimon33 commented Dec 15, 2023

scr2016 commented Dec 15, 2023 via email •

edited

Loading

dklyuchinskiy commented Dec 18, 2023

ACSimon33 commented Dec 18, 2023 •

edited

Loading

ACSimon33 commented Dec 18, 2023 •

edited

Loading

Failing tests for truncated QR routine in coverage build #956

Failing tests for truncated QR routine in coverage build #956

Comments

ACSimon33 commented Dec 1, 2023

dklyuchinskiy commented Dec 14, 2023

ACSimon33 commented Dec 15, 2023 • edited Loading

dklyuchinskiy commented Dec 15, 2023 • edited Loading

ACSimon33 commented Dec 15, 2023 • edited Loading

scr2016 commented Dec 15, 2023

ACSimon33 commented Dec 15, 2023

scr2016 commented Dec 15, 2023 via email • edited Loading

dklyuchinskiy commented Dec 18, 2023

ACSimon33 commented Dec 18, 2023 • edited Loading

ACSimon33 commented Dec 18, 2023 • edited Loading

ACSimon33 commented Dec 15, 2023 •

edited

Loading

dklyuchinskiy commented Dec 15, 2023 •

edited

Loading

ACSimon33 commented Dec 15, 2023 •

edited

Loading

scr2016 commented Dec 15, 2023 via email •

edited

Loading

ACSimon33 commented Dec 18, 2023 •

edited

Loading

ACSimon33 commented Dec 18, 2023 •

edited

Loading