Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing tests for truncated QR routine in coverage build #956

Closed
ACSimon33 opened this issue Dec 1, 2023 · 10 comments · Fixed by #961
Closed

Failing tests for truncated QR routine in coverage build #956

ACSimon33 opened this issue Dec 1, 2023 · 10 comments · Fixed by #961
Assignees

Comments

@ACSimon33
Copy link
Contributor

Description

Our coverage build was broken after the upgrade to 3.12.0 which led me to the bug in the lapack_testing.py script (see #954). After fixing that bug it was revealed that there are some tests which only fail in the coverage build:

                        -->   LAPACK TESTING SUMMARY  <--
                Processing LAPACK Testing output found in the TESTING directory
SUMMARY                 nb test run     numerical error         other error  
================        ===========     =================       ================  
REAL                    1328283         36885   (2.777%)        0       (0.000%)
DOUBLE PRECISION        1329105         36885   (2.775%)        0       (0.000%)
COMPLEX                 788035          36885   (4.681%)        0       (0.000%)
COMPLEX16               1029705         1       (0.000%)        0       (0.000%)

--> ALL PRECISIONS      4475128         110656  (2.473%)        0       (0.000%)

These tests are all related to the truncated QR routines:

testing_results.txt: SQK:  36885 out of 241365 tests failed to pass the threshold
testing_results.txt: DQK:  36885 out of 241365 tests failed to pass the threshold
testing_results.txt: CQK:  36885 out of 241695 tests failed to pass the threshold
Test ratios:
    1: 2-norm(svd(A) - svd(R)) / ( max(M,N) * 2-norm(svd(R)) * EPS )
    2: 1-norm( A*P - Q*R ) / ( max(M,N) * 1-norm(A) * EPS )
    3: 1-norm( I - Q'*Q ) / ( M * EPS )
    4: Returns 1.0D+100, if abs(R(K+1,K+1)) > abs(R(K,K)), where K=1:KFACT-1
    5: 1-norm(Q**T * B - Q**T * B ) / ( M * EPS )
 Messages:
 DGEQP3RK M =    2, N =    2, NRHS =    1, KMAX =    2, ABSTOL = -1.0000    , RELTOL = -1.0000    , NB =   1, NX =   1, type  2, test  4, ratio = 0.15179E+73
 DGEQP3RK M =    2, N =    2, NRHS =    1, KMAX =    3, ABSTOL = -1.0000    , RELTOL = -1.0000    , NB =   1, NX =   1, type  2, test  4, ratio = 0.15179E+73
 DGEQP3RK M =    2, N =    2, NRHS =    1, KMAX =    2, ABSTOL = -1.0000    , RELTOL = -1.0000    , NB =   3, NX =   0, type  2, test  4, ratio = 0.15179E+73
 DGEQP3RK M =    2, N =    2, NRHS =    1, KMAX =    3, ABSTOL = -1.0000    , RELTOL = -1.0000    , NB =   3, NX =   0, type  2, test  4, ratio = 0.15179E+73
 DGEQP3RK M =    2, N =    2, NRHS =    1, KMAX =    2, ABSTOL = -1.0000    , RELTOL = -1.0000    , NB =   3, NX =   5, type  2, test  4, ratio = 0.15179E+73

It is always the 4th test which fails for all kinds of matrices. Weirdly, the COMPLEX16 routines don't have that issue and if I build without LAPACKE the COMPLEX tests are also fine. To reproduce this issue just build with -DCMAKE_BUILD_TYPE=coverage.


Hi @scr2016, I guess you know the most about these routines. Do you have any ideas about what might go wrong here?

@dklyuchinskiy
Copy link
Contributor

@ACSimon33 In my environment, I have reproduced these failures even not for coverage build.

Looks like root-cause is uninitialized variable RESULT( 4 ) inside test routine, for example in TESTING/LIN/dchkqp3rk.f. Initially, it contains trash, since it can be uninitialized due to false of condition

              IF( DTEMP.LT.ZERO ) THEN
                  RESULT( 4 ) = BIGNUM
              END IF

in normal case.

That's why final check for thresh

IF( RESULT( 4 ).GE.THRESH ) THEN

is always true, which lead to every test failure.

Somewhere above we should set

RESULT( 4 ) = ZERO

@ACSimon33
Copy link
Contributor Author

ACSimon33 commented Dec 15, 2023

@dklyuchinskiy Nice catch! Should I create a MR or do you want to do that?

@dklyuchinskiy
Copy link
Contributor

dklyuchinskiy commented Dec 15, 2023

@dklyuchinskiy Nice catch! Should I create a MR or do you want to do that?

@ACSimon33 I will be glad, if you create MR and check fix with coverage build. I did not work with it before.

Also, I am confused with some other places inside test.

  1. According to the documentation, condition 4 is
Returns 1.0D+100 if abs(R(K+1,K+1)) > abs(R(K,K)),  K=1:KFACT-1
The elements on the diagonal of R should be non-increasing.

But after that we check the condition

                        DTEMP = (( ABS( A( (J-1)*M+J ) ) -
     $                          ABS( A( (J)*M+J+1 ) ) ) /
     $                          ABS( A(1) ) )

Indexes point to sub-diagonal elements of A (or R). Is it equal to the documentation?

  1. In the formula above we should use LDA instead M, I guess.

Please correct me, if I am wrong.

@ACSimon33
Copy link
Contributor Author

ACSimon33 commented Dec 15, 2023

@dklyuchinskiy I think the indices are actually pointing to the diagonal since Fortran is 1-indexed. So, for example if M=10 and J=1 it will be (A(1) - A(12))/A(1), which is the first diagonal element minus the second one scaled by the first. So, the test itself is correct.

I agree that we should use LDA even if it doesn't make a difference for the test (LDA=max(1,M)) because the test is only executed if the matrix rank is greater than 2.

@scr2016
Copy link
Contributor

scr2016 commented Dec 15, 2023

This bug is currently in the process of fixing. This is a test number 4 which currently does not affect the correctness of the routine code/results. The test should check (with some care) if ABS values of the diagonal elements are non-increasing.

@ACSimon33 could you please provide:

  1. the information about your system environment;
  2. If the failing tests report that you provided in your original bug report if not complete (i.e. truncated), please prove the full output.

Thank you.

@ACSimon33
Copy link
Contributor Author

Hi @scr2016,
please have a look at the PR which is linked in this issue. The problem was just an un initialized RESULT vector as far as I can tell. At least it fixed the issue on my side and all tests are passing now.

I can reproduce the old errors tomorrow if you think that it’s still necessary.

@scr2016
Copy link
Contributor

scr2016 commented Dec 15, 2023 via email

@dklyuchinskiy
Copy link
Contributor

@dklyuchinskiy I think the indices are actually pointing to the diagonal since Fortran is 1-indexed. So, for example if M=10 and J=1 it will be (A(1) - A(12))/A(1), which is the first diagonal element minus the second one scaled by the first. So, the test itself is correct.

@ACSimon33 Yeap, thank you for explanation! You are right! My fault :)

@ACSimon33
Copy link
Contributor Author

ACSimon33 commented Dec 18, 2023

@ACSimon33. The complete test error output and the environment information would help to check the issue thoroughly. Thank you in advance.

@scr2016 Here are the complete test results:
LAPACK_test_results.txt

I compiled with GCC 13.2 on CentOS Linux 7 (Core). The issues only appear in the coverage build for me:

mkdir build && cd build
cmake -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++ -DCMAKE_Fortran_COMPILER=gfortran -DCMAKE_BUILD_TYPE=coverage ..
make -j8
ctest -j8

@ACSimon33
Copy link
Contributor Author

ACSimon33 commented Dec 18, 2023

@scr2016 I tried with some more GCC version (4.8.5, 5.5.0, 6.5.0, 7.5.0, 8.4.0, 9.3.0, 10.3.0, 12.2.0, 13.2.0). The issue only exists for GCC >= 7.5.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants