Possible problem either in `NM_csc` or `NM_dense_to_sparse` #467

radarsat1 · 2022-09-25T21:02:24Z

This is a bit hard to describe so please bear with me.

Background: a "benefit" of having the package building on Debian is that it gets tested on a variety of CPU architectures. After the recent update to the package to Siconos 4.4.0, a bug has been reported with the test NM_test failing on architecture s390x.

Now, you could chalk this up to numerical differences on a platform we don't care about, but I think there is actually a problem revealed here, so I'll give all the details. Feel free to skip to tldr section at the bottom to reproduce.

The corresponding log can be found here, where you can see the following error message:

NM_test: ./numerics/src/tools/NumericsMatrix.c:3052: NM_csc: Assertion `A->matrix2->csc->m == A->size0 && "inconsistent size of csc storage"' failed.

I've had a chance this weekend to look into it, and the crash is where NM_norm_1 is called in the function test_NM_LU_solve_matrix_rhs_unit here.

It's a bit hard to debug because of having to run it under qemu, where it seems that gdb is useless.

But, after a lot of printf-tracing, it seems that what is happening is the following call stack:

NM_norm_1 -> NM_csc -> NM_triplet -> NM_dense_to_sparse

Then, in NM_csc, it checks that the resulting CSC matrix has the same size as the original dense matrix. This is what is failing.

But, looking at how NM_dense_to_sparse works, it just adds all non-zero values to an empty CSC array, and counts up the number of rows and columns that have any non-zero values.

Here is the difference. It seems that due to numerical architecture differences, on x86_64 this is the matrix that is given to NM_dense_to_sparse:

 2.2e-16,  1.5e-16,  3.6e-16,        0, -8.3e-17,  1.1e-16, -1.3e-16, -1.5e-16
-6.9e-18,        0, -3.6e-16, -2.8e-17, -1.4e-16,  1.1e-16,  2.1e-17,  4.2e-17
 5.9e-17, -1.2e-16,  4.4e-16,  2.4e-17,  2.8e-16, -1.6e-17, -8.3e-17, -8.3e-17
 6.9e-17, -1.4e-16,    5e-16,        0,  5.6e-17,  2.2e-16, -9.7e-17, -2.5e-16
 6.9e-18, -1.4e-17,  2.8e-17,        0,        0, -5.2e-18,  6.9e-18,  1.4e-17
       0,        0,        0,        0,        0,        0,        0,        0
       0,        0, -5.6e-17,  6.9e-18,        0,  3.5e-18,        0,  2.8e-17
       0,        0, -1.1e-16,        0, -1.1e-16,  3.5e-18, -4.2e-17,  4.4e-16

whereas on s390x, this is the matrix:

2.2e-16, -2.2e-16,  3.6e-16, 5.6e-17, -5.6e-17,  1.1e-16,    9e-17,  1.8e-16
1.1e-16,        0, -3.6e-16, 8.3e-17, -1.9e-16,  1.1e-16,  2.1e-17,  4.2e-17
5.2e-17,   -1e-16,  4.4e-16, 3.5e-17,  1.4e-16, -1.9e-17, -8.3e-17, -1.1e-16
      0,        0,  1.4e-15,       0,  1.7e-16, -2.2e-16, -9.7e-17, -1.9e-16
      0,        0,  2.8e-17,       0,        0, -5.2e-18,  6.9e-18,  1.4e-17
      0,        0,        0,       0,        0,        0,        0,        0
      0,        0,  5.6e-17,       0,        0,        0,        0,  2.8e-17
4.2e-17, -8.3e-17,  1.7e-16, 1.4e-17,  1.1e-16, -1.4e-17, -4.2e-17, -1.1e-16

Now, the important thing is to know that DBL_EPSILON is the threshold used for determining sparse values in NM_dense_to_sparse, which is about 2.2e-16.

You can see that in the case of x86_64, the last value is 4.4e-16, which is > DBL_EPSILON. However, on s390x, all values in the last 4 rows and columns are < DBL_EPISLON.

So in the x86_64 case, csc->m = 8, but in the s390x case, csc->m = 4.

Now, the assertion in NM_csc seems to make sense:

assert(A->matrix2->csc->m == A->size0 && "inconsistent size of csc storage");

The loop that populates the CSC matrix in NM_dense_to_sparse is the following:

for(int i = 0; i < A->size0; ++i)
  {
    for(int j = 0; j < A->size1; ++j)
    {
      CHECK_RETURN(CSparseMatrix_zentry(B->matrix2->triplet, i, j, A->matrix0[i + A->size0*j], threshold));
    }
  }

which increments m and n whenever i+1 or j+1 are greater, but only when the value is < threshold! There is no follow-up code to set m and n to the dense matrix size.

Therefore, csc->m will only contain the number of rows containing non-zero values, NOT necessarily the same size as the original dense, it depends on what values are in the dense matrix!

So, the reason I don't supply a patch with this bug report, is that I am not sure which is correct. Either NM_dense_to_sparse is correct in sort of "dropping" the last zero-valued rows and columns, and therefore the assertion in NM_csc is incorrect, OR, NM_dense_to_sparse should contain some code setting n and m to the values of size0 and size1.

Which do you think it is?

tldr

If anyone wants to reproduce this I can provide instructions on how to set up qemu, but you can also just copy and paste the values I give above and witness that NM_csc asserts in the second case, even on x86_64:

  double data[] = {
    2.2e-16, -2.2e-16,  3.6e-16, 5.6e-17, -5.6e-17,  1.1e-16,    9e-17,  1.8e-16,
    1.1e-16,        0, -3.6e-16, 8.3e-17, -1.9e-16,  1.1e-16,  2.1e-17,  4.2e-17,
    5.2e-17,   -1e-16,  4.4e-16, 3.5e-17,  1.4e-16, -1.9e-17, -8.3e-17, -1.1e-16,
          0,        0,  1.4e-15,       0,  1.7e-16, -2.2e-16, -9.7e-17, -1.9e-16,
          0,        0,  2.8e-17,       0,        0, -5.2e-18,  6.9e-18,  1.4e-17,
          0,        0,        0,       0,        0,        0,        0,        0,
          0,        0,  5.6e-17,       0,        0,        0,        0,  2.8e-17,
    4.2e-17, -8.3e-17,  1.7e-16, 1.4e-17,  1.1e-16, -1.4e-17, -4.2e-17, -1.1e-16
  };
  NumericsMatrix *r = NM_create_from_data(NM_DENSE, 8, 8, data);
  double res = NM_norm_1(r);

The text was updated successfully, but these errors were encountered:

vacary · 2022-09-29T09:12:44Z

Hi Steve,

I already had this problem with matrices that vanishes in the iteration of an interior point method. In that case, NM_dense_to_sparse remove entire rows and columns of the matrices.

In csparse, if the last column is empty, the size m is reduced and the matrix can become a non square matrix. In the latter case, some methods like LU and Cholesky may fail. It is somehow logical that a non square matrix is not invertible.

I think we must separate the size of the NumericsMatrix from the size in the csparse format csc->m = 4. I want to say that the assertion is wrong. At least, we can set a warning message if the size is reduced.

radarsat1 · 2022-09-29T12:29:58Z

Yes, I think that's my diagnosis as well. I was not sure if that's what is going on, but it seems maybe the issue is that the n and m of csc have a double role here to track the size of the information as well as the size of the matrix. Don't know how complicated this would make the solution unfortunately, as it seems like it might change the interface to CSC somewhat. Hopefully it's possible to adjust in a low-impact way.

vacary closed this as completed Dec 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible problem either in `NM_csc` or `NM_dense_to_sparse` #467

Possible problem either in `NM_csc` or `NM_dense_to_sparse` #467

radarsat1 commented Sep 25, 2022

vacary commented Sep 29, 2022

radarsat1 commented Sep 29, 2022

Possible problem either in NM_csc or NM_dense_to_sparse #467

Possible problem either in NM_csc or NM_dense_to_sparse #467

Comments

radarsat1 commented Sep 25, 2022

vacary commented Sep 29, 2022

radarsat1 commented Sep 29, 2022

Possible problem either in `NM_csc` or `NM_dense_to_sparse` #467

Possible problem either in `NM_csc` or `NM_dense_to_sparse` #467