You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is a bit hard to describe so please bear with me.
Background: a "benefit" of having the package building on Debian is that it gets tested on a variety of CPU architectures. After the recent update to the package to Siconos 4.4.0, a bug has been reported with the test NM_test failing on architecture s390x.
Now, you could chalk this up to numerical differences on a platform we don't care about, but I think there is actually a problem revealed here, so I'll give all the details. Feel free to skip to tldr section at the bottom to reproduce.
The corresponding log can be found here, where you can see the following error message:
Then, in NM_csc, it checks that the resulting CSC matrix has the same size as the original dense matrix. This is what is failing.
But, looking at how NM_dense_to_sparse works, it just adds all non-zero values to an empty CSC array, and counts up the number of rows and columns that have any non-zero values.
Here is the difference. It seems that due to numerical architecture differences, on x86_64 this is the matrix that is given to NM_dense_to_sparse:
Now, the important thing is to know that DBL_EPSILON is the threshold used for determining sparse values in NM_dense_to_sparse, which is about 2.2e-16.
You can see that in the case of x86_64, the last value is 4.4e-16, which is > DBL_EPSILON. However, on s390x, all values in the last 4 rows and columns are < DBL_EPISLON.
So in the x86_64 case, csc->m = 8, but in the s390x case, csc->m = 4.
assert(A->matrix2->csc->m == A->size0 && "inconsistent size of csc storage");
The loop that populates the CSC matrix in NM_dense_to_sparse is the following:
for(int i = 0; i < A->size0; ++i)
{
for(int j = 0; j < A->size1; ++j)
{
CHECK_RETURN(CSparseMatrix_zentry(B->matrix2->triplet, i, j, A->matrix0[i + A->size0*j], threshold));
}
}
which increments m and n whenever i+1 or j+1 are greater, but only when the value is < threshold! There is no follow-up code to set m and n to the dense matrix size.
Therefore, csc->m will only contain the number of rows containing non-zero values, NOT necessarily the same size as the original dense, it depends on what values are in the dense matrix!
So, the reason I don't supply a patch with this bug report, is that I am not sure which is correct. Either NM_dense_to_sparse is correct in sort of "dropping" the last zero-valued rows and columns, and therefore the assertion in NM_csc is incorrect, OR, NM_dense_to_sparse should contain some code setting n and m to the values of size0 and size1.
Which do you think it is?
tldr
If anyone wants to reproduce this I can provide instructions on how to set up qemu, but you can also just copy and paste the values I give above and witness that NM_csc asserts in the second case, even on x86_64:
I already had this problem with matrices that vanishes in the iteration of an interior point method. In that case, NM_dense_to_sparse remove entire rows and columns of the matrices.
In csparse, if the last column is empty, the size m is reduced and the matrix can become a non square matrix. In the latter case, some methods like LU and Cholesky may fail. It is somehow logical that a non square matrix is not invertible.
I think we must separate the size of the NumericsMatrix from the size in the csparse format csc->m = 4. I want to say that the assertion is wrong. At least, we can set a warning message if the size is reduced.
Yes, I think that's my diagnosis as well. I was not sure if that's what is going on, but it seems maybe the issue is that the n and m of csc have a double role here to track the size of the information as well as the size of the matrix. Don't know how complicated this would make the solution unfortunately, as it seems like it might change the interface to CSC somewhat. Hopefully it's possible to adjust in a low-impact way.
This is a bit hard to describe so please bear with me.
Background: a "benefit" of having the package building on Debian is that it gets tested on a variety of CPU architectures. After the recent update to the package to Siconos 4.4.0, a bug has been reported with the test
NM_test
failing on architecture s390x.Now, you could chalk this up to numerical differences on a platform we don't care about, but I think there is actually a problem revealed here, so I'll give all the details. Feel free to skip to tldr section at the bottom to reproduce.
The corresponding log can be found here, where you can see the following error message:
I've had a chance this weekend to look into it, and the crash is where
NM_norm_1
is called in the functiontest_NM_LU_solve_matrix_rhs_unit
here.It's a bit hard to debug because of having to run it under qemu, where it seems that gdb is useless.
But, after a lot of printf-tracing, it seems that what is happening is the following call stack:
Then, in
NM_csc
, it checks that the resulting CSC matrix has the same size as the original dense matrix. This is what is failing.But, looking at how
NM_dense_to_sparse
works, it just adds all non-zero values to an empty CSC array, and counts up the number of rows and columns that have any non-zero values.Here is the difference. It seems that due to numerical architecture differences, on
x86_64
this is the matrix that is given toNM_dense_to_sparse
:whereas on
s390x
, this is the matrix:Now, the important thing is to know that
DBL_EPSILON
is the threshold used for determining sparse values inNM_dense_to_sparse
, which is about 2.2e-16.You can see that in the case of
x86_64
, the last value is 4.4e-16, which is> DBL_EPSILON
. However, ons390x
, all values in the last 4 rows and columns are< DBL_EPISLON
.So in the
x86_64
case,csc->m = 8
, but in thes390x
case,csc->m = 4
.Now, the assertion in
NM_csc
seems to make sense:The loop that populates the CSC matrix in
NM_dense_to_sparse
is the following:which increments
m
andn
wheneveri+1
orj+1
are greater, but only when the value is< threshold
! There is no follow-up code to setm
andn
to the dense matrix size.Therefore,
csc->m
will only contain the number of rows containing non-zero values, NOT necessarily the same size as the original dense, it depends on what values are in the dense matrix!So, the reason I don't supply a patch with this bug report, is that I am not sure which is correct. Either
NM_dense_to_sparse
is correct in sort of "dropping" the last zero-valued rows and columns, and therefore the assertion inNM_csc
is incorrect, OR,NM_dense_to_sparse
should contain some code settingn
andm
to the values ofsize0
andsize1
.Which do you think it is?
tldr
If anyone wants to reproduce this I can provide instructions on how to set up qemu, but you can also just copy and paste the values I give above and witness that
NM_csc
asserts in the second case, even onx86_64
:The text was updated successfully, but these errors were encountered: