Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory fault during unsymmetric scaling of singular matrix with Hungarian algorithm #200

Open
chrhansk opened this issue Jun 8, 2024 · 10 comments · May be fixed by #205
Open

Memory fault during unsymmetric scaling of singular matrix with Hungarian algorithm #200

chrhansk opened this issue Jun 8, 2024 · 10 comments · May be fixed by #205
Labels

Comments

@chrhansk
Copy link
Contributor

chrhansk commented Jun 8, 2024

Consider the following example problem:

#include <stdlib.h>
#include <stdio.h>
#include "spral.h"

int main(void) {
   /* Derived types */
   struct spral_scaling_hungarian_options options;
   struct spral_scaling_hungarian_inform inform;

   /* Other variables */
   int match[5];
   double rscaling[5], cscaling[5];

   /* Data for unsymmetric matrix:
    * ( 2,  1,  0,  0,  0)
    * ( 1,  4,  1,  0,  1)
    * ( 0,  0,  0,  0,  0)
    */
   int m = 3, n = 5;
   int ptr[]    = { 1, 3, 5, 6, 6, 7};
   int row[]    = { 1, 2, 1, 2, 2, 2};
   double val[] = { 2., 1., 1., 4., 1., 1. };
   printf("Initial matrix:\n");
   spral_print_matrix(-1, SPRAL_MATRIX_REAL_UNSYM, m, n, ptr, row, val, 1);

   /* Perform unsymmetric scaling */
   spral_scaling_hungarian_default_options(&options);

   options.array_base = 1;
   options.scale_if_singular = true;

   spral_scaling_hungarian_unsym(m, n, ptr, row, val, rscaling, cscaling, match,
         &options, &inform);

   if(inform.flag<0) {
      printf("spral_scaling_hungarian_unsym() returned with error %5d", inform.flag);
      exit(1);
   }

   /* Print scaling and matching */
   printf("Matching:");
   for(int i=0; i<n; i++) printf(" %10d", match[i]);
   printf("\nRow Scaling: ");
   for(int i=0; i<m; i++) printf(" %10.2le", rscaling[i]);
   printf("\nCol Scaling: ");
   for(int i=0; i<n; i++) printf(" %10.2le", cscaling[i]);
   printf("\n");

   return 0;
}

This outputs the following:

Initial matrix:
Real unsymmetric matrix, dimension 3x5 with 6 entries.
1:   2.0000E+00   1.0000E+00                                       
2:   1.0000E+00   4.0000E+00   1.0000E+00                1.0000E+00
3:                                                                 
Matching:          1          5         -3      32522         -4
Row Scaling:    8.41e-01   8.41e-01   8.41e-01
Col Scaling:    5.95e-01   2.97e-01   1.19e+00   1.00e+00   1.19e+00

The matching has the clearly invalid entry 32522 (random, changing at each iteration). There is also a valgrind error:

==213605== Invalid write of size 8
==213605==    at 0x49AF0A5: __spral_scaling_MOD_match_postproc (scaling.f90:1675)
==213605==    by 0x49B59AB: __spral_scaling_MOD_hungarian_wrapper (scaling.f90:684)
==213605==    by 0x49B9679: __spral_scaling_MOD_hungarian_scale_unsym_int64 (scaling.f90:225)
==213605==    by 0x49B9C02: __spral_scaling_MOD_hungarian_scale_unsym_int32 (scaling.f90:202)
==213605==    by 0x48EA1D8: spral_scaling_hungarian_unsym (scaling.f90:729)
==213605==    by 0x10936A: main (hungarian_unsym.c:33)
==213605==  Address 0x7260180 is 32 bytes before a block of size 48 in arena "client"

The offending line attempts to access the rscaling variable at the match(i) index, where i = 3 and match = (1, 5, -3). It is clearly infeasible to access the array at a negative index, which is causing the initial memory fault. On the other hand, I don't know why the index becomes negative in the first place...

@mjacobse
Copy link
Collaborator

mjacobse commented Jun 8, 2024

The matching used in this scaling is documented to be based on MC64

spral/src/scaling.f90

Lines 936 to 938 in 662c7ac

! This code is adapted from MC64 v 1.6.0
!
subroutine hungarian_match(m,n,ptr,row,val,iperm,num,dualu,dualv,st)

which itself documents that "If the matrix is structurally singular [...], then the entries of [match] for which there is
no entry on the diagonal of the permuted matrix are set negative" (https://www.hsl.rl.ac.uk/specs/mc64.pdf). So in the example above, row 3 is causing the matrix to be structurally singular and supposedly that's why -3 is returned.

I'm not quite sure how the scaling has to handle such cases. Perhaps the negative entries of match should just be set to 0 to signal "unmatched", similar to what the auction algorithm does here

spral/src/scaling.f90

Lines 1487 to 1488 in 662c7ac

! We expect unmatched columns to have match(col) = 0
where(match(:) .eq. -1) match(:) = 0

such that they are skipped in match_postproc, e.g. the aforementioned offending line.

@chrhansk
Copy link
Contributor Author

chrhansk commented Jun 8, 2024

That is probably a good idea. Seeing as the spral_scaling_hungarian_options struct has an entry scale_if_singular, a user would expect that singular matrices can be scaled (albeit with a warning)...

@jfowkes
Copy link
Contributor

jfowkes commented Jun 10, 2024

It has been implemented in this way in MC64 so that the rows which cause the matrix to be structurally singular can be identified, but I agree that it probably makes sense to set the negative entries of match to 0. Happy to accept a PR.

@mjacobse
Copy link
Collaborator

mjacobse commented Jun 18, 2024

I noticed that this actually already goes wrong before what was mentioned. This was hidden with match being allocated larger than necessary in the example (see #207). When allocating with size m (on the heap instead of the stack to make it easier to detect with valgrind) like so

/* examples/C/scaling/hungarian_unsym.c - Example code for SPRAL_SCALING */
#include <stdlib.h>
#include <stdio.h>
#include "spral.h"

int main(void) {
   /* Derived types */
   struct spral_scaling_hungarian_options options;
   struct spral_scaling_hungarian_inform inform;

   /* Data for unsymmetric matrix:
    * ( 2  1         )
    * ( 1  4  1    1 )
    * (              ) */
   int m = 3, n = 5;
   int ptr[]    = { 0,        2,        4,   5,   5,  6 };
   int row[]    = { 0,   1,   0,   1,   1,   1   };
   double val[] = { 2.0, 1.0, 1.0, 4.0, 1.0, 1.0 };
   printf("Initial matrix:\n");
   spral_print_matrix(-1, SPRAL_MATRIX_REAL_UNSYM, m, n, ptr, row, val, 0);

   /* Perform symmetric scaling */
   int* match = malloc(sizeof(int)*m);
   double* rscaling = malloc(sizeof(double)*m);
   double* cscaling = malloc(sizeof(double)*n);
   spral_scaling_hungarian_default_options(&options);
   spral_scaling_hungarian_unsym(m, n, ptr, row, val, rscaling, cscaling, match,
         &options, &inform);
   if(inform.flag<0) {
      printf("spral_scaling_hungarian_unsym() returned with error %5d", inform.flag);
      exit(1);
   }

   /* Print scaling and matching */
   printf("Matching:");
   for(int i=0; i<m; i++) printf(" %10d", match[i]);
   printf("\nRow Scaling: ");
   for(int i=0; i<m; i++) printf(" %10.2le", rscaling[i]);
   printf("\nCol Scaling: ");
   for(int i=0; i<n; i++) printf(" %10.2le", cscaling[i]);
   printf("\n");

   /* Calculate scaled matrix and print it */
   for(int i=0; i<n; i++) {
      for(int j=ptr[i]; j<ptr[i+1]; j++)
         val[j] = rscaling[row[j]] * val[j] * cscaling[i];
   }
   printf("Scaled matrix:\n");
   spral_print_matrix(-1, SPRAL_MATRIX_REAL_UNSYM, m, n, ptr, row, val, 0);

   free(rscaling);
   free(cscaling);
   free(match);
   /* Success */
   return 0;
}

we already get an error inside of hungarian_match, without even setting options.scale_if_singular to true:

Initial matrix:
Real unsymmetric matrix, dimension 3x5 with 6 entries.
0:   2.0000E+00   1.0000E+00                                       
1:   1.0000E+00   4.0000E+00   1.0000E+00                1.0000E+00
2:                                                                 
==29266== Invalid write of size 4
==29266==    at 0x124E1B: __spral_scaling_MOD_hungarian_match (scaling.f90:1192)
==29266==    by 0x126BA1: __spral_scaling_MOD_hungarian_wrapper (scaling.f90:660)
==29266==    by 0x12AAFF: __spral_scaling_MOD_hungarian_scale_unsym_int64 (scaling.f90:225)
==29266==    by 0x12B0B6: __spral_scaling_MOD_hungarian_scale_unsym_int32 (scaling.f90:201)
==29266==    by 0x10F7F5: spral_scaling_hungarian_unsym (scaling.f90:723)
==29266==    by 0x1092AB: main (hungarian_unsym.c:27)
==29266==  Address 0x4ee02d0 is 4 bytes after a block of size 12 alloc'd
==29266==    at 0x4848899: malloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==29266==    by 0x109258: main (hungarian_unsym.c:23)

The offending code

spral/src/scaling.f90

Lines 1174 to 1193 in 7fc5ec3

! Otherwise, matrix is structurally singular, complete iperm.
! jperm, out are work arrays
jperm(1:n) = 0
k = 0
do i = 1, m
if (iperm(i) .eq. 0) then
k = k + 1
out(k) = i
else
j = iperm(i)
jperm(j) = i
end if
end do
k = 0
do j = 1, n
if (jperm(j) .ne. 0) cycle
k = k + 1
jdum = int(out(k))
iperm(jdum) = -j
end do
looks dubious to me; for this example with one unmatched and two matched rows, one index is stored in out and two in jperm. That leaves three 0-entries in jperm, causing three indices to be read from out - despite it only containing one. So we essentially overflow and read what happens to remain in the work array from earlier, which just so happens to be out of bounds for iperm. I am not sure right now how this is supposed to work.

This "inner" issue definitely has to be fixed before fixing the "outer" issue can then also allow using options.scale_if_singular == true.

@jfowkes
Copy link
Contributor

jfowkes commented Jun 18, 2024

@nimgould any thoughts on what was intended here: #200 (comment)

@nimgould
Copy link
Contributor

nimgould commented Jun 19, 2024

It is unclear to me what the purpose of this last 1174-1193 section is. The documentation implies that either iperm(i) > 0 gives a match of row i to column iperm(i) or iperm(i) = 0, and there is no match. But this last loop seems to be setting negative values to unmatched rows; what does this mean? And as @mjacobse says there is an inconsistency between the counts for the variable k when marching through the rows (first loop) and columns (second loop). The iperm array becomes the match array in hungarian_scale_unsym & hungarian_scale_sym, and I see no attempt there to correct negative components of match. And hungarian_wrapper is not used anywhere else in spral.

I would be tempted to comment out this segment and see what happens. Perhaps Jonathan had something else in mind?

@mjacobse
Copy link
Collaborator

mjacobse commented Jun 19, 2024

Reassuring to hear that this section seems strange to you too, many thanks.

My understanding is that this function is from MC64, so while it is true that the SPRAL documentation says that 0 is returned for unmatched rows, MC64 itself returns negative entries according to its documentation (#200 (comment)). And indeed, as it stands the SPRAL documentation is incorrect, the hungarian_scale_[un]sym functions do return those negative entries as noticed in #205 (review).

Just returning 0 from hungarian_match leads to other problems in other SPRAL-internal uses of hungarian_match that do expect the negative entries, as we noticed in #205 (comment).

But I tend to agree that it might be easiest to just return 0 and adapt all uses to work with that instead of with negative entries. But to be honest I still do not really understand what further information the negative indices in MC64 provide, so perhaps I am missing how they are crucially important for some usecase.

@jfowkes
Copy link
Contributor

jfowkes commented Jun 19, 2024

According to @isduff the negative entries in MC64 are intended to signal which of the rows are problematic as the negative entry is in fact minus the index of the problematic row. Now the question is, does SSIDS actually use this information anywhere? If not, then I agree with @mjacobse that we should just standardise on returning zero for unmatched entries.

@nimgould
Copy link
Contributor

All of this makes sense for square matrices, but for rectangular ones a 0 is simply a reflection that one cannot possibly match all rows to columns. But, to me anyway, rank defficiency is more than simply not having enough rows or columns, but that there are zero singular values. I do not see how the scheme before the offending lines (the bulk of the subroutine) distinguishes between the two, and thus the offending lines add nothing extra (except for memory trouble!). Thus, as I said earlier, I would simply comment this out.

@jfowkes
Copy link
Contributor

jfowkes commented Jun 19, 2024

Thanks @nimgould, it looks like we are converging on a solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants