Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock in gesvx when FACT='E' #119

Closed
TLCFEM opened this issue Mar 13, 2025 · 2 comments
Closed

Deadlock in gesvx when FACT='E' #119

TLCFEM opened this issue Mar 13, 2025 · 2 comments

Comments

@TLCFEM
Copy link
Contributor

TLCFEM commented Mar 13, 2025

Running the attached problem with two processes such as

mpirun -n 2 ./deadlock

results in deadlock.

From the bt, one process is waiting to receive data from the other process in pdcopy.

libmpi.so.12!MPIDI_NM_progress.constprop.0 (Unknown Source:0)
libmpi.so.12!MPIDI_progress_test (Unknown Source:0)
libmpi.so.12!MPIR_Wait_state (Unknown Source:0)
libmpi.so.12!MPIR_Wait (Unknown Source:0)
libmpi.so.12!PMPI_Recv (Unknown Source:0)
BI_Srecv(BLACSCONTEXT * ctxt, int src, int msgid, BLACBUFF * bp) (\scalapack-2.2.2\BLACS\SRC\BI_Srecv.c:8)
Cdgerv2d(int ConTxt, int m, int n, double * A, int lda, int rsrc, int csrc) (\scalapack-2.2.2\BLACS\SRC\dgerv2d_.c:79)
PB_CpaxpbyDN(PBTYP_T * TYPE, char * CONJUG, int M, int N, char * ALPHA, char * A, int IA, int JA, int * DESCA, char * AROC, char * BETA, char * B, int IB, int JB, int * DESCB, char * BROC) (\scalapack-2.2.2\PBLAS\SRC\PTOOLS\PB_CpaxpbyDN.c:656)
PB_Cpaxpby(PBTYP_T * TYPE, char * CONJUG, int M, int N, char * ALPHA, char * A, int IA, int JA, int * DESCA, char * AROC, char * BETA, char * B, int IB, int JB, int * DESCB, char * BROC) (\scalapack-2.2.2\PBLAS\SRC\PTOOLS\PB_Cpaxpby.c:754)
pdcopy_(int * N, double * X, int * IX, int * JX, int * DESCX, int * INCX, double * Y, int * IY, int * JY, int * DESCY, int * INCY) (\scalapack-2.2.2\PBLAS\SRC\pdcopy_.c:217)
pdgesvx(character*1 fact, character*1 trans, integer(kind=4) n, integer(kind=4) nrhs, real(kind=8) (*) a, integer(kind=4) ia, integer(kind=4) ja, integer(kind=4) (*) desca, real(kind=8) (*) af, integer(kind=4) iaf, integer(kind=4) jaf, integer(kind=4) (*) descaf, integer(kind=4) (*) ipiv, character*1 equed, real(kind=8) (*) r, real(kind=8) (*) c, real(kind=8) (*) b, integer(kind=4) ib, integer(kind=4) jb, integer(kind=4) (*) descb, real(kind=8) (*) x, integer(kind=4) ix, integer(kind=4) jx, integer(kind=4) (*) descx, real(kind=8) rcond, real(kind=8) (*) ferr, real(kind=8) (*) berr, real(kind=8) (*) work, integer(kind=4) lwork, integer(kind=4) (*) iwork, integer(kind=4) liwork, integer(kind=4) info, integer(kind=8) _fact, integer(kind=8) _trans, integer(kind=8) _equed) (\scalapack-2.2.2\SRC\pdgesvx.f:791)
run() (\deadlock.cpp:92)
main() (\deadlock.cpp:100)

The attached program does not yield correct solution, only created to reproduce the deadlock.

deadlock.cpp.txt

@TLCFEM
Copy link
Contributor Author

TLCFEM commented Mar 14, 2025

Further debugging shows that, the local variable COLEQU, indicating whether to perform column equilibration is determined by the subroutine PDLAQGE, line 672.

scalapack/SRC/pdgesvx.f

Lines 665 to 673 in a23c2cd

IF( INFEQU.EQ.0 ) THEN
*
* Equilibrate the matrix.
*
CALL PDLAQGE( N, N, A, IA, JA, DESCA, R, C, ROWCND, COLCND,
$ AMAX, EQUED )
ROWEQU = LSAME( EQUED, 'R' ) .OR. LSAME( EQUED, 'B' )
COLEQU = LSAME( EQUED, 'C' ) .OR. LSAME( EQUED, 'B' )
END IF

Thus, for two processes, it is possible that process A has COLEQU=.TRUE. and process B has COLEQU=.FALSE..

However, later after getting the solution via PDGERFS, the solution needs to be scaled back.

scalapack/SRC/pdgesvx.f

Lines 783 to 810 in a23c2cd

IF( NOTRAN ) THEN
IF( COLEQU ) THEN
*
* Transpose the column scaling factors
*
CALL DESCSET( CDESC, 1, N+ICOFFA, 1, DESCA( NB_ ), MYROW,
$ IACOL, ICTXT, 1 )
CALL PDCOPY( N, C, 1, JA, CDESC, CDESC( LLD_ ), WORK, IX,
$ JX, DESCX, 1 )
IF( MYCOL.EQ.IBCOL ) THEN
CALL DGEBS2D( ICTXT, 'Rowwise', ' ', NP, 1,
$ WORK( IIX ), DESCX( LLD_ ) )
ELSE
CALL DGEBR2D( ICTXT, 'Rowwise', ' ', NP, 1,
$ WORK( IIX ), DESCX( LLD_ ), MYROW, IBCOL )
END IF
*
DO 80 J = JJX, JJX+NRHSQ-1
DO 70 I = IIX, IIX+NP-1
X( I+( J-1 )*DESCX( LLD_ ) ) = WORK( I )*
$ X( I+( J-1 )*DESCX( LLD_ ) )
70 CONTINUE
80 CONTINUE
DO 90 J = JJX, JJX+NRHSQ-1
FERR( J ) = FERR( J ) / COLCND
90 CONTINUE
END IF
ELSE IF( ROWEQU ) THEN

In this inner if statement line 784, process B will skip since its COLEQU=.FALSE..
While process A wants to collect column scaling factors C via PDCOPY, it hangs.

Thus, it is necessary to distribute C as long as any process in the grid needs to perform equilibration, such that processes with COLEQU=.FALSE. will not simply quit.

@langou
Copy link
Contributor

langou commented Mar 24, 2025

Fixed by #120
Thanks @TLCFEM

@langou langou closed this as completed Mar 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants