Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mpirun -np 2 python -Wd setup.py test --trilinos hanging on sandbox under buildbot #305

Closed
wd15 opened this issue Sep 19, 2014 · 13 comments
Closed

Comments

@wd15
Copy link
Contributor

@wd15 wd15 commented Sep 19, 2014

Hangs at the chemotaxis tests.

The problem seems to be that when the PYTHONPATH is explicitly set to "." then it hangs otherwise it runs through without issue. The crontab on sandbox sets the PYTHONPATH. I'm worried about simply removing that as python could pick up on the system version of fipy during buildbot runs. Maybe use a virtualenv.

Also reported in issue #264

Imported from trac ticket #425, created by wd15 on 01-20-2012 at 13:30, last modified: 01-30-2013 at 13:42

@wd15
Copy link
Contributor Author

@wd15 wd15 commented Sep 19, 2014

This is so weird. I can no longer reproduce the hang failure on sandbox even when setting PYTHONPATH. It clearly broke after setting the PYTHONPATH at the command line. That is the only time it hanged at the command line. Probably just intermittent. I messed with the PYTHONPATH in the crontab and then relaunched the slave and restarted the buildbot build. It didn't hang http://build.cmi.kent.edu:8010/builders/Ubuntu_x86_64~trunk~full/builds/12/steps/trial_2/logs/stdio

Trac comment by wd15 on 01-20-2012 at 14:42

@wd15
Copy link
Contributor Author

@wd15 wd15 commented Sep 19, 2014

killed the buildslave and removed the PYTHONPATH from the crontab. The original configuration. New buildslave started up again. Let's see what happens.

Trac comment by wd15 on 01-20-2012 at 15:02

@guyer
Copy link
Member

@guyer guyer commented Sep 19, 2014

It fails on Mac OS X 10.6 with python 2.7 even if PYTHONPATH is unset completely.

Could it be that the two (or more) processes are getting different PYTHONPATHs?

A separate question is what is setting PYTHONPATH in the first place? As seen in http://build.cmi.kent.edu:8010/builders/Mac_OS_X%7Etrunk%7Efull/builds/8/steps/trial_2/logs/stdio, PYTHONPATH=.:, but this is not set in master.cfg or in the LaunchDaemon that launches paco's buildbot slave process.

Trac comment by guyer on 01-20-2012 at 18:20

@wd15
Copy link
Contributor Author

@wd15 wd15 commented Sep 19, 2014

Using the following debugging patch in trilinosMatrix.py seems to always fail in the rowMap construction

Index: fipy/matrices/trilinosMatrix.py
===================================================================
--- fipy/matrices/trilinosMatrix.py     (revision 5115)
+++ fipy/matrices/trilinosMatrix.py     (working copy)
@@ -590,9 +590,18 @@
self.numberOfEquations = numberOfEquations

comm = mesh.communicator.epetra_comm
+
+        import sys
+        print >>sys.stderr, mesh.communicator.procID,'trilinosMatrix 0'
+
rowMap = Epetra.Map(-1, list(self._globalNonOverlappingRowIDs), 0, comm)
+        print >>sys.stderr, mesh.communicator.procID,'trilinosMatrix 1'
+
colMap = Epetra.Map(-1, list(self._globalOverlappingColIDs), 0, comm)
+        print >>sys.stderr, mesh.communicator.procID,'trilinosMatrix 2'
+
domainMap = Epetra.Map(-1, list(self._globalNonOverlappingColIDs), 0, comm)
+        print >>sys.stderr, mesh.communicator.procID,'trilinosMatrix 3'

_TrilinosMatrix.__init__(self, 
rows=self.numberOfEquations * self.mesh.globalNumberOfCells, 

This tests actually run fairly rapidly with these debug statements. The slowness using the PRINT statements is caused by the sleep command. What to do next?

Trac comment by wd15 on 01-25-2012 at 16:05

@wd15
Copy link
Contributor Author

@wd15 wd15 commented Sep 19, 2014

Going further, it stops hanging when the domainMap is set equal to the
rowMap. I can reliably get it to toggle between hanging and passing when I change this. I noticed that
swapping the order of the construction of the maps didn't change the
place where the hang occurs. It always seems to hang in the first map
construction of the three.

Index: fipy/matrices/trilinosMatrix.py
===================================================================
--- fipy/matrices/trilinosMatrix.py     (revision 5115)
+++ fipy/matrices/trilinosMatrix.py     (working copy)
@@ -588,11 +588,17 @@
self.mesh = mesh
self.numberOfVariables = numberOfVariables
self.numberOfEquations = numberOfEquations
-
+        from fipy.tools.debug import PRINT
comm = mesh.communicator.epetra_comm
+        PRINT('trilinosMatrix 0', stall=False)
rowMap = Epetra.Map(-1, list(self._globalNonOverlappingRowIDs), 0, comm)
+        PRINT('trilinosMatrix 1', stall=False)
colMap = Epetra.Map(-1, list(self._globalOverlappingColIDs), 0, comm)
-        domainMap = Epetra.Map(-1, list(self._globalNonOverlappingColIDs), 0, comm)
+##        domainMap = rowMap
+        PRINT('trilinosMatrix 2', stall=False)
+##        domainMap = Epetra.Map(-1, list(self._globalNonOverlappingColIDs), 0, comm)
+        domainMap = rowMap
+        PRINT('trilinosMatrix 3', stall=False)

_TrilinosMatrix.__init__(self, 
rows=self.numberOfEquations * self.mesh.globalNumberOfCells, 

I just wanted to see what would happen if the domainMap was equal to
the rowMap. They are the same map anyway as we no longer use
non-square matrices. The number of equations is always equal to the
number of variables. I'm going to commit this just out of
curiosity. To see how it fares on the other test boxes (all tests pass
when domainMap=rowMap on bunter as one would expect).

Trac comment by wd15 on 01-25-2012 at 16:52

@wd15
Copy link
Contributor Author

@wd15 wd15 commented Sep 19, 2014

As of 5:45 the tests on sandbox and on Mac OS X seem to have got through the critical chemotaxis tests which bodes well.

Trac comment by wd15 on 01-25-2012 at 17:45

@wd15
Copy link
Contributor Author

@wd15 wd15 commented Sep 19, 2014

It seems to have passed all the parallel tests. Jon needs to test this on his laptop.

Trac comment by wd15 on 01-25-2012 at 17:57

@wd15
Copy link
Contributor Author

@wd15 wd15 commented Sep 19, 2014

The change was r5117. Unfortunately it looks like Ubuntu i686 may be hanging :-(.

Trac comment by wd15 on 01-25-2012 at 18:03

@wd15
Copy link
Contributor Author

@wd15 wd15 commented Sep 19, 2014

Replying to wd15:

The change was r5117. Unfortunately it looks like Ubuntu i686 may be hanging :-(.

It all seems to have worked. We have to think about the following:

  • is this an adequate change (obviously not) if we stick with this we
    need to get rid of referenced to domainMap and non-square matrices
    altogether.
  • Almost certainly, this didn't really fix the problem. My instinct
    is to leave this for awhile and see if anything reoccurs and
    implement a test box with different numbers of processors and then revisit later.

Trac comment by wd15 on 01-26-2012 at 10:37

@guyer
Copy link
Member

@guyer guyer commented Sep 19, 2014

Replying to wd15:

Using the following debugging patch in trilinosMatrix.py seems to always fail in the rowMap construction

On my Mac OS X laptop, it stalls before the rowMap construction, but only if I used 7 processors. After dumping stack traces, I ended up modifying

#diff

===================================================================
--- fipy/matrices/pysparseMatrix.py (revision 5116)
+++ fipy/matrices/pysparseMatrix.py (working copy)
@@ -357,6 +357,9 @@
numberOfVariables=self.numberOfVariables,
numberOfEquations=self.numberOfEquations)

+        from fipy.tools import parallel
+        parallel.Barrier()
+
self.trilinosMatrix.addAt(values, irow, jcol)
self.trilinosMatrix.finalize()

which seems to prevent the stall in chemotaxis, but then it stalls later in examples.cahnHilliard.mesh2D-coupled.

Trac comment by guyer on 01-26-2012 at 11:14

@wd15
Copy link
Contributor Author

@wd15 wd15 commented Sep 19, 2014

The following changes to trunk@5117 still hangs on chemotaxis with mpirun -np 2 python setup.py test --trilinos --examples:

Index: fipy/matrices/trilinosMatrix.py
===============================================================h4. 
--- fipy/matrices/trilinosMatrix.py     (revision 5117)
+++ fipy/matrices/trilinosMatrix.py     (working copy)
@@ -592,7 +592,8 @@
comm = mesh.communicator.epetra_comm
rowMap = Epetra.Map(-1, list(self._globalNonOverlappingRowIDs), 0, comm)
colMap = Epetra.Map(-1, list(self._globalOverlappingColIDs), 0, comm)
-        domainMap = rowMap
+        domainMap = Epetra.Map(-1, list(self._globalNonOverlappingColIDs), 0, comm)
+##        domainMap = rowMap

_TrilinosMatrix.__init__(self, 
rows=self.numberOfEquations * self.mesh.globalNumberOfCells, 
Index: fipy/matrices/pysparseMatrix.py
===========================================================h4. 
--- fipy/matrices/pysparseMatrix.py     (revision 5115)
+++ fipy/matrices/pysparseMatrix.py     (working copy)
@@ -357,6 +357,9 @@
numberOfVariables=self.numberOfVariables,
numberOfEquations=self.numberOfEquations)

+        from fipy.tools import parallel
+        parallel.Barrier()
+
self.trilinosMatrix.addAt(values, irow, jcol)
self.trilinosMatrix.finalize()

Index: examples/test.py
=============================================================h2. 
--- examples/test.py    (revision 5115)
+++ examples/test.py    (working copy)
@@ -42,15 +42,15 @@
return _LateImportTestSuite(testModuleNames = (
'diffusion.test',
'chemotaxis.test',  
-        'phase.test',
-        'convection.test',
-        'elphf.test',
-        'levelSet.test',
-        'cahnHilliard.test',
-        'flow.test',
-        'meshing.test',
-        'reactiveWetting.test',
-        'riemann.test'
+        # 'phase.test',
+        # 'convection.test',
+        # 'elphf.test',
+        # 'levelSet.test',
+        # 'cahnHilliard.test',
+        # 'flow.test',
+        # 'meshing.test',
+        # 'reactiveWetting.test',
+        # 'riemann.test'
), base = __name__)

 if __name__ '__main__':

Trac comment by wd15 on 01-26-2012 at 12:27

@guyer
Copy link
Member

@guyer guyer commented Sep 19, 2014

This intermittent freeze has been documented in r5149, so I am removing it from milestone:3.0

Trac comment by guyer on 01-31-2012 at 17:54

@guyer
Copy link
Member

@guyer guyer commented Oct 8, 2017

As of #524, there appear to be no parallel hangs in the tests

@guyer guyer closed this Oct 8, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants