Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HANG in parallel test of examples/chemotaxis/input2D.py on some configurations #264

Closed
guyer opened this issue Sep 19, 2014 · 14 comments
Closed

Comments

@guyer
Copy link
Member

@guyer guyer commented Sep 19, 2014

  • mpirun -np 2 python setup.py test hangs indefinitely at Doctest: examples.chemotaxis.input2D ....
    • mpirun -np 2 python examples/chemotaxis/input2D.py runs successfully.
    • mpirun -np 2 python examples/chemotaxis/test.py runs successfully.
    • Doctest: examples.chemotaxis.input succeeds, but if examples/chemotaxis/input2D is removed from the test suite, then Doctest: examples.chemotaxis.input hangs.
    • If examples/chemotaxis is removed from the test suite, all other tests run to completion.
    • Adding print >>sys.stderr, i after trunk/examples/chemotaxis/input.py@4131#L44, after the solve loop, causes Doctest: examples.chemotaxis.input to hang after 60 to 70 steps. Ditto for input2D.py.
    • Adding print >>sys.stderr, i at trunk/examples/chemotaxis/input.py@4131#L44, within the solve loop, causes Doctest: examples.chemotaxis.input to run to completion. Ditto for input2D.py.

Both processes are hung in Epetra.Map

pystack
Current language:  auto; currently c++
Current language:  auto; currently c
${PREFIX}/lib/python2.7/site-packages/PyTrilinos/Epetra.py (7160): __init__
${FIPYPATH}/fipy/matrices/trilinosMatrix.py (589): __init__
${FIPYPATH}/fipy/matrices/trilinosMatrix.py (589): __init__
${FIPYPATH}/fipy/matrices/trilinosMatrix.py (589): __init__
${FIPYPATH}/fipy/matrices/pysparseMatrix.py (357): asTrilinosMeshMatrix
${FIPYPATH}/fipy/matrices/pysparseMatrix.py (357): asTrilinosMeshMatrix
${FIPYPATH}/fipy/matrices/pysparseMatrix.py (357): asTrilinosMeshMatrix
${FIPYPATH}/fipy/solvers/trilinos/trilinosSolver.py (72): _globalMatrixAndVectors
${FIPYPATH}/fipy/solvers/trilinos/trilinosSolver.py (113): _solve
${FIPYPATH}/fipy/terms/term.py (217): solve
<doctest examples.chemotaxis.input2D[0]> (5): <module>
<doctest examples.chemotaxis.input2D[0]> (5): <module>
${PREFIX}/lib/python2.7/doctest.py (Cannot access memory at address 0x0

but the backtraces https://raw.githubusercontent.com/wd15/fipy-attachments/master/raw-attachment/ticket/360/bt0 and https://raw.githubusercontent.com/wd15/fipy-attachments/master/raw-attachment/ticket/360/bt1 show that they diverge in Epetra_BlockMap::Epetra_BlockMap (line 264 vs 252) and seem to hang in a race between Epetra_MpiComm::GatherAll/MPI_Allgather and Epetra_MpiComm::MaxAll/MPI_Allreduce.

Why does this only happen for chemotaxis?

h3. Configuration

Mac OS X Snow Leopard 10.6.8

I have seen this with a variety of builds, but current installation was done with wiki:InstallFiPy/MacOSX/HomeBrew and wiki:InstallFiPy/PipInstallsPython

fipy version 2.2-dev4702
numpy version 1.6.1
pysparse version 1.2-dev
PyTrilinos version 4.4 (Trilinos 10.6.4)
scipy version 0.9.0
matplotlib version 1.1.0
gist is not installed
mpi4py version 1.2.2
enthought.mayavi is not installed

Imported from trac ticket #360, created by guyer on 09-14-2011 at 09:22, last modified: 01-31-2012 at 13:47

@guyer
Copy link
Member Author

@guyer guyer commented Sep 19, 2014

guyer attached bt1 on 09-14-2011 at 09:23

@guyer
Copy link
Member Author

@guyer guyer commented Sep 19, 2014

guyer attached bt0 on 09-14-2011 at 09:23

@wd15
Copy link
Contributor

@wd15 wd15 commented Sep 19, 2014

See issue #275 as it happened on bunter, but is no longer occurring on bunter.

Trac comment by wd15 on 11-16-2011 at 17:04

@guyer
Copy link
Member Author

@guyer guyer commented Sep 19, 2014

--no-pysparse not required on dogbert, but apparently is on zizou (issue #275)

Trac comment by guyer on 12-05-2011 at 14:57

@guyer
Copy link
Member Author

@guyer guyer commented Sep 19, 2014

Replying to guyer:

--no-pysparse not required on dogbert, but apparently is on zizou (issue #275)

correction: but apparently is on bunter (issue #275)

Trac comment by guyer on 12-05-2011 at 14:58

@guyer
Copy link
Member Author

@guyer guyer commented Sep 19, 2014

dogbert and bunter both experience freezes (issue #275), but buildbot shows that [http://build.cmi.kent.edu:8010/builders/Ubuntu-trunk-full/builds/0/steps/trial_2/logs/stdio zizou] and [http://build.cmi.kent.edu:8010/builders/OS%20X-trunk-full/builds/0/steps/trial_2/logs/stdio paco] don't, even though full test suite now being run instead of split into --modules and --examples as before

Trac comment by guyer on 12-05-2011 at 15:03

@guyer
Copy link
Member Author

@guyer guyer commented Sep 19, 2014

stripped examples/tests.py@5004 down to

def _suite():
return _LateImportTestSuite(testModuleNames = (
#         'diffusion.test',
'chemotaxis.test',  
#         'phase.test',
#         'convection.test',
#         'elphf.test',
#         'levelSet.test',
#         'cahnHilliard.test',
#         'flow.test',
#         'meshing.test',
#         'reactiveWetting.test',
#         'riemann.test'
), base = __name__)

and fipy/tests.py@5004 down to

def _suite():
return _LateImportTestSuite(testModuleNames = (
#         'solvers.test',
#         'models.test',
#         'terms.test',
#         'tools.test',
#         'matrices.test',
#         'meshes.test',
'variables.test',
#         'viewers.test',
#   'boundaryConditions.test',
), base = __name__)

and still get the freeze in chemotaxis, whereas [Wheeler's results on bunter](issue #275) seemed more ambigous.

Trac comment by guyer on 12-05-2011 at 15:07

@guyer
Copy link
Member Author

@guyer guyer commented Sep 19, 2014

Reduced trunk/fipy/variables/test.py@5004 to

def _suite():
return _LateImportDocTestSuite(
docTestModuleNames = (
#             'fipy.variables.variable',
#             'fipy.variables.meshVariable',
'fipy.variables.cellVariable',
#             'fipy.variables.faceVariable',
#             'fipy.variables.operatorVariable',
#             'fipy.variables.betaNoiseVariable',
#             'fipy.variables.exponentialNoiseVariable',
'fipy.variables.gammaNoiseVariable',
#             'fipy.variables.gaussianNoiseVariable',
#             'fipy.variables.uniformNoiseVariable',
#             'fipy.variables.cellVolumeAverageVariable',
#             'fipy.variables.modularVariable',
#             'fipy.variables.binaryOperatorVariable',
#             'fipy.variables.coupledCellVariable',
#             'fipy.variables.cellToFaceVariable',
#             'fipy.variables.faceGradVariable',
#             'fipy.variables.gaussCellGradVariable',
#             'fipy.variables.faceGradContributionsVariable'
))

and still get the freeze. Through process of elimination, found that commenting out trunk/fipy/variables/cellVariable.py@5004#L198 was enough to remove the freeze.

Restoring all other tests in trunk/fipy/variables/test.py@5004 did not resurrect the freeze, but restoring all tests in trunk/fipy/test.py@5004 did bring the freeze back.

Trac comment by guyer on 12-05-2011 at 19:38

@guyer
Copy link
Member Author

@guyer guyer commented Sep 19, 2014

Reducing trunk/fipy/test.py#5004 to

def _suite():
return _LateImportTestSuite(testModuleNames = (
#         'solvers.test',
#         'models.test',
#         'terms.test',
#         'tools.test',
'matrices.test',
'meshes.test',
#         'variables.test',
#         'viewers.test',
#   'boundaryConditions.test',
), base = __name__)

produces the freeze, but commenting out either matrices.test or meshes.test makes it go away.

Trac comment by guyer on 12-06-2011 at 09:53

@guyer
Copy link
Member Author

@guyer guyer commented Sep 19, 2014

Commenting out just about any one doctest line in trunk/fipy/matrices/trilinosMatrix.py@5004, e.g., trunk/fipy/matrices/trilinosMatrix.py@5004#L193 or trunk/fipy/matrices/trilinosMatrix.py@5004#L194, but not both, trunk/fipy/matrices/trilinosMatrix.py@5004#L229, trunk/fipy/matrices/trilinosMatrix.py@5004#L231, etc. Even trunk/fipy/matrices/trilinosMatrix.py@5004#L918.

Sometimes, commenting out an .addAt() produces its own deadlock, for reasons I don't understand.

Trac comment by guyer on 12-06-2011 at 10:54

@guyer
Copy link
Member Author

@guyer guyer commented Sep 19, 2014

This also hangs on bunter (issue #275), but apparently only with --no-pysparse and possibly requiring different combinations of tests before chemotaxis.

Trac comment by guyer on 12-08-2011 at 09:43

@guyer
Copy link
Member Author

@guyer guyer commented Sep 19, 2014

deadlock occurs on both dogbert (Mac OS X Snow Leopard) and bunter (Debian squeeze), but not on the buildbot slaves paco (Mac OS X Snow Leopard) and zizou (Ubuntu something).

Is this because buildbot slaves are different (stderr and stdout are captured, so different buffering?) or because there's something different about those installations?

Trac comment by guyer on 12-08-2011 at 09:59

@guyer
Copy link
Member Author

@guyer guyer commented Sep 19, 2014

Removing from milestone 3.0

This hang is too idiosyncratic and Wheeler and I have both spent too much time trying to isolate and debug it.

Trac comment by guyer on 12-09-2011 at 12:45

@guyer
Copy link
Member Author

@guyer guyer commented Sep 19, 2014

Redundant with issue #305

Trac comment by guyer on 01-31-2012 at 13:47

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants