Skip to content

Hangs/errors with CPU mpi test run #5

@RAMitchell

Description

@RAMitchell

Running command legate --launcher mpirun --ranks-per-node 2 --module pytest legateboost/test -svx
Either hangs or results in errors as below.

These runs included the fix nv-legate/legate#778

legion_python: /home/nfs/rorym/legate.core/_skbuild/linux-x86_64-3.10/cmake-build/_deps/legion-src/runtime/realm/runtime_impl.cc:2539: Realm::GenEventImpl* Realm::RuntimeImpl::get_genevent_impl(Realm::Event): Assertion `id.is_event()' failed.
Fatal Python error: Aborted

or

Fatal Python error: Segmentation fault

Thread 0x00007fa6b45d8000 (most recent call first):
  File "/home/nfs/rorym/legate.core/legate/core/_legion/future.py", line 157 in get_buffer
  File "/home/nfs/rorym/cunumeric/cunumeric/deferred.py", line 382 in get_scalar_array
  File "/home/nfs/rorym/cunumeric/cunumeric/deferred.py", line 291 in __numpy_array__
  File "/home/nfs/rorym/cunumeric/cunumeric/array.py", line 839 in __array__
  File "/home/nfs/rorym/cunumeric/cunumeric/coverage.py", line 119 in wrapper
  File "/home/nfs/rorym/legate.core/legate/core/runtime.py", line 2086 in wrapper
  File "/home/nfs/rorym/cunumeric/cunumeric/array.py", line 176 in maybe_convert_to_np_ndarray
  File "/home/nfs/rorym/cunumeric/cunumeric/utils.py", line 230 in deep_apply
  File "/home/nfs/rorym/cunumeric/cunumeric/utils.py", line 226 in <genexpr>
  File "/home/nfs/rorym/cunumeric/cunumeric/utils.py", line 226 in deep_apply
  File "/home/nfs/rorym/cunumeric/cunumeric/array.py", line 431 in __array_function__
  File "/home/nfs/rorym/cunumeric/cunumeric/coverage.py", line 119 in wrapper
  File "/home/nfs/rorym/legate.core/legate/core/runtime.py", line 2086 in wrapper
  File "<__array_function__ internals>", line 180 in result_type
  File "/home/nfs/rorym/cunumeric/cunumeric/_ufunc/ufunc.py", line 569 in _find_common_type
  File "/home/nfs/rorym/cunumeric/cunumeric/_ufunc/ufunc.py", line 581 in _resolve_dtype
  File "/home/nfs/rorym/cunumeric/cunumeric/_ufunc/ufunc.py", line 669 in __call__
  File "/home/nfs/rorym/cunumeric/cunumeric/array.py", line 1256 in __itruediv__
  File "/home/nfs/rorym/cunumeric/cunumeric/coverage.py", line 119 in wrapper
  File "/home/nfs/rorym/legate.core/legate/core/runtime.py", line 2086 in wrapper
  File "/home/nfs/rorym/cunumeric/cunumeric/array.py", line 3142 in mean
  File "/home/nfs/rorym/cunumeric/cunumeric/array.py", line 142 in wrapper
  File "/home/nfs/rorym/cunumeric/cunumeric/coverage.py", line 119 in wrapper
  File "/home/nfs/rorym/legate.core/legate/core/runtime.py", line 2083 in wrapper
  File "/home/nfs/rorym/LegateGBM/legateboost/metrics.py", line 19 in metric
  File "/home/nfs/rorym/LegateGBM/legateboost/legateboost.py", line 321 in fit
  File "/home/nfs/rorym/LegateGBM/legateboost/legateboost.py", line 389 in fit
  File "/home/nfs/rorym/LegateGBM/legateboost/test/test_estimator.py", line 56 in test_regressor_improving_with_depth
  File "/home/nfs/rorym/anaconda3/envs/legate-test/lib/python3.10/site-packages/_pytest/python.py", line 194 in pytest_pyfunc_call
  File "/home/nfs/rorym/anaconda3/envs/legate-test/lib/python3.10/site-packages/pluggy/_callers.py", line 39 in _multicall
  File "/home/nfs/rorym/anaconda3/envs/legate-test/lib/python3.10/site-packages/pluggy/_manager.py", line 80 in _hookexec
  File "/home/nfs/rorym/anaconda3/envs/legate-test/lib/python3.10/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "/home/nfs/rorym/anaconda3/envs/legate-test/lib/python3.10/site-packages/_pytest/python.py", line 1799 in runtest
  File "/home/nfs/rorym/anaconda3/envs/legate-test/lib/python3.10/site-packages/_pytest/runner.py", line 169 in pytest_runtest_call
  File "/home/nfs/rorym/anaconda3/envs/legate-test/lib/python3.10/site-packages/pluggy/_callers.py", line 39 in _multicall
  File "/home/nfs/rorym/anaconda3/envs/legate-test/lib/python3.10/site-packages/pluggy/_manager.py", line 80 in _hookexec
  File "/home/nfs/rorym/anaconda3/envs/legate-test/lib/python3.10/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "/home/nfs/rorym/anaconda3/envs/legate-test/lib/python3.10/site-packages/_pytest/runner.py", line 262 in <lambda>
  File "/home/nfs/rorym/anaconda3/envs/legate-test/lib/python3.10/site-packages/_pytest/runner.py", line 341 in from_call
  File "/home/nfs/rorym/anaconda3/envs/legate-test/lib/python3.10/site-packages/_pytest/runner.py", line 261 in call_runtest_hook
  File "/home/nfs/rorym/anaconda3/envs/legate-test/lib/python3.10/site-packages/_pytest/runner.py", line 222 in call_and_report
  File "/home/nfs/rorym/anaconda3/envs/legate-test/lib/python3.10/site-packages/_pytest/runner.py", line 133 in runtestprotocol
  File "/home/nfs/rorym/anaconda3/envs/legate-test/lib/python3.10/site-packages/_pytest/runner.py", line 114 in pytest_runtest_protocol
  File "/home/nfs/rorym/anaconda3/envs/legate-test/lib/python3.10/site-packages/pluggy/_callers.py", line 39 in _multicall
  File "/home/nfs/rorym/anaconda3/envs/legate-test/lib/python3.10/site-packages/pluggy/_manager.py", line 80 in _hookexec
  File "/home/nfs/rorym/anaconda3/envs/legate-test/lib/python3.10/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "/home/nfs/rorym/anaconda3/envs/legate-test/lib/python3.10/site-packages/_pytest/main.py", line 348 in pytest_runtestloop
  File "/home/nfs/rorym/anaconda3/envs/legate-test/lib/python3.10/site-packages/pluggy/_callers.py", line 39 in _multicall
  File "/home/nfs/rorym/anaconda3/envs/legate-test/lib/python3.10/site-packages/pluggy/_manager.py", line 80 in _hookexec
  File "/home/nfs/rorym/anaconda3/envs/legate-test/lib/python3.10/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "/home/nfs/rorym/anaconda3/envs/legate-test/lib/python3.10/site-packages/_pytest/main.py", line 323 in _main
  File "/home/nfs/rorym/anaconda3/envs/legate-test/lib/python3.10/site-packages/_pytest/main.py", line 269 in wrap_session
  File "/home/nfs/rorym/anaconda3/envs/legate-test/lib/python3.10/site-packages/_pytest/main.py", line 316 in pytest_cmdline_main
  File "/home/nfs/rorym/anaconda3/envs/legate-test/lib/python3.10/site-packages/pluggy/_callers.py", line 39 in _multicall
  File "/home/nfs/rorym/anaconda3/envs/legate-test/lib/python3.10/site-packages/pluggy/_manager.py", line 80 in _hookexec
  File "/home/nfs/rorym/anaconda3/envs/legate-test/lib/python3.10/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "/home/nfs/rorym/anaconda3/envs/legate-test/lib/python3.10/site-packages/_pytest/config/__init__.py", line 166 in main
  File "/home/nfs/rorym/anaconda3/envs/legate-test/lib/python3.10/site-packages/_pytest/config/__init__.py", line 189 in console_main
  File "/home/nfs/rorym/anaconda3/envs/legate-test/lib/python3.10/site-packages/pytest/__main__.py", line 5 in <module>
  File "/home/nfs/rorym/legate.core/_skbuild/linux-x86_64-3.10/cmake-build/_deps/legion-src/bindings/python/build/lib/legion_top.py", line 295 in run_path
  File "/home/nfs/rorym/legate.core/_skbuild/linux-x86_64-3.10/cmake-build/_deps/legion-src/bindings/python/build/lib/legion_top.py", line 463 in legion_python_main

Extension modules: _cffi_backend, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, sklearn.__check_build._check_build, scipy._lib._ccallback_c, scipy.sparse._sparsetools, scipy.sparse._csparsetools, scipy.sparse.linalg._isolve._iterative, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg._cythonized_array_utils, scipy.linalg._flinalg, scipy.linalg._solve_toeplitz, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg.cython_lapack, scipy.linalg.cython_blas, scipy.linalg._matfuncs_expm, scipy.linalg._decomp_update, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, sklearn.utils.murmurhash, psutil._psutil_linux, psutil._psutil_posix, numpy.linalg.lapack_lite, scipy.spatial._ckdtree, scipy._lib.messagestream, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._distance_wrap, scipy.spatial._hausdorff, scipy.special._ufuncs_cxx, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.special._ellip_harm_2, scipy.spatial.transform._rotation, scipy.ndimage._nd_image, _ni_label, scipy.ndimage._ni_label, scipy.optimize._minpack2, scipy.optimize._group_columns, scipy.optimize._trlib._trlib, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNC, scipy.optimize._cobyla, scipy.optimize._slsqp, scipy.optimize._minpack, scipy.optimize._lsq.givens_elimination, scipy.optimize._zeros, scipy.optimize.__nnls, scipy.optimize._highs.cython.src._highs_wrapper, scipy.optimize._highs._highs_wrapper, scipy.optimize._highs.cython.src._highs_constants, scipy.optimize._highs._highs_constants, scipy.linalg._interpolative, scipy.optimize._bglu_dense, scipy.optimize._lsap, scipy.optimize._direct, scipy.integrate._odepack, scipy.integrate._quadpack, scipy.integrate._vode, scipy.integrate._dop, scipy.integrate._lsoda, scipy.special.cython_special, scipy.stats._stats, beta_ufunc, scipy.stats._boost.beta_ufunc, binom_ufunc, scipy.stats._boost.binom_ufunc, nbinom_ufunc, scipy.stats._boost.nbinom_ufunc, hypergeom_ufunc, scipy.stats._boost.hypergeom_ufunc, ncf_ufunc, scipy.stats._boost.ncf_ufunc, ncx2_ufunc, scipy.stats._boost.ncx2_ufunc, nct_ufunc, scipy.stats._boost.nct_ufunc, skewnorm_ufunc, scipy.stats._boost.skewnorm_ufunc, invgauss_ufunc, scipy.stats._boost.invgauss_ufunc, scipy.interpolate._fitpack, scipy.interpolate.dfitpack, scipy.interpolate._bspl, scipy.interpolate._ppoly, scipy.interpolate.interpnd, scipy.interpolate._rbfinterp_pythran, scipy.interpolate._rgi_cython, scipy.stats._biasedurn, scipy.stats._levy_stable.levyst, scipy.stats._stats_pythran, scipy._lib._uarray._uarray, scipy.stats._statlib, scipy.stats._mvn, scipy.stats._sobol, scipy.stats._qmc_cy, scipy.stats._rcont.rcont, sklearn.utils._isfinite, sklearn.utils._openmp_helpers, sklearn.utils._logistic_sigmoid, sklearn.utils.sparsefuncs_fast, sklearn.preprocessing._csr_polynomial_expansion, sklearn.utils._random, sklearn.utils._seq_dataset, sklearn.utils._cython_blas, sklearn.utils.arrayfuncs, sklearn.utils._typedefs, sklearn.utils._readonly_array_wrapper, sklearn.metrics._dist_metrics, sklearn.metrics.cluster._expected_mutual_info_fast, sklearn.metrics._pairwise_distances_reduction._datasets_pair, sklearn.metrics._pairwise_distances_reduction._base, sklearn.metrics._pairwise_distances_reduction._middle_term_computer, sklearn.utils._heap, sklearn.utils._sorting, sklearn.metrics._pairwise_distances_reduction._argkmin, sklearn.utils._vector_sentinel, sklearn.metrics._pairwise_distances_reduction._radius_neighbors, sklearn.metrics._pairwise_fast, sklearn.linear_model._cd_fast, sklearn._loss._loss, sklearn.utils._weight_vector, sklearn.linear_model._sgd_fast, sklearn.linear_model._sag_fast, sklearn.svm._libsvm, sklearn.svm._liblinear, sklearn.svm._libsvm_sparse, sklearn.neighbors._partition_nodes, sklearn.neighbors._ball_tree, sklearn.neighbors._kd_tree, sklearn.decomposition._cdnmf_fast, sklearn.decomposition._online_lda_fast, sklearn.feature_extraction._hashing_fast, sklearn.datasets._svmlight_format_fast, scipy.io.matlab._mio_utils, scipy.io.matlab._streams, scipy.io.matlab._mio5_utils, legate.core._lib.types, context, legate.core._lib.context (total: 161)
Signal 11 received by node 0, process 472669 (thread 7fa6d00af000) - obtaining backtrace
Signal 11 received by process 472669 (thread 7fa6d00af000) at: stack trace: 17 frames
  [0] = /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb) [0x7fa6d5d7300b]
  [1] = /lib/x86_64-linux-gnu/libc.so.6(+0x4308f) [0x7fa6d5d7308f]
  [2] = /home/nfs/rorym/legate.core/_skbuild/linux-x86_64-3.10/cmake-build/_deps/legion-build/lib/liblegion.so.1(Legion::Internal::CollectiveViewRendezvous::unpack_collective(Legion::Deserializer&)+0x548) [0x7fa6d7b247f8]
  [3] = /home/nfs/rorym/legate.core/_skbuild/linux-x86_64-3.10/cmake-build/_deps/legion-build/lib/liblegion.so.1(Legion::Internal::GatherCollective::handle_collective_message(Legion::Deserializer&)+0x43) [0x7fa6d7af1a43]
  [4] = /home/nfs/rorym/legate.core/_skbuild/linux-x86_64-3.10/cmake-build/_deps/legion-build/lib/liblegion.so.1(Legion::Internal::ReplicateContext::register_collective(Legion::Internal::ShardCollective*)+0x266) [0x7fa6d79c43e6]
  [5] = /home/nfs/rorym/legate.core/_skbuild/linux-x86_64-3.10/cmake-build/_deps/legion-build/lib/liblegion.so.1(Legion::Internal::GatherCollective::perform_collective_async(Legion::Internal::RtEvent)+0x3d) [0x7fa6d7af186d]
  [6] = /home/nfs/rorym/legate.core/_skbuild/linux-x86_64-3.10/cmake-build/_deps/legion-build/lib/liblegion.so.1(Legion::Internal::CollectiveViewCreator<Legion::Internal::AttachOp>::rendezvous_collective_mapping(unsigned int, unsigned int, Legion::LogicalRegion, Legion::Internal::CollectiveViewCreatorBase::RendezvousResult*, unsigned int, std::vector<std::pair<unsigned long long, AVXBitMask<256u> >, Legion::Internal::LegionAllocator<std::pair<unsigned long long, AVXBitMask<256u> >, (Legion::Internal::AllocationType)106> > const&)+0x349) [0x7fa6d7a9c4c9]
  [7] = /home/nfs/rorym/legate.core/_skbuild/linux-x86_64-3.10/cmake-build/_deps/legion-build/lib/liblegion.so.1(Legion::Internal::CollectiveViewCreator<Legion::Internal::AttachOp>::convert_collective_views(unsigned int, unsigned int, Legion::LogicalRegion, Legion::Internal::InstanceSet const&, Legion::Internal::InnerContext*, Legion::Internal::CollectiveMapping*&, bool&, std::vector<Legion::Internal::FieldMaskSet<Legion::Internal::InstanceView, (Legion::Internal::AllocationType)106, false>, Legion::Internal::LegionAllocator<Legion::Internal::FieldMaskSet<Legion::Internal::InstanceView, (Legion::Internal::AllocationType)106, false>, (Legion::Internal::AllocationType)106> >&, std::map<Legion::Internal::InstanceView*, unsigned long, std::less<Legion::Internal::InstanceView*>, std::allocator<std::pair<Legion::Internal::InstanceView* const, unsigned long> > >&)+0xf9) [0x7fa6d7aa2ff9]
  [8] = /home/nfs/rorym/legate.core/_skbuild/linux-x86_64-3.10/cmake-build/_deps/legion-build/lib/liblegion.so.1(Legion::Internal::OverwriteAnalysis::convert_views(Legion::LogicalRegion, Legion::Internal::InstanceSet const&, unsigned int)+0x2c5) [0x7fa6d78b5695]
  [9] = /home/nfs/rorym/legate.core/_skbuild/linux-x86_64-3.10/cmake-build/_deps/legion-build/lib/liblegion.so.1(Legion::Internal::RegionTreeForest::attach_external(Legion::Internal::AttachOp*, unsigned int, Legion::RegionRequirement const&, Legion::Internal::InstanceSet const&, Legion::Internal::VersionInfo const&, Legion::Internal::ApEvent, Legion::Internal::PhysicalTraceInfo const&, std::set<Legion::Internal::RtEvent, std::less<Legion::Internal::RtEvent>, std::allocator<Legion::Internal::RtEvent> >&, bool)+0x13b) [0x7fa6d7c5863b]
  [10] = /home/nfs/rorym/legate.core/_skbuild/linux-x86_64-3.10/cmake-build/_deps/legion-build/lib/liblegion.so.1(Legion::Internal::AttachOp::trigger_mapping()+0xa9) [0x7fa6d7a4e4a9]
  [11] = /home/nfs/rorym/legate.core/_skbuild/linux-x86_64-3.10/cmake-build/_deps/legion-build/lib/liblegion.so.1(Legion::Internal::Runtime::legion_runtime_task(void const*, unsigned long, void const*, unsigned long, Realm::Processor)+0x771) [0x7fa6d7d846b1]
  [12] = /home/nfs/rorym/legate.core/_skbuild/linux-x86_64-3.10/cmake-build/_deps/legion-build/lib/librealm.so.1(+0x510a01) [0x7fa6d660fa01]
  [13] = /home/nfs/rorym/legate.core/_skbuild/linux-x86_64-3.10/cmake-build/_deps/legion-build/lib/librealm.so.1(+0x510ab5) [0x7fa6d660fab5]
  [14] = /home/nfs/rorym/legate.core/_skbuild/linux-x86_64-3.10/cmake-build/_deps/legion-build/lib/librealm.so.1(+0x514595) [0x7fa6d6613595]
  [15] = /home/nfs/rorym/legate.core/_skbuild/linux-x86_64-3.10/cmake-build/_deps/legion-build/lib/librealm.so.1(+0x518d22) [0x7fa6d6617d22]
  [16] = /lib/x86_64-linux-gnu/libc.so.6(+0x5b4df) [0x7fa6d5d8b4df]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions