Merge updates for HCLIB #57

Matthew-Whitlock · 2020-10-21T18:19:09Z

A set of changes for:

A few minor bug fixes and memory management improvements
Adds function to query a list of failed processes
Replaces MPI overrides with a custom error handler
- TODO: This breaks the MPI Comm store from Fenix, which removes our ability to automatically destroy old communicators. We can leave this as a user responsibility, or we can do partial MPI overrides for all comm creation functions.
Removes the Fenix request store, leaving it up to MPI to revoke any old MPI_Requests
Adds a few tests which use asynchronous functions

Fix some travis/testing issues. Travis now pulls from ULFM master branch when it needs to rebuild ULFM. Travis has an environment variable enabling oversubscription during the tests, instead of having that on all platforms when running make test Tests that involve failure have their timeouts individually set to 1, so tests don't take 10+ seconds each w/ the default timeout of 10s Simplified travis scripts (no more .travis_helpers directory)

New function, "Fenix_test_cancelled" for checking if pre-failure requests completed or were cancelled. One thing to try finding a solution for: If a failure was found during an MPI_Test, that request has already been removed from MPI internals and replaced w/ MPI_REQUEST_NULL. Fenix_test_cancelled will report that this req was completed

This includes removing the option for comm_replace - users now must provide a comm pointer to fenix_init and cannot rely on fenix to automatically replace their input comm with the resilient comm.

Merge in Issend test

* Travis fixes (sandialabs#55) Fix some travis/testing issues. Travis now pulls from ULFM master branch when it needs to rebuild ULFM. Travis has an environment variable enabling oversubscription during the tests, instead of having that on all platforms when running make test Tests that involve failure have their timeouts individually set to 1, so tests don't take 10+ seconds each w/ the default timeout of 10s Simplified travis scripts (no more .travis_helpers directory) * Revert "Travis fixes (sandialabs#55)" (sandialabs#56) Reverting un-reviewer PR, it was meant to be in my fork. This reverts commit a41fd3b. * Update README.md * Merge updates for HCLIB (sandialabs#57) * Add ability to query which processes failed * Add support for MPI_Test * Add support for testing pre-failure requests * Fix bug when ERR_PROC_FAILED/ERR_REVOKED discovered in MPI_Test * Fix MPI_Wait w/ cancelled requests * Add missing file to commit * Fix bug with MPI_STATUS_IGNORE * Fix another bug with MPI_Test * Add no-jump recovery option * Travis fixes (#2) Fix some travis/testing issues. Travis now pulls from ULFM master branch when it needs to rebuild ULFM. Travis has an environment variable enabling oversubscription during the tests, instead of having that on all platforms when running make test Tests that involve failure have their timeouts individually set to 1, so tests don't take 10+ seconds each w/ the default timeout of 10s Simplified travis scripts (no more .travis_helpers directory) * First pass at removing the request store New function, "Fenix_test_cancelled" for checking if pre-failure requests completed or were cancelled. One thing to try finding a solution for: If a failure was found during an MPI_Test, that request has already been removed from MPI internals and replaced w/ MPI_REQUEST_NULL. Fenix_test_cancelled will report that this req was completed * Implement custom errhandler This includes removing the option for comm_replace - users now must provide a comm pointer to fenix_init and cannot rely on fenix to automatically replace their input comm with the resilient comm. * Fenix comms are stack-allocated now, instead of malloced * Cleanup redundant set_errhandler calls * Fix data recovery bug * Add usage instructions to all examples/tests * Add support for MPI_Issend and MPI_Ssend (#3) Merge in Issend test Co-authored-by: mwhitlo@sandia.gov <mwhitlo@sandia.gov> Co-authored-by: sriraj <srirajpaul@gmail.com> Co-authored-by: Keita Teranishi <knteran@sandia.gov> Co-authored-by: mwhitlo@sandia.gov <mwhitlo@sandia.gov> Co-authored-by: sriraj <srirajpaul@gmail.com>

Matthew-Whitlock and others added 17 commits February 12, 2020 10:22

Add ability to query which processes failed

0798fd3

Add support for MPI_Test

3582c7e

Add support for testing pre-failure requests

dfc7d58

Fix bug when ERR_PROC_FAILED/ERR_REVOKED discovered in MPI_Test

cee633d

Fix MPI_Wait w/ cancelled requests

e4c6a3f

Add missing file to commit

42814f6

Fix bug with MPI_STATUS_IGNORE

58e02c8

Fix another bug with MPI_Test

3071f29

Add no-jump recovery option

813540a

Implement custom errhandler

9badba6

This includes removing the option for comm_replace - users now must provide a comm pointer to fenix_init and cannot rely on fenix to automatically replace their input comm with the resilient comm.

Fenix comms are stack-allocated now, instead of malloced

3b69294

Cleanup redundant set_errhandler calls

7f56ade

Fix data recovery bug

fa569a4

Add usage instructions to all examples/tests

4c2539d

Add support for MPI_Issend and MPI_Ssend (#3)

68eea1c

Merge in Issend test

keitat approved these changes Oct 21, 2020

View reviewed changes

Matthew-Whitlock merged commit a6d6647 into sandialabs:master Nov 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Merge updates for HCLIB #57

Merge updates for HCLIB #57

Uh oh!

Matthew-Whitlock commented Oct 21, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Merge updates for HCLIB #57

Merge updates for HCLIB #57

Uh oh!

Conversation

Matthew-Whitlock commented Oct 21, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants