-
Notifications
You must be signed in to change notification settings - Fork 555
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tpetra: Improve thread scalability of Import/Export #797
Comments
@trilinos/tpetra, @mhoemmen Comments -------- This commit is a combination of several commits that address a several issues: trilinos#797, trilinos#798, trilinos#800, trilinos#802 A summary of changes are as follows: - Refactor CrsMatrix pack/unpack procedures to use PackTraits (for both static and dynamic profile matrices) - Refactor packCrsMatrix to pack (optional) PIDs - Remove exists packAndPrepareWithOwningPIDs and instead use the aforementioned packCrsMatrix procedure. - Modify PackTraits run on threads by removing calls to TEUCHOS_TEST_FOR_EXCEPTION and decorating device code with KOKKOS_INLINE_FUNCTION - Ditto for Stokhos' specialization of PackTraits Build/Test Cases Summary Enabled Packages: TpetraCore Disabled Packages: PyTrilinos,Claps,TriKota Enabled all Forward Packages 0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=1483,notpassed=0 (79.62 min)
…aits @trilinos/tpetra, @mhoemmen Comments -------- This single commit is a rebase of several commits that address the following issues: trilinos#797, trilinos#798, trilinos#800, trilinos#802 A summary of changes are as follows: - Refactor CrsMatrix pack/unpack procedures to use PackTraits (for both static and dynamic profile matrices) - Refactor packCrsMatrix to pack (optional) PIDs - Remove exists packAndPrepareWithOwningPIDs and instead use the aforementioned packCrsMatrix procedure. - Modify PackTraits run on threads by removing calls to TEUCHOS_TEST_FOR_EXCEPTION and decorating device code with KOKKOS_INLINE_FUNCTION - Ditto for Stokhos' specialization of PackTraits - Modify unpackCrsMatrix row to *not* unpack directly in to the local CrsMatrix but to return the unpacked data. This required allocating enough scratch space in to which data could be unpacked. We used Kokkos::UniqueToken to allocate the scratch space and to grab a unique (to each thread) subview of the scratch space.
…aits @trilinos/tpetra, @mhoemmen This single commit is a rebase of several commits that address the following issues: trilinos#797, trilinos#798, trilinos#800, trilinos#802 A summary of changes are as follows: - Refactor CrsMatrix pack/unpack procedures to use PackTraits (for both static and dynamic profile matrices) - Refactor packCrsMatrix to pack (optional) PIDs - Remove exists packAndPrepareWithOwningPIDs and instead use the aforementioned packCrsMatrix procedure. - Modify PackTraits run on threads by removing calls to TEUCHOS_TEST_FOR_EXCEPTION and decorating device code with KOKKOS_INLINE_FUNCTION - Ditto for Stokhos' specialization of PackTraits - Modify unpackCrsMatrix row to *not* unpack directly in to the local CrsMatrix but to return the unpacked data. This required allocating enough scratch space in to which data could be unpacked. We used Kokkos::UniqueToken to allocate the scratch space and to grab a unique (to each thread) subview of the scratch space. Enabled Packages: TpetraCore Disabled Packages: PyTrilinos,Claps,TriKota Enabled all Forward Packages 0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=1487,notpassed=0 (102.26 min) 1) MPI_RELEASE_DEBUG_SHARED_OPENMP_PT => passed: passed=1490,notpassed=0 (104.52 min) Enabled Packages: TpetraCore Disabled all Forward Packages 0) MPI_RELEASE_CUDA => passed: passed=124,notpassed=0
…aits @trilinos/tpetra, @mhoemmen This single commit is a rebase of several commits that address the following issues: trilinos#797, trilinos#798, trilinos#800, trilinos#802 Summary ------- - Refactor CrsMatrix pack/unpack procedures to use PackTraits (for both static and dynamic profile matrices) - Refactor packCrsMatrix to pack (optional) PIDs - Remove exists packAndPrepareWithOwningPIDs and instead use the aforementioned packCrsMatrix procedure. - Modify PackTraits run on threads by removing calls to TEUCHOS_TEST_FOR_EXCEPTION and decorating device code with KOKKOS_INLINE_FUNCTION - Ditto for Stokhos' specialization of PackTraits - Modify unpackCrsMatrix row to *not* unpack directly in to the local CrsMatrix but to return the unpacked data. This required allocating enough scratch space in to which data could be unpacked. We used Kokkos::UniqueToken to allocate the scratch space and to grab a unique (to each thread) subview of the scratch space. Build/Test Case Summaries ------------------------- Linux/SEMS, gcc 4.8.3, openmpi 1.8.7 ------------------------------------ Enabled Packages: TpetraCore Disabled Packages: PyTrilinos,Claps,TriKota Enabled all Forward Packages 0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=1487,notpassed=0 (102.26 min) 1) MPI_RELEASE_DEBUG_SHARED_OPENMP_PT => passed: passed=1490,notpassed=0 (104.52 min) CUDA, gcc 5.4, openmpi 1.10.2, cuda 8.0.44 ------------------------------------------ Enabled Packages: TpetraCore Disabled all Forward Packages 0) MPI_RELEASE_CUDA => passed: passed=124,notpassed=0
…aits @trilinos/tpetra, @mhoemmen This single commit is a rebase of several commits that address the following issues: trilinos#797, trilinos#798, trilinos#800, trilinos#802 Summary ------- - Refactor CrsMatrix pack/unpack procedures to use PackTraits (for both static and dynamic profile matrices) - Refactor packCrsMatrix to pack (optional) PIDs - Remove exists packAndPrepareWithOwningPIDs and instead use the aforementioned packCrsMatrix procedure. - Modify PackTraits run on threads by removing calls to TEUCHOS_TEST_FOR_EXCEPTION and decorating device code with KOKKOS_INLINE_FUNCTION - Ditto for Stokhos' specialization of PackTraits - Modify unpackCrsMatrix row to *not* unpack directly in to the local CrsMatrix but to return the unpacked data. This required allocating enough scratch space in to which data could be unpacked. We used Kokkos::UniqueToken to allocate the scratch space and to grab a unique (to each thread) subview of the scratch space. Build/Test Case Summaries ------------------------- Linux/SEMS, gcc 4.8.3, openmpi 1.8.7 ------------------------------------ Enabled Packages: TpetraCore Disabled Packages: PyTrilinos,Claps,TriKota Enabled all Forward Packages 0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=1487,notpassed=0 (102.26 min) 1) MPI_RELEASE_DEBUG_SHARED_OPENMP_PT => passed: passed=1490,notpassed=0 (104.52 min) CUDA, gcc 5.4, openmpi 1.10.2, cuda 8.0.44 ------------------------------------------ Enabled Packages: TpetraCore Disabled all Forward Packages 0) MPI_RELEASE_CUDA => passed: passed=124,notpassed=0
- Moved unpackAndCombineIntoCrsArrays (and friends) from Tpetra_Import_Util2.hpp to Tpetra_Details_unpackCrsMatrixAndCombine_de*.hpp. - unpackAndCombineIntoCrsArrays broken up in to many many smaller functions (it was previously one large monolithic function). Each of the small functions was refactored to be thread parallel. - Race conditions were identified and resolved, mostly by using Kokkos::atomic_fetch_add where appropriate. Addresses: trilinos#797, trilinos#800, trilinos#802 Review: @mhoemmen Build/Test Cases Summary Enabled Packages: TpetraCore Disabled Packages: PyTrilinos,Claps,TriKota Enabled all Forward Packages 0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=1506,notpassed=0 (19.13 min) 1) MPI_RELEASE_DEBUG_SHARED_OPENMP_PT => passed: passed=1509,notpassed=0 (13.75 min) Build/Test Cases Summary [CUDA] Enabled Packages: Tpetra,MueLu,Stokhos 94% tests passed, 16 tests failed out of 257 Label Time Summary: MueLu = 1690.79 sec (69 tests) Stokhos = 496.32 sec (63 tests) Tpetra = 404.65 sec (126 tests) The following tests FAILED: 158 - MueLu_Navier2DBlocked_Epetra_MPI_4 (Failed) 159 - MueLu_Navier2DBlocked_xml_format_MPI_4 (Failed) 160 - MueLu_Navier2DBlocked_xml_format2_MPI_4 (Failed) 161 - MueLu_Navier2DBlocked_xml_blockdirect_MPI_4 (Failed) 162 - MueLu_Navier2DBlocked_xml_bgs1_MPI_4 (Failed) 163 - MueLu_Navier2DBlocked_xml_bs1_MPI_4 (Failed) 164 - MueLu_Navier2DBlocked_xml_bs2_MPI_4 (Failed) 165 - MueLu_Navier2DBlocked_xml_sim1_MPI_4 (Failed) 166 - MueLu_Navier2DBlocked_xml_sim2_MPI_4 (Failed) 167 - MueLu_Navier2DBlocked_xml_uzawa1_MPI_4 (Failed) 168 - MueLu_Navier2DBlocked_xml_indef1_MPI_4 (Failed) 171 - MueLu_Navier2DBlocked_BraessSarazin_MPI_4 (Failed) 172 - MueLu_Navier2DBlockedReuseAggs_MPI_4 (Failed) 173 - MueLu_Navier2DBlocked_Simple_MPI_4 (Failed) 240 - Stokhos_KokkosCrsMatrixUQPCEUnitTest_Cuda_MPI_1 (Failed) 242 - Stokhos_TpetraCrsMatrixUQPCEUnitTest_Cuda_MPI_4 (Failed) According to @mhoemmen, the Stokhos failures are known failures. All of the MueLu tests failed with the following error: MueLu::EpetraOperator::Comm(): Cast from Xpetra::CrsMatrix to Xpetra::EpetraCrsMatrix failed
We don't need all these things for FY17 Q4, so I'm moving this to the backlog. |
- Moved unpackAndCombineIntoCrsArrays (and friends) from Tpetra_Import_Util2.hpp to Tpetra_Details_unpackCrsMatrixAndCombine_de*.hpp. - unpackAndCombineIntoCrsArrays broken up in to many many smaller functions (it was previously one large monolithic function). Each of the small functions was refactored to be thread parallel. - Race conditions were identified and resolved, mostly by using Kokkos::atomic_fetch_add where appropriate. Addresses: trilinos#797, trilinos#800, trilinos#802 Review: @mhoemmen Build/Test Cases Summary Enabled Packages: TpetraCore Disabled Packages: PyTrilinos,Claps,TriKota Enabled all Forward Packages 0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=1506,notpassed=0 (19.13 min) 1) MPI_RELEASE_DEBUG_SHARED_OPENMP_PT => passed: passed=1509,notpassed=0 (13.75 min) Build/Test Cases Summary [CUDA] Enabled Packages: Tpetra,MueLu,Stokhos 94% tests passed, 14 tests failed out of 257 Label Time Summary: MueLu = 1690.79 sec (69 tests) Stokhos = 496.32 sec (63 tests) Tpetra = 404.65 sec (126 tests) The following tests FAILED: 158 - MueLu_Navier2DBlocked_Epetra_MPI_4 (Failed) 159 - MueLu_Navier2DBlocked_xml_format_MPI_4 (Failed) 160 - MueLu_Navier2DBlocked_xml_format2_MPI_4 (Failed) 161 - MueLu_Navier2DBlocked_xml_blockdirect_MPI_4 (Failed) 162 - MueLu_Navier2DBlocked_xml_bgs1_MPI_4 (Failed) 163 - MueLu_Navier2DBlocked_xml_bs1_MPI_4 (Failed) 164 - MueLu_Navier2DBlocked_xml_bs2_MPI_4 (Failed) 165 - MueLu_Navier2DBlocked_xml_sim1_MPI_4 (Failed) 166 - MueLu_Navier2DBlocked_xml_sim2_MPI_4 (Failed) 167 - MueLu_Navier2DBlocked_xml_uzawa1_MPI_4 (Failed) 168 - MueLu_Navier2DBlocked_xml_indef1_MPI_4 (Failed) 171 - MueLu_Navier2DBlocked_BraessSarazin_MPI_4 (Failed) 172 - MueLu_Navier2DBlockedReuseAggs_MPI_4 (Failed) 173 - MueLu_Navier2DBlocked_Simple_MPI_4 (Failed) All of the MueLu tests failed with the following error: MueLu::EpetraOperator::Comm(): Cast from Xpetra::CrsMatrix to Xpetra::EpetraCrsMatrix failed These tests can be ignored, see trilinos#1699
- Moved unpackAndCombineIntoCrsArrays (and friends) from Tpetra_Import_Util2.hpp to Tpetra_Details_unpackCrsMatrixAndCombine_de*.hpp. - unpackAndCombineIntoCrsArrays broken up in to many many smaller functions (it was previously one large monolithic function). Each of the small functions was refactored to be thread parallel. - Race conditions were identified and resolved, mostly by using Kokkos::atomic_fetch_add where appropriate. Addresses: trilinos#797, trilinos#800, trilinos#802 Review: @mhoemmen Build/Test Cases Summary Enabled Packages: TpetraCore Disabled Packages: PyTrilinos,Claps,TriKota Enabled all Forward Packages 0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=1506,notpassed=0 (19.13 min) 1) MPI_RELEASE_DEBUG_SHARED_OPENMP_PT => passed: passed=1509,notpassed=0 (13.75 min) All of the MueLu tests failed with the following error: MueLu::EpetraOperator::Comm(): Cast from Xpetra::CrsMatrix to Xpetra::EpetraCrsMatrix failed These tests can be ignored, see trilinos#1699
- Moved unpackAndCombineIntoCrsArrays (and friends) from Tpetra_Import_Util2.hpp to Tpetra_Details_unpackCrsMatrixAndCombine_de*.hpp. - unpackAndCombineIntoCrsArrays broken up in to many many smaller functions (it was previously one large monolithic function). Each of the small functions was refactored to be thread parallel. - Race conditions were identified and resolved, mostly by using Kokkos::atomic_fetch_add where appropriate. Addresses: trilinos#797, trilinos#800, trilinos#802 Review: @mhoemmen Tests were run on two different machines and there results amended to this commit: Build/Test Cases Summary [RHEL6, standard checkin script] Enabled Packages: TpetraCore Disabled Packages: PyTrilinos,Claps,TriKota Enabled all Forward Packages 0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=1506,notpassed=0 (19.13 min) 1) MPI_RELEASE_DEBUG_SHARED_OPENMP_PT => passed: passed=1509,notpassed=0 (13.75 min) Build/Test Cases Summary [ride.sandia.gov, CUDA] Enabled Packages: Tpetra,MueLu,Stokhos 0) MPI_RELEASE_SHARED_CUDA => passed=233,notpassed=14 (8.76 min) The 14 failing tests are unrelated MueLu tests that can be ignored, see trilinos#1699 The failing Stokhos tests mentioned in trilinos#1655 were fixed with commit e97e37b
- Moved unpackAndCombineIntoCrsArrays (and friends) from Tpetra_Import_Util2.hpp to Tpetra_Details_unpackCrsMatrixAndCombine_de*.hpp. - unpackAndCombineIntoCrsArrays broken up in to many many smaller functions (it was previously one large monolithic function). Each of the small functions was refactored to be thread parallel. - Race conditions were identified and resolved, mostly by using Kokkos::atomic_fetch_add where appropriate. Addresses: #797, #800, #802 Review: @mhoemmen Tests were run on two different machines and there results amended to this commit: Build/Test Cases Summary [RHEL6, standard checkin script] Enabled Packages: TpetraCore Disabled Packages: PyTrilinos,Claps,TriKota Enabled all Forward Packages 0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=1506,notpassed=0 (19.13 min) 1) MPI_RELEASE_DEBUG_SHARED_OPENMP_PT => passed: passed=1509,notpassed=0 (13.75 min) Build/Test Cases Summary [ride.sandia.gov, CUDA] Enabled Packages: Tpetra,MueLu,Stokhos 0) MPI_RELEASE_SHARED_CUDA => passed=233,notpassed=14 (8.76 min) The 14 failing tests are unrelated MueLu tests that can be ignored, see #1699 The failing Stokhos tests mentioned in #1655 were fixed with commit e97e37b
This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity. |
This issue was closed due to inactivity for 395 days. |
@trilinos/tpetra
Epic: #796
[mfh edit 13 Jul 2017: promote the transferAndFillComplete task, #802, into its own story]
This involves several tasks, not all fully identified:
For #800, #801, and #802, we need to do performance tests to make sure that the changes thread-scale without sacrificing performance in the MPI-only case. It may make sense to have a non-threaded implementation if the number of threads is 1. (Some users may use Tpetra's OpenMP back-end without realizing it, but run with 1 thread per MPI process. That's why this should be a run-time decision rather than a decision based on the back-end type.)
The text was updated successfully, but these errors were encountered: