-
Notifications
You must be signed in to change notification settings - Fork 555
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tpetra: Make CrsMatrix::transferAndFillComplete do thread-parallel pack & unpack #802
Comments
Task breakdown, discussed today (13 Jul 2017) with @tjfulle , with questions for @csiefer2 and perhaps @jhux2 if he wishes to help:
NOTE: If we only care about transferAndFillComplete, then the target graph / matrix needs to be fill complete on return anyway. This means we don't actually need to solve (5) and (6). Instead, we can count the number of entries needed in each row, allocate the final fill-complete data structure ( |
Does that answer your questions? |
Thanks @csiefer2 for the quick reply! :-D
Yes, the target graph / matrix of the Import / Export.
Thanks for reminding us of that distinction -- it means that the target Crs{Graph,Matrix} doesn't even need to exist until TAFC returns it, so there is no such thing as "DynamicProfile" or "StaticProfile" for it. This implies that we can use whatever data structure we want for the target, since users only see the fill complete version of it.
Do you expect interface or pack format changes? |
@trilinos/tpetra, @mhoemmen Comments -------- This commit is a combination of several commits that address a several issues: trilinos#797, trilinos#798, trilinos#800, trilinos#802 A summary of changes are as follows: - Refactor CrsMatrix pack/unpack procedures to use PackTraits (for both static and dynamic profile matrices) - Refactor packCrsMatrix to pack (optional) PIDs - Remove exists packAndPrepareWithOwningPIDs and instead use the aforementioned packCrsMatrix procedure. - Modify PackTraits run on threads by removing calls to TEUCHOS_TEST_FOR_EXCEPTION and decorating device code with KOKKOS_INLINE_FUNCTION - Ditto for Stokhos' specialization of PackTraits Build/Test Cases Summary Enabled Packages: TpetraCore Disabled Packages: PyTrilinos,Claps,TriKota Enabled all Forward Packages 0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=1483,notpassed=0 (79.62 min)
The packing portion of TAFC now uses |
Adding, #1569 addresses the packing portion |
…aits @trilinos/tpetra, @mhoemmen Comments -------- This single commit is a rebase of several commits that address the following issues: trilinos#797, trilinos#798, trilinos#800, trilinos#802 A summary of changes are as follows: - Refactor CrsMatrix pack/unpack procedures to use PackTraits (for both static and dynamic profile matrices) - Refactor packCrsMatrix to pack (optional) PIDs - Remove exists packAndPrepareWithOwningPIDs and instead use the aforementioned packCrsMatrix procedure. - Modify PackTraits run on threads by removing calls to TEUCHOS_TEST_FOR_EXCEPTION and decorating device code with KOKKOS_INLINE_FUNCTION - Ditto for Stokhos' specialization of PackTraits - Modify unpackCrsMatrix row to *not* unpack directly in to the local CrsMatrix but to return the unpacked data. This required allocating enough scratch space in to which data could be unpacked. We used Kokkos::UniqueToken to allocate the scratch space and to grab a unique (to each thread) subview of the scratch space.
…aits @trilinos/tpetra, @mhoemmen This single commit is a rebase of several commits that address the following issues: trilinos#797, trilinos#798, trilinos#800, trilinos#802 A summary of changes are as follows: - Refactor CrsMatrix pack/unpack procedures to use PackTraits (for both static and dynamic profile matrices) - Refactor packCrsMatrix to pack (optional) PIDs - Remove exists packAndPrepareWithOwningPIDs and instead use the aforementioned packCrsMatrix procedure. - Modify PackTraits run on threads by removing calls to TEUCHOS_TEST_FOR_EXCEPTION and decorating device code with KOKKOS_INLINE_FUNCTION - Ditto for Stokhos' specialization of PackTraits - Modify unpackCrsMatrix row to *not* unpack directly in to the local CrsMatrix but to return the unpacked data. This required allocating enough scratch space in to which data could be unpacked. We used Kokkos::UniqueToken to allocate the scratch space and to grab a unique (to each thread) subview of the scratch space. Enabled Packages: TpetraCore Disabled Packages: PyTrilinos,Claps,TriKota Enabled all Forward Packages 0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=1487,notpassed=0 (102.26 min) 1) MPI_RELEASE_DEBUG_SHARED_OPENMP_PT => passed: passed=1490,notpassed=0 (104.52 min) Enabled Packages: TpetraCore Disabled all Forward Packages 0) MPI_RELEASE_CUDA => passed: passed=124,notpassed=0
…aits @trilinos/tpetra, @mhoemmen This single commit is a rebase of several commits that address the following issues: trilinos#797, trilinos#798, trilinos#800, trilinos#802 Summary ------- - Refactor CrsMatrix pack/unpack procedures to use PackTraits (for both static and dynamic profile matrices) - Refactor packCrsMatrix to pack (optional) PIDs - Remove exists packAndPrepareWithOwningPIDs and instead use the aforementioned packCrsMatrix procedure. - Modify PackTraits run on threads by removing calls to TEUCHOS_TEST_FOR_EXCEPTION and decorating device code with KOKKOS_INLINE_FUNCTION - Ditto for Stokhos' specialization of PackTraits - Modify unpackCrsMatrix row to *not* unpack directly in to the local CrsMatrix but to return the unpacked data. This required allocating enough scratch space in to which data could be unpacked. We used Kokkos::UniqueToken to allocate the scratch space and to grab a unique (to each thread) subview of the scratch space. Build/Test Case Summaries ------------------------- Linux/SEMS, gcc 4.8.3, openmpi 1.8.7 ------------------------------------ Enabled Packages: TpetraCore Disabled Packages: PyTrilinos,Claps,TriKota Enabled all Forward Packages 0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=1487,notpassed=0 (102.26 min) 1) MPI_RELEASE_DEBUG_SHARED_OPENMP_PT => passed: passed=1490,notpassed=0 (104.52 min) CUDA, gcc 5.4, openmpi 1.10.2, cuda 8.0.44 ------------------------------------------ Enabled Packages: TpetraCore Disabled all Forward Packages 0) MPI_RELEASE_CUDA => passed: passed=124,notpassed=0
…aits @trilinos/tpetra, @mhoemmen This single commit is a rebase of several commits that address the following issues: trilinos#797, trilinos#798, trilinos#800, trilinos#802 Summary ------- - Refactor CrsMatrix pack/unpack procedures to use PackTraits (for both static and dynamic profile matrices) - Refactor packCrsMatrix to pack (optional) PIDs - Remove exists packAndPrepareWithOwningPIDs and instead use the aforementioned packCrsMatrix procedure. - Modify PackTraits run on threads by removing calls to TEUCHOS_TEST_FOR_EXCEPTION and decorating device code with KOKKOS_INLINE_FUNCTION - Ditto for Stokhos' specialization of PackTraits - Modify unpackCrsMatrix row to *not* unpack directly in to the local CrsMatrix but to return the unpacked data. This required allocating enough scratch space in to which data could be unpacked. We used Kokkos::UniqueToken to allocate the scratch space and to grab a unique (to each thread) subview of the scratch space. Build/Test Case Summaries ------------------------- Linux/SEMS, gcc 4.8.3, openmpi 1.8.7 ------------------------------------ Enabled Packages: TpetraCore Disabled Packages: PyTrilinos,Claps,TriKota Enabled all Forward Packages 0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=1487,notpassed=0 (102.26 min) 1) MPI_RELEASE_DEBUG_SHARED_OPENMP_PT => passed: passed=1490,notpassed=0 (104.52 min) CUDA, gcc 5.4, openmpi 1.10.2, cuda 8.0.44 ------------------------------------------ Enabled Packages: TpetraCore Disabled all Forward Packages 0) MPI_RELEASE_CUDA => passed: passed=124,notpassed=0
Fixed in develop. See PR #1569. |
@mhoemmen , it's not totally fixed. The packing portion is, but |
@tjfulle thanks for the clarification! If you like, you may either open up a new issue for the work left to do, or reopen this issue. |
I've "completed" thread parallelizing I say "completed" because there are two parts of the algorithm that are not easily thread parallelizable and are still serial. These parts deal with local matrix rows that have contributions from multiple other processors. Thread parallelizing the current algorithm/data structures results in local row quantities being touched and updated by concurrent threads, leading to clashes and failures. I've got to think about whether a |
@tjfulle Awesome!!! :-D btw watch out for potential merge conflicts with my #1088 fix, coming in soon (possibly today). btw @tjfulle wrote:
Could you please clarify? As long as the target matrix's structure does not change, then it sounds like you could just use atomic updates to resolve these thread conflicts. |
@mhoemmen wrote:
Perhaps I can clarify, but it would probably be easier to show you. Let me try. The difficulty is that data for a row is packed sequentially as
After unpacking Did that make any sense? |
@mhoemmen wrote:
Now you tell me! I'm running in to other |
Could you send me a meeting invite, say for tomorrow afternooon? That might be easier. Thanks! |
@mhoemmen, what, that didn't make sense? Meeting invite sent |
Just got my |
- Moved unpackAndCombineIntoCrsArrays (and friends) from Tpetra_Import_Util2.hpp to Tpetra_Details_unpackCrsMatrixAndCombine_de*.hpp. - unpackAndCombineIntoCrsArrays broken up in to many many smaller functions (it was previously one large monolithic function). Each of the small functions was refactored to be thread parallel. - Race conditions were identified and resolved, mostly by using Kokkos::atomic_fetch_add where appropriate. Addresses: trilinos#797, trilinos#800, trilinos#802 Review: @mhoemmen Build/Test Cases Summary Enabled Packages: TpetraCore Disabled Packages: PyTrilinos,Claps,TriKota Enabled all Forward Packages 0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=1506,notpassed=0 (19.13 min) 1) MPI_RELEASE_DEBUG_SHARED_OPENMP_PT => passed: passed=1509,notpassed=0 (13.75 min) Build/Test Cases Summary [CUDA] Enabled Packages: Tpetra,MueLu,Stokhos 94% tests passed, 16 tests failed out of 257 Label Time Summary: MueLu = 1690.79 sec (69 tests) Stokhos = 496.32 sec (63 tests) Tpetra = 404.65 sec (126 tests) The following tests FAILED: 158 - MueLu_Navier2DBlocked_Epetra_MPI_4 (Failed) 159 - MueLu_Navier2DBlocked_xml_format_MPI_4 (Failed) 160 - MueLu_Navier2DBlocked_xml_format2_MPI_4 (Failed) 161 - MueLu_Navier2DBlocked_xml_blockdirect_MPI_4 (Failed) 162 - MueLu_Navier2DBlocked_xml_bgs1_MPI_4 (Failed) 163 - MueLu_Navier2DBlocked_xml_bs1_MPI_4 (Failed) 164 - MueLu_Navier2DBlocked_xml_bs2_MPI_4 (Failed) 165 - MueLu_Navier2DBlocked_xml_sim1_MPI_4 (Failed) 166 - MueLu_Navier2DBlocked_xml_sim2_MPI_4 (Failed) 167 - MueLu_Navier2DBlocked_xml_uzawa1_MPI_4 (Failed) 168 - MueLu_Navier2DBlocked_xml_indef1_MPI_4 (Failed) 171 - MueLu_Navier2DBlocked_BraessSarazin_MPI_4 (Failed) 172 - MueLu_Navier2DBlockedReuseAggs_MPI_4 (Failed) 173 - MueLu_Navier2DBlocked_Simple_MPI_4 (Failed) 240 - Stokhos_KokkosCrsMatrixUQPCEUnitTest_Cuda_MPI_1 (Failed) 242 - Stokhos_TpetraCrsMatrixUQPCEUnitTest_Cuda_MPI_4 (Failed) According to @mhoemmen, the Stokhos failures are known failures. All of the MueLu tests failed with the following error: MueLu::EpetraOperator::Comm(): Cast from Xpetra::CrsMatrix to Xpetra::EpetraCrsMatrix failed
- Moved unpackAndCombineIntoCrsArrays (and friends) from Tpetra_Import_Util2.hpp to Tpetra_Details_unpackCrsMatrixAndCombine_de*.hpp. - unpackAndCombineIntoCrsArrays broken up in to many many smaller functions (it was previously one large monolithic function). Each of the small functions was refactored to be thread parallel. - Race conditions were identified and resolved, mostly by using Kokkos::atomic_fetch_add where appropriate. Addresses: trilinos#797, trilinos#800, trilinos#802 Review: @mhoemmen Build/Test Cases Summary Enabled Packages: TpetraCore Disabled Packages: PyTrilinos,Claps,TriKota Enabled all Forward Packages 0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=1506,notpassed=0 (19.13 min) 1) MPI_RELEASE_DEBUG_SHARED_OPENMP_PT => passed: passed=1509,notpassed=0 (13.75 min) Build/Test Cases Summary [CUDA] Enabled Packages: Tpetra,MueLu,Stokhos 94% tests passed, 14 tests failed out of 257 Label Time Summary: MueLu = 1690.79 sec (69 tests) Stokhos = 496.32 sec (63 tests) Tpetra = 404.65 sec (126 tests) The following tests FAILED: 158 - MueLu_Navier2DBlocked_Epetra_MPI_4 (Failed) 159 - MueLu_Navier2DBlocked_xml_format_MPI_4 (Failed) 160 - MueLu_Navier2DBlocked_xml_format2_MPI_4 (Failed) 161 - MueLu_Navier2DBlocked_xml_blockdirect_MPI_4 (Failed) 162 - MueLu_Navier2DBlocked_xml_bgs1_MPI_4 (Failed) 163 - MueLu_Navier2DBlocked_xml_bs1_MPI_4 (Failed) 164 - MueLu_Navier2DBlocked_xml_bs2_MPI_4 (Failed) 165 - MueLu_Navier2DBlocked_xml_sim1_MPI_4 (Failed) 166 - MueLu_Navier2DBlocked_xml_sim2_MPI_4 (Failed) 167 - MueLu_Navier2DBlocked_xml_uzawa1_MPI_4 (Failed) 168 - MueLu_Navier2DBlocked_xml_indef1_MPI_4 (Failed) 171 - MueLu_Navier2DBlocked_BraessSarazin_MPI_4 (Failed) 172 - MueLu_Navier2DBlockedReuseAggs_MPI_4 (Failed) 173 - MueLu_Navier2DBlocked_Simple_MPI_4 (Failed) All of the MueLu tests failed with the following error: MueLu::EpetraOperator::Comm(): Cast from Xpetra::CrsMatrix to Xpetra::EpetraCrsMatrix failed These tests can be ignored, see trilinos#1699
- Moved unpackAndCombineIntoCrsArrays (and friends) from Tpetra_Import_Util2.hpp to Tpetra_Details_unpackCrsMatrixAndCombine_de*.hpp. - unpackAndCombineIntoCrsArrays broken up in to many many smaller functions (it was previously one large monolithic function). Each of the small functions was refactored to be thread parallel. - Race conditions were identified and resolved, mostly by using Kokkos::atomic_fetch_add where appropriate. Addresses: trilinos#797, trilinos#800, trilinos#802 Review: @mhoemmen Build/Test Cases Summary Enabled Packages: TpetraCore Disabled Packages: PyTrilinos,Claps,TriKota Enabled all Forward Packages 0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=1506,notpassed=0 (19.13 min) 1) MPI_RELEASE_DEBUG_SHARED_OPENMP_PT => passed: passed=1509,notpassed=0 (13.75 min) All of the MueLu tests failed with the following error: MueLu::EpetraOperator::Comm(): Cast from Xpetra::CrsMatrix to Xpetra::EpetraCrsMatrix failed These tests can be ignored, see trilinos#1699
- Moved unpackAndCombineIntoCrsArrays (and friends) from Tpetra_Import_Util2.hpp to Tpetra_Details_unpackCrsMatrixAndCombine_de*.hpp. - unpackAndCombineIntoCrsArrays broken up in to many many smaller functions (it was previously one large monolithic function). Each of the small functions was refactored to be thread parallel. - Race conditions were identified and resolved, mostly by using Kokkos::atomic_fetch_add where appropriate. Addresses: trilinos#797, trilinos#800, trilinos#802 Review: @mhoemmen Tests were run on two different machines and there results amended to this commit: Build/Test Cases Summary [RHEL6, standard checkin script] Enabled Packages: TpetraCore Disabled Packages: PyTrilinos,Claps,TriKota Enabled all Forward Packages 0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=1506,notpassed=0 (19.13 min) 1) MPI_RELEASE_DEBUG_SHARED_OPENMP_PT => passed: passed=1509,notpassed=0 (13.75 min) Build/Test Cases Summary [ride.sandia.gov, CUDA] Enabled Packages: Tpetra,MueLu,Stokhos 0) MPI_RELEASE_SHARED_CUDA => passed=233,notpassed=14 (8.76 min) The 14 failing tests are unrelated MueLu tests that can be ignored, see trilinos#1699 The failing Stokhos tests mentioned in trilinos#1655 were fixed with commit e97e37b
- Moved unpackAndCombineIntoCrsArrays (and friends) from Tpetra_Import_Util2.hpp to Tpetra_Details_unpackCrsMatrixAndCombine_de*.hpp. - unpackAndCombineIntoCrsArrays broken up in to many many smaller functions (it was previously one large monolithic function). Each of the small functions was refactored to be thread parallel. - Race conditions were identified and resolved, mostly by using Kokkos::atomic_fetch_add where appropriate. Addresses: #797, #800, #802 Review: @mhoemmen Tests were run on two different machines and there results amended to this commit: Build/Test Cases Summary [RHEL6, standard checkin script] Enabled Packages: TpetraCore Disabled Packages: PyTrilinos,Claps,TriKota Enabled all Forward Packages 0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=1506,notpassed=0 (19.13 min) 1) MPI_RELEASE_DEBUG_SHARED_OPENMP_PT => passed: passed=1509,notpassed=0 (13.75 min) Build/Test Cases Summary [ride.sandia.gov, CUDA] Enabled Packages: Tpetra,MueLu,Stokhos 0) MPI_RELEASE_SHARED_CUDA => passed=233,notpassed=14 (8.76 min) The 14 failing tests are unrelated MueLu tests that can be ignored, see #1699 The failing Stokhos tests mentioned in #1655 were fixed with commit e97e37b
@mhoemmen, can this issue be closed? Or, is there more to be done? |
@tjfulle It's done -- thanks! :-D Great work btw! |
@trilinos/tpetra
"Superstory": #797
Tpetra::CrsMatrix::transferAndFillComplete implements a specialized pack and unpack for CrsMatrix. Tpetra's sparse matrix-matrix multiply uses this.
Try to share as much code with #800 as possible. See e.g., packRow in Trilinos/packages/tpetra/core/src/Tpetra_Import_Util2.hpp. It would make sense to adapt PackTraits methods for use inside Kokkos::parallel_*. That would call for changes to Stokhos and perhaps also Sacado.
The text was updated successfully, but these errors were encountered: