Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tpetra: Improve thread scalability of Import/Export #797

Closed
3 of 5 tasks
mhoemmen opened this issue Nov 9, 2016 · 3 comments
Closed
3 of 5 tasks

Tpetra: Improve thread scalability of Import/Export #797

mhoemmen opened this issue Nov 9, 2016 · 3 comments
Assignees
Labels
CLOSED_DUE_TO_INACTIVITY Issue or PR has been closed by the GitHub Actions bot due to inactivity. MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. pkg: Tpetra story The issue corresponds to a Kanban Story (vs. Epic or Task) TpetraRF

Comments

@mhoemmen
Copy link
Contributor

mhoemmen commented Nov 9, 2016

@trilinos/tpetra
Epic: #796

[mfh edit 13 Jul 2017: promote the transferAndFillComplete task, #802, into its own story]

This involves several tasks, not all fully identified:

For #800, #801, and #802, we need to do performance tests to make sure that the changes thread-scale without sacrificing performance in the MPI-only case. It may make sense to have a non-threaded implementation if the number of threads is 1. (Some users may use Tpetra's OpenMP back-end without realizing it, but run with 1 thread per MPI process. That's why this should be a run-time decision rather than a decision based on the back-end type.)

@mhoemmen mhoemmen added story The issue corresponds to a Kanban Story (vs. Epic or Task) pkg: Tpetra labels Nov 9, 2016
@mhoemmen mhoemmen added this to the Tpetra-FY17-Q4 milestone Nov 9, 2016
@mhoemmen mhoemmen self-assigned this Nov 9, 2016
@mhoemmen mhoemmen changed the title Tpetra: Improve thread scalability of Import/Export & transferAndFillComplete Tpetra: Improve thread scalability of Import/Export Jul 13, 2017
tjfulle added a commit to tjfulle/Trilinos that referenced this issue Aug 2, 2017
@trilinos/tpetra, @mhoemmen

Comments
--------
This commit is a combination of several commits that address a several
issues: trilinos#797, trilinos#798, trilinos#800, trilinos#802

A summary of changes are as follows:

- Refactor CrsMatrix pack/unpack procedures to use PackTraits (for both static
  and dynamic profile matrices)
- Refactor packCrsMatrix to pack (optional) PIDs
- Remove exists packAndPrepareWithOwningPIDs and instead use the aforementioned
  packCrsMatrix procedure.
- Modify PackTraits run on threads by removing calls to TEUCHOS_TEST_FOR_EXCEPTION
  and decorating device code with KOKKOS_INLINE_FUNCTION
- Ditto for Stokhos' specialization of PackTraits

Build/Test Cases Summary
Enabled Packages: TpetraCore
Disabled Packages: PyTrilinos,Claps,TriKota
Enabled all Forward Packages
0) MPI_RELEASE_DEBUG_SHARED_PT => passed:
passed=1483,notpassed=0 (79.62 min)
tjfulle added a commit to tjfulle/Trilinos that referenced this issue Aug 16, 2017
…aits

@trilinos/tpetra, @mhoemmen

Comments
--------
This single commit is a rebase of several commits that address the following
issues: trilinos#797, trilinos#798, trilinos#800, trilinos#802

A summary of changes are as follows:

- Refactor CrsMatrix pack/unpack procedures to use PackTraits (for both static
  and dynamic profile matrices)
- Refactor packCrsMatrix to pack (optional) PIDs
- Remove exists packAndPrepareWithOwningPIDs and instead use the aforementioned
  packCrsMatrix procedure.
- Modify PackTraits run on threads by removing calls to TEUCHOS_TEST_FOR_EXCEPTION
  and decorating device code with KOKKOS_INLINE_FUNCTION
- Ditto for Stokhos' specialization of PackTraits
- Modify unpackCrsMatrix row to *not* unpack directly in to the local CrsMatrix
  but to return the unpacked data.  This required allocating enough scratch
  space in to which data could be unpacked.  We used Kokkos::UniqueToken to
  allocate the scratch space and to grab a unique (to each thread) subview of
  the scratch space.
tjfulle added a commit to tjfulle/Trilinos that referenced this issue Aug 16, 2017
…aits

@trilinos/tpetra, @mhoemmen

This single commit is a rebase of several commits that address the following
issues: trilinos#797, trilinos#798, trilinos#800, trilinos#802

A summary of changes are as follows:

- Refactor CrsMatrix pack/unpack procedures to use PackTraits (for both static
  and dynamic profile matrices)
- Refactor packCrsMatrix to pack (optional) PIDs
- Remove exists packAndPrepareWithOwningPIDs and instead use the aforementioned
  packCrsMatrix procedure.
- Modify PackTraits run on threads by removing calls to TEUCHOS_TEST_FOR_EXCEPTION
  and decorating device code with KOKKOS_INLINE_FUNCTION
- Ditto for Stokhos' specialization of PackTraits
- Modify unpackCrsMatrix row to *not* unpack directly in to the local CrsMatrix
  but to return the unpacked data.  This required allocating enough scratch
  space in to which data could be unpacked.  We used Kokkos::UniqueToken to
  allocate the scratch space and to grab a unique (to each thread) subview of
  the scratch space.

Enabled Packages: TpetraCore
Disabled Packages: PyTrilinos,Claps,TriKota
Enabled all Forward Packages

0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=1487,notpassed=0 (102.26 min)
1) MPI_RELEASE_DEBUG_SHARED_OPENMP_PT => passed: passed=1490,notpassed=0 (104.52 min)

Enabled Packages: TpetraCore
Disabled all Forward Packages

0) MPI_RELEASE_CUDA => passed: passed=124,notpassed=0
tjfulle added a commit to tjfulle/Trilinos that referenced this issue Aug 16, 2017
…aits

@trilinos/tpetra, @mhoemmen

This single commit is a rebase of several commits that address the following
issues: trilinos#797, trilinos#798, trilinos#800, trilinos#802

Summary
-------

- Refactor CrsMatrix pack/unpack procedures to use PackTraits (for both static
  and dynamic profile matrices)
- Refactor packCrsMatrix to pack (optional) PIDs
- Remove exists packAndPrepareWithOwningPIDs and instead use the aforementioned
  packCrsMatrix procedure.
- Modify PackTraits run on threads by removing calls to TEUCHOS_TEST_FOR_EXCEPTION
  and decorating device code with KOKKOS_INLINE_FUNCTION
- Ditto for Stokhos' specialization of PackTraits
- Modify unpackCrsMatrix row to *not* unpack directly in to the local CrsMatrix
  but to return the unpacked data.  This required allocating enough scratch
  space in to which data could be unpacked.  We used Kokkos::UniqueToken to
  allocate the scratch space and to grab a unique (to each thread) subview of
  the scratch space.

Build/Test Case Summaries
-------------------------

Linux/SEMS, gcc 4.8.3, openmpi 1.8.7
------------------------------------

Enabled Packages: TpetraCore
Disabled Packages: PyTrilinos,Claps,TriKota
Enabled all Forward Packages

0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=1487,notpassed=0 (102.26 min)
1) MPI_RELEASE_DEBUG_SHARED_OPENMP_PT => passed: passed=1490,notpassed=0 (104.52 min)

CUDA, gcc 5.4, openmpi 1.10.2, cuda 8.0.44
------------------------------------------

Enabled Packages: TpetraCore
Disabled all Forward Packages

0) MPI_RELEASE_CUDA => passed: passed=124,notpassed=0
mhoemmen pushed a commit to mhoemmen/Trilinos that referenced this issue Aug 21, 2017
…aits

@trilinos/tpetra, @mhoemmen

This single commit is a rebase of several commits that address the following
issues: trilinos#797, trilinos#798, trilinos#800, trilinos#802

Summary
-------

- Refactor CrsMatrix pack/unpack procedures to use PackTraits (for both static
  and dynamic profile matrices)
- Refactor packCrsMatrix to pack (optional) PIDs
- Remove exists packAndPrepareWithOwningPIDs and instead use the aforementioned
  packCrsMatrix procedure.
- Modify PackTraits run on threads by removing calls to TEUCHOS_TEST_FOR_EXCEPTION
  and decorating device code with KOKKOS_INLINE_FUNCTION
- Ditto for Stokhos' specialization of PackTraits
- Modify unpackCrsMatrix row to *not* unpack directly in to the local CrsMatrix
  but to return the unpacked data.  This required allocating enough scratch
  space in to which data could be unpacked.  We used Kokkos::UniqueToken to
  allocate the scratch space and to grab a unique (to each thread) subview of
  the scratch space.

Build/Test Case Summaries
-------------------------

Linux/SEMS, gcc 4.8.3, openmpi 1.8.7
------------------------------------

Enabled Packages: TpetraCore
Disabled Packages: PyTrilinos,Claps,TriKota
Enabled all Forward Packages

0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=1487,notpassed=0 (102.26 min)
1) MPI_RELEASE_DEBUG_SHARED_OPENMP_PT => passed: passed=1490,notpassed=0 (104.52 min)

CUDA, gcc 5.4, openmpi 1.10.2, cuda 8.0.44
------------------------------------------

Enabled Packages: TpetraCore
Disabled all Forward Packages

0) MPI_RELEASE_CUDA => passed: passed=124,notpassed=0
tjfulle added a commit to tjfulle/Trilinos that referenced this issue Sep 6, 2017
- Moved unpackAndCombineIntoCrsArrays (and friends) from Tpetra_Import_Util2.hpp
  to Tpetra_Details_unpackCrsMatrixAndCombine_de*.hpp.
- unpackAndCombineIntoCrsArrays broken up in to many many smaller functions (it
  was previously one large monolithic function).  Each of the small functions
  was refactored to be thread parallel.
- Race conditions were identified and resolved, mostly by using
  Kokkos::atomic_fetch_add where appropriate.

Addresses: trilinos#797, trilinos#800, trilinos#802
Review: @mhoemmen

Build/Test Cases Summary
Enabled Packages: TpetraCore
Disabled Packages: PyTrilinos,Claps,TriKota
Enabled all Forward Packages
0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=1506,notpassed=0 (19.13 min)
1) MPI_RELEASE_DEBUG_SHARED_OPENMP_PT => passed: passed=1509,notpassed=0 (13.75 min)

Build/Test Cases Summary [CUDA]
Enabled Packages: Tpetra,MueLu,Stokhos
94% tests passed, 16 tests failed out of 257

Label Time Summary:
MueLu      = 1690.79 sec (69 tests)
Stokhos    = 496.32 sec (63 tests)
Tpetra     = 404.65 sec (126 tests)

The following tests FAILED:
158 - MueLu_Navier2DBlocked_Epetra_MPI_4 (Failed)
159 - MueLu_Navier2DBlocked_xml_format_MPI_4 (Failed)
160 - MueLu_Navier2DBlocked_xml_format2_MPI_4 (Failed)
161 - MueLu_Navier2DBlocked_xml_blockdirect_MPI_4 (Failed)
162 - MueLu_Navier2DBlocked_xml_bgs1_MPI_4 (Failed)
163 - MueLu_Navier2DBlocked_xml_bs1_MPI_4 (Failed)
164 - MueLu_Navier2DBlocked_xml_bs2_MPI_4 (Failed)
165 - MueLu_Navier2DBlocked_xml_sim1_MPI_4 (Failed)
166 - MueLu_Navier2DBlocked_xml_sim2_MPI_4 (Failed)
167 - MueLu_Navier2DBlocked_xml_uzawa1_MPI_4 (Failed)
168 - MueLu_Navier2DBlocked_xml_indef1_MPI_4 (Failed)
171 - MueLu_Navier2DBlocked_BraessSarazin_MPI_4 (Failed)
172 - MueLu_Navier2DBlockedReuseAggs_MPI_4 (Failed)
173 - MueLu_Navier2DBlocked_Simple_MPI_4 (Failed)
240 - Stokhos_KokkosCrsMatrixUQPCEUnitTest_Cuda_MPI_1 (Failed)
242 - Stokhos_TpetraCrsMatrixUQPCEUnitTest_Cuda_MPI_4 (Failed)

According to @mhoemmen, the Stokhos failures are known failures.

All of the MueLu tests failed with the following error:

MueLu::EpetraOperator::Comm(): Cast from Xpetra::CrsMatrix to Xpetra::EpetraCrsMatrix failed
@mhoemmen
Copy link
Contributor Author

mhoemmen commented Sep 6, 2017

We don't need all these things for FY17 Q4, so I'm moving this to the backlog.

tjfulle added a commit to tjfulle/Trilinos that referenced this issue Sep 7, 2017
- Moved unpackAndCombineIntoCrsArrays (and friends) from Tpetra_Import_Util2.hpp
  to Tpetra_Details_unpackCrsMatrixAndCombine_de*.hpp.
- unpackAndCombineIntoCrsArrays broken up in to many many smaller functions (it
  was previously one large monolithic function).  Each of the small functions
  was refactored to be thread parallel.
- Race conditions were identified and resolved, mostly by using
  Kokkos::atomic_fetch_add where appropriate.

Addresses: trilinos#797, trilinos#800, trilinos#802
Review: @mhoemmen

Build/Test Cases Summary
Enabled Packages: TpetraCore
Disabled Packages: PyTrilinos,Claps,TriKota
Enabled all Forward Packages
0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=1506,notpassed=0 (19.13 min)
1) MPI_RELEASE_DEBUG_SHARED_OPENMP_PT => passed: passed=1509,notpassed=0 (13.75 min)

Build/Test Cases Summary [CUDA]
Enabled Packages: Tpetra,MueLu,Stokhos
94% tests passed, 14 tests failed out of 257

Label Time Summary:
MueLu      = 1690.79 sec (69 tests)
Stokhos    = 496.32 sec (63 tests)
Tpetra     = 404.65 sec (126 tests)

The following tests FAILED:
158 - MueLu_Navier2DBlocked_Epetra_MPI_4 (Failed)
159 - MueLu_Navier2DBlocked_xml_format_MPI_4 (Failed)
160 - MueLu_Navier2DBlocked_xml_format2_MPI_4 (Failed)
161 - MueLu_Navier2DBlocked_xml_blockdirect_MPI_4 (Failed)
162 - MueLu_Navier2DBlocked_xml_bgs1_MPI_4 (Failed)
163 - MueLu_Navier2DBlocked_xml_bs1_MPI_4 (Failed)
164 - MueLu_Navier2DBlocked_xml_bs2_MPI_4 (Failed)
165 - MueLu_Navier2DBlocked_xml_sim1_MPI_4 (Failed)
166 - MueLu_Navier2DBlocked_xml_sim2_MPI_4 (Failed)
167 - MueLu_Navier2DBlocked_xml_uzawa1_MPI_4 (Failed)
168 - MueLu_Navier2DBlocked_xml_indef1_MPI_4 (Failed)
171 - MueLu_Navier2DBlocked_BraessSarazin_MPI_4 (Failed)
172 - MueLu_Navier2DBlockedReuseAggs_MPI_4 (Failed)
173 - MueLu_Navier2DBlocked_Simple_MPI_4 (Failed)

All of the MueLu tests failed with the following error:

MueLu::EpetraOperator::Comm(): Cast from Xpetra::CrsMatrix to Xpetra::EpetraCrsMatrix failed

These tests can be ignored, see trilinos#1699
tjfulle added a commit to tjfulle/Trilinos that referenced this issue Sep 8, 2017
- Moved unpackAndCombineIntoCrsArrays (and friends) from Tpetra_Import_Util2.hpp
  to Tpetra_Details_unpackCrsMatrixAndCombine_de*.hpp.
- unpackAndCombineIntoCrsArrays broken up in to many many smaller functions (it
  was previously one large monolithic function).  Each of the small functions
  was refactored to be thread parallel.
- Race conditions were identified and resolved, mostly by using
  Kokkos::atomic_fetch_add where appropriate.

Addresses: trilinos#797, trilinos#800, trilinos#802
Review: @mhoemmen

Build/Test Cases Summary
Enabled Packages: TpetraCore
Disabled Packages: PyTrilinos,Claps,TriKota
Enabled all Forward Packages
0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=1506,notpassed=0 (19.13 min)
1) MPI_RELEASE_DEBUG_SHARED_OPENMP_PT => passed: passed=1509,notpassed=0 (13.75 min)

All of the MueLu tests failed with the following error:

MueLu::EpetraOperator::Comm(): Cast from Xpetra::CrsMatrix to Xpetra::EpetraCrsMatrix failed

These tests can be ignored, see trilinos#1699
tjfulle added a commit to tjfulle/Trilinos that referenced this issue Sep 8, 2017
- Moved unpackAndCombineIntoCrsArrays (and friends) from Tpetra_Import_Util2.hpp
  to Tpetra_Details_unpackCrsMatrixAndCombine_de*.hpp.
- unpackAndCombineIntoCrsArrays broken up in to many many smaller functions (it
  was previously one large monolithic function).  Each of the small functions
  was refactored to be thread parallel.
- Race conditions were identified and resolved, mostly by using
  Kokkos::atomic_fetch_add where appropriate.

Addresses: trilinos#797, trilinos#800, trilinos#802
Review: @mhoemmen

Tests were run on two different machines and there results amended to this
commit:

Build/Test Cases Summary [RHEL6, standard checkin script]
Enabled Packages: TpetraCore
Disabled Packages: PyTrilinos,Claps,TriKota
Enabled all Forward Packages
0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=1506,notpassed=0 (19.13 min)
1) MPI_RELEASE_DEBUG_SHARED_OPENMP_PT => passed: passed=1509,notpassed=0 (13.75 min)

Build/Test Cases Summary [ride.sandia.gov, CUDA]
Enabled Packages: Tpetra,MueLu,Stokhos
0) MPI_RELEASE_SHARED_CUDA => passed=233,notpassed=14 (8.76 min)

The 14 failing tests are unrelated MueLu tests that can be ignored, see trilinos#1699

The failing Stokhos tests mentioned in trilinos#1655 were fixed with
commit e97e37b
mhoemmen pushed a commit that referenced this issue Sep 8, 2017
- Moved unpackAndCombineIntoCrsArrays (and friends) from Tpetra_Import_Util2.hpp
  to Tpetra_Details_unpackCrsMatrixAndCombine_de*.hpp.
- unpackAndCombineIntoCrsArrays broken up in to many many smaller functions (it
  was previously one large monolithic function).  Each of the small functions
  was refactored to be thread parallel.
- Race conditions were identified and resolved, mostly by using
  Kokkos::atomic_fetch_add where appropriate.

Addresses: #797, #800, #802
Review: @mhoemmen

Tests were run on two different machines and there results amended to this
commit:

Build/Test Cases Summary [RHEL6, standard checkin script]
Enabled Packages: TpetraCore
Disabled Packages: PyTrilinos,Claps,TriKota
Enabled all Forward Packages
0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=1506,notpassed=0 (19.13 min)
1) MPI_RELEASE_DEBUG_SHARED_OPENMP_PT => passed: passed=1509,notpassed=0 (13.75 min)

Build/Test Cases Summary [ride.sandia.gov, CUDA]
Enabled Packages: Tpetra,MueLu,Stokhos
0) MPI_RELEASE_SHARED_CUDA => passed=233,notpassed=14 (8.76 min)

The 14 failing tests are unrelated MueLu tests that can be ignored, see #1699

The failing Stokhos tests mentioned in #1655 were fixed with
commit e97e37b
@kddevin kddevin added this to Backlog in Tpetra Oct 26, 2017
@crtrott crtrott added this to Backlog in TpetraRF FY18 Oct 26, 2017
@github-actions
Copy link

This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity.
If you would like to keep this issue open please add a comment and remove the MARKED_FOR_CLOSURE label.
If this issue should be kept open even with no activity beyond the time limits you can add the label DO_NOT_AUTOCLOSE.

@github-actions github-actions bot added the MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. label Feb 20, 2021
@github-actions
Copy link

This issue was closed due to inactivity for 395 days.

@github-actions github-actions bot added the CLOSED_DUE_TO_INACTIVITY Issue or PR has been closed by the GitHub Actions bot due to inactivity. label Mar 24, 2021
TpetraRF FY18 automation moved this from Backlog to Done Mar 24, 2021
Tpetra automation moved this from Backlog to Done Mar 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLOSED_DUE_TO_INACTIVITY Issue or PR has been closed by the GitHub Actions bot due to inactivity. MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. pkg: Tpetra story The issue corresponds to a Kanban Story (vs. Epic or Task) TpetraRF
Projects
Tpetra
  
Done
TpetraRF FY18
  
Done
Development

No branches or pull requests

4 participants