Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tpetra: Make CrsMatrix do thread-parallel pack & unpack #800

Closed
3 tasks
mhoemmen opened this issue Nov 9, 2016 · 5 comments
Closed
3 tasks

Tpetra: Make CrsMatrix do thread-parallel pack & unpack #800

mhoemmen opened this issue Nov 9, 2016 · 5 comments
Assignees

Comments

@mhoemmen
Copy link
Contributor

mhoemmen commented Nov 9, 2016

@trilinos/tpetra
Story: #797

See notes on #802. Try to share as much code with that solution as possible. For example, it would make sense to have a single pack function, and let callers decide whether they want to pack PIDs.

Steps:

It's much harder to do this for dynamic graph, so we can skip that for now.

Thread parallelization of unpack should be over rows, so we should not need atomic updates when updating values in the matrix.

For the host-only thread parallelization of unpack, that's a single parallel_scan over local (row) indices to get offsets from byte counts of the unpack buffer. In the 'final' pass of the scan, actually unpack the data.

For pack, we first have to change packRow so that it goes directly to the KokkosSparse::CrsMatrix if that exists, rather than going through the "generic" getLocalRowView / getGlobalRowView interfaces that return Teuchos::ArrayView instances.

@mhoemmen mhoemmen self-assigned this Nov 9, 2016
@mhoemmen mhoemmen added this to the Tpetra-FY17-Q4 milestone Nov 9, 2016
mhoemmen pushed a commit that referenced this issue Mar 23, 2017
@trilinos/tpetra The versions of replaceGlobalValues and
sumIntoGlobalValues that take "raw" input arrays used to convert them
to Teuchos::ArrayView and then call the versions of those methods that
take Teuchos::ArrayView.  The latter in turn would convert _back_ to
raw arrays and then to Kokkos::View.  This commit removes the
intermediate step of conversion to Teuchos::ArrayView.

This is a small step towards #800.  Thread-parallelizing CrsMatrix
pack and unpack must begin with basic thread safety.  Since
Teuchos::ArrayView is not thread-safe in debug mode (see #229), and
since it will never work on a GPU, the first step is to stop using
Teuchos::ArrayView in pack and unpack.  Instead, we must use pointers
/ Kokkos::View objects all the way through.
mhoemmen pushed a commit that referenced this issue Mar 23, 2017
@trilinos/tpetra Add a new nonpublic method, combineGlobalValuesRaw,
to Tpetra::CrsMatrix.  This method bypasses Teuchos::ArrayView, and is
thus thread safe (see also #229) under the following conditions:

  1. The matrix has a static graph
  2. The CombineMode argument is ADD or REPLACE

We now use this method in unpackRow.  This is the first step of
thread-parallel unpack (see #800).
mhoemmen pushed a commit that referenced this issue Mar 23, 2017
@trilinos/tpetra Tpetra::CrsMatrix's unpackRow method no longer uses
Teuchos::Array* (which is not thread safe; see #229), under the
following conditions:

  1. The matrix has a static graph
  2. The CombineMode is ADD or REPLACE

Thus, under these conditions, CrsMatrix's unpackAndCombine method no
longer uses Teuchos::Array* either.  This brings us one step closer to
thread-parallel CrsMatrix unpack (#800).

Build/Test Cases Summary
Enabled Packages: TpetraCore, Belos, Zoltan2, Ifpack2, Amesos2, Xpetra, MueLu, Stokhos
Disabled Packages: FEI,PyTrilinos,Moertel,STK,SEACAS,ThreadPool,OptiPack,Rythmos,Intrepid,ROL,Panzer
0) MPI_RELEASE_DEBUG_SHARED_PT => Test case MPI_RELEASE_DEBUG_SHARED_PT was not run! => Does not affect push readiness! (-1.00 min)
1) MPI_DEBUG => passed: passed=485,notpassed=0 (63.01 min)
2) SERIAL_RELEASE => passed: passed=432,notpassed=0 (38.21 min)
Other local commits for this build/test group: acd76d8, 71725af
mhoemmen pushed a commit that referenced this issue Mar 26, 2017
@trilinos/tpetra This is related to #800.  See my comments there, in
particular "we first have to change packRow so that it goes directly
to the KokkosSparse::CrsMatrix if that exists, rather than going
through the 'generic' getLocalRowView / getGlobalRowView interfaces
that return Teuchos::ArrayView instances."  This commit is the first
step to accomplish that subgoal.

Tpetra::CrsMatrix::pack now uses packRowStatic when the graph is
static.  Otherwise, it falls back to packRow.  This ensures that the
new method gets tested.
@mhoemmen
Copy link
Contributor Author

See PR #1321. @tjfulle wrote it and I'm done improving it; just need to test downstream and push.

tjfulle added a commit to tjfulle/Trilinos that referenced this issue Jul 15, 2017
Initial implementation of CrsMatrix threaded unpack.

Addresses: trilinos#800
Review: @mhoemmen

Test Summary:

Fri Jul 14 16:13:28 MDT 2017

Enabled Packages: TpetraCore
Disabled Packages: PyTrilinos,Claps,TriKota
Enabled all Forward Packages

Build: Passed (58.86 min)
Test: Passed (11.80 min)

100% tests passed, 0 tests failed out of 1475

Label Time Summary:
Amesos               =  20.83 sec (14 tests)
Amesos2              =  10.79 sec (8 tests)
Anasazi              = 121.33 sec (71 tests)
Belos                = 110.76 sec (70 tests)
Domi                 = 174.66 sec (125 tests)
FEI                  =  46.87 sec (43 tests)
Galeri               =   4.77 sec (9 tests)
Ifpack               =  65.05 sec (53 tests)
Ifpack2              =  44.34 sec (33 tests)
ML                   =  49.09 sec (34 tests)
MueLu                = 311.30 sec (56 tests)
NOX                  = 175.08 sec (106 tests)
OptiPack             =   6.90 sec (5 tests)
Panzer               = 316.53 sec (129 tests)
Pike                 =   4.30 sec (7 tests)
Piro                 =  30.97 sec (12 tests)
ROL                  = 1038.91 sec (133 tests)
Rythmos              = 222.97 sec (83 tests)
ShyLU                =   8.68 sec (5 tests)
Stokhos              = 131.40 sec (75 tests)
Stratimikos          =  42.14 sec (39 tests)
Teko                 = 107.08 sec (19 tests)
Tempus               = 741.42 sec (9 tests)
Thyra                =  67.89 sec (80 tests)
Tpetra               = 159.26 sec (132 tests)
TrilinosCouplings    =  53.78 sec (19 tests)
Xpetra               =  51.11 sec (17 tests)
Zoltan2              = 141.10 sec (97 tests)

Total Test time (real) = 708.18 sec

Total time for MPI_RELEASE_DEBUG_SHARED_PT = 70.66 min
tjfulle added a commit to tjfulle/Trilinos that referenced this issue Jul 15, 2017
tjfulle added a commit to tjfulle/Trilinos that referenced this issue Jul 15, 2017
Addresses: trilinos#800
Review: @mhoemmen

Build/Test Cases Summary
Enabled Packages: TpetraCore
Disabled Packages: PyTrilinos,Claps,TriKota
Enabled all Forward Packages
0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=1476,notpassed=0 (36.23 min)
@mhoemmen
Copy link
Contributor Author

See #1503. I will squash those three commits into an atomic unit and run tests.

@mhoemmen mhoemmen added the stage: in progress Work on the issue has started label Jul 18, 2017
@tjfulle
Copy link
Contributor

tjfulle commented Jul 18, 2017

@mhoemmen wrote:

See #1503. I will squash those three commits into an atomic unit and run tests.

Good idea - two of the intermediate commits don't build/run on cuda :)

@mhoemmen
Copy link
Contributor Author

Argh, I can't build CUDA RELEASE without nvlink crashing....

mhoemmen pushed a commit that referenced this issue Jul 19, 2017
@trilinos/tpetra
Addresses: #800
Written by: @tjfulle
Review: @mhoemmen

@mhoemmen formed this commit by squashing the three commits in PR

Build/Test Cases Summary
Enabled Packages: TpetraCore
Disabled Packages: PyTrilinos,Claps,TriKota
Enabled all Forward Packages
0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=1476,notpassed=0 (36.23 min)

Tpetra CUDA tests also pass.
@mhoemmen
Copy link
Contributor Author

mhoemmen commented Jul 19, 2017

OK, I pushed this to develop. Thanks!

@mhoemmen mhoemmen removed the stage: in progress Work on the issue has started label Jul 19, 2017
tjfulle added a commit to tjfulle/Trilinos that referenced this issue Aug 2, 2017
@trilinos/tpetra, @mhoemmen

Comments
--------
This commit is a combination of several commits that address a several
issues: trilinos#797, trilinos#798, trilinos#800, trilinos#802

A summary of changes are as follows:

- Refactor CrsMatrix pack/unpack procedures to use PackTraits (for both static
  and dynamic profile matrices)
- Refactor packCrsMatrix to pack (optional) PIDs
- Remove exists packAndPrepareWithOwningPIDs and instead use the aforementioned
  packCrsMatrix procedure.
- Modify PackTraits run on threads by removing calls to TEUCHOS_TEST_FOR_EXCEPTION
  and decorating device code with KOKKOS_INLINE_FUNCTION
- Ditto for Stokhos' specialization of PackTraits

Build/Test Cases Summary
Enabled Packages: TpetraCore
Disabled Packages: PyTrilinos,Claps,TriKota
Enabled all Forward Packages
0) MPI_RELEASE_DEBUG_SHARED_PT => passed:
passed=1483,notpassed=0 (79.62 min)
tjfulle added a commit to tjfulle/Trilinos that referenced this issue Aug 16, 2017
…aits

@trilinos/tpetra, @mhoemmen

Comments
--------
This single commit is a rebase of several commits that address the following
issues: trilinos#797, trilinos#798, trilinos#800, trilinos#802

A summary of changes are as follows:

- Refactor CrsMatrix pack/unpack procedures to use PackTraits (for both static
  and dynamic profile matrices)
- Refactor packCrsMatrix to pack (optional) PIDs
- Remove exists packAndPrepareWithOwningPIDs and instead use the aforementioned
  packCrsMatrix procedure.
- Modify PackTraits run on threads by removing calls to TEUCHOS_TEST_FOR_EXCEPTION
  and decorating device code with KOKKOS_INLINE_FUNCTION
- Ditto for Stokhos' specialization of PackTraits
- Modify unpackCrsMatrix row to *not* unpack directly in to the local CrsMatrix
  but to return the unpacked data.  This required allocating enough scratch
  space in to which data could be unpacked.  We used Kokkos::UniqueToken to
  allocate the scratch space and to grab a unique (to each thread) subview of
  the scratch space.
tjfulle added a commit to tjfulle/Trilinos that referenced this issue Aug 16, 2017
…aits

@trilinos/tpetra, @mhoemmen

This single commit is a rebase of several commits that address the following
issues: trilinos#797, trilinos#798, trilinos#800, trilinos#802

A summary of changes are as follows:

- Refactor CrsMatrix pack/unpack procedures to use PackTraits (for both static
  and dynamic profile matrices)
- Refactor packCrsMatrix to pack (optional) PIDs
- Remove exists packAndPrepareWithOwningPIDs and instead use the aforementioned
  packCrsMatrix procedure.
- Modify PackTraits run on threads by removing calls to TEUCHOS_TEST_FOR_EXCEPTION
  and decorating device code with KOKKOS_INLINE_FUNCTION
- Ditto for Stokhos' specialization of PackTraits
- Modify unpackCrsMatrix row to *not* unpack directly in to the local CrsMatrix
  but to return the unpacked data.  This required allocating enough scratch
  space in to which data could be unpacked.  We used Kokkos::UniqueToken to
  allocate the scratch space and to grab a unique (to each thread) subview of
  the scratch space.

Enabled Packages: TpetraCore
Disabled Packages: PyTrilinos,Claps,TriKota
Enabled all Forward Packages

0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=1487,notpassed=0 (102.26 min)
1) MPI_RELEASE_DEBUG_SHARED_OPENMP_PT => passed: passed=1490,notpassed=0 (104.52 min)

Enabled Packages: TpetraCore
Disabled all Forward Packages

0) MPI_RELEASE_CUDA => passed: passed=124,notpassed=0
tjfulle added a commit to tjfulle/Trilinos that referenced this issue Aug 16, 2017
…aits

@trilinos/tpetra, @mhoemmen

This single commit is a rebase of several commits that address the following
issues: trilinos#797, trilinos#798, trilinos#800, trilinos#802

Summary
-------

- Refactor CrsMatrix pack/unpack procedures to use PackTraits (for both static
  and dynamic profile matrices)
- Refactor packCrsMatrix to pack (optional) PIDs
- Remove exists packAndPrepareWithOwningPIDs and instead use the aforementioned
  packCrsMatrix procedure.
- Modify PackTraits run on threads by removing calls to TEUCHOS_TEST_FOR_EXCEPTION
  and decorating device code with KOKKOS_INLINE_FUNCTION
- Ditto for Stokhos' specialization of PackTraits
- Modify unpackCrsMatrix row to *not* unpack directly in to the local CrsMatrix
  but to return the unpacked data.  This required allocating enough scratch
  space in to which data could be unpacked.  We used Kokkos::UniqueToken to
  allocate the scratch space and to grab a unique (to each thread) subview of
  the scratch space.

Build/Test Case Summaries
-------------------------

Linux/SEMS, gcc 4.8.3, openmpi 1.8.7
------------------------------------

Enabled Packages: TpetraCore
Disabled Packages: PyTrilinos,Claps,TriKota
Enabled all Forward Packages

0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=1487,notpassed=0 (102.26 min)
1) MPI_RELEASE_DEBUG_SHARED_OPENMP_PT => passed: passed=1490,notpassed=0 (104.52 min)

CUDA, gcc 5.4, openmpi 1.10.2, cuda 8.0.44
------------------------------------------

Enabled Packages: TpetraCore
Disabled all Forward Packages

0) MPI_RELEASE_CUDA => passed: passed=124,notpassed=0
mhoemmen pushed a commit to mhoemmen/Trilinos that referenced this issue Aug 21, 2017
…aits

@trilinos/tpetra, @mhoemmen

This single commit is a rebase of several commits that address the following
issues: trilinos#797, trilinos#798, trilinos#800, trilinos#802

Summary
-------

- Refactor CrsMatrix pack/unpack procedures to use PackTraits (for both static
  and dynamic profile matrices)
- Refactor packCrsMatrix to pack (optional) PIDs
- Remove exists packAndPrepareWithOwningPIDs and instead use the aforementioned
  packCrsMatrix procedure.
- Modify PackTraits run on threads by removing calls to TEUCHOS_TEST_FOR_EXCEPTION
  and decorating device code with KOKKOS_INLINE_FUNCTION
- Ditto for Stokhos' specialization of PackTraits
- Modify unpackCrsMatrix row to *not* unpack directly in to the local CrsMatrix
  but to return the unpacked data.  This required allocating enough scratch
  space in to which data could be unpacked.  We used Kokkos::UniqueToken to
  allocate the scratch space and to grab a unique (to each thread) subview of
  the scratch space.

Build/Test Case Summaries
-------------------------

Linux/SEMS, gcc 4.8.3, openmpi 1.8.7
------------------------------------

Enabled Packages: TpetraCore
Disabled Packages: PyTrilinos,Claps,TriKota
Enabled all Forward Packages

0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=1487,notpassed=0 (102.26 min)
1) MPI_RELEASE_DEBUG_SHARED_OPENMP_PT => passed: passed=1490,notpassed=0 (104.52 min)

CUDA, gcc 5.4, openmpi 1.10.2, cuda 8.0.44
------------------------------------------

Enabled Packages: TpetraCore
Disabled all Forward Packages

0) MPI_RELEASE_CUDA => passed: passed=124,notpassed=0
tjfulle added a commit to tjfulle/Trilinos that referenced this issue Sep 6, 2017
- Moved unpackAndCombineIntoCrsArrays (and friends) from Tpetra_Import_Util2.hpp
  to Tpetra_Details_unpackCrsMatrixAndCombine_de*.hpp.
- unpackAndCombineIntoCrsArrays broken up in to many many smaller functions (it
  was previously one large monolithic function).  Each of the small functions
  was refactored to be thread parallel.
- Race conditions were identified and resolved, mostly by using
  Kokkos::atomic_fetch_add where appropriate.

Addresses: trilinos#797, trilinos#800, trilinos#802
Review: @mhoemmen

Build/Test Cases Summary
Enabled Packages: TpetraCore
Disabled Packages: PyTrilinos,Claps,TriKota
Enabled all Forward Packages
0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=1506,notpassed=0 (19.13 min)
1) MPI_RELEASE_DEBUG_SHARED_OPENMP_PT => passed: passed=1509,notpassed=0 (13.75 min)

Build/Test Cases Summary [CUDA]
Enabled Packages: Tpetra,MueLu,Stokhos
94% tests passed, 16 tests failed out of 257

Label Time Summary:
MueLu      = 1690.79 sec (69 tests)
Stokhos    = 496.32 sec (63 tests)
Tpetra     = 404.65 sec (126 tests)

The following tests FAILED:
158 - MueLu_Navier2DBlocked_Epetra_MPI_4 (Failed)
159 - MueLu_Navier2DBlocked_xml_format_MPI_4 (Failed)
160 - MueLu_Navier2DBlocked_xml_format2_MPI_4 (Failed)
161 - MueLu_Navier2DBlocked_xml_blockdirect_MPI_4 (Failed)
162 - MueLu_Navier2DBlocked_xml_bgs1_MPI_4 (Failed)
163 - MueLu_Navier2DBlocked_xml_bs1_MPI_4 (Failed)
164 - MueLu_Navier2DBlocked_xml_bs2_MPI_4 (Failed)
165 - MueLu_Navier2DBlocked_xml_sim1_MPI_4 (Failed)
166 - MueLu_Navier2DBlocked_xml_sim2_MPI_4 (Failed)
167 - MueLu_Navier2DBlocked_xml_uzawa1_MPI_4 (Failed)
168 - MueLu_Navier2DBlocked_xml_indef1_MPI_4 (Failed)
171 - MueLu_Navier2DBlocked_BraessSarazin_MPI_4 (Failed)
172 - MueLu_Navier2DBlockedReuseAggs_MPI_4 (Failed)
173 - MueLu_Navier2DBlocked_Simple_MPI_4 (Failed)
240 - Stokhos_KokkosCrsMatrixUQPCEUnitTest_Cuda_MPI_1 (Failed)
242 - Stokhos_TpetraCrsMatrixUQPCEUnitTest_Cuda_MPI_4 (Failed)

According to @mhoemmen, the Stokhos failures are known failures.

All of the MueLu tests failed with the following error:

MueLu::EpetraOperator::Comm(): Cast from Xpetra::CrsMatrix to Xpetra::EpetraCrsMatrix failed
tjfulle added a commit to tjfulle/Trilinos that referenced this issue Sep 7, 2017
- Moved unpackAndCombineIntoCrsArrays (and friends) from Tpetra_Import_Util2.hpp
  to Tpetra_Details_unpackCrsMatrixAndCombine_de*.hpp.
- unpackAndCombineIntoCrsArrays broken up in to many many smaller functions (it
  was previously one large monolithic function).  Each of the small functions
  was refactored to be thread parallel.
- Race conditions were identified and resolved, mostly by using
  Kokkos::atomic_fetch_add where appropriate.

Addresses: trilinos#797, trilinos#800, trilinos#802
Review: @mhoemmen

Build/Test Cases Summary
Enabled Packages: TpetraCore
Disabled Packages: PyTrilinos,Claps,TriKota
Enabled all Forward Packages
0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=1506,notpassed=0 (19.13 min)
1) MPI_RELEASE_DEBUG_SHARED_OPENMP_PT => passed: passed=1509,notpassed=0 (13.75 min)

Build/Test Cases Summary [CUDA]
Enabled Packages: Tpetra,MueLu,Stokhos
94% tests passed, 14 tests failed out of 257

Label Time Summary:
MueLu      = 1690.79 sec (69 tests)
Stokhos    = 496.32 sec (63 tests)
Tpetra     = 404.65 sec (126 tests)

The following tests FAILED:
158 - MueLu_Navier2DBlocked_Epetra_MPI_4 (Failed)
159 - MueLu_Navier2DBlocked_xml_format_MPI_4 (Failed)
160 - MueLu_Navier2DBlocked_xml_format2_MPI_4 (Failed)
161 - MueLu_Navier2DBlocked_xml_blockdirect_MPI_4 (Failed)
162 - MueLu_Navier2DBlocked_xml_bgs1_MPI_4 (Failed)
163 - MueLu_Navier2DBlocked_xml_bs1_MPI_4 (Failed)
164 - MueLu_Navier2DBlocked_xml_bs2_MPI_4 (Failed)
165 - MueLu_Navier2DBlocked_xml_sim1_MPI_4 (Failed)
166 - MueLu_Navier2DBlocked_xml_sim2_MPI_4 (Failed)
167 - MueLu_Navier2DBlocked_xml_uzawa1_MPI_4 (Failed)
168 - MueLu_Navier2DBlocked_xml_indef1_MPI_4 (Failed)
171 - MueLu_Navier2DBlocked_BraessSarazin_MPI_4 (Failed)
172 - MueLu_Navier2DBlockedReuseAggs_MPI_4 (Failed)
173 - MueLu_Navier2DBlocked_Simple_MPI_4 (Failed)

All of the MueLu tests failed with the following error:

MueLu::EpetraOperator::Comm(): Cast from Xpetra::CrsMatrix to Xpetra::EpetraCrsMatrix failed

These tests can be ignored, see trilinos#1699
tjfulle added a commit to tjfulle/Trilinos that referenced this issue Sep 8, 2017
- Moved unpackAndCombineIntoCrsArrays (and friends) from Tpetra_Import_Util2.hpp
  to Tpetra_Details_unpackCrsMatrixAndCombine_de*.hpp.
- unpackAndCombineIntoCrsArrays broken up in to many many smaller functions (it
  was previously one large monolithic function).  Each of the small functions
  was refactored to be thread parallel.
- Race conditions were identified and resolved, mostly by using
  Kokkos::atomic_fetch_add where appropriate.

Addresses: trilinos#797, trilinos#800, trilinos#802
Review: @mhoemmen

Build/Test Cases Summary
Enabled Packages: TpetraCore
Disabled Packages: PyTrilinos,Claps,TriKota
Enabled all Forward Packages
0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=1506,notpassed=0 (19.13 min)
1) MPI_RELEASE_DEBUG_SHARED_OPENMP_PT => passed: passed=1509,notpassed=0 (13.75 min)

All of the MueLu tests failed with the following error:

MueLu::EpetraOperator::Comm(): Cast from Xpetra::CrsMatrix to Xpetra::EpetraCrsMatrix failed

These tests can be ignored, see trilinos#1699
tjfulle added a commit to tjfulle/Trilinos that referenced this issue Sep 8, 2017
- Moved unpackAndCombineIntoCrsArrays (and friends) from Tpetra_Import_Util2.hpp
  to Tpetra_Details_unpackCrsMatrixAndCombine_de*.hpp.
- unpackAndCombineIntoCrsArrays broken up in to many many smaller functions (it
  was previously one large monolithic function).  Each of the small functions
  was refactored to be thread parallel.
- Race conditions were identified and resolved, mostly by using
  Kokkos::atomic_fetch_add where appropriate.

Addresses: trilinos#797, trilinos#800, trilinos#802
Review: @mhoemmen

Tests were run on two different machines and there results amended to this
commit:

Build/Test Cases Summary [RHEL6, standard checkin script]
Enabled Packages: TpetraCore
Disabled Packages: PyTrilinos,Claps,TriKota
Enabled all Forward Packages
0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=1506,notpassed=0 (19.13 min)
1) MPI_RELEASE_DEBUG_SHARED_OPENMP_PT => passed: passed=1509,notpassed=0 (13.75 min)

Build/Test Cases Summary [ride.sandia.gov, CUDA]
Enabled Packages: Tpetra,MueLu,Stokhos
0) MPI_RELEASE_SHARED_CUDA => passed=233,notpassed=14 (8.76 min)

The 14 failing tests are unrelated MueLu tests that can be ignored, see trilinos#1699

The failing Stokhos tests mentioned in trilinos#1655 were fixed with
commit e97e37b
mhoemmen pushed a commit that referenced this issue Sep 8, 2017
- Moved unpackAndCombineIntoCrsArrays (and friends) from Tpetra_Import_Util2.hpp
  to Tpetra_Details_unpackCrsMatrixAndCombine_de*.hpp.
- unpackAndCombineIntoCrsArrays broken up in to many many smaller functions (it
  was previously one large monolithic function).  Each of the small functions
  was refactored to be thread parallel.
- Race conditions were identified and resolved, mostly by using
  Kokkos::atomic_fetch_add where appropriate.

Addresses: #797, #800, #802
Review: @mhoemmen

Tests were run on two different machines and there results amended to this
commit:

Build/Test Cases Summary [RHEL6, standard checkin script]
Enabled Packages: TpetraCore
Disabled Packages: PyTrilinos,Claps,TriKota
Enabled all Forward Packages
0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=1506,notpassed=0 (19.13 min)
1) MPI_RELEASE_DEBUG_SHARED_OPENMP_PT => passed: passed=1509,notpassed=0 (13.75 min)

Build/Test Cases Summary [ride.sandia.gov, CUDA]
Enabled Packages: Tpetra,MueLu,Stokhos
0) MPI_RELEASE_SHARED_CUDA => passed=233,notpassed=14 (8.76 min)

The 14 failing tests are unrelated MueLu tests that can be ignored, see #1699

The failing Stokhos tests mentioned in #1655 were fixed with
commit e97e37b
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants