Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MueLu: Test failures on CUDA #1699

Closed
tjfulle opened this issue Sep 6, 2017 · 5 comments
Closed

MueLu: Test failures on CUDA #1699

tjfulle opened this issue Sep 6, 2017 · 5 comments
Assignees
Labels
impacting: tests The defect (bug) is primarily a test failure (vs. a build failure) pkg: MueLu
Projects

Comments

@tjfulle
Copy link
Contributor

tjfulle commented Sep 6, 2017

@trilinos/muelu

The following tests fail on CUDA (ride.sandia.gov, gcc 5.4, openmpi 1.10.4, cuda 8.0.44):

 158 - MueLu_Navier2DBlocked_Epetra_MPI_4 (Failed)
 159 - MueLu_Navier2DBlocked_xml_format_MPI_4 (Failed)
 160 - MueLu_Navier2DBlocked_xml_format2_MPI_4 (Failed)
 161 - MueLu_Navier2DBlocked_xml_blockdirect_MPI_4 (Failed)
 162 - MueLu_Navier2DBlocked_xml_bgs1_MPI_4 (Failed)
 163 - MueLu_Navier2DBlocked_xml_bs1_MPI_4 (Failed)
 164 - MueLu_Navier2DBlocked_xml_bs2_MPI_4 (Failed)
 165 - MueLu_Navier2DBlocked_xml_sim1_MPI_4 (Failed)
 166 - MueLu_Navier2DBlocked_xml_sim2_MPI_4 (Failed)
 167 - MueLu_Navier2DBlocked_xml_uzawa1_MPI_4 (Failed)
 168 - MueLu_Navier2DBlocked_xml_indef1_MPI_4 (Failed)
 171 - MueLu_Navier2DBlocked_BraessSarazin_MPI_4 (Failed)
 172 - MueLu_Navier2DBlockedReuseAggs_MPI_4 (Failed)
 173 - MueLu_Navier2DBlocked_Simple_MPI_4 (Failed)

All of the MueLu tests failed with the following error:

MueLu::EpetraOperator::Comm(): Cast from Xpetra::CrsMatrix to Xpetra::EpetraCrsMatrix failed
@jhux2 jhux2 added pkg: MueLu impacting: tests The defect (bug) is primarily a test failure (vs. a build failure) labels Sep 6, 2017
@jhux2
Copy link
Member

jhux2 commented Sep 6, 2017

@tjfulle These have been failing for sometime now. If they are blocking your checkins or testing, it is safe to disable/ignore them.

@tjfulle
Copy link
Contributor Author

tjfulle commented Sep 6, 2017

Thanks @jhux2! I figured the failure was not due to my work, but thought I'd post them to be sure.

tjfulle added a commit to tjfulle/Trilinos that referenced this issue Sep 7, 2017
- Moved unpackAndCombineIntoCrsArrays (and friends) from Tpetra_Import_Util2.hpp
  to Tpetra_Details_unpackCrsMatrixAndCombine_de*.hpp.
- unpackAndCombineIntoCrsArrays broken up in to many many smaller functions (it
  was previously one large monolithic function).  Each of the small functions
  was refactored to be thread parallel.
- Race conditions were identified and resolved, mostly by using
  Kokkos::atomic_fetch_add where appropriate.

Addresses: trilinos#797, trilinos#800, trilinos#802
Review: @mhoemmen

Build/Test Cases Summary
Enabled Packages: TpetraCore
Disabled Packages: PyTrilinos,Claps,TriKota
Enabled all Forward Packages
0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=1506,notpassed=0 (19.13 min)
1) MPI_RELEASE_DEBUG_SHARED_OPENMP_PT => passed: passed=1509,notpassed=0 (13.75 min)

Build/Test Cases Summary [CUDA]
Enabled Packages: Tpetra,MueLu,Stokhos
94% tests passed, 14 tests failed out of 257

Label Time Summary:
MueLu      = 1690.79 sec (69 tests)
Stokhos    = 496.32 sec (63 tests)
Tpetra     = 404.65 sec (126 tests)

The following tests FAILED:
158 - MueLu_Navier2DBlocked_Epetra_MPI_4 (Failed)
159 - MueLu_Navier2DBlocked_xml_format_MPI_4 (Failed)
160 - MueLu_Navier2DBlocked_xml_format2_MPI_4 (Failed)
161 - MueLu_Navier2DBlocked_xml_blockdirect_MPI_4 (Failed)
162 - MueLu_Navier2DBlocked_xml_bgs1_MPI_4 (Failed)
163 - MueLu_Navier2DBlocked_xml_bs1_MPI_4 (Failed)
164 - MueLu_Navier2DBlocked_xml_bs2_MPI_4 (Failed)
165 - MueLu_Navier2DBlocked_xml_sim1_MPI_4 (Failed)
166 - MueLu_Navier2DBlocked_xml_sim2_MPI_4 (Failed)
167 - MueLu_Navier2DBlocked_xml_uzawa1_MPI_4 (Failed)
168 - MueLu_Navier2DBlocked_xml_indef1_MPI_4 (Failed)
171 - MueLu_Navier2DBlocked_BraessSarazin_MPI_4 (Failed)
172 - MueLu_Navier2DBlockedReuseAggs_MPI_4 (Failed)
173 - MueLu_Navier2DBlocked_Simple_MPI_4 (Failed)

All of the MueLu tests failed with the following error:

MueLu::EpetraOperator::Comm(): Cast from Xpetra::CrsMatrix to Xpetra::EpetraCrsMatrix failed

These tests can be ignored, see trilinos#1699
@tawiesn
Copy link
Contributor

tawiesn commented Sep 7, 2017

I can reproduce the problem on geminga.

@tawiesn
Copy link
Contributor

tawiesn commented Sep 7, 2017

Ok, found the problem and fixed it. Will push it tomorrow.

@tawiesn tawiesn self-assigned this Sep 7, 2017
tjfulle added a commit to tjfulle/Trilinos that referenced this issue Sep 8, 2017
- Moved unpackAndCombineIntoCrsArrays (and friends) from Tpetra_Import_Util2.hpp
  to Tpetra_Details_unpackCrsMatrixAndCombine_de*.hpp.
- unpackAndCombineIntoCrsArrays broken up in to many many smaller functions (it
  was previously one large monolithic function).  Each of the small functions
  was refactored to be thread parallel.
- Race conditions were identified and resolved, mostly by using
  Kokkos::atomic_fetch_add where appropriate.

Addresses: trilinos#797, trilinos#800, trilinos#802
Review: @mhoemmen

Build/Test Cases Summary
Enabled Packages: TpetraCore
Disabled Packages: PyTrilinos,Claps,TriKota
Enabled all Forward Packages
0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=1506,notpassed=0 (19.13 min)
1) MPI_RELEASE_DEBUG_SHARED_OPENMP_PT => passed: passed=1509,notpassed=0 (13.75 min)

All of the MueLu tests failed with the following error:

MueLu::EpetraOperator::Comm(): Cast from Xpetra::CrsMatrix to Xpetra::EpetraCrsMatrix failed

These tests can be ignored, see trilinos#1699
tjfulle added a commit to tjfulle/Trilinos that referenced this issue Sep 8, 2017
- Moved unpackAndCombineIntoCrsArrays (and friends) from Tpetra_Import_Util2.hpp
  to Tpetra_Details_unpackCrsMatrixAndCombine_de*.hpp.
- unpackAndCombineIntoCrsArrays broken up in to many many smaller functions (it
  was previously one large monolithic function).  Each of the small functions
  was refactored to be thread parallel.
- Race conditions were identified and resolved, mostly by using
  Kokkos::atomic_fetch_add where appropriate.

Addresses: trilinos#797, trilinos#800, trilinos#802
Review: @mhoemmen

Tests were run on two different machines and there results amended to this
commit:

Build/Test Cases Summary [RHEL6, standard checkin script]
Enabled Packages: TpetraCore
Disabled Packages: PyTrilinos,Claps,TriKota
Enabled all Forward Packages
0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=1506,notpassed=0 (19.13 min)
1) MPI_RELEASE_DEBUG_SHARED_OPENMP_PT => passed: passed=1509,notpassed=0 (13.75 min)

Build/Test Cases Summary [ride.sandia.gov, CUDA]
Enabled Packages: Tpetra,MueLu,Stokhos
0) MPI_RELEASE_SHARED_CUDA => passed=233,notpassed=14 (8.76 min)

The 14 failing tests are unrelated MueLu tests that can be ignored, see trilinos#1699

The failing Stokhos tests mentioned in trilinos#1655 were fixed with
commit e97e37b
mhoemmen pushed a commit that referenced this issue Sep 8, 2017
- Moved unpackAndCombineIntoCrsArrays (and friends) from Tpetra_Import_Util2.hpp
  to Tpetra_Details_unpackCrsMatrixAndCombine_de*.hpp.
- unpackAndCombineIntoCrsArrays broken up in to many many smaller functions (it
  was previously one large monolithic function).  Each of the small functions
  was refactored to be thread parallel.
- Race conditions were identified and resolved, mostly by using
  Kokkos::atomic_fetch_add where appropriate.

Addresses: #797, #800, #802
Review: @mhoemmen

Tests were run on two different machines and there results amended to this
commit:

Build/Test Cases Summary [RHEL6, standard checkin script]
Enabled Packages: TpetraCore
Disabled Packages: PyTrilinos,Claps,TriKota
Enabled all Forward Packages
0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=1506,notpassed=0 (19.13 min)
1) MPI_RELEASE_DEBUG_SHARED_OPENMP_PT => passed: passed=1509,notpassed=0 (13.75 min)

Build/Test Cases Summary [ride.sandia.gov, CUDA]
Enabled Packages: Tpetra,MueLu,Stokhos
0) MPI_RELEASE_SHARED_CUDA => passed=233,notpassed=14 (8.76 min)

The 14 failing tests are unrelated MueLu tests that can be ignored, see #1699

The failing Stokhos tests mentioned in #1655 were fixed with
commit e97e37b
tawiesn added a commit that referenced this issue Sep 8, 2017
…t,serial)

This fixes issue #1699

Build/Test Cases Summary
Enabled Packages: MueLu
Disabled Packages: PyTrilinos,Claps,TriKota
Enabled all Forward Packages
0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=446,notpassed=0 (146.33 min)
@jhux2 jhux2 added this to To do in MueLu May 18, 2018
@jhux2
Copy link
Member

jhux2 commented Sep 17, 2020

Closing, as the code base has changed significantly.

@jhux2 jhux2 closed this as completed Sep 17, 2020
MueLu automation moved this from To do to Done Sep 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
impacting: tests The defect (bug) is primarily a test failure (vs. a build failure) pkg: MueLu
Projects
MueLu
  
Done
Development

No branches or pull requests

3 participants