Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Framework: Please remove problematic Trilinos PR ats2 build using cuda-10.1.243 on 'vortex' with mass random test failures #11109

Closed
bartlettroscoe opened this issue Oct 5, 2022 · 19 comments
Assignees
Labels
Framework tasks Framework tasks (used internally by Framework team) PA: Framework Issues that fall under the Trilinos Framework Product Area type: bug The primary issue is a bug in Trilinos code or tests

Comments

@bartlettroscoe
Copy link
Member

bartlettroscoe commented Oct 5, 2022

Bug Report

@trilinos/framework, @rppawlo, @ccober6, @e10harvey

Internal issues

Description

The current Trilinos PR ats2 CUDA build which is using cuda-10.1.243 on 'vortex' is problematic because it is taking out PR test iterations randomly.

Given the move to C++17 and requiring CUDA 11 and the long known problems on this system killing PR builds:

...

Given (as shown in this SPARC CDash query and this SPARC CDash query) that SPARC has switched to using cuda-11.2.152 on 2022-04-26 (and EMPIRE is switching as well) ...

Given that a Trilinos PR CUDA 11 build has been running for some time on 'ascicgpu' machines ...

Can we ask that this problematic Trilinos PR build be removed from Trilinos PR testing? If you want to keep it for develop-to-master merge testing (and then perhaps just run as an informational build), then fine. But please remove this from blocking the merges of PRs.

@bartlettroscoe bartlettroscoe added the type: bug The primary issue is a bug in Trilinos code or tests label Oct 5, 2022
@bartlettroscoe bartlettroscoe added this to ToDo in Trilinos TriBITS Refactor via automation Oct 5, 2022
@e10harvey
Copy link
Contributor

CUDA 10 will be removed as part of the C++17 transition. CUDA 10 and CUDA 11 currently result in a different set of package enables. In order to preserve code coverage for the cuda toolchain, I suggest waiting until the C++17 transition is complete before disabling CUDA 10. That being said, we can certainly disable CUDA 10 now if the product leaders want that.

@bartlettroscoe bartlettroscoe added Framework tasks Framework tasks (used internally by Framework team) PA: Framework Issues that fall under the Trilinos Framework Product Area labels Oct 5, 2022
@bartlettroscoe
Copy link
Member Author

What important Trilinos customer currently needs CUDA 10 support from Trilinos?

@e10harvey
Copy link
Contributor

What important Trilinos customer currently needs CUDA 10 support from Trilinos?

@ccober6 ?

From my perspective, we don't want to loose code coverage with the CUDA toolchain by disabling CUDA 10 before the C++17 transition.

@bartlettroscoe
Copy link
Member Author

Vortex never should have been used to run PR builds blocking people's PRs from merging. That ATS2 system has too many known problems to subject PR testing to this system.

Plus, it is very difficult for regular developers to reproduce full builds and test runs on that system.

@rppawlo
Copy link
Contributor

rppawlo commented Oct 5, 2022

Since there are a bunch of packages disabled in the cuda 11 build (STK, NOX, Panzer at least), we can't drop the cuda 10 build yet until they are covered. Currently empire is still using cuda 10 for all its build and testing. They are working on changing to cuda 11 now.

#11087

@bartlettroscoe
Copy link
Member Author

Currently empire is still using cuda 10 for all its build and testing. They are working on changing to cuda 11 now.

@rppawlo, what is going to happen to EMPIRE when Trilinos upgrades Kokkos that requires C++17 and Trilinos switches to require C++17? Will EMPIRE no longer be able to use Trilinos 'develop'?

Does Kokkos that requires C++17 need to be integrated before Trilinos 14.0 is branched?

I am almost 100% certain that the mass test failures on vortex blocking the merge of my PRs #11093 and #11099 have nothing to do with the changes in those PRs (other than triggering the enable of all Trilinos packages).

@cgcgcg cgcgcg changed the title Framework: Trilinos PR ats2 build using cuda-10.1.243 on 'vortex' is irrelevant, customers now using cuda-1.2.152 Framework: Trilinos PR ats2 build using cuda-10.1.243 on 'vortex' is irrelevant, customers now using cuda-11.2.152 Oct 5, 2022
@DaveBeCoding DaveBeCoding pinned this issue Oct 5, 2022
@rppawlo
Copy link
Contributor

rppawlo commented Oct 5, 2022

@bartlettroscoe - David is working on updating the empire builds to c++17 and new TPLs stacks as we speak. The plan is to have it done near the end of October (hopefully earlier). We will continue to use the develop branch, but we may have a small gap in daily updating depending on the timing.

Trilinos 14.0 will include the Kokkos 4.0 release and c++17 requirement. It was timed to happen together.

@jhux2
Copy link
Member

jhux2 commented Oct 6, 2022

Vortex never should have been used to run PR builds blocking people's PRs from merging. That ATS2 system has too many known problems to subject PR testing to this system.

Plus, it is very difficult for regular developers to reproduce full builds and test runs on that system.

Is there another PR testing machine that is a good proxy for ATS2?

@bartlettroscoe
Copy link
Member Author

bartlettroscoe commented Oct 6, 2022

Is there another PR testing machine that is a good proxy for ATS2?

@jhux2, not that I now of.

But be clear, I am not saying don't run automated builds on ATS2 'vortex'. I am saying don't block PR merges based in results on this flaky system. What I am saying is to use it for nightly testing. The stability of ATS2 'vortex' is okay for nightly testing in my experience. (For example, the ATDM nightly build monitoring scripts filter out failing tests for the mass failures described in #6861 and #7122 so they don't hamper the ability to effectively monitor those builds and detect new test failures.) But that does not work for PR testing. You can't just ignore 3K tests that crash due to these system issues for a PR build (because that means that you have no feedback from 3K tests). But as long as this does not happen to frequently for the nightly build, this is quite workable.

@rppawlo
Copy link
Contributor

rppawlo commented Oct 6, 2022

@jhux2 The current use of vortex was a stopgap measure to recover PR testing. When weaver burned out, vortex was the only option at the time. It was intended to be temporary until the new asc machines were set up. I believe the framework team is working on moving the builds over, but the GenConfig builds on the new asc platforms are currently disabling too many packages. STK, Panzer and a number of other packages are disabled. Once we can reenable all packages in the new cuda 11 builds, we should be able to drop vortex from PR testing. We all know it is not the right platform for PR testing - it was a desperation move because PR testing was down for almost 2 weeks at that time and that was the only other working cuda autotester build. The transition to cuda 11 should come at the end of this month, so hopefully we will be in a better place soon...

@jhux2
Copy link
Member

jhux2 commented Oct 6, 2022

@bartlettroscoe @rppawlo Thanks, that makes a lot of sense.

@csiefer2
Copy link
Member

csiefer2 commented Oct 7, 2022

I would also add that there are a bunch of individual tests diabled on CUDA 11 that really need to get resolved before we can say with confidence that Trilinos really works with Trilinos 11. In addition to half the packages being disabled as @rppawlo noted :)

@bartlettroscoe
Copy link
Member Author

Why can't we set up CUDA 10.1.243 builds on the 'ascicgpu' machines? There are ATDM Trilinos builds running just fine on 'ascicgpu' machines as shown here:

@rppawlo, note that these ATDM RHEL7 CUDA 10.1.243 builds are all passing the configure and build when using find_package(CUDAToolkit) as you can see, for example, the build Trilinos-atdm-sems-rhel7-cuda-10.1-Volta70-complex-shared-release-debug showing:

Processing enabled external package/TPL: CUDA (enabled explicitly, disable with -DTPL_ENABLE_CUDA=OFF)
-- Found CUDAToolkit: /projects/sems/install/rhel7-x86_64/sems/compiler/cuda/10.1/base/include (found version "10.1.243") 
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE  

and for Trilinos-atdm-cee-rhel7_cuda-10.1.243_gnu-7.2.0_openmpi-4.0.3_shared_dbg showing:

Processing enabled external package/TPL: CUDA (enabled explicitly, disable with -DTPL_ENABLE_CUDA=OFF)
-- Found CUDAToolkit: /projects/sierra/linux_rh7/SDK/compilers/nvidia/cuda_10.1.243/include (found version "10.1.243") 
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE  

So that is two completely different stacks of CUDA 10.1.243 working with updated Trilinos 'develop'.

@bartlettroscoe
Copy link
Member Author

bartlettroscoe commented Oct 7, 2022

I would also add that there are a bunch of individual tests diabled on CUDA 11 that really need to get resolved before we can say with confidence that Trilinos really works with [CUDA] 11. In addition to half the packages being disabled as @rppawlo noted :)

Then someone needs to fix that ASAP.

@rppawlo
Copy link
Contributor

rppawlo commented Oct 7, 2022

@bartlettroscoe @csiefer2 - I've got the cuda 11 errors figured out. The cuda 11 build is linking against a blas/lapack that is threaded and built for openmp. The new build uses this library for both blas and lapack:

libopenblas.so.0 => /projects/sems/install/rhel7-x86_64/sems/v2/tpl/openblas/0.3.10/gcc/10.1.0/base/ceyagsr/lib/libopenblas.so.0 (0x00007f35d7b4c000)

while the old cuda 10 build points to the machine installed blas:

liblapack.so.3 => /lib64/liblapack.so.3 (0x00007f954a753000)
libblas.so.3 => /lib64/libblas.so.3 (0x00007f954a4fa000)

In genconfig ini files, we have this (note the required openmp library):

set(TPL_BLAS_LIBRARY_DIRS $ENV{CBLAS_ROOT}/lib CACHE STRING "from .ini configuration" FORCE)
set(TPL_BLAS_LIBRARIES $ENV{CBLAS_ROOT}/lib/libblas.a;-L$ENV{CBLAS_ROOT}/lib;-lgfortran;-lgomp;-lm CACHE STRING "from .ini configuration" FORCE)
set(TPL_LAPACK_LIBRARY_DIRS $ENV{LAPACK_ROOT}/lib CACHE STRING "from .ini configuration" FORCE)
set(TPL_LAPACK_LIBRARIES -L$ENV{LAPACK_ROOT}/lib;-lgfortran;-lgomp;$ENV{LAPACK_ROOT}/lib/liblapack.a CACHE STRING "from .ini configuration" FORCE)

If I comment out the lines above and use the two lines below, then all NOX tests are passing with the new cuda 11 genconfig build. Looks like we can probably reenable everything.

set(TPL_BLAS_LIBRARIES /lib64/libblas.so.3 CACHE STRING "from .ini configuration" FORCE)
set(TPL_LAPACK_LIBRARIES /lib64/liblapack.so.3 CACHE STRING "from .ini configuration" FORCE)

This explains why NOX tests that don't use anything but lapack (no kokkos, no tpetra, no epetra) were failing. The lapack linear solves are threaded and generate race conditions in our build. I got a clue when valgrind slowed the tests down and all of a sudden they started passing :) The nox examples were generating garbage directions from the lapack linear solve.

@trilinos/framework @e10harvey @jwillenbring - you either need to find a normal non-openmp module for blas and lapack or point to the default machine installed one in rhel as I did above.

@csiefer2
Copy link
Member

csiefer2 commented Oct 8, 2022

That might explain the Ifpack (not Ifpack2) fails as well.

@rppawlo I'd recommend just submitting a PR for changing the ini files

@bartlettroscoe
Copy link
Member Author

@rppawlo I'd recommend just submitting a PR for changing the ini files

@rppawlo, let me know if you want any help with that. It might even be good to pair program a quick session with this.

srbdev added a commit to srbdev/Trilinos that referenced this issue Oct 14, 2022
Also made changes suggested in
trilinos#11109 (comment)
to fix failing NOX tests. I was able to successfully build and test
the CUDA 11 build with these changes.
srbdev added a commit to srbdev/Trilinos that referenced this issue Oct 14, 2022
Also made changes suggested in
trilinos#11109 (comment)
to fix failing NOX tests. I was able to successfully build and test
the CUDA 11 build with these changes.
srbdev added a commit to srbdev/Trilinos that referenced this issue Oct 17, 2022
Also made changes suggested in
trilinos#11109 (comment)
to fix failing NOX tests. I was able to successfully build and test
the CUDA 11 build with these changes.
srbdev added a commit to srbdev/Trilinos that referenced this issue Oct 18, 2022
Also made changes suggested in
trilinos#11109 (comment)
to fix failing NOX tests. I was able to successfully build and test
the CUDA 11 build with these changes.
@bartlettroscoe bartlettroscoe unpinned this issue Oct 23, 2022
srbdev added a commit to srbdev/Trilinos that referenced this issue Oct 24, 2022
Also made changes suggested in
trilinos#11109 (comment)
to fix failing NOX tests. I was able to successfully build and test
the CUDA 11 build with these changes.
srbdev added a commit to srbdev/Trilinos that referenced this issue Oct 25, 2022
Also made changes suggested in
trilinos#11109 (comment)
to fix failing NOX tests. I was able to successfully build and test
the CUDA 11 build with these changes.
srbdev added a commit to srbdev/Trilinos that referenced this issue Oct 25, 2022
Also made changes suggested in
trilinos#11109 (comment)
to fix failing NOX tests. I was able to successfully build and test
the CUDA 11 build with these changes.
@bartlettroscoe bartlettroscoe changed the title Framework: Trilinos PR ats2 build using cuda-10.1.243 on 'vortex' is irrelevant, customers now using cuda-11.2.152 Framework: Please remove Trilinos PR ats2 build using cuda-10.1.243 on 'vortex' with mass random test failures Oct 26, 2022
@bartlettroscoe bartlettroscoe changed the title Framework: Please remove Trilinos PR ats2 build using cuda-10.1.243 on 'vortex' with mass random test failures Framework: Please remove problematic Trilinos PR ats2 build using cuda-10.1.243 on 'vortex' with mass random test failures Oct 26, 2022
@bartlettroscoe
Copy link
Member Author

Now that I have rephrased this Issue, I will use this as the pinned issue to alert Trilinos developers to this problem (to try to help avoid periodic inquiries like #11175 (comment)).

@bartlettroscoe bartlettroscoe pinned this issue Oct 26, 2022
srbdev added a commit to srbdev/Trilinos that referenced this issue Nov 15, 2022
Also made changes suggested in
trilinos#11109 (comment)
to fix failing NOX tests. I was able to successfully build and test
the CUDA 11 build with these changes.
@bartlettroscoe
Copy link
Member Author

Looks like the 'ats2' builds on 'vortex are gone after the PR/MM builds upgrade that completed on the 29th. For example, you see 'ats2' builds on 'vortex' nodes up till Nov 28 in this query but there are no 'ats2' builds after that in the last three days as shown here, so this problematic build is now gone!

Closing as complete!

@bartlettroscoe bartlettroscoe unpinned this issue Dec 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Framework tasks Framework tasks (used internally by Framework team) PA: Framework Issues that fall under the Trilinos Framework Product Area type: bug The primary issue is a bug in Trilinos code or tests
Projects
None yet
Development

No branches or pull requests

5 participants