e4s: add upcxx +cuda #32157

eugeneswalker · 2022-08-15T21:13:30Z

Add upcxx +cuda to E4S stack

bonachea

I have no easy way to test and no detailed knowledge of E4S internals, but looks right to me.

eugeneswalker · 2022-08-16T15:17:55Z

Looks like we can't build this in our standard container environment due to lack of NVIDIA driver:

https://gitlab.spack.io/spack/spack/-/jobs/3029172

1 error found in build log:
     36    nvcc: NVIDIA (R) Cuda compiler driver
     37    Copyright (c) 2005-2022 NVIDIA Corporation
     38    Built on Tue_May__3_18:49:52_PDT_2022
     39    Cuda compilation tools, release 11.7, V11.7.64
     40    Build cuda_11.7.r11.7/compiler.31294372_0
     41    
  >> 42    failed to link against CUDA Driver API

tldahlgren · 2022-08-16T15:33:24Z

The package fails to build. See https://gitlab.spack.io/spack/spack/-/jobs/3029172.

bonachea · 2022-08-16T17:24:20Z

Looks like we can't build this in our standard container environment due to lack of NVIDIA driver:

https://gitlab.spack.io/spack/spack/-/jobs/3029172

1 error found in build log:
     36    nvcc: NVIDIA (R) Cuda compiler driver
     37    Copyright (c) 2005-2022 NVIDIA Corporation
     38    Built on Tue_May__3_18:49:52_PDT_2022
     39    Cuda compilation tools, release 11.7, V11.7.64
     40    Build cuda_11.7.r11.7/compiler.31294372_0
     41    
  >> 42    failed to link against CUDA Driver API

This sounds like the expected behavior - the CUDA support in UPC++ (and the underlying GASNet-EX communication layer) exist to allow GPUDirect RDMA offload communication. As such, the libraries have a requirement on the CUDA Driver Library when using the +cuda variant. We don't see this as a problem in practice since the CUDA support is not usable at runtime without a working CUDA driver.

eugeneswalker · 2022-08-16T17:43:31Z

Looks like we can't build this in our standard container environment due to lack of NVIDIA driver:

https://gitlab.spack.io/spack/spack/-/jobs/3029172
1 error found in build log:
     36    nvcc: NVIDIA (R) Cuda compiler driver
     37    Copyright (c) 2005-2022 NVIDIA Corporation
     38    Built on Tue_May__3_18:49:52_PDT_2022
     39    Cuda compilation tools, release 11.7, V11.7.64
     40    Build cuda_11.7.r11.7/compiler.31294372_0
     41    
  >> 42    failed to link against CUDA Driver API
This sounds like the expected behavior - the CUDA support in UPC++ (and the underlying GASNet-EX communication layer) exist to allow GPUDirect RDMA offload communication. As such, the libraries have a requirement on the CUDA Driver Library when using the +cuda variant. We don't see this as a problem in practice since the CUDA support is not usable at runtime without a working CUDA driver.

Of course! We have so many packages in E4S I sometimes forget which ones require the driver. We can in the future adapt our CI build environment so that upcxx and others can be targetted to build in containers scheduled to machines where the gpu driver is available. Thank you!

eugeneswalker · 2022-10-03T20:43:59Z

This sounds like the expected behavior - the CUDA support in UPC++ (and the underlying GASNet-EX communication layer) exist to allow GPUDirect RDMA offload communication. As such, the libraries have a requirement on the CUDA Driver Library when using the +cuda variant. We don't see this as a problem in practice since the CUDA support is not usable at runtime without a working CUDA driver.

Looking at UPCXX's configure I see:

1070   # check that the CUDA Driver API is linkable, adding explicit link flags if needed
1071   echo -e "#include <cuda.h>\n#include <cuda_runtime_api.h>\nint main() { cuInit(0); return 0; }" >| conftest.cpp
1072   for ldextra in '' '-lcuda' '-framework CUDA' 'FAIL'; do
1073     eval $CXX $UPCXX_CUDA_CPPFLAGS conftest.cpp -o conftest.exe $UPCXX_CUDA_LIBFLAGS $ldextra &> /dev/null
1074     if [[ $? -eq 0 ]] && ( ./conftest.exe 2>/dev/null ); then
1075       [[ -n "$ldextra" ]] && UPCXX_CUDA_LIBFLAGS+=" $ldextra"
1076       break
1077     fi
1078   done

Line 1073 above can exit OK just using the stub driver. But then there is the additional check on exit code for ./conftest.exe, which doesn't exit OK without the real driver. For the purposes of being able to generate binaries in an environment without the actual driver, and then install those pre-built binaries later into the intended run environment where the driver would be available, could the exit code check on./conftest.exe be skipped? What do you think?

PHHargrove · 2022-10-03T23:41:18Z

We have a number of things going on at our end right now, including getting new releases of UPC++ and GASNet-EX out the door, and corresponding updates to their spack packages to follow that. Is there an impending cutoff date for a Spack and/or E4S release before which we need to address the issue of an inability to test the PR in an Nvidia-driverless environment?

PHHargrove · 2022-10-03T23:53:32Z

Independent of any other points in the existing discussion thread, I want to respond to somethings @eugeneswalker said:

We can in the future adapt our CI build environment so that upcxx and others can be targetted to build in containers scheduled to machines where the gpu driver is available.

AND

For the purposes of being able to generate binaries in an environment without the actual driver, and then install those pre-built binaries later into the intended run environment where the driver would be available [...]

There are reasons other than just the CUDA driver library to be concerned about the prospect of configuring and building UPC++ and its underlying GASNet-EX communications library in one environment for later deployment in a different environment. So this "future adaptation" is something we should potentially discuss elsewhere (when @bonachea and I are not quite as pressed for time as we are this week).

PHHargrove · 2022-10-04T00:26:10Z

In response to what I think is the main sticking point for this pull request:

I believe the current configuration logic accurately reflects a probe for a supported environment. We link the driver API as a normal library and do not currently have the all of necessary logic to allow reliable substitution of the stub libs. Therefore, removing the return code check would likely allow one to proceed past configuration to instead fail at application runtime.

I am not sure how to best address the fact that "your standard container environment" wants to build the +cuda variant of UPC++ without the CUDA driver library installed. That doesn't correspond to an environment we currently support.

eugeneswalker · 2022-10-04T04:44:50Z

Is there an impending cutoff date for a Spack and/or E4S release before which we need to address the issue of an inability to test the PR in an Nvidia-driverless environment?

No, not at all. It would be convenient if the driver weren’t required in order to ensure upcxx +cuda remains buildable as Spack changes, but it is definitely not urgent and is not a blocker as far as us doing deployments at facilities.

alalazo · 2022-12-12T12:38:30Z

@spackbot run pipeline

spackbot-app · 2022-12-12T12:38:37Z

I've started that pipeline for you!

eugeneswalker added cuda e4s labels Aug 15, 2022

spackbot-app bot added core PR affects Spack core functionality gitlab Issues related to gitlab integration labels Aug 15, 2022

bonachea previously approved these changes Aug 15, 2022

View reviewed changes

wspear previously approved these changes Aug 15, 2022

View reviewed changes

tldahlgren assigned bonachea and wspear Aug 15, 2022

tldahlgren enabled auto-merge (squash) August 15, 2022 21:51

eugeneswalker dismissed stale reviews from wspear and bonachea via 9d5375a August 16, 2022 01:12

eugeneswalker force-pushed the e4s-add-upcxx-cuda branch from c5e120d to 9d5375a Compare August 16, 2022 01:12

tldahlgren previously approved these changes Aug 16, 2022

View reviewed changes

e4s: add upcxx +cuda

828f3ae

eugeneswalker dismissed tldahlgren’s stale review via 828f3ae November 30, 2022 18:26

eugeneswalker force-pushed the e4s-add-upcxx-cuda branch from 9d5375a to 828f3ae Compare November 30, 2022 18:26

add note explaining why upcxx+cuda is not here

f6a618a

eugeneswalker closed this Feb 24, 2024

auto-merge was automatically disabled February 24, 2024 18:26
Pull request was closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

e4s: add upcxx +cuda #32157

e4s: add upcxx +cuda #32157

eugeneswalker commented Aug 15, 2022

bonachea left a comment

eugeneswalker commented Aug 16, 2022 •

edited

tldahlgren commented Aug 16, 2022

bonachea commented Aug 16, 2022

eugeneswalker commented Aug 16, 2022

eugeneswalker commented Oct 3, 2022 •

edited

PHHargrove commented Oct 3, 2022

PHHargrove commented Oct 3, 2022

PHHargrove commented Oct 4, 2022

eugeneswalker commented Oct 4, 2022

alalazo commented Dec 12, 2022

spackbot-app bot commented Dec 12, 2022

e4s: add upcxx +cuda #32157

e4s: add upcxx +cuda #32157

Conversation

eugeneswalker commented Aug 15, 2022

bonachea left a comment

Choose a reason for hiding this comment

eugeneswalker commented Aug 16, 2022 • edited

tldahlgren commented Aug 16, 2022

bonachea commented Aug 16, 2022

eugeneswalker commented Aug 16, 2022

eugeneswalker commented Oct 3, 2022 • edited

PHHargrove commented Oct 3, 2022

PHHargrove commented Oct 3, 2022

PHHargrove commented Oct 4, 2022

eugeneswalker commented Oct 4, 2022

alalazo commented Dec 12, 2022

spackbot-app bot commented Dec 12, 2022

eugeneswalker commented Aug 16, 2022 •

edited

eugeneswalker commented Oct 3, 2022 •

edited