Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

e4s: add upcxx +cuda #32157

Closed
wants to merge 2 commits into from

Conversation

eugeneswalker
Copy link
Contributor

Add upcxx +cuda to E4S stack

@wspear @bonachea

@spackbot-app spackbot-app bot added core PR affects Spack core functionality gitlab Issues related to gitlab integration labels Aug 15, 2022
bonachea
bonachea previously approved these changes Aug 15, 2022
Copy link
Contributor

@bonachea bonachea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no easy way to test and no detailed knowledge of E4S internals, but looks right to me.

wspear
wspear previously approved these changes Aug 15, 2022
tldahlgren
tldahlgren previously approved these changes Aug 16, 2022
@eugeneswalker
Copy link
Contributor Author

eugeneswalker commented Aug 16, 2022

Looks like we can't build this in our standard container environment due to lack of NVIDIA driver:

1 error found in build log:
     36    nvcc: NVIDIA (R) Cuda compiler driver
     37    Copyright (c) 2005-2022 NVIDIA Corporation
     38    Built on Tue_May__3_18:49:52_PDT_2022
     39    Cuda compilation tools, release 11.7, V11.7.64
     40    Build cuda_11.7.r11.7/compiler.31294372_0
     41    
  >> 42    failed to link against CUDA Driver API

@tldahlgren
Copy link
Contributor

The package fails to build. See https://gitlab.spack.io/spack/spack/-/jobs/3029172.

@bonachea
Copy link
Contributor

Looks like we can't build this in our standard container environment due to lack of NVIDIA driver:

1 error found in build log:
     36    nvcc: NVIDIA (R) Cuda compiler driver
     37    Copyright (c) 2005-2022 NVIDIA Corporation
     38    Built on Tue_May__3_18:49:52_PDT_2022
     39    Cuda compilation tools, release 11.7, V11.7.64
     40    Build cuda_11.7.r11.7/compiler.31294372_0
     41    
  >> 42    failed to link against CUDA Driver API

This sounds like the expected behavior - the CUDA support in UPC++ (and the underlying GASNet-EX communication layer) exist to allow GPUDirect RDMA offload communication. As such, the libraries have a requirement on the CUDA Driver Library when using the +cuda variant. We don't see this as a problem in practice since the CUDA support is not usable at runtime without a working CUDA driver.

@eugeneswalker
Copy link
Contributor Author

Looks like we can't build this in our standard container environment due to lack of NVIDIA driver:

1 error found in build log:
     36    nvcc: NVIDIA (R) Cuda compiler driver
     37    Copyright (c) 2005-2022 NVIDIA Corporation
     38    Built on Tue_May__3_18:49:52_PDT_2022
     39    Cuda compilation tools, release 11.7, V11.7.64
     40    Build cuda_11.7.r11.7/compiler.31294372_0
     41    
  >> 42    failed to link against CUDA Driver API

This sounds like the expected behavior - the CUDA support in UPC++ (and the underlying GASNet-EX communication layer) exist to allow GPUDirect RDMA offload communication. As such, the libraries have a requirement on the CUDA Driver Library when using the +cuda variant. We don't see this as a problem in practice since the CUDA support is not usable at runtime without a working CUDA driver.

Of course! We have so many packages in E4S I sometimes forget which ones require the driver. We can in the future adapt our CI build environment so that upcxx and others can be targetted to build in containers scheduled to machines where the gpu driver is available. Thank you!

@eugeneswalker
Copy link
Contributor Author

eugeneswalker commented Oct 3, 2022

This sounds like the expected behavior - the CUDA support in UPC++ (and the underlying GASNet-EX communication layer) exist to allow GPUDirect RDMA offload communication. As such, the libraries have a requirement on the CUDA Driver Library when using the +cuda variant. We don't see this as a problem in practice since the CUDA support is not usable at runtime without a working CUDA driver.

Looking at UPCXX's configure I see:

1070   # check that the CUDA Driver API is linkable, adding explicit link flags if needed
1071   echo -e "#include <cuda.h>\n#include <cuda_runtime_api.h>\nint main() { cuInit(0); return 0; }" >| conftest.cpp
1072   for ldextra in '' '-lcuda' '-framework CUDA' 'FAIL'; do
1073     eval $CXX $UPCXX_CUDA_CPPFLAGS conftest.cpp -o conftest.exe $UPCXX_CUDA_LIBFLAGS $ldextra &> /dev/null
1074     if [[ $? -eq 0 ]] && ( ./conftest.exe 2>/dev/null ); then
1075       [[ -n "$ldextra" ]] && UPCXX_CUDA_LIBFLAGS+=" $ldextra"
1076       break
1077     fi
1078   done

Line 1073 above can exit OK just using the stub driver. But then there is the additional check on exit code for ./conftest.exe, which doesn't exit OK without the real driver. For the purposes of being able to generate binaries in an environment without the actual driver, and then install those pre-built binaries later into the intended run environment where the driver would be available, could the exit code check on./conftest.exe be skipped? What do you think?

@PHHargrove
Copy link

We have a number of things going on at our end right now, including getting new releases of UPC++ and GASNet-EX out the door, and corresponding updates to their spack packages to follow that. Is there an impending cutoff date for a Spack and/or E4S release before which we need to address the issue of an inability to test the PR in an Nvidia-driverless environment?

@PHHargrove
Copy link

Independent of any other points in the existing discussion thread, I want to respond to somethings @eugeneswalker said:

We can in the future adapt our CI build environment so that upcxx and others can be targetted to build in containers scheduled to machines where the gpu driver is available.

AND

For the purposes of being able to generate binaries in an environment without the actual driver, and then install those pre-built binaries later into the intended run environment where the driver would be available [...]

There are reasons other than just the CUDA driver library to be concerned about the prospect of configuring and building UPC++ and its underlying GASNet-EX communications library in one environment for later deployment in a different environment. So this "future adaptation" is something we should potentially discuss elsewhere (when @bonachea and I are not quite as pressed for time as we are this week).

@PHHargrove
Copy link

In response to what I think is the main sticking point for this pull request:

I believe the current configuration logic accurately reflects a probe for a supported environment. We link the driver API as a normal library and do not currently have the all of necessary logic to allow reliable substitution of the stub libs. Therefore, removing the return code check would likely allow one to proceed past configuration to instead fail at application runtime.

I am not sure how to best address the fact that "your standard container environment" wants to build the +cuda variant of UPC++ without the CUDA driver library installed. That doesn't correspond to an environment we currently support.

@eugeneswalker
Copy link
Contributor Author

Is there an impending cutoff date for a Spack and/or E4S release before which we need to address the issue of an inability to test the PR in an Nvidia-driverless environment?

No, not at all. It would be convenient if the driver weren’t required in order to ensure upcxx +cuda remains buildable as Spack changes, but it is definitely not urgent and is not a blocker as far as us doing deployments at facilities.

@alalazo
Copy link
Member

alalazo commented Dec 12, 2022

@spackbot run pipeline

@spackbot-app
Copy link

spackbot-app bot commented Dec 12, 2022

I've started that pipeline for you!

auto-merge was automatically disabled February 24, 2024 18:26

Pull request was closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core PR affects Spack core functionality cuda e4s gitlab Issues related to gitlab integration
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants