-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alternate CUDA provider #19365
Comments
I also think solution 2 is the way to go. |
Agreed. |
I reached out to one of the NVHPC architects, @brycelelbach, and got the following detailed info on nvhpc and the cuda toolkit:
|
This has come up a couple times very recently (SDKs including implementations of packages) and I am considering an alternative approach:
The goal of this would be to avoid the work of converting packages to virtuals when an SDK provides them (so I agree that once packages are converted to virtuals that Spack should be able to resolve these sorts of issues, but I think it would be ideal to avoid the need for that conversion). This is based on the assumption that I'm curious what you all think of that. |
My product manager pointed out another caveat I didn't fully cover: while NVHPC packages specific existing versions of the CUDA toolkit, CUDA libraries (CUBLAS, etc) are independently semantically versioning, and the versions in an NVHPC release may be different from the versions in the corresponding CUDA toolkit release. There are also libraries in the NVHPC SDK that are not in the CTK. So, to summarize:
So, my suggestions:
Given all the above, I'd still encourage something like option 2. |
Now that I've given this some more thought, I think it would be useful for me to understand how y'all package all the NVIDIA-related stuff. E.g. what are the packages you have today, and what version schemes are associated with them. We, NVIDIA, could potentially take on some of the work here in defining Spack packages, if that would be helpful. |
We would love to get contributions directly from NVIDIA. Another thing you can do is add the GitHub handles of any NVIDIA employees who would like to be listed as official maintainers for the build recipe. This gives us someone to ping when we review PRs or get reports of build issues. |
Ah, it seems there was some confusion. @samcmill is my coworker at NVIDIA and the relevant person. I think neither I nor Axel realized he filed this bug! |
In case folks don't realize it, I am employed by NVIDIA. We recently contributed support for the NVIDIA HPC SDK (#19294). Please go ahead and ping me if there are any NV software issues. |
This sounds fantastic, yes would love to add your GitHub handles as co-maintainers, e.g. to the We currently ship a package called Spack has a pretty on-point Python DSL in its Packages can also Another neet little CUDA thing that we do is that we provide a mixin-class for packages that are (optionally) depending on CUDA. Above, The |
@brycelelbach Could you mention an example? I also have a few other questions:
|
NCCL, cuTENSOR.
Typically, yes, there's a way to obtain just the library from our website or otherwise.
Uh, can you elaborate on what you want? The documentation for both packages should list the contents.
The overlapping libraries are provided by the NVHPC SDK, and may not be the same version that was packaged with the associated CTK versions. |
I think the Spack Does the NVHPC SDK provide a distinct version of NCCL, or is it a compiled instance of an archive available at https://github.com/NVIDIA/nccl/archive/?
For NVHPC SDK I see https://docs.nvidia.com/hpc-sdk/index.html at https://developer.nvidia.com/hpc-sdk which leads me to https://docs.nvidia.com/hpc-sdk/hpc-sdk-release-notes/index.html. That table includes mapped versions of e.g. NCCL for the NVHPC SDK version Likewise for a CTK release I see https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html Based on #19365 (comment), it sounds like any CTK library also mentioned in the NVHPC SDK release is overridden by NVHPC SDK.
|
Btw, I'm currently listed as a maintainer for our |
|
Can those more familiar with the Spack internals please comment on @scheibelp's alternative proposal above? It is not clear to me how much effort it would require to architect and implement? Otherwise, the consensus pretty clearly is option 2. |
Based on your comment and looking at https://docs.nvidia.com/hpc-sdk/hpc-sdk-release-notes/index.html, my impression is that
The first suggestion would be easy (IMO): you would just add logic for locating each library inside of the |
Commenting to get emails on this. |
It would be good if Github provided a "subscribe to discussion" button. |
I've just created PR #29155 to specifically try to add |
Revisiting #19294 (comment), the concretizer now correctly picks up CUDA:
Not sure if it was due to the switch to |
@ax3l / @scheibelp , can we move on here?
See #29782 for an implementation. Peter's alternative proposal of having sdk-type of installations where
The build system has nothing to do with concretization |
I would like to check internally this first. |
Is it possible to build with an existing or spack cuda? I'm not able to build a package that needs cuda using an nvidia base container. I tried letting spack install the whole thing but I get spack:
specs:
- openmpi@4.1.4 fabrics=ofi +legacylaunchers +cuda cuda_arch=70
- libfabric@1.19.0 fabrics=efa,tcp,udp,sockets,verbs,shm,mrail,rxd,rxm
- flux-sched
- flux-core
- pmix@4.2.2
- flux-pmix@0.4.0
- amg2023 +mpi +cuda cuda_arch=70
... and the message: # spack install --reuse --fail-fast
==> Error: Invalid environment configuration detected: error parsing YAML: near /opt/spack-environment/spack.yaml, 0, 0: expected <block end>, but found '<block mapping start>' And the error when I don't set anything: c-11.4.0/openmpi-4.1.4-btvkmo2ljfbtxjdjjax2zu2zoklloz5j/include -c general.c -o general.obj
212 nvcc fatal : Unsupported gpu architecture 'compute_none'
>> 213 make[1]: *** [../config/Makefile.config:66: device_utils.obj] Error 1
214 nvcc fatal : Unsupported gpu architecture 'compute_none'
215 make[1]: *** Waiting for unfinished jobs....
>> 216 make[1]: *** [../config/Makefile.config:66: general.obj] Error 1
217 make[1]: Leaving directory '/tmp/root/spack-stage/spack-stage-hypre-2.29.0-jnuklmbah7zuqbwbdxbg45wngzighkxf/spack-src/src/utilities'
>> 218 make: *** [Makefile:86: all] Error 1 I am not great with spack, hopefully just doing something dumb. :LP |
@vsoch Generally that should be possible, this issue is about making it possible for Spack to use the NVHPC SDK as a provider of CUDA, so I don't see how it has anything to do with your problem other than the very thin connection that the NVHPC SDK also comes with its own OpenMPI. |
What should the spack environment file look like so it works? I really just want any solution that will work - I want compute_70 (and not compute_none). |
@vsoch I have no idea why the given file does not work. But I'm pretty sure it's completely off-topic for this thread. How about filing a new issue or asking the devs on Slack? |
The main reason why this hasn't been able to progress at all is due to
vendor packaging. Some of the libraries that come with the CUDA monolithic
tarball are scattered in different subdirectories in the NVHPC tarball.
Notably, the math (cuBLAS etc) libraries and the comms (MPI, NCCL, etc)
libraries.
…On Thu, May 9, 2024, 11:50 pauleonix ***@***.***> wrote:
@vsoch <https://github.com/vsoch> I have no idea why the given file does
not work. But I'm pretty sure it's completely off-topic for this thread.
How about filing a new issue or asking the devs on Slack?
—
Reply to this email directly, view it on GitHub
<#19365 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AMERY5GYI3E76TTURTLTMQ3ZBOLKVAVCNFSM4SWI5LRKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMJQGI4TGNBTHAYQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I figured it out - there was an underlying dependency that also needed to be provided in specs with cuda_arch=70, otherwise it looks like it defaulted to none. |
I wonder if `cuda_arch==70` will propagate down. Or perhaps that only works
for "normal" variants that are explicitly declared in the recipes, instead
of in the build system?
…On Thu, May 9, 2024, 18:35 Vanessasaurus ***@***.***> wrote:
@vsoch <https://github.com/vsoch> I have no idea why the given file does
not work. But I'm pretty sure it's completely off-topic for this thread.
How about filing a new issue or asking the devs on Slack?
I figured it out - there was an underlying dependency that also needed to
be provided in specs with cuda_arch=70, otherwise it looks like it
defaulted to none.
—
Reply to this email directly, view it on GitHub
<#19365 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AMERY5HSPVLVMI2OALQB3LDZBP23ZAVCNFSM4SWI5LRKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMJQGM2TKNBTHAYA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
It didn't seem to - I did try that variant with the double equals. |
The
nvhpc
package provides CUDA, but there is currently no way to use it as acuda
provider.Continues discussion started in #19294 (comment).
Description
The NVIDIA HPC SDK is a comprehensive set of compilers, libraries, and tools. The
nvhpc
package currently exposes the compilers, CPU math libraries (+blas
,+lapack
), and MPI (+mpi
). While the HPC SDK includes CUDA and CUDA math libraries, they are not currently exposed (no+cuda
). The included CUDA may be used with other compilers and is not limited to the NV compilers.CUDA is currently provided by the
cuda
package. A virtual package cannot exist with the same name as a real package.Potential solutions:
cuda-virtual
(and packages would have to change theirdepends_on
declarations to indicate that any provider ofcuda-virtual
is acceptable).cuda
package, for instance tocuda-toolkit
, and have it providecuda
. Thenvhpc
package could also providecuda
.depend_on('nvhpc')
to use the CUDA bundled with the HPC SDK.The same issue also applies to
nccl
. The HPC SDK includes NCCL, but it is already provided by thenccl
package.cc @scheibelp
The text was updated successfully, but these errors were encountered: