Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternate CUDA provider #19365

Open
samcmill opened this issue Oct 19, 2020 · 33 comments · May be fixed by #30748
Open

Alternate CUDA provider #19365

samcmill opened this issue Oct 19, 2020 · 33 comments · May be fixed by #30748

Comments

@samcmill
Copy link
Contributor

The nvhpc package provides CUDA, but there is currently no way to use it as a cuda provider.

Continues discussion started in #19294 (comment).

Description

The NVIDIA HPC SDK is a comprehensive set of compilers, libraries, and tools. The nvhpc package currently exposes the compilers, CPU math libraries (+blas, +lapack), and MPI (+mpi). While the HPC SDK includes CUDA and CUDA math libraries, they are not currently exposed (no +cuda). The included CUDA may be used with other compilers and is not limited to the NV compilers.

CUDA is currently provided by the cuda package. A virtual package cannot exist with the same name as a real package.

Potential solutions:

  1. Create a new virtual package name like cuda-virtual (and packages would have to change their depends_on declarations to indicate that any provider of cuda-virtual is acceptable).
  2. Rename the cuda package, for instance to cuda-toolkit, and have it provide cuda. The nvhpc package could also provide cuda.
  3. Packages explicitly depend_on('nvhpc') to use the CUDA bundled with the HPC SDK.

The same issue also applies to nccl. The HPC SDK includes NCCL, but it is already provided by the nccl package.

cc @scheibelp

@adamjstewart
Copy link
Member

I vote for solution 2: rename cuda to cuda-toolkit and have both packages provide cuda. For nccl, maybe something like nvidia-nccl?

Also pinging our official CUDA maintainers: @ax3l @Rombur

@Rombur
Copy link
Contributor

Rombur commented Oct 19, 2020

I also think solution 2 is the way to go.

@ax3l
Copy link
Member

ax3l commented Oct 21, 2020

Agreed.

@ax3l
Copy link
Member

ax3l commented Oct 21, 2020

I reached out to one of the NVHPC architects, @brycelelbach, and got the following detailed info on nvhpc and the cuda toolkit:

  • The nvhpc sdk supports multiple versions of the cuda toolkit. Currently, downloads provide either a bundle with the newest cuda version or another bundle with the newest + plus two previous CUDA versions, i.e. provides("cuda@10.2:11.1")
  • Nvhpc packages existing releases of cuda, it will never have a special version.
  • Solution 2 seems right.

@scheibelp
Copy link
Member

This has come up a couple times very recently (SDKs including implementations of packages) and I am considering an alternative approach:

  • The SDK package (nvhpc) can come with a method to define externals that reference its installation prefix
  • A new concretizer directive could be included (something like supplies) that tells the concretizer that the SDK comes with additional packages (which would fit well with the concretize-together option in environments)

The goal of this would be to avoid the work of converting packages to virtuals when an SDK provides them (so I agree that once packages are converted to virtuals that Spack should be able to resolve these sorts of issues, but I think it would be ideal to avoid the need for that conversion).

This is based on the assumption that nvhpc is not providing a distinct CUDA implementation (in the sense that openmpi provides a distinct implementation of MPI) but instead is downloading the same binaries that you could get when installing the CUDA package directly. I should say that if an SDK does provide a distinct version, making it a provider could be an appropriate choice.

I'm curious what you all think of that.

@brycelelbach
Copy link

brycelelbach commented Oct 21, 2020

My product manager pointed out another caveat I didn't fully cover: while NVHPC packages specific existing versions of the CUDA toolkit, CUDA libraries (CUBLAS, etc) are independently semantically versioning, and the versions in an NVHPC release may be different from the versions in the corresponding CUDA toolkit release. There are also libraries in the NVHPC SDK that are not in the CTK.

So, to summarize:

  • There are CUDA toolkit releases, which contains NVCC, a CUDA runtime, a subset of CUDA libraries, and an NVIDIA driver.
    • NVCC and the CUDA runtime use the same version as the CUDA toolkit.
    • The NVIDIA driver has a distinct version from the CUDA toolkit (rNNN); a minimum version is required for each CUDA toolkit.
    • The CUDA libraries have an independent semantic version scheme from the CUDA toolkit.
  • There are NVHPC SDK releases, which contains multiple CUDA toolkit, NVC++, NVFORTRAN, and all the CUDA libraries.
    • There are multiple versions of NVCC and the CUDA toolkit in an NVHPC release. These versions are always some existing release of NVCC and the CUDA toolkit; we don't introduce new versions of NVCC and the CUDA toolkit in an NVHPC release.
    • The CUDA libraries in an NVHPC release may not be the same versions as those included with the CUDA toolkit; they're independently versioned. New versions of CUDA libraries may be introduced by NVHPC SDK releases.
    • NVC++, NVFORTRAN, and some CUDA libraries are only released in the NVHPC SDK.

So, my suggestions:

  • The CUDA toolkit package should provide an NVCC package (with CUDA toolkit versioning) and a CUDA runtime package (with CUDA toolkit versioning).
  • The CUDA toolkit package should provide a package for each CUDA library (with independent semantic versioning).
  • The NVHPC SDK should provide multiple NVCC package (with CUDA toolkit versioning) and a CUDA runtime package (with CUDA toolkit versioning).
  • The NVHPC SDK should provide a package for each CUDA library (with independent semantic versioning).
  • The NVHPC SDK should provide an NVC++ and NVFORTRAN package (with NVHPC SDK versioning).

Given all the above, I'd still encourage something like option 2.

@brycelelbach
Copy link

Now that I've given this some more thought, I think it would be useful for me to understand how y'all package all the NVIDIA-related stuff. E.g. what are the packages you have today, and what version schemes are associated with them.

We, NVIDIA, could potentially take on some of the work here in defining Spack packages, if that would be helpful.

@adamjstewart
Copy link
Member

We would love to get contributions directly from NVIDIA. Another thing you can do is add the GitHub handles of any NVIDIA employees who would like to be listed as official maintainers for the build recipe. This gives us someone to ping when we review PRs or get reports of build issues.

@brycelelbach
Copy link

Ah, it seems there was some confusion. @samcmill is my coworker at NVIDIA and the relevant person. I think neither I nor Axel realized he filed this bug!

@samcmill
Copy link
Contributor Author

In case folks don't realize it, I am employed by NVIDIA. We recently contributed support for the NVIDIA HPC SDK (#19294). Please go ahead and ping me if there are any NV software issues.

@ax3l
Copy link
Member

ax3l commented Oct 21, 2020

This sounds fantastic, yes would love to add your GitHub handles as co-maintainers, e.g. to the cuda package so you receive pings on them.

We currently ship a package called cuda that provides the CTK (spack edit cuda). We also have thrust and cub packages that could be used to install a development version and/or an older/newer version.

Spack has a pretty on-point Python DSL in its package.py files that pop up if you spack edit <package>. The class name inside such a file is the package name, e.g. class Cub(Package) inside var/spack/repos/builtin/packages/cub/package.py is cub, class NlohmannJson(CMakePackage) inside var/spack/repos/builtin/packages/nlohmann-json/package.py is nlohmann-json.

Packages can also provide("virtual-name") other packages, e.g. openmpi and mpich do both provide mpi at a certain version range in their package.py. One could potentially make cuda, thrust, cub, etc. virtual packages that are provided by various packages like cuda-toolkit and nvhpc or thrust-oss. @adamjstewart et al. can definitely brief you further; let me already leave the tutorial and packageing guide. There is also an open discussion in #19269 if we should provide("cublas") et al from the cuda (CTK) package.

Another neet little CUDA thing that we do is that we provide a mixin-class for packages that are (optionally) depending on CUDA. Above, cub is a Package based on a simple install logic (downloads and calls the install phase; optionally accepts patch-ing, etc.). Whereas thrust is a CMakePackage that further knowns the cmake-configure, build and install phase, a method for CMake arguments, among others. Everything that is not overwritten as a method is just taken from defaults (e.g. compare spack edit thrust with spack edit adios2).

The CudaPackage class, defined in lib/spack/spack/build_systems/cuda.py, maintains host-compiler conflicts and provides unified package variants (options) to select the GPU architecture. It's currently used by ~150 packages and reduced duplication, which just derive from it in their package.py. PR #19038 adds GoUDA support.
As an example how this looks, see spack edit paraview and spack info paraview, which lists cuda and cuda_arch as its build variants, which are in turn derived from CudaPackage.

@scheibelp
Copy link
Member

There are also libraries in the NVHPC SDK that are not in the CTK.

@brycelelbach Could you mention an example?

I also have a few other questions:

  • Is there a means to obtain the CUDA libraries other than via the CUDA toolkit or by the NVHPC SDK?
  • Would you have a means to identify the different libraries provided by the CUDA toolkit and the NVHPC SDK (as well as their versions)?
  • It is mentioned that NVHPC SDK provides all the libraries that the CTK does: are the libraries that overlap provided by the CTK instance that is bundled with the NVHPC SDK, or are there distinct libraries provided (in which case I assume the CTK is useful for the driver/runtime/nvcc)?

@brycelelbach
Copy link

brycelelbach commented Oct 22, 2020

@brycelelbach Could you mention an example?

NCCL, cuTENSOR.

Is there a means to obtain the CUDA libraries other than via the CUDA toolkit or by the NVHPC SDK?

Typically, yes, there's a way to obtain just the library from our website or otherwise.

Would you have a means to identify the different libraries provided by the CUDA toolkit and the NVHPC SDK (as well as their versions)?

Uh, can you elaborate on what you want? The documentation for both packages should list the contents.

It is mentioned that NVHPC SDK provides all the libraries that the CTK does: are the libraries that overlap provided by the CTK instance that is bundled with the NVHPC SDK, or are there distinct libraries provided (in which case I assume the CTK is useful for the driver/runtime/nvcc)?

The overlapping libraries are provided by the NVHPC SDK, and may not be the same version that was packaged with the associated CTK versions.

@scheibelp
Copy link
Member

Could you mention an example?

NCCL, cuTENSOR.

Is there a means to obtain the CUDA libraries other than via the CUDA toolkit or by the NVHPC SDK?

Typically, yes, there's a way to obtain just the library from our website or otherwise.

I think the Spack nccl package is an example of this: it downloads from (e.g.) https://github.com/NVIDIA/nccl/archive/v2.7.3-1.tar.gz

Does the NVHPC SDK provide a distinct version of NCCL, or is it a compiled instance of an archive available at https://github.com/NVIDIA/nccl/archive/?

Would you have a means to identify the different libraries provided by the CUDA toolkit and the NVHPC SDK (as well as their versions)?

can you elaborate on what you want? The documentation for both packages should list the contents.

For NVHPC SDK I see https://docs.nvidia.com/hpc-sdk/index.html at https://developer.nvidia.com/hpc-sdk which leads me to https://docs.nvidia.com/hpc-sdk/hpc-sdk-release-notes/index.html. That table includes mapped versions of e.g. NCCL for the NVHPC SDK version 20.9.

Likewise for a CTK release I see https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html

Based on #19365 (comment), it sounds like any CTK library also mentioned in the NVHPC SDK release is overridden by NVHPC SDK.

cudart is an example of a library provided only in CTK. Unlike NCCL, I do not see a separate download option, so is it the case that the cudart library is only available via the CTK?

@adamjstewart
Copy link
Member

Btw, I'm currently listed as a maintainer for our cudnn and nccl packages, but I honestly don't know much about them. The only reason I've been trying to keep them up-to-date is because I'm a DL researcher and I use PyTorch pretty heavily. I would love for NVIDIA to officially take them over and add any features others might be interested in.

@costat
Copy link

costat commented Oct 22, 2020

cudart is an example of a library provided only in CTK. Unlike NCCL, I do not see a separate download option, so is it the case that the cudart library is only available via the CTK?

cudart is also provided in the HPC SDK. Since this is versioned directly in lock step with CUDA (e.g. libcudart.so.11.0 is included for CUDA 11.0) we did not list it out separately in the docs. If that is confusing, we can enhance the docs.

@samcmill
Copy link
Contributor Author

This has come up a couple times very recently (SDKs including implementations of packages) and I am considering an alternative approach:

  • The SDK package (nvhpc) can come with a method to define externals that reference its installation prefix
  • A new concretizer directive could be included (something like supplies) that tells the concretizer that the SDK comes with additional packages (which would fit well with the concretize-together option in environments)

The goal of this would be to avoid the work of converting packages to virtuals when an SDK provides them (so I agree that once packages are converted to virtuals that Spack should be able to resolve these sorts of issues, but I think it would be ideal to avoid the need for that conversion).

Can those more familiar with the Spack internals please comment on @scheibelp's alternative proposal above? It is not clear to me how much effort it would require to architect and implement?

Otherwise, the consensus pretty clearly is option 2.

@scheibelp
Copy link
Member

cudart is an example of a library provided only in CTK. Unlike NCCL, I do not see a separate download option, so is it the case that the cudart library is only available via the CTK?

cudart is also provided in the HPC SDK. Since this is versioned directly in lock step with CUDA (e.g. libcudart.so.11.0 is included for CUDA 11.0) we did not list it out separately in the docs. If that is confusing, we can enhance the docs.

Based on your comment and looking at https://docs.nvidia.com/hpc-sdk/hpc-sdk-release-notes/index.html, my impression is that cudart is supplied via the CTK that comes with NVHPC SDK: I assume the labels CUDA 10.1 | CUDA 10.2 | CUDA 11.0 at the top of table 1 refer to CUDA toolkit releases. When I say "only available via the CTK" I mean that it is also possible that it is available via the CTK supplied by the NVHPC SDK; another way to put it is that all the cudart libs provided by the NVHPC SDK are in some release of the CTK - is that correct?

The SDK package (nvhpc) can come with a method to define externals that reference its installation prefix

A new concretizer directive could be included (something like supplies) that tells the concretizer that the SDK comes with additional packages (which would fit well with the concretize-together option in environments)

an those more familiar with the Spack internals please comment on @scheibelp's alternative proposal above? It is not clear to me how much effort it would require to architect and implement?

The first suggestion would be easy (IMO): you would just add logic for locating each library inside of the nvhpc installation prefix. The second would require some work on my end to create the directives but also would not be much more difficult than adding provides declarations. The definite advantage of using provides here is that it integrates seamlessly into Spack's current concretizer. The only hangup is confusion/effort related to conversion of existing implementation packages (e.g. spack edit nccl) into virtuals.

@tgamblin
Copy link
Member

Commenting to get emails on this.

@alalazo
Copy link
Member

alalazo commented Nov 14, 2020

Commenting to get emails on this

It would be good if Github provided a "subscribe to discussion" button.

@adamjstewart
Copy link
Member

It would be good if Github provided a "subscribe to discussion" button.

You mean this?
Screen Shot 2020-11-14 at 9 39 47 AM

@wyphan
Copy link
Contributor

wyphan commented Feb 23, 2022

While the HPC SDK includes CUDA and CUDA math libraries, they are not currently exposed (no +cuda).

I've just created PR #29155 to specifically try to add +cuda variant to nvhpc. Let's see how this goes...

@wyphan
Copy link
Contributor

wyphan commented Feb 23, 2022

Revisiting #19294 (comment), the concretizer now correctly picks up CUDA:

$ spack spec magma cuda_arch=70 ^nvhpc
Input spec
--------------------------------
magma cuda_arch=70
    ^nvhpc

Concretized
--------------------------------
magma@2.6.1%gcc@12.0.1+cuda+fortran~ipo~rocm+shared build_type=RelWithDebInfo cuda_arch=70 arch=linux-ubuntu20.04-zen2
    ^cmake@3.22.2%gcc@12.0.1~doc+ncurses+openssl+ownlibs~qt build_type=Release arch=linux-ubuntu20.04-zen2
    ^nvhpc@22.2%gcc@12.0.1+blas+cuda+lapack~mpi install_type=single arch=linux-ubuntu20.04-zen2

Not sure if it was due to the switch to clingo in 0.17, or because I specifically added CUDA_HOME environment variable to the nvhpc+cuda variant, which MAGMA build system specifically looks for.

@haampie
Copy link
Member

haampie commented Mar 29, 2022

@ax3l / @scheibelp , can we move on here?

  • Get @wyphan's PR in that renames cuda -> cuda-toolkit and makes cuda virtual.
  • I think nvhpc can provide cuda unconditionally, doesn't need this +cuda toggle, since the variant changes nothing to how the package is installed, and with packages of 16GB+ it's very annoying if Spack doesn't pick up what you've installed already.
  • Have nvhpc provide a version list, e.g. provides('cuda@10.2.x,11.0.y,11.6.z') instead of a version range of CUDA toolkits.
  • [Potentially: Have two nvhpc packages, e.g. nvhpc with 3x CUDA toolkit, and nvhpc-slim with 1x latest CUDA toolkit. There's been some concern about moving from the 3x CUDA nvhpc -> 1x CUDA nvhpc by default, so I'm happy to stick to 3x CUDA by default.]

See #29782 for an implementation.


Peter's alternative proposal of having sdk-type of installations where spack install nvhpc effectively installs 1x nvhpc and 3x cuda as individual packages in Spack's database was discussed yesterday, but it's complicated by the fact that you can't have a unique prefix for a CUDA provided by nvhpc thanks to the unfortunate decision to split up the directory structure .../{cuda,math_libs}/<cuda version>/, which differs from standalone cuda toolkit installs. Further NVHPC can be used as an individual package, since it provides the cmake config file cmake/NVHPCConfig.cmake which CMake projects may depend on.


Not sure if it was due to the switch to clingo in 0.17, or because I specifically added CUDA_HOME environment variable to the nvhpc+cuda variant, which MAGMA build system specifically looks for.

The build system has nothing to do with concretization

@fspiga
Copy link
Contributor

fspiga commented Mar 29, 2022

  • I think nvhpc can provide cuda unconditionally

I would like to check internally this first.

@haampie haampie mentioned this issue Mar 29, 2022
@wyphan wyphan linked a pull request May 19, 2022 that will close this issue
@alalazo alalazo added this to To do in Spack v0.19.0 release via automation Jul 5, 2022
@tgamblin tgamblin added this to the v0.20.0 milestone Nov 7, 2022
@tgamblin tgamblin removed this from To do in Spack v0.19.0 release Nov 7, 2022
@alalazo alalazo removed this from the v0.20.0 milestone May 2, 2023
@alalazo alalazo added the revisit label May 2, 2023
@vsoch
Copy link
Member

vsoch commented May 8, 2024

Is it possible to build with an existing or spack cuda? I'm not able to build a package that needs cuda using an nvidia base container. I tried letting spack install the whole thing but I get compute_none for the cuda arch. I've been trying to figure out how to set that value but no luck so far. Here is what I'm trying:

spack:
  specs:
  - openmpi@4.1.4 fabrics=ofi +legacylaunchers +cuda cuda_arch=70
  - libfabric@1.19.0 fabrics=efa,tcp,udp,sockets,verbs,shm,mrail,rxd,rxm
  - flux-sched
  - flux-core
  - pmix@4.2.2
  - flux-pmix@0.4.0
  - amg2023 +mpi +cuda cuda_arch=70
...

and the message:

# spack install --reuse --fail-fast
==> Error: Invalid environment configuration detected: error parsing YAML: near /opt/spack-environment/spack.yaml, 0, 0: expected <block end>, but found '<block mapping start>'

And the error when I don't set anything:

            c-11.4.0/openmpi-4.1.4-btvkmo2ljfbtxjdjjax2zu2zoklloz5j/include  -c general.c -o general.obj
     212    nvcc fatal   : Unsupported gpu architecture 'compute_none'
  >> 213    make[1]: *** [../config/Makefile.config:66: device_utils.obj] Error 1
     214    nvcc fatal   : Unsupported gpu architecture 'compute_none'
     215    make[1]: *** Waiting for unfinished jobs....
  >> 216    make[1]: *** [../config/Makefile.config:66: general.obj] Error 1
     217    make[1]: Leaving directory '/tmp/root/spack-stage/spack-stage-hypre-2.29.0-jnuklmbah7zuqbwbdxbg45wngzighkxf/spack-src/src/utilities'
  >> 218    make: *** [Makefile:86: all] Error 1

I am not great with spack, hopefully just doing something dumb. :LP

@pauleonix
Copy link
Contributor

@vsoch Generally that should be possible, this issue is about making it possible for Spack to use the NVHPC SDK as a provider of CUDA, so I don't see how it has anything to do with your problem other than the very thin connection that the NVHPC SDK also comes with its own OpenMPI.

@vsoch
Copy link
Member

vsoch commented May 8, 2024

What should the spack environment file look like so it works? I really just want any solution that will work - I want compute_70 (and not compute_none).

@pauleonix
Copy link
Contributor

@vsoch I have no idea why the given file does not work. But I'm pretty sure it's completely off-topic for this thread. How about filing a new issue or asking the devs on Slack?

@wyphan
Copy link
Contributor

wyphan commented May 9, 2024 via email

@vsoch
Copy link
Member

vsoch commented May 9, 2024

@vsoch I have no idea why the given file does not work. But I'm pretty sure it's completely off-topic for this thread. How about filing a new issue or asking the devs on Slack?

I figured it out - there was an underlying dependency that also needed to be provided in specs with cuda_arch=70, otherwise it looks like it defaulted to none.

@wyphan
Copy link
Contributor

wyphan commented May 9, 2024 via email

@vsoch
Copy link
Member

vsoch commented May 9, 2024

It didn't seem to - I did try that variant with the double equals.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Development

Successfully merging a pull request may close this issue.