Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature request] Support Intel GPU #1094

Closed
houyushan opened this issue Oct 28, 2022 · 23 comments
Closed

[feature request] Support Intel GPU #1094

houyushan opened this issue Oct 28, 2022 · 23 comments
Labels
enhancement New feature or request

Comments

@houyushan
Copy link

hello , Use Intel GPU(XPU )

my product needs to use the singularity/apptainer to use the Intel XPU (the latest Intel GPU for AI training). Is there a command like -- nv to support it at present, or will future versions provide similar commands? When will it be?

@houyushan houyushan added the enhancement New feature or request label Oct 28, 2022
@dtrudg
Copy link
Member

dtrudg commented Oct 31, 2022

Hi @houyushan - we'll have to look into this a little more. It appears from looking at relevant Intel container images that the intention is to have libraries in the container, and make the /dev/dri devices available in the container.

This means that you should be able to run without any additional options, unless you are using --contain / --containall, in which case you will have to add -B /dev/dri to ensure the devices are available.

@houyushan
Copy link
Author

okay, thank you,

I will also continue to research and test

@elezar
Copy link
Contributor

elezar commented Feb 20, 2023

Just a note: If we assume CDI support in the OCI mode, using a CDI spec generated for the Intel devices would allow injection of these.

See #813

@dtrudg
Copy link
Member

dtrudg commented Feb 20, 2023

@elezar - yep, thanks. This was a hope at the back of my mind :-)

@pzehner
Copy link

pzehner commented Mar 12, 2024

Hello, any news on this issue?

Using singularity CE version 4.0.2 with an Intel GPU Max 1550, I don't have access to the GPU, even if the card is listed in /dev/dri.

@dtrudg
Copy link
Member

dtrudg commented Mar 13, 2024

As mentioned in a comment above, Singularity's OCI-mode supports CDI (Container Device Interface) configuration for access to GPUs, which would includ Intel GPUs if a CDI configuration is available.

With regard to adding a direct Intel GPU flag for the default native, non-OCI, mode... generally adding this kind of hardware specific support into SingularityCE is dependent on either:

  1. The vendor, or a user, contributing the functionality as a pull request that they will also be able to assist with maintaining.
  2. The vendor, or a 3rd party, providing us (as a project) with access to the relevant GPU hardware on an ongoing basis so that we can develop and maintain the requested functionality.

NVIDIA GPU support comes under (2), as we have had signficant contributions from NVIDIA, and it is also trivial to access Tesla GPUs at reasonable cost via public cloud providers.

What we wish to avoid, when adding Intel GPU support, is the situation we find ourselves in with AMD GPUs / ROCm. The lack of access to data center AMD GPUs (capable of running latest ROCm) in the cloud, or by other means, makes maintaining ROCm support difficult / costly.

If you are able to, we would suggest that you indicate to Intel that support integrated into SingularityCE is important to you.

Without access to hardware, the minimum information required for us to add an experimental flag, without commitment that it will be well maintained, would be:

  • A comprehensive list of /dev entries required to use GPU functionality.
  • A comprehensive list of any libraries and binaries that need to be present in a container to use GPU functionality.

@elezar
Copy link
Contributor

elezar commented Mar 14, 2024

I would strongly recommend following the CDI route here instead of relying on vendor-specific logic in Singularity. If effort is to be spent, I would recommend adding (experiemental) CDI support to the native mode of singularity (see #1395) if the support is required there instead.

@kad do you have any visibility on the generation of CDI specification for Intel devices?

@kad
Copy link

kad commented Mar 14, 2024

I don't, but @byako and @tkatila will be good candidates to chime in here.

@pzehner
Copy link

pzehner commented Mar 19, 2024

I checked the OCI way and CDI, but I cannot access the GPU out of the box. I guess I should indicate a CDI file with --device. The documentation states that usual lookup directories are /etc/cdi and /var/run/cdi, but none of them exist. I tried to guess intel.com/gpu=all, but it was obviously incorrect.

It would be nice to have more documentation about this.

@byako
Copy link

byako commented Mar 19, 2024

The CDI specs are generated automatically at the moment only by the kubelet-plugin part of the DRA resource-driver.

If you don't need the dynamic creation of the specs - it's possible to create the specs manually, they are quite simple.

There is a chance however, that they will need to be fixed after reboot if you have multiple GPUs that are different, or if you have integrated GPU that also gets enabled in DRM, because the DRM devices' indexes are not persistent across reboots. For instance, /dev/dri/card0 can become card1, and card1 might become card0.

@byako
Copy link

byako commented Mar 19, 2024

Here's an example of CDI spec:
sudo cat /etc/cdi/intel.com-gpu.yaml

cdiVersion: 0.5.0
containerEdits: {}
devices:
- containerEdits:
    deviceNodes:
    - path: /dev/dri/card1
      type: c
    - path: /dev/dri/renderD129
      type: c
  name: 0000:03:00.0-0x56a0
- containerEdits:
    deviceNodes:
    - path: /dev/dri/card0
      type: c
    - path: /dev/dri/renderD128
      type: c
  name: 0000:00:02.0-0x4680
kind: intel.com/gpu

the name field can be somewhat arbitrary albeit with spelling restrictions, if you just create /etc/cdi folder and paste the contents of the above snippet into file inside that folder, it should work, given that your runtime supports CDI.

sudo mkdir /etc/cdi
sudo vim /etc/cdi/mygpus.yaml

then --device intel.com/gpu=0000:03:00.0-0x56a0

@pzehner
Copy link

pzehner commented Mar 19, 2024

I see, is there a way to get these configuration files without writing them by hand? When I googled "intel gpu container device interface," I couldn't find anything like that. How is the user supposed to know this?

@byako
Copy link

byako commented Mar 19, 2024

Hello, any news on this issue?

Using singularity CE version 4.0.2 with an Intel GPU Max 1550, I don't have access to the GPU, even if the card is listed in /dev/dri.

Could you please add more details about this case: what was the command line you used with what options?

@pzehner
Copy link

pzehner commented Mar 19, 2024

My bad, I missed one of your answers. Hum, I'm not sure to understand this line:

The CDI specs are generated automatically at the moment only by the kubelet-plugin part of the DRA resource-driver.

Should I install Kubernetes as well? Noob question here.

Could you please add more details about this case: what was the command line you used with what options?

In my case, I have a machine with four Intel GPU Max 1550, and I want to run code withing an Intel OneAPI image. For the demonstration, I just use sycl-ls to list the SYCL-compatible devices (note that I'm not using the manual CDI file yet):

$ singularity run --oci docker://intel/oneapi-basekit:2024.0.1-devel-ubuntu20.04 sycl-ls   
Getting image source signatures
Copying blob 521f275cc58b done   | 
Copying blob 565c40052dc3 done   | 
Copying blob afcec6bc5983 done   | 
Copying blob 93b1720de081 done   | 
Copying blob bcd9c7c8e2dd done   | 
Copying blob 3c86603e9f04 done   | 
Copying blob 45a1c23aa4e7 done   | 
Copying config ba41f6c638 done   | 
Writing manifest to image destination
INFO:    Converting OCI image to OCI-SIF format
INFO:    Squashing image to single layer
INFO:    Writing OCI-SIF image
INFO:    Cleaning up.
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:cpu:1] Intel(R) OpenCL, Intel(R) Xeon(R) CPU Max 9460 OpenCL 3.0 (Build 0) [2023.16.12.0.12_195853.xmain-hotfix]

As you can see, only the CPU is detected. This is what I should see:

[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:cpu:1] Intel(R) OpenCL, Intel(R) Xeon(R) CPU Max 9460 OpenCL 3.0 (Build 0) [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Data Center GPU Max 1550 OpenCL 3.0 NEO  [23.22.26516.34]
[opencl:gpu:3] Intel(R) OpenCL Graphics, Intel(R) Data Center GPU Max 1550 OpenCL 3.0 NEO  [23.22.26516.34]
[opencl:gpu:4] Intel(R) OpenCL Graphics, Intel(R) Data Center GPU Max 1550 OpenCL 3.0 NEO  [23.22.26516.34]
[opencl:gpu:5] Intel(R) OpenCL Graphics, Intel(R) Data Center GPU Max 1550 OpenCL 3.0 NEO  [23.22.26516.34]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1550 1.3 [1.3.26516]
[ext_oneapi_level_zero:gpu:1] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1550 1.3 [1.3.26516]
[ext_oneapi_level_zero:gpu:2] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1550 1.3 [1.3.26516]
[ext_oneapi_level_zero:gpu:3] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1550 1.3 [1.3.26516]

@byako
Copy link

byako commented Mar 19, 2024

There is no need to install Kubernetes, I meant that the automated generation of the CDI specs at the moment available only in K8s.

Once you created the /etc/cdi dir and saved the yaml file into it, devices described in that yaml can be used by singularity.

You have to use the --device parameter in the command like I mentioned above - that will tell singularity to use the device that it finds in the CDI spec. https://docs.sylabs.io/guides/latest/user-guide/oci_runtime.html#sec-cdi.

The yaml file I quoted above is just an example. Check what is the DRM index of the GPU, for instance: ls -al /dev/dri/by-path/, and see which /dev/dri/cardX is linked to the Max 1550. You can see which PCI device Max 1550 is by running 'lspci | grep Display. When you know which /dev/dri/cardXis Max 1550, use that in the/etc/cdi/mygpus.yaml. renderdnode is not needed for Max1550, onlycardX`.

We'll work on finding the way to generate CDI specs or at least documenting it.

@pzehner
Copy link

pzehner commented Mar 20, 2024

Ok, I see. I think it would be nice to have a better way to generate these CDI specs. The logics from the Kubernetes plugin could be extracted.

If I'm not wrong, you can completely deduce them from the structure in /dev/dri, right?

@pzehner
Copy link

pzehner commented Mar 21, 2024

So, I tried with the example CDI specs file that I adapted for my hardware, but the GPU is still not visible from within the container:

$ singularity run --oci --device intel.com/gpu=0000:29:00.0 docker://intel/oneapi-basekit:2024.0.1-devel-ubuntu20.04 sycl-ls 
INFO:    Using cached OCI-SIF image
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:cpu:1] Intel(R) OpenCL, Intel(R) Xeon(R) CPU Max 9460 OpenCL 3.0 (Build 0) [2023.16.12.0.12_195853.xmain-hotfix]

Where the CDI specs look like:

cdiVersion: 0.5.0
containerEdits: {}
devices:
- containerEdits:
    deviceNodes:
    - path: /dev/dri/card1
      type: c
    - path: /dev/dri/renderD128
      type: c
  name: 0000:29:00.0
...
kind: intel.com/gpu

@tkatila
Copy link

tkatila commented Mar 21, 2024

@pzehner can you check whether /dev/dri/ has card and renderD devices? If they are, it might be an access rights issue with the actual devices.

@pzehner
Copy link

pzehner commented Mar 21, 2024

Yes, I have the correct devices listed in /dev/dri, and I can access them outside of the container.

@tkatila
Copy link

tkatila commented Mar 21, 2024

roger, I downloaded the same image and tried it within docker. sycl-ls didn't list GPUs for me either. I'll try to understand what is with it.

@tkatila
Copy link

tkatila commented Mar 21, 2024

I don't exactly know why sycl-ls doesn't detect the GPUs. What I did notice is that 2024.0.1-devel-ubuntu22.04 version does detect them. Comparing the images didn't reveal anything obvious nor could I make the 20.04 variant functional by installing packages.

I'd use the 22.04 variant as a workaround, if that suites you.

@pzehner
Copy link

pzehner commented Apr 4, 2024

I think using an up-to-date image is acceptable.

@dtrudg
Copy link
Member

dtrudg commented Jun 14, 2024

Closing this issue. CDI support is available in --oci mode, and appears to work with the correct image.

Support for Intel GPUs in native mode would come via #1395 - however this is not on the development roadmap firmly at this time.

@dtrudg dtrudg closed this as completed Jun 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

7 participants