Skip to content

Latest commit

 

History

History
93 lines (73 loc) · 3.74 KB

kubernetes-gpu.rst

File metadata and controls

93 lines (73 loc) · 3.74 KB
orphan

GPU Usage with Kubernetes

This document provides some notes on GPU usage with Kubernetes.

To use GPUs on Kubernetes, you will need to configure both your Kubernetes setup and add additional values to your Ray cluster configuration.

For relevant documentation for GPU usage on different clouds, see instructions for GKE, for EKS, and for AKS.

The Ray Docker Hub hosts CUDA-based images packaged with Ray for use in Kubernetes pods. For example, the image rayproject/ray-ml:nightly-gpu is ideal for running GPU-based ML workloads with the most recent nightly build of Ray. Read here<docker-images> for further details on Ray images.

Using Nvidia GPUs requires specifying the relevant resource limits in the container fields of your Kubernetes configurations. (Kubernetes sets the GPU request equal to the limit.) The configuration for a pod running a Ray GPU image and using one Nvidia GPU looks like this:

apiVersion: v1
kind: Pod
metadata:
 generateName: example-cluster-ray-worker
 spec:
  ...
  containers:
   - name: ray-node
     image: rayproject/ray:nightly-gpu
     ...
     resources:
      cpu: 1000m
      memory: 512Mi
     limits:
      memory: 512Mi
      nvidia.com/gpu: 1

GPU taints and tolerations

Note

Users using a managed Kubernetes service probably don't need to worry about this section.

The Nvidia gpu plugin for Kubernetes applies taints to GPU nodes; these taints prevent non-GPU pods from being scheduled on GPU nodes. Managed Kubernetes services like GKE, EKS, and AKS automatically apply matching tolerations to pods requesting GPU resources. Tolerations are applied by means of Kubernetes's ExtendedResourceToleration admission controller. If this admission controller is not enabled for your Kubernetes cluster, you may need to manually add a GPU toleration each of to your GPU pod configurations. For example,

apiVersion: v1
kind: Pod
metadata:
 generateName: example-cluster-ray-worker
 spec:
 ...
 tolerations:
 - effect: NoSchedule
   key: nvidia.com/gpu
   operator: Exists
 ...
 containers:
 - name: ray-node
   image: rayproject/ray:nightly-gpu
   ...

Further reference and discussion

Read about Kubernetes device plugins here, about Kubernetes GPU plugins here, and about Nvidia's GPU plugin for Kubernetes here.

If you run into problems setting up GPUs for your Ray cluster on Kubernetes, please reach out to us at https://discuss.ray.io.

Questions or Issues?