[Enhancement] GPU RayCluster doesn't work on GKE Autopilot #1349

architkulkarni · 2023-08-17T23:51:38Z

GKE Autopilot is a streamlined, user-friendly way to set up a cluster.

However, trying to use KubeRay with a GPU results in the RayCluster showing its status as failed, with the following error message:

GPU-using init containers are not supported in Autopilot.

The workaround is to not use Autopilot, and instead manually create the GKE cluster with a GPU node pool, manually installing the nvidia drivers, setting up taints and tolerations, etc. If we can fix the issue in KubeRay, the user won't have to think about any of this.

The text was updated successfully, but these errors were encountered:

architkulkarni · 2023-08-17T23:53:49Z

To reproduce the issue, this is the YAML file I used, but I'm sure you can find a much more minimal YAML to reproduce the issue.

kubectl apply -f job.yaml

Expand full YAML

apiVersion: ray.io/v1alpha1
kind: RayJob
metadata:
  name: rayjob-sample
spec:
  entrypoint: python /home/ray/samples/sample_code.py
  # shutdownAfterJobFinishes specifies whether the RayCluster should be deleted after the RayJob finishes. Default is false.
  # shutdownAfterJobFinishes: false
  # ttlSecondsAfterFinished specifies the number of seconds after which the RayCluster will be deleted after the RayJob finishes.
  # ttlSecondsAfterFinished: 10
  # Runtime env decoded to {
  # {
  # "pip": [
  #   "torch",
  #   "torchvision",
  #   "Pillow",
  #   "transformers"
  # ]
  # }
  runtimeEnv: ewogICJwaXAiOiBbCiAgICAidG9yY2giLAogICAgInRvcmNodmlzaW9uIiwKICAgICJQaWxsb3ciLAogICAgInRyYW5zZm9ybWVycyIKICBdCn0=
  # Suspend specifies whether the RayJob controller should create a RayCluster instance.
  # If a job is applied with the suspend field set to true, the RayCluster will not be created and we will wait for the transition to false.
  # If the RayCluster is already created, it will be deleted. In the case of transition to false, a new RayCluste rwill be created.
  # suspend: false
  # rayClusterSpec specifies the RayCluster instance to be created by the RayJob controller.
  rayClusterSpec:
    rayVersion: '2.6.3' # should match the Ray version in the image of the containers
    # Ray head pod template
    headGroupSpec:
      # The `rayStartParams` are used to configure the `ray start` command.
      # See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
      # See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
      rayStartParams:
        dashboard-host: '0.0.0.0'
      #pod template
      template:
        spec:
          containers:
            - name: ray-head
              image: rayproject/ray-ml:2.6.3-gpu
              ports:
                - containerPort: 6379
                  name: gcs-server
                - containerPort: 8265 # Ray dashboard
                  name: dashboard
                - containerPort: 10001
                  name: client
              resources:
                limits:
                  cpu: 2
                  memory: 8Gi
                requests:
                  cpu: 2
                  memory: 8Gi
              volumeMounts:
                - mountPath: /home/ray/samples
                  name: code-sample
          volumes:
            # You set volumes at the Pod level, then mount them into containers inside that Pod
            - name: code-sample
              configMap:
                # Provide the name of the ConfigMap you want to mount.
                name: ray-job-code-sample
                # An array of keys from the ConfigMap to create as files
                items:
                  - key: sample_code.py
                    path: sample_code.py
    workerGroupSpecs:
      # the pod replicas in this group typed worker
      - replicas: 1
        minReplicas: 1
        maxReplicas: 5
        # logical group name, for this called small-group, also can be functional
        groupName: small-group
        # The `rayStartParams` are used to configure the `ray start` command.
        # See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
        # See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
        rayStartParams:
          resources: '"{\"accelerator_type_cpu\": 48, \"accelerator_type_a10\": 2, \"accelerator_type_a100\": 2}"'
        #pod template
        template:
          spec:
            containers:
              - name: ray-worker # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name',  or '123-abc'
                image: rayproject/ray-ml:2.6.3-gpu
                lifecycle:
                  preStop:
                    exec:
                      command: [ "/bin/sh","-c","ray stop" ]
                resources:
                  limits:
                    cpu: "48"
                    memory: "192G"
                    nvidia.com/gpu: 4
                  requests:
                    cpu: "36"
                    memory: "128G"
                    nvidia.com/gpu: 4
            nodeSelector:
              cloud.google.com/gke-accelerator: nvidia-tesla-t4
  # SubmitterPodTemplate is the template for the pod that will run the `ray job submit` command against the RayCluster.
  # If SubmitterPodTemplate is specified, the first container is assumed to be the submitter container.
  # submitterPodTemplate:
  #   spec:
  #     restartPolicy: Never
  #     containers:
  #       - name: my-custom-rayjob-submitter-pod
  #         image: rayproject/ray:2.6.3
  #         # If Command is not specified, the correct command will be supplied at runtime using the RayJob spec `entrypoint` field.
  #         # Specifying Command is not recommended.
  #         # command: ["ray job submit --address=http://rayjob-sample-raycluster-v6qcq-head-svc.default.svc.cluster.local:8265 -- echo hello world"]
      

######################Ray code sample#################################
# this sample is from https://docs.ray.io/en/latest/cluster/job-submission.html#quick-start-example
# it is mounted into the container and executed to show the Ray job at work
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: ray-job-code-sample
data:
  sample_code.py: |
    import ray

    s3_uri = "s3://anonymous@air-example-data-2/imagenette2/val/"

    ds = ray.data.read_images(
        s3_uri, mode="RGB"
    )
    ds
    # TODO(archit) need to install Pillow, pytorch or tf or flax (pip install torch torchvision torchaudio)
    from typing import Dict
    import numpy as np

    from transformers import pipeline
    from PIL import Image

    # Pick the largest batch size that can fit on our GPUs
    BATCH_SIZE = 1024

    # TODO(archit) basic step

    # single_batch = ds.take_batch(10)

    # from PIL import Image

    # img = Image.fromarray(single_batch["image"][0])
    # # display image
    # img.show()
    # from transformers import pipeline
    # from PIL import Image

    # # If doing CPU inference, set device="cpu" instead.
    # classifier = pipeline("image-classification", model="google/vit-base-patch16-224", device="cuda:0")
    # outputs = classifier([Image.fromarray(image_array) for image_array in single_batch["image"]], top_k=1, batch_size=10)
    # del classifier # Delete the classifier to free up GPU memory.
    # print(outputs)

    @ray.remote(num_gpus=1)
    def do_single_batch():
        single_batch = ds.take_batch(10)

        from PIL import Image

        img = Image.fromarray(single_batch["image"][0])
        # display image
        img.show()
        from transformers import pipeline
        from PIL import Image

        # If doing CPU inference, set device="cpu" instead.
        classifier = pipeline("image-classification", model="google/vit-base-patch16-224", device="cuda:0")
        outputs = classifier([Image.fromarray(image_array) for image_array in single_batch["image"]], top_k=1, batch_size=10)
        del classifier # Delete the classifier to free up GPU memory.
        print(outputs)
        return outputs

    print(ray.get(do_single_batch.remote()))

jrosti · 2023-08-24T13:20:39Z

I'm curious if one can make autopilot to work by disabling init container injection:
#1069 (comment)
and making gcs-ready check in the main container?

richardsliu · 2023-09-21T22:34:15Z

I'm curious if one can make autopilot to work by disabling init container injection: #1069 (comment) and making gcs-ready check in the main container?

Yes, that should work.

kevin85421 · 2023-10-06T07:07:30Z

I just realized that GKE Autopilot and the node pool's autoscaling are different. I reproduced this issue successfully by:

gcloud container clusters create-auto kuberay-gpu-cluster --region=us-west1
helm install kuberay-operator kuberay/kuberay-operator --version 1.0.0-rc.0
# Create a RayCluster where the workers require a GPU.

architkulkarni added enhancement New feature or request P2 Important issue, but not time critical labels Aug 17, 2023

kevin85421 added the 1.0 label Sep 15, 2023

kevin85421 self-assigned this Sep 21, 2023

kevin85421 removed the P2 Important issue, but not time critical label Oct 5, 2023

kevin85421 mentioned this issue Oct 6, 2023

[Enhancement] GPU RayCluster doesn't work on GKE Autopilot #1470

Merged

4 tasks

kevin85421 closed this as completed in #1470 Oct 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Enhancement] GPU RayCluster doesn't work on GKE Autopilot #1349

[Enhancement] GPU RayCluster doesn't work on GKE Autopilot #1349

architkulkarni commented Aug 17, 2023

architkulkarni commented Aug 17, 2023

jrosti commented Aug 24, 2023 •

edited

Loading

richardsliu commented Sep 21, 2023

kevin85421 commented Oct 6, 2023

[Enhancement] GPU RayCluster doesn't work on GKE Autopilot #1349

[Enhancement] GPU RayCluster doesn't work on GKE Autopilot #1349

Comments

architkulkarni commented Aug 17, 2023

architkulkarni commented Aug 17, 2023

jrosti commented Aug 24, 2023 • edited Loading

richardsliu commented Sep 21, 2023

kevin85421 commented Oct 6, 2023

jrosti commented Aug 24, 2023 •

edited

Loading