Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added support for ephemeral volumes and ingress creation support #1312

Closed

Conversation

blublinsky
Copy link
Contributor

Why are these changes needed?

The current implementation of the API server is still not completely on par with the operator's capabilities. Two important things that are missing:

  1. Ability to request Ingress creation for Ray cluster
  2. Ability to add specific PVCs for individual head/worker pods
    This PR is adding support for those. For individual PVC, this PR is leveraging generic ephemeral volumes https://kubernetes.io/docs/concepts/storage/ephemeral-volumes/#generic-ephemeral-volumes, which are similar to PVCs, but are completely controlled by the pod. The downside, they do not survive the pod crashes, which is probably ok, as they are mostly used for scratch data in the pod, for example, disk spilling. The upside of this approach is that it is not necessary to manage PVCs lifecycle.

Related issue number

closes #1087

Checks

  • [ x] I've made sure the tests are passing.
  • Testing Strategy
    • [x ] Unit tests
    • Manual tests
    • This PR is not tested :(

@blublinsky
Copy link
Contributor Author

@psschwei This extends your PR by adding ephemeral volumes, which are sort of ephemeral PVC. Please take a look

architkulkarni and others added 12 commits August 15, 2023 08:38
Closes ray-project#1309. I don't know why, after but downgrading the kind version to 0.11.1, I never observed the issue again. Whereas with 0.20.0, the issue is consistently reproducible.

I haven't investigated which kind version between 0.11.1 and 0.20.0 is the first one that caused the issue.

The reason for choosing 0.11.1 is that this is the version used in Ray CI, and we haven't observed the issue in Ray CI. https://github.com/architkulkarni/ray/blob/5cb837dbaf1e5875f4f365e67cec6b09d90bf710/ci/k8s/prep-k8s-environment.sh#L8

Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
Do not update pod labels if they haven't changed
Sets up the Buildkite CI pipeline to test the RayJob sample YAML files using kind.

Related issue number
Closes ray-project#1246

---------

Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
apiserver/pkg/model/volumes_test.go Show resolved Hide resolved
proto/cluster.proto Show resolved Hide resolved
blublinsky and others added 4 commits August 17, 2023 15:15
Signed-off-by: Boris Lublinsky <blublinsky@hotmail.com>
…t#1342)

Bump the golangci-lint version in the api server makefile
…ay-project#1340)

* add service yaml for nlp

* Documentation fixes

* Fix instructions

* Apply suggestions from code review

Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org>
Signed-off-by: Praveen <gorthypraveen@gmail.com>

* Fix tolerations comment

* review comments

* Update docs/guidance/stable-diffusion-rayservice.md

Signed-off-by: Kai-Hsun Chen <kaihsun@apache.org>

---------

Signed-off-by: Praveen <gorthypraveen@gmail.com>
Signed-off-by: Kai-Hsun Chen <kaihsun@apache.org>
Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org>
@@ -221,22 +221,22 @@ message HeadGroupSpec {
string service_type = 3;
// Optional. Enable Ingress
// if Ingress is enabled, we might have to specify annotation IngressClassAnnotationKey, for the cluster itself, defining Ingress class
bool enableIngress = 4;
bool enableIngress = 11;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: Kind of odd to have a higher number so high up in the code. Can you please move lines 222-224 past 239.

z103cb and others added 2 commits August 18, 2023 15:44
…1336)

Removed use of the  of BUILD_FLAGS in apiserver makefile
…istening to Kubernetes events (ray-project#1341)

Redefine the behavior for deleting Pods and stop listening to Kubernetes events
@tedhtchang
Copy link
Contributor

tedhtchang commented Aug 21, 2023

@blublinsky Could you give an example of an ephemeral volume rest api request. I have tried but the api server would just crash

curl -X POST 'localhost:31888/apis/v1alpha2/namespaces/default/compute_templates' \
--data '{
  "name": "default-template",
  "namespace": "default",
  "cpu": 1,
  "memory": 1
}'

curl -X POST 'http://localhost:8888/apis/v1alpha2/namespaces/default/clusters' \
--data '{
  "name": "myraycluster",
  "namespace": "default",
  "user": "tedchang",
  "version": "2.5.0",
  "clusterSpec": {
    "headGroupSpec": {
      "computeTemplate": "default-template",
      "image": "rayproject/ray:2.5.0",
      "serviceType": "NodePort",
      "rayStartParams": {"dashboard-host": "0.0.0.0"},
      "volumes": [{
        "mountPath": "/data",
        "name": "test-vol",
        "volumeType": "EPHEMERAL"
        }
      ]
    },
    "workerGroupSpec": [
      {
        "groupName": "small-wg",
        "computeTemplate": "default-template",
        "image": "rayproject/ray:2.5.0",
        "replicas": 1,
        "minReplicas": 0,
        "maxReplicas": 1,
        "rayStartParams": {"metrics-export-port": "8080"}
      }
    ]
  }
}'

Error:

I0821 15:53:20.544800 2931087 client_manager.go:52] Initializing client manager
I0821 15:53:20.590550 2931087 client_manager.go:71] Client manager initialized successfully
I0821 15:53:20.590775 2931087 main.go:89] Starting Http Proxy
I0821 15:53:20.590939 2931087 main.go:58] Starting gRPC server
I0821 15:55:06.238512 2931087 interceptor.go:14] /proto.ClusterService/CreateCluster handler starting
panic: cannot parse '': quantities must match the regular expression '^([+-]?[0-9.]+)([eEinumkKMGTP]*[-+]?[0-9]*)$'

see entire error message debug_ephemeral_vol.txt

@@ -197,7 +200,16 @@ message Volume {
HOSTTOCONTAINER = 1;
BIDIRECTIONAL = 2;
}
MountPropagationMode mount_propagation_mode = 7;
MountPropagationMode mount_propagation_mode = 7;
// If indicate ephemeral, we need to let user specify volumeClaimTemplate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be a volumeClaimTemplate message ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are specifying here volumeClaimTemplate parameters - storage class, storage size and access mode. The actual volumeClaimTemplate is build based on them

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the comment to make it clear

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also @tedhtchang, If you use ephemeral volumes, you need to specify, at least the size. See this:

var testEphemeralVolume = &api.Volume{
	Name:       "test-ephemeral",
	VolumeType: api.Volume_EPHEMERAL,
	MountPath:  "/ephimeral/dir",
	Storage:    "10Gi",
}

from the cluster_test.go

blublinsky and others added 5 commits August 22, 2023 08:42
Adds the field RuntimeEnvYAML to the RayJob CRD which accepts a multi-line YAML string. This format is preferred for two reasons:

Consistency with the ServeConfigV2 format, which is also a Ray configuration specified as a multi-line YAML string
(Related to above) Allows using snake_case fields without modification
We preserve the older field RuntimeEnv which accepts a base64-encoded string of the runtime env. We mark it as deprecated in the documentation.

We raise an error if both fields are specified.

Related issue number
Closes ray-project#1195

---------

Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
@tedhtchang
Copy link
Contributor

@blublinsky The error handling worked for me. Thanks.
For the ingress feature, do you have an instruction on how to:

  1. manually test a raycluster with the "enableIngress": true on a KinD cluster and then
  2. send request to the ingress url with curl/browser

@blublinsky
Copy link
Contributor Author

blublinsky commented Aug 29, 2023

Sure @tedhtchang.

  1. Install NGNIX ingress on kind https://kind.sigs.k8s.io/docs/user/ingress/
  2. Set annotation, see https://github.com/ray-project/kuberay/blob/master/ray-operator/controllers/ray/common/ingress_test.go#L22
  3. Enable ingress
  4. You should see an ingress created along with the ray cluster
  5. You can use this ingress to get to the dashboard

@tedhtchang
Copy link
Contributor

I still couldn't get it to work. Here is my setup:

kind delete cluster --name ray-test
cat <<EOF | kind create cluster --name ray-test --config -
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
  - role: control-plane
    kubeadmConfigPatches:
      - |
        kind: InitConfiguration
        nodeRegistration:
          kubeletExtraArgs:
            node-labels: "ingress-ready=true"
    extraPortMappings:
      - containerPort: 80
        hostPort: 80
        listenAddress: "0.0.0.0"
        protocol: tcp
EOF

Nginx ingress controller

kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/main/deploy/static/provider/kind/deploy.yaml

Kuberay

helm repo add kuberay https://ray-project.github.io/kuberay-helm/
helm install kuberay-operator kuberay/kuberay-operator --version 0.6.0 --set image.repository=quay.io/kuberay/operator

Apiserver run on separate terminal

make build && make run

Compute template

curl -X POST 'localhost:8888/apis/v1alpha2/namespaces/default/compute_templates' \
--data '{
  "name": "default-template",
  "namespace": "default",
  "cpu": 1,
  "memory": 1
}'

raycluster with ingress

curl -X POST 'http://localhost:8888/apis/v1alpha2/namespaces/default/clusters' --data '{
  "name": "myraycluster",
  "namespace": "default",
  "user": "tedchang",
  "version": "2.5.0",
  "clusterSpec": {
    "headGroupSpec": {
      "enableIngress": true,
      "computeTemplate": "default-template",
      "image": "quay.io/project-codeflare/ray:2.5.0-py38-cu116",
      "rayStartParams": {"dashboard-host": "0.0.0.0"},
      "volumes": [{
        "mountPath": "/data",
        "name": "test-vol",
        "volumeType": "EPHEMERAL",
        "storage": "1Gi"
        }
      ]
    },
    "workerGroupSpec": [
      {
        "groupName": "small-wg",
        "computeTemplate": "default-template",
        "image": "quay.io/project-codeflare/ray:2.5.0-py38-cu116",
        "replicas": 1,
        "minReplicas": 0,
        "maxReplicas": 1,
        "rayStartParams": {"metrics-export-port": "8080"}
      }
    ]
  }
}'

kubectl get ingress myraycluster-head-ingress -oyaml

  ...
    rules:
    - http:
        paths:
        - backend:
            service:
              name: myraycluster-head-svc
              port:
                number: 8265
          path: /myraycluster/(.*)
          pathType: Exact
  status:
    loadBalancer:
      ingress:
      - hostname: localhost

curl 127.0.0.1/myraycluster/

<html>
<head><title>404 Not Found</title></head>
<body>
<center><h1>404 Not Found</h1></center>
<hr><center>nginx</center>
</body>
</html>

@blublinsky
Copy link
Contributor Author

blublinsky commented Aug 30, 2023

@tedhtchang
1 You missed annotations:

curl -X POST 'http://localhost:8888/apis/v1alpha2/namespaces/default/clusters' --data '{
  "name": "myraycluster",
  "namespace": "default",
  "user": "tedchang",
  "version": "2.5.0",
  "annotations": {
    "kubernetes.io/ingress.class": "nginx",
    "nginx.ingress.kubernetes.io/rewrite-target": "/$2"
  },
  "clusterSpec": {
    "headGroupSpec": {
      "enableIngress": true,
      "computeTemplate": "default-template",
      "image": "docker.io/rayproject/ray:2.6.3-py310",
      "rayStartParams": {
        "dashboard-host": "0.0.0.0",
        "metrics-export-port": "8080",
        "num-cpus": "0"
      }
    },
    "workerGroupSpec": [
      {
        "groupName": "small-wg",
        "computeTemplate": "default-template",
        "image": "docker.io/rayproject/ray:2.6.3-py310",
        "rayStartParams": {
          "node-ip-address": "$MY_POD_IP"
        },
        "replicas": 1,
        "minReplicas": 0,
        "maxReplicas": 1
      }
    ]
  }
}'

With these annotations in place the correct Ingress is created and you can use the browser to access dashboard

@blublinsky
Copy link
Contributor Author

@tedhtchang can you, please, approve this so that @kevin85421 can merge it?

@tedhtchang
Copy link
Contributor

@blublinsky The browser just opened a blank page.
image

The curll http://localhost/myraycluster/ worked

<!doctype html><html lang="en"><head><meta charset="utf-8"/><link rel="shortcut icon" href="./favicon.ico"/><meta name="viewport" content="width=device-width,initial-scale=1"/><title>Ray Dashboard</title><script defer="defer" src="./static/js/main.4ff4f4db.js"></script><link href="./static/css/main.388a904b.css" rel="stylesheet"></head><body><noscript>You need to enable JavaScript to run this app.</noscript><div id="root"></div></body></html>

@blublinsky
Copy link
Contributor Author

@tedhtchang
do you have js enabled in your browser?

You need to enable JavaScript to run this app

@tedhtchang
Copy link
Contributor

yes. I think so...
Tested with kubectl port-forward svc/myraycluster-head-svc 8265 i can visit the http://localhost:8265 and see dashboard in my browser.
image

@tedhtchang
Copy link
Contributor

tedhtchang commented Sep 9, 2023

Change to "nginx.ingress.kubernetes.io/rewrite-target": "/$1" worked... This example needs to be documented..lol

curl -X POST 'http://localhost:8888/apis/v1alpha2/namespaces/default/clusters' --data '{
  "name": "myraycluster",
  "namespace": "default",
  "user": "tedchang",
  "version": "2.5.0",
  "annotations": {
    "kubernetes.io/ingress.class": "nginx",
    "nginx.ingress.kubernetes.io/rewrite-target": "/$1"
  },
  "clusterSpec": {
    "headGroupSpec": {
      "enableIngress": true,
      "computeTemplate": "default-template",
      "image": "docker.io/rayproject/ray:2.6.3-py310",
      "rayStartParams": {
        "dashboard-host": "0.0.0.0",
        "metrics-export-port": "8080",
        "num-cpus": "0"
      }
    },
    "workerGroupSpec": [
      {
        "groupName": "small-wg",
        "computeTemplate": "default-template",
        "image": "docker.io/rayproject/ray:2.6.3-py310",
        "rayStartParams": {
          "node-ip-address": "$MY_POD_IP"
        },
        "replicas": 1,
        "minReplicas": 0,
        "maxReplicas": 1
      }
    ]
  }
}'
image

Copy link
Contributor

@tedhtchang tedhtchang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. We should add some examples to the README.md later.

@kevin85421
Copy link
Member

The screenshot shows that this PR changes 45 files, and most changes are from the master branch's commits not from this PR. Would you mind updating this PR or opening a new PR with the necessary changes only? Typically, I use the following commands to rebase my local branch with the upstream/master branch.

git fetch upstream
git rebase upstream/master
Screen Shot 2023-09-09 at 10 19 00 AM

@blublinsky
Copy link
Contributor Author

@kevin85421 you can use #1409 in place of this. It only has required files

@kevin85421
Copy link
Member

Close this PR because the replacement #1409 is merged.

@kevin85421 kevin85421 closed this Sep 11, 2023
@blublinsky blublinsky deleted the api_server_volumes_ingress branch September 12, 2023 06:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enable support for PVCs in Ray cluster nodes
9 participants