Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add validating webhook #1584

Merged
merged 1 commit into from
Nov 27, 2023
Merged

Conversation

davidxia
Copy link
Contributor

@davidxia davidxia commented Oct 30, 2023

and validate worker group names are unique.

Add new Makefile targets install-with-webhooks, uninstall-with-webhooks,
deploy-with-webhooks, and undeploy-with-webhooks to be backwards compatible
and opt-in.

Much of the code, especially the YAML files, was generated by running the
command below as documented in kubebuilder.

kubebuilder create webhook \
  --group ray \
  --version v1 \
  --kind RayCluster \
  --defaulting \
  --programmatic-validation

How to use locally

make manifests generate
make install-with-webhooks
IMG=kuberay/operator:test make docker-build
kind load docker-image kuberay/operator:test
IMG=kuberay/operator:test make deploy-with-webhooks

Example RayCluster that has duplicate worker group names

cat dupe-worker-group-name.yaml

apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: dupe-worker-group-name
spec:
  headGroupSpec:
    rayStartParams:
      dashboard-host: '0.0.0.0'
    template:
      spec:
        containers:
        - name: ray-head
          image: rayproject/ray:2.7.0
  workerGroupSpecs:
  - replicas: 1
    minReplicas: 1
    maxReplicas: 10
    groupName: group1
    rayStartParams: {}
    template:
      spec:
        containers:
        - name: ray-worker
          image: rayproject/ray:2.7.0
  - replicas: 1
    minReplicas: 1
    maxReplicas: 10
    groupName: group1
    rayStartParams: {}
    template:
      spec:
        containers:
        - name: ray-worker
          image: rayproject/ray:2.7.0

Before

kubectl apply -f dupe-worker-group-name.yaml
raycluster.ray.io/raycluster-dupe-worker-name created

After

kubectl --context kind-kind apply -f config/samples/ray-cluster-dupe-worker-name.yaml

The RayCluster "raycluster-dupe-worker-name" is invalid: spec.workerGroupSpecs[1]: Invalid value: v1.WorkerGroupSpec{GroupName:"group1", Replicas:(*int32)(0x40006e63cc), MinReplicas:(*int32)(0x40006e63c8), MaxReplicas:(*int32)(0x40006e63c0), RayStartParams:map[string]string{}, Template:v1.PodTemplateSpec{ObjectMeta:v1.ObjectMeta{Name:"", GenerateName:"", Namespace:"", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, Spec:v1.PodSpec{Volumes:[]v1.Volume(nil), InitContainers:[]v1.Container(nil), Containers:[]v1.Container{v1.Container{Name:"ray-worker", Image:"rayproject/ray:2.7.0", Command:[]string(nil), Args:[]string(nil), WorkingDir:"", Ports:[]v1.ContainerPort(nil), EnvFrom:[]v1.EnvFromSource(nil), Env:[]v1.EnvVar(nil), Resources:v1.ResourceRequirements{Limits:v1.ResourceList(nil), Requests:v1.ResourceList(nil)}, VolumeMounts:[]v1.VolumeMount(nil), VolumeDevices:[]v1.VolumeDevice(nil), LivenessProbe:(*v1.Probe)(nil), ReadinessProbe:(*v1.Probe)(nil), StartupProbe:(*v1.Probe)(nil), Lifecycle:(*v1.Lifecycle)(nil), TerminationMessagePath:"", TerminationMessagePolicy:"", ImagePullPolicy:"", SecurityContext:(*v1.SecurityContext)(nil), Stdin:false, StdinOnce:false, TTY:false}}, EphemeralContainers:[]v1.EphemeralContainer(nil), RestartPolicy:"", TerminationGracePeriodSeconds:(*int64)(nil), ActiveDeadlineSeconds:(*int64)(nil), DNSPolicy:"", NodeSelector:map[string]string(nil), ServiceAccountName:"", DeprecatedServiceAccount:"", AutomountServiceAccountToken:(*bool)(nil), NodeName:"", HostNetwork:false, HostPID:false, HostIPC:false, ShareProcessNamespace:(*bool)(nil), SecurityContext:(*v1.PodSecurityContext)(nil), ImagePullSecrets:[]v1.LocalObjectReference(nil), Hostname:"", Subdomain:"", Affinity:(*v1.Affinity)(nil), SchedulerName:"", Tolerations:[]v1.Toleration(nil), HostAliases:[]v1.HostAlias(nil), PriorityClassName:"", Priority:(*int32)(nil), DNSConfig:(*v1.PodDNSConfig)(nil), ReadinessGates:[]v1.PodReadinessGate(nil), RuntimeClassName:(*string)(nil), EnableServiceLinks:(*bool)(nil), PreemptionPolicy:(*v1.PreemptionPolicy)(nil), Overhead:v1.ResourceList(nil), TopologySpreadConstraints:[]v1.TopologySpreadConstraint(nil), SetHostnameAsFQDN:(*bool)(nil), OS:(*v1.PodOS)(nil)}}, ScaleStrategy:v1.ScaleStrategy{WorkersToDelete:[]string(nil)}}: worker group names must be unique

Backwards compatibility

Just use original Makefile targets of install, uninstall, deploy, and
undeploy.

IMG=kuberay/operator:test make docker-build
kind load docker-image kuberay/operator:test
IMG=kuberay/operator:test make deploy

kubectl apply -f dupe-worker-group-name.yaml
raycluster.ray.io/raycluster-dupe-worker-name created

closes #718
closes #736

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

ray-operator/Makefile Outdated Show resolved Hide resolved
Comment on lines +70 to +82
func (r *RayCluster) validateWorkerGroups() *field.Error {
workerGroupNames := make(map[string]bool)

for i, workerGroup := range r.Spec.WorkerGroupSpecs {
if _, ok := workerGroupNames[workerGroup.GroupName]; ok {
return field.Invalid(field.NewPath("spec").Child("workerGroupSpecs").Index(i), workerGroup, "worker group names must be unique")
}
workerGroupNames[workerGroup.GroupName] = true
}

return nil
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

important part

err := k8sClient.Create(context.TODO(), &rayCluster)
Expect(err).To(HaveOccurred())

Expect(err.Error()).To(ContainSubstring("worker group names must be unique"))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

relevant test, rest of this file mostly generated by kubebuilder

ray-operator/main.go Outdated Show resolved Hide resolved
@@ -0,0 +1,39 @@
# The following manifests contain a self-signed issuer CR and a certificate CR.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the YAML below was either generated by kubebuilder or cargo-culted from a fresh kubebuilder-generated repo that had a lot of original boilerplate for webhooks and certmanager preserved.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any documentation for YAML generation in case I need to upgrade them in the future?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevin85421
Copy link
Member

Cool! Can users determine whether to install this webhook or not?

@davidxia
Copy link
Contributor Author

I need to find a way to make it opt-in. Lmk if you have some examples.

@davidxia davidxia force-pushed the validating-webhook branch 3 times, most recently from 450e681 to b407473 Compare October 31, 2023 15:25
Comment on lines +122 to +136
certmanager:
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.13.2/cert-manager.yaml
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any compatibility limitations between cert-manager v1.13.2 and Kubernetes, controller-runtime, or other dependencies? Is it necessary to deploy cert-manager if users want to use the webhook?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any compatibility limitations between cert-manager v1.13.2 and Kubernetes, controller-runtime, or other dependencies?

cert-manager 1.13 is the latest release and support K8s versions 1.23 — 1.28. Does this work?

Is it necessary to deploy cert-manager if users want to use the webhook?

I think it's possible to run webhooks with insecure communication (something like insecure-skip-tls-verify) between K8s control plane and the webhook server, but it's not recommended. I don't even see it mentioned in https://book.kubebuilder.io/. We should probably stick with the guidelines here. wdyt?

@davidxia davidxia force-pushed the validating-webhook branch 2 times, most recently from f842bf3 to 1ad2074 Compare October 31, 2023 15:48
ray-operator/Makefile Outdated Show resolved Hide resolved
Comment on lines +5 to +7
- ../crd/bases/ray.io_rayclusters.yaml
- ../crd/bases/ray.io_rayservices.yaml
- ../crd/bases/ray.io_rayjobs.yaml
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

relative paths so we don't have to duplicate these large files

Comment on lines +110 to +118
($(KUSTOMIZE) build --load-restrictor LoadRestrictionsNone config/crd-with-webhooks | kubectl create -f -) || ($(KUSTOMIZE) build --load-restrictor LoadRestrictionsNone config/crd-with-webhooks | kubectl replace -f -)

uninstall-with-webhooks: manifests kustomize ## Uninstall CRDs with webhooks from the K8s cluster specified in ~/.kube/config.
$(KUSTOMIZE) build --load-restrictor LoadRestrictionsNone config/crd-with-webhooks | kubectl delete -f -
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--load-restrictor LoadRestrictionsNone is needed to allow kustomize to load files from outside root of its context. Otherwise, we get the error

make uninstall-with-webhooks

test -s /Users/dxia/src/github.com/ray-project/kuberay/ray-operator/bin/controller-gen || GOBIN=/Users/dxia/src/github.com/ray-project/kuberay/ray-operator/bin/controller-gen/.. go install sigs.k8s.io/controller-tools/cmd/controller-gen@v0.6.0
/Users/dxia/src/github.com/ray-project/kuberay/ray-operator/bin/controller-gen "crd:maxDescLen=0,trivialVersions=true,preserveUnknownFields=false,generateEmbeddedObjectMeta=true,allowDangerousTypes=true" rbac:roleName=kuberay-operator webhook paths="./..." output:crd:artifacts:config=config/crd/bases
test -s /Users/dxia/src/github.com/ray-project/kuberay/ray-operator/bin/kustomize ||  GOBIN=/Users/dxia/src/github.com/ray-project/kuberay/ray-operator/bin/kustomize/.. go install sigs.k8s.io/kustomize/kustomize/v5@latest
/Users/dxia/src/github.com/ray-project/kuberay/ray-operator/bin/kustomize build config/crd-with-webhooks | grep -v 'creationTimestamp: "null"' | kubectl delete -f -
Error: accumulating resources: accumulation err='accumulating resources from 'bases/ray.io_rayclusters.yaml': security; file '/Users/dxia/src/github.com/ray-project/kuberay/ray-operator/config/crd-with-webhooks/bases/ray.io_rayclusters.yaml' is not in or below '/Users/dxia/src/github.com/ray-project/kuberay/ray-operator/config/crd-with-webhooks'': must build at directory: '/Users/dxia/src/github.com/ray-project/kuberay/ray-operator/config/crd-with-webhooks/bases/ray.io_rayclusters.yaml': file is not directory
No resources found
make: *** [uninstall-with-webhooks] Error 1

Comment on lines 171 to 176
if os.Getenv("ENABLE_WEBHOOKS") == "true" {
if err = (&rayv1.RayCluster{}).SetupWebhookWithManager(mgr); err != nil {
setupLog.Error(err, "unable to create webhook", "webhook", "RayCluster")
os.Exit(1)
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +12 to +13
- name: ENABLE_WEBHOOKS
value: "true"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this env var enables the webhooks

Comment on lines +109 to +118
install-with-webhooks: manifests kustomize ## Install CRDs with webhooks into the K8s cluster specified in ~/.kube/config.
($(KUSTOMIZE) build --load-restrictor LoadRestrictionsNone config/crd-with-webhooks | kubectl create -f -) || ($(KUSTOMIZE) build --load-restrictor LoadRestrictionsNone config/crd-with-webhooks | kubectl replace -f -)

uninstall-with-webhooks: manifests kustomize ## Uninstall CRDs with webhooks from the K8s cluster specified in ~/.kube/config.
$(KUSTOMIZE) build --load-restrictor LoadRestrictionsNone config/crd-with-webhooks | kubectl delete -f -
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

created new targets to be backwards compatible and so users can opt-in

@davidxia

This comment was marked as outdated.

@davidxia davidxia marked this pull request as ready for review November 15, 2023 23:20
@kevin85421 kevin85421 self-assigned this Nov 15, 2023
@kevin85421
Copy link
Member

@davidxia Would you mind fixing the CI errors? Thanks!

@kevin85421
Copy link
Member

The lint check in CI keeps failing.

@davidxia
Copy link
Contributor Author

The lint check in CI keeps failing.

should be fixed now!

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contributions! Your contributions are always very high quality. Would you mind resolving the conflicts with the master branch? It looks good to me, but I will also try to test it manually before I merge it.

@@ -112,6 +112,12 @@ install: manifests kustomize ## Install CRDs into the K8s cluster specified in ~
uninstall: manifests kustomize ## Uninstall CRDs from the K8s cluster specified in ~/.kube/config.
$(KUSTOMIZE) build config/crd | kubectl delete -f -

install-with-webhooks: manifests kustomize ## Install CRDs with webhooks into the K8s cluster specified in ~/.kube/config.
($(KUSTOMIZE) build --load-restrictor LoadRestrictionsNone config/crd-with-webhooks | kubectl create -f -) || ($(KUSTOMIZE) build --load-restrictor LoadRestrictionsNone config/crd-with-webhooks | kubectl replace -f -)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use ($(KUSTOMIZE) build --load-restrictor LoadRestrictionsNone config/crd-with-webhooks | kubectl apply -f -) instead of using both kubectl create and kubectl replace?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied the pattern from the install: target. I can change it, but maybe it's better to keep it consistent?

Comment on lines +122 to +136
certmanager:
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.13.2/cert-manager.yaml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any compatibility limitations between cert-manager v1.13.2 and Kubernetes, controller-runtime, or other dependencies? Is it necessary to deploy cert-manager if users want to use the webhook?


deploy-with-webhooks: manifests kustomize certmanager ## Deploy controller with webhooks to the K8s cluster specified in ~/.kube/config.
cd config/default-with-webhooks && $(KUSTOMIZE) edit set image kuberay/operator=${IMG}
($(KUSTOMIZE) build config/default-with-webhooks | kubectl create -f -) || ($(KUSTOMIZE) build config/default-with-webhooks | kubectl replace -f -)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kubectl apply -f -?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -1,3 +1,7 @@
# Code generated by tool. DO NOT EDIT.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just out of curiosity, which tools update this file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most of the YAMLs and this were created or modified by (noted in PR description)

kubebuilder create webhook \
  --group ray \
  --version v1 \
  --kind RayCluster \
  --defaulting \
  --programmatic-validation

ray-operator/main.go Outdated Show resolved Hide resolved

//+kubebuilder:webhook:path=/mutate-ray-io-v1-raycluster,mutating=true,failurePolicy=fail,sideEffects=None,groups=ray.io,resources=rayclusters,verbs=create;update,versions=v1,name=mraycluster.kb.io,admissionReviewVersions=v1

var _ webhook.Defaulter = &RayCluster{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: https://book.kubebuilder.io/cronjob-tutorial/webhook-implementation

We use the webhook.Defaulter interface to set defaults to our CRD. A webhook will automatically be served that calls this defaulting. The Default method is expected to mutate the receiver, setting the defaults.

// TODO(user): change verbs to "verbs=create;update;delete" if you want to enable deletion validation.
//+kubebuilder:webhook:path=/validate-ray-io-v1-raycluster,mutating=false,failurePolicy=fail,sideEffects=None,groups=ray.io,resources=rayclusters,verbs=create;update,versions=v1,name=vraycluster.kb.io,admissionReviewVersions=v1

var _ webhook.Validator = &RayCluster{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: https://book.kubebuilder.io/cronjob-tutorial/webhook-implementation

If webhook.Validator interface is implemented, a webhook will automatically be served that calls the validation. The ValidateCreate, ValidateUpdate and ValidateDelete methods are expected to validate its receiver upon creation, update and deletion respectively.

// TODO(user): fill in your defaulting logic.
}

// TODO(user): change verbs to "verbs=create;update;delete" if you want to enable deletion validation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The field mutating: false is different from mutating: true in L22. What's the meaning of the field?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means this webhook endpoint is for validation instead of mutation.

marks this as a mutating webhook (it's validating only if false)

— https://book.kubebuilder.io/reference/markers/webhook.html

@@ -0,0 +1,39 @@
# The following manifests contain a self-signed issuer CR and a certificate CR.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any documentation for YAML generation in case I need to upgrade them in the future?

@@ -0,0 +1,177 @@
package v1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this test suite be executed by running make test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, look for Running Suite: Webhook Suite. I also checked by putting Fail() in the test and seeing make test fails.

@davidxia davidxia force-pushed the validating-webhook branch 2 times, most recently from 6a4986c to db8604c Compare November 23, 2023 17:24
@davidxia
Copy link
Contributor Author

Should I update docs here in a subsequent PR?

and validate worker group names are unique.

Add new Makefile targets `install-with-webhooks`, `uninstall-with-webhooks`,
`deploy-with-webhooks`, and `undeploy-with-webhooks` to be backwards compatible
and opt-in.

Much of the code, especially the YAML files, was generated by running the
command below as [documented in kubebuilder][1].

```
kubebuilder create webhook \
  --group ray \
  --version v1 \
  --kind RayCluster \
  --defaulting \
  --programmatic-validation
```

## How to use locally

```shell
make manifests generate
make install-with-webhooks
IMG=kuberay/operator:test make docker-build
kind load docker-image kuberay/operator:test
IMG=kuberay/operator:test make deploy-with-webhooks
```

## Example RayCluster that has duplicate worker group names

```shell
cat dupe-worker-group-name.yaml

apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: dupe-worker-group-name
spec:
  headGroupSpec:
    rayStartParams:
      dashboard-host: '0.0.0.0'
    template:
      spec:
        containers:
        - name: ray-head
          image: rayproject/ray:2.7.0
  workerGroupSpecs:
  - replicas: 1
    minReplicas: 1
    maxReplicas: 10
    groupName: group1
    rayStartParams: {}
    template:
      spec:
        containers:
        - name: ray-worker
          image: rayproject/ray:2.7.0
  - replicas: 1
    minReplicas: 1
    maxReplicas: 10
    groupName: group1
    rayStartParams: {}
    template:
      spec:
        containers:
        - name: ray-worker
          image: rayproject/ray:2.7.0
```

## Before

```
kubectl apply -f dupe-worker-group-name.yaml
raycluster.ray.io/raycluster-dupe-worker-name created
```

## After

```
kubectl --context kind-kind apply -f config/samples/ray-cluster-dupe-worker-name.yaml
```

`The RayCluster "raycluster-dupe-worker-name" is invalid:
spec.workerGroupSpecs[1]: Invalid value: v1.WorkerGroupSpec{GroupName:"group1",
Replicas:(*int32)(0x40006e63cc), MinReplicas:(*int32)(0x40006e63c8),
MaxReplicas:(*int32)(0x40006e63c0), RayStartParams:map[string]string{},
Template:v1.PodTemplateSpec{ObjectMeta:v1.ObjectMeta{Name:"", GenerateName:"",
Namespace:"", SelfLink:"", UID:"", ResourceVersion:"", Generation:0,
CreationTimestamp:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC),
DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil),
Labels:map[string]string(nil), Annotations:map[string]string(nil),
OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil),
ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)},
Spec:v1.PodSpec{Volumes:[]v1.Volume(nil), InitContainers:[]v1.Container(nil),
Containers:[]v1.Container{v1.Container{Name:"ray-worker",
Image:"rayproject/ray:2.7.0", Command:[]string(nil), Args:[]string(nil),
WorkingDir:"", Ports:[]v1.ContainerPort(nil), EnvFrom:[]v1.EnvFromSource(nil),
Env:[]v1.EnvVar(nil),
Resources:v1.ResourceRequirements{Limits:v1.ResourceList(nil),
Requests:v1.ResourceList(nil)}, VolumeMounts:[]v1.VolumeMount(nil),
VolumeDevices:[]v1.VolumeDevice(nil), LivenessProbe:(*v1.Probe)(nil),
ReadinessProbe:(*v1.Probe)(nil), StartupProbe:(*v1.Probe)(nil),
Lifecycle:(*v1.Lifecycle)(nil), TerminationMessagePath:"",
TerminationMessagePolicy:"", ImagePullPolicy:"",
SecurityContext:(*v1.SecurityContext)(nil), Stdin:false, StdinOnce:false,
TTY:false}}, EphemeralContainers:[]v1.EphemeralContainer(nil),
RestartPolicy:"", TerminationGracePeriodSeconds:(*int64)(nil),
ActiveDeadlineSeconds:(*int64)(nil), DNSPolicy:"",
NodeSelector:map[string]string(nil), ServiceAccountName:"",
DeprecatedServiceAccount:"", AutomountServiceAccountToken:(*bool)(nil),
NodeName:"", HostNetwork:false, HostPID:false, HostIPC:false,
ShareProcessNamespace:(*bool)(nil),
SecurityContext:(*v1.PodSecurityContext)(nil),
ImagePullSecrets:[]v1.LocalObjectReference(nil), Hostname:"", Subdomain:"",
Affinity:(*v1.Affinity)(nil), SchedulerName:"",
Tolerations:[]v1.Toleration(nil), HostAliases:[]v1.HostAlias(nil),
PriorityClassName:"", Priority:(*int32)(nil),
DNSConfig:(*v1.PodDNSConfig)(nil), ReadinessGates:[]v1.PodReadinessGate(nil),
RuntimeClassName:(*string)(nil), EnableServiceLinks:(*bool)(nil),
PreemptionPolicy:(*v1.PreemptionPolicy)(nil), Overhead:v1.ResourceList(nil),
TopologySpreadConstraints:[]v1.TopologySpreadConstraint(nil),
SetHostnameAsFQDN:(*bool)(nil), OS:(*v1.PodOS)(nil)}},
ScaleStrategy:v1.ScaleStrategy{WorkersToDelete:[]string(nil)}}: worker group
names must be unique`

## Backwards compatibility

Just use original Makefile targets of `install`, `uninstall`, `deploy`, and
`undeploy`.

```shell
IMG=kuberay/operator:test make docker-build
kind load docker-image kuberay/operator:test
IMG=kuberay/operator:test make deploy

kubectl apply -f dupe-worker-group-name.yaml
raycluster.ray.io/raycluster-dupe-worker-name created
```
closes ray-project#718
closes ray-project#736

[1]: https://book.kubebuilder.io/cronjob-tutorial/webhook-implementation
@kevin85421
Copy link
Member

Should I update docs here in a subsequent PR?

We currently move almost all docs to the Ray website. Maybe we can add a doc for webhook under User Guides instead?

@kevin85421
Copy link
Member

Thanks, I will merge it after I manually test it.

@davidxia
Copy link
Contributor Author

Manual tests include

  • make install-with-webhooks && make uninstall-with-webhooks exits 0
  • make deploy-with-webhooks && make undeploy-with-webhooks exits 0
  • make install && make install-with-webhooks exits 0
  • make deploy && make deploy-with-webhooks exits 0
  • make install-with-webhooks deploy-with-webhooks && kubectl apply -f dupe-worker-group-name.yaml (see above for file) results in error with message worker group names must be unique

Should we write proper tests for these? Is it easy to do so?

@kevin85421
Copy link
Member

I tested this PR manually by running the following

kind create cluster --image=kindest/node:v1.26.0
IMG=kuberay/operator:test make docker-image
kind load docker-image kuberay/operator:test

# If this command reports any errors, rerun it.
IMG=kuberay/operator:test make deploy-with-webhooks

kubectl apply -f dupe-worker-group-name.yaml
# The RayCluster "dupe-worker-group-name" is invalid: spec.workerGroupSpecs[1]: Invalid value: ...... : worker group names must be unique

The command make deploy-with-webhooks always reports errors about cert-manager in the first run. I need to rerun the command when the cert-manager Pods are ready. Is there any way to make the first try succeed?

Screen Shot 2023-11-27 at 12 48 10 PM

@davidxia
Copy link
Contributor Author

Do serving-cert and selfsigned-issuer take some time to appear on the cluster?

@kevin85421
Copy link
Member

kevin85421 commented Nov 27, 2023

Do serving-cert and selfsigned-issuer take some time to appear on the cluster?

No. I do not dig into the root cause, but, in my observation, both serving-cert and selfsigned-issuer can only be created after cert-manager Pods are ready. If I do not rerun the command, the KubeRay operator Pod will hang forever because Warning FailedMount 11s (x6 over 27s) kubelet MountVolume.SetUp failed for volume "cert" : secret "webhook-server-cert" not found. After I run the command again, the certificate will be created immediately.

I will merge this PR before addressing this issue. We can have a follow-up PR to improve this later.

@kevin85421
Copy link
Member

Compatibility:

  • cert-manager v1.13.2 supports Kubernetes 1.23 - 1.28. (Add validating webhook #1584 (comment))
  • KubeRay currently supports Kubernetes 1.20 and above. However, it would be acceptable to only support versions 1.23 and higher, as the Kubernetes community has not supported version 1.23 since February 2023. (https://endoflife.date/kubernetes).
  • I tried to test this PR with webhook disabled on Kubernetes 1.20.7. It works.

To conclude, it is fine for KubeRay to support v1.23+.

@kevin85421 kevin85421 merged commit f8ed876 into ray-project:master Nov 27, 2023
25 checks passed
@kevin85421
Copy link
Member

Open an issue to track the progress: #1692

@davidxia davidxia deleted the validating-webhook branch November 27, 2023 23:00
@davidxia
Copy link
Contributor Author

thanks so much for merging, i'll take a look at that new issue when I get a chance.

I'll first make a doc PR to the User Guides. I'm guessing you'd like a new page and item in that list for it?

@kevin85421
Copy link
Member

I'll first make a doc PR to the User Guides. I'm guessing you'd like a new page and item in that list for it?

Yes. Thank you!

@kevin85421 kevin85421 mentioned this pull request Nov 28, 2023
4 tasks
davidxia added a commit to davidxia/ray that referenced this pull request Nov 30, 2023
follow up to ray-project/kuberay#1584

Signed-off-by: David Xia <dxia@spotify.com>
davidxia added a commit to davidxia/ray that referenced this pull request Dec 2, 2023
follow up to ray-project/kuberay#1584

Signed-off-by: David Xia <dxia@spotify.com>
davidxia added a commit to davidxia/ray that referenced this pull request Dec 2, 2023
follow up to ray-project/kuberay#1584

Signed-off-by: David Xia <dxia@spotify.com>
davidxia added a commit to davidxia/ray that referenced this pull request Dec 2, 2023
follow up to ray-project/kuberay#1584

Signed-off-by: David Xia <dxia@spotify.com>
@cermeng
Copy link

cermeng commented Apr 30, 2024

Good work! I have one question: it seems that helm chart installation is inconsistent with this pr and only kustomize installation for webhook is supported?

@davidxia
Copy link
Contributor Author

@cermeng yes we didn't add that yet. But I think PRs are welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants