Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update securityContext values.yaml for kuberay-operator to safe defaults. #1896

Merged
merged 3 commits into from
Feb 21, 2024

Conversation

vinayakankugoyal
Copy link
Contributor

@vinayakankugoyal vinayakankugoyal commented Feb 1, 2024

Why are these changes needed?

I noticed that the securityContext of kube-ray operator by default is empty and requires it to be set to secure defaults manually. We should try to be secure by default where possible.

Related issue number

N/A

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

…lts.

I noticed that the securityContext of kube-ray operator by default  is empty and requires it to be set to secure defaults manually. We should try to be secure by default where possible.

Signed-off-by: Vinayak Goyal <vinayakankugoyal@gmail.com>
@kevin85421 kevin85421 self-assigned this Feb 2, 2024
@kevin85421 kevin85421 self-requested a review February 2, 2024 01:22
Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution! I am not familiar with securityContext. How do you manually test it? In my experience, under the restricted Pod security standard, the spec.securityContext.seccompProfile setting is necessary. See #750 (comment) for more details.

In addition, different Kubernetes distributions (e.g. OpenShift) seem to have different PSS. For example, OpenShift appears to report an error if we set the seccompProfile(#750 (comment)).

In those cases, adding securityContext might cause some users to fail in launching KubeRay if different Kubernetes distributions have varying PSS. Security is very important, but the primary goal of the default Helm chart values is to enable users to try KubeRay with ease. My suggestion is to leave the securityContext empty and include comments with suggested secure configurations. WDYT?

@vinayakankugoyal
Copy link
Contributor Author

vinayakankugoyal commented Feb 2, 2024

Ah I meant to add seccompProfile RuntimeDefault as well, thanks for reminding me!

PSS is defined by Kuberenetes, https://kubernetes.io/docs/concepts/security/pod-security-standards/ so its alarming that they have something different.

Setting securityContext.seccompProfile to RuntimeDefault is a security best practice and we should start with that so people are secure by default. We could add a comment saying that some of these settings might be needed to be removed to support certain cloud providers environments. In terms of risk its an operator and does not need any permissions on the host AFAICT but please correct me if I am wrong.

In terms of how I tested it was by deploying the ray-operator daemonset to a kind cluster and a GKE cluster and check if it came up. However I didn't really schedule a job or anything like that. I am happy to test more scenarios if you'd like.

Signed-off-by: Vinayak Goyal <vinayakankugoyal@gmail.com>
Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of how I tested it was by deploying the ray-operator daemonset to a kind cluster and a GKE cluster and check if it came up.

Would you mind providing more details about this? Which security level did you test, restricted?

@vinayakankugoyal
Copy link
Contributor Author

vinayakankugoyal commented Feb 2, 2024

Here is the setup I used for kind

  1. Create a kind cluster
kind create cluster                                                                                                                                                                                        ok 
Creating cluster "kind" ...
 ✓ Ensuring node image (kindest/node:v1.27.3) 🖼
 ✓ Preparing nodes 📦  
 ✓ Writing configuration 📜 
 ✓ Starting control-plane 🕹️ 
 ✓ Installing CNI 🔌 
 ✓ Installing StorageClass 💾 
Set kubectl context to "kind-kind"
You can now use your cluster with:

kubectl cluster-info --context kind-kind

Have a nice day! 👋

  1. Set enforce restricted in all namespaces.
kubectl label --overwrite ns --all \                                                                                                                                                  ok  20s  kind-kind kube 
pod-security.kubernetes.io/enforce=restricted
namespace/default labeled
namespace/kube-node-lease labeled
namespace/kube-public labeled
Warning: existing pods in namespace "kube-system" violate the new PodSecurity enforce level "restricted:latest"
Warning: coredns-5d78c9869d-4d5vn (and 1 other pod): unrestricted capabilities, runAsNonRoot != true, seccompProfile
Warning: etcd-kind-control-plane (and 3 other pods): host namespaces, allowPrivilegeEscalation != false, unrestricted capabilities, restricted volume types, runAsNonRoot != true
Warning: kindnet-k7sll: host namespaces, allowPrivilegeEscalation != false, unrestricted capabilities, restricted volume types, runAsNonRoot != true, seccompProfile
Warning: kube-proxy-hhqk2: host namespaces, privileged, allowPrivilegeEscalation != false, unrestricted capabilities, restricted volume types, runAsNonRoot != true, seccompProfile
namespace/kube-system labeled
Warning: existing pods in namespace "local-path-storage" violate the new PodSecurity enforce level "restricted:latest"
Warning: local-path-provisioner-6bc4bddd6b-ks6n8: allowPrivilegeEscalation != false, unrestricted capabilities, runAsNonRoot != true, seccompProfile
namespace/local-path-storage labeled

  1. Try to run a bad pod in default namespace.
kubectl run my-shell --image nginx                                                                                                                                        ok  kind-kind kube 
Error from server (Forbidden): pods "my-shell" is forbidden: violates PodSecurity "restricted:latest": allowPrivilegeEscalation != false (container "my-shell" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "my-shell" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or container "my-shell" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container "my-shell" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")

Note: see how the error requires securityContext.seccompProfile.type to be set to RuntimeDefault?

  1. Switch to my branch
 git checkout patch-1
  1. Check what changes are there
git  --no-pager log -n 3                                                                                                                                                                  ok 
commit bdae08c1302f0403226729abe9c2b099f92ce9bc (HEAD -> patch-1, origin/patch-1)
Author: Vinayak Goyal <vinayakankugoyal@gmail.com>
Date:   Fri Feb 2 09:09:35 2024 -0800

    Set seccompProfile.type to RuntimeDefault.
    
    Signed-off-by: Vinayak Goyal <vinayakankugoyal@gmail.com>

commit c6aa69f8281585d65b6860e92860daa04fdb9bad
Author: Vinayak Goyal <vinayakankugoyal@gmail.com>
Date:   Thu Feb 1 15:20:54 2024 -0800

    Update securityContext values.yaml for kuberay-operator to safe defaults.
    
    I noticed that the securityContext of kube-ray operator by default  is empty and requires it to be set to secure defaults manually. We should try to be secure by default where possible.
    
    Signed-off-by: Vinayak Goyal <vinayakankugoyal@gmail.com>

commit f53b42a77aa748995661a89df4a2ecb1e78df55a (origin/master, origin/HEAD, master)
Author: Andrew Sy Kim <andrewsy@google.com>
Date:   Thu Feb 1 13:41:14 2024 -0500

    [RayJob] Add additional print columns for RayJob (#1895)

  1. Install kube-ray operator
helm install kuberay-operator -n default ./helm-chart/kuberay-operator
NAME: kuberay-operator
LAST DEPLOYED: Fri Feb  2 20:43:46 2024
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
  1. Check the status of the pod
kubectl get pods                                                                                                                                                          ok  kind-kind kube 
NAME                               READY   STATUS    RESTARTS   AGE
kuberay-operator-8fd7c9499-th984   1/1     Running   0          13m
  1. Check the logs
kubectl logs -l app.kubernetes.io/component=kuberay-operator -n default                                                                                                   ok  kind-kind kube 
2024-02-02T20:43:54.507Z	INFO	Starting EventSource	{"controller": "rayjob", "controllerGroup": "ray.io", "controllerKind": "RayJob", "source": "kind source: *v1.Job"}
2024-02-02T20:43:54.507Z	INFO	Starting Controller	{"controller": "rayjob", "controllerGroup": "ray.io", "controllerKind": "RayJob"}
2024-02-02T20:43:54.507Z	INFO	Starting EventSource	{"controller": "rayservice", "controllerGroup": "ray.io", "controllerKind": "RayService", "source": "kind source: *v1.RayService"}
2024-02-02T20:43:54.507Z	INFO	Starting EventSource	{"controller": "rayservice", "controllerGroup": "ray.io", "controllerKind": "RayService", "source": "kind source: *v1.RayCluster"}
2024-02-02T20:43:54.507Z	INFO	Starting EventSource	{"controller": "rayservice", "controllerGroup": "ray.io", "controllerKind": "RayService", "source": "kind source: *v1.Service"}
2024-02-02T20:43:54.507Z	INFO	Starting EventSource	{"controller": "rayservice", "controllerGroup": "ray.io", "controllerKind": "RayService", "source": "kind source: *v1.Ingress"}
2024-02-02T20:43:54.507Z	INFO	Starting Controller	{"controller": "rayservice", "controllerGroup": "ray.io", "controllerKind": "RayService"}
2024-02-02T20:43:54.816Z	INFO	Starting workers	{"controller": "rayjob", "controllerGroup": "ray.io", "controllerKind": "RayJob", "worker count": 1}
2024-02-02T20:43:54.816Z	INFO	Starting workers	{"controller": "rayservice", "controllerGroup": "ray.io", "controllerKind": "RayService", "worker count": 1}
2024-02-02T20:43:54.816Z	INFO	Starting workers	{"controller": "raycluster-controller", "controllerGroup": "ray.io", "controllerKind": "RayCluster", "worker count": 1}
  1. Check that my securityContext changes apply.
kubectl get pods -o yaml -o jsonpath={".items[0].spec.containers[0].securityContext"}                                                                                     ok  kind-kind kube 
{"allowPrivilegeEscalation":false,"capabilities":{"drop":["ALL"]},"readOnlyRootFilesystem":true,"runAsNonRoot":true,"seccompProfile":{"type":"RuntimeDefault"}}

@kevin85421
Copy link
Member

Thank you for the detailed explanation #1896 (comment)! Would you mind checking the CI failures? You can check this doc for more details about how to run "end-to-end tests" or "configuration tests" locally.

@vinayakankugoyal
Copy link
Contributor Author

@kevin85421 - that issue seems related to docker-library/golang#467. Anyway for us to bump the version with fix available?

@kevin85421
Copy link
Member

@vinayakankugoyal Sorry for the late response. I am currently at our company's offsite event.

Anyway for us to bump the version with fix available?

I am not sure which version you are actually referring to (Golang, Docker, or the image?), but you can try it.

@andrewsykim
Copy link
Contributor

/cc

@vinayakankugoyal
Copy link
Contributor Author

@kevin85421 - the Docker version that is running in the CI. As explained in this comment we would need to bump the docker versions now because of the golang version we are using to build this image. (Maybe we should clarify that in the docs as well.)

@kevin85421
Copy link
Member

@vinayakankugoyal would you mind opening an issue to track the progress of adbc593? Thanks!

@kevin85421 kevin85421 merged commit 2e35bff into ray-project:master Feb 21, 2024
23 checks passed
@vinayakankugoyal
Copy link
Contributor Author

@vinayakankugoyal would you mind opening an issue to track the progress of adbc593? Thanks!

Thanks for the approval! Done! #1935

bjornsen added a commit to GoogleCloudPlatform/ai-on-gke that referenced this pull request Mar 20, 2024
bjornsen added a commit to GoogleCloudPlatform/ai-on-gke that referenced this pull request Mar 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants