Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Ray restricted podsecuritystandards for enterprise security and Kubeflow integration #750

Merged
merged 21 commits into from
Dec 8, 2022

Conversation

kevin85421
Copy link
Member

@kevin85421 kevin85421 commented Nov 21, 2022

Why are these changes needed?

Kubernetes defines three different Pod Security Standards, including privileged, baseline, and restricted, to broadly
cover the security spectrum. The privileged standard allows users to do known privilege escalations, and thus it is not
safe enough for security-critical applications.

This PR describes how to configure RayCluster YAML file to apply restricted Pod security standard. The following
references can help you understand this document better:

Related issue number

ray-project/ray#29665

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

Screenshot for Step 5.2

Screen Shot 2022-11-21 at 4 33 51 PM

Screen Shot 2022-11-21 at 4 34 46 PM

Followup issues

  • Set default securityContext in KubeRay operator (X)
  • Add a test in CI to make sure KubeRay works as expected with the restricted security standard.
  • Configure securityContext properly on Pods created by Autoscaler. (Done: exposed related interfaces)
  • Configure securityContext properly on Pods specified by RayCluster YAML file. (X)

@juliusvonkohout
Copy link

juliusvonkohout commented Nov 21, 2022

This documentation looks very good. Two things from are missing so far.

  1. @DmitriGekhtman said that that there is a dynamic auto scaler container in some environments that is attached to the head pod and must be checked too

  2. Not only the cluster but also the operator should be installed in the pod-security namespace and have a proper security context

@juliusvonkohout
Copy link

We also checked that kuberay properly has its own serviceaccount https://github.com/ray-project/kuberay/tree/master/ray-operator/config/rbac

@kevin85421 kevin85421 changed the title WIP [Feature] Ray restricted podsecuritystandards for enterprise security and Kubeflow integration Nov 22, 2022
@kevin85421 kevin85421 changed the title [Feature] Ray restricted podsecuritystandards for enterprise security and Kubeflow integration (DO NOT MERGE)[Feature] Ray restricted podsecuritystandards for enterprise security and Kubeflow integration Nov 22, 2022
@kevin85421 kevin85421 changed the title (DO NOT MERGE)[Feature] Ray restricted podsecuritystandards for enterprise security and Kubeflow integration [Feature] Ray restricted podsecuritystandards for enterprise security and Kubeflow integration Nov 22, 2022
@kevin85421
Copy link
Member Author

cc @DmitriGekhtman @juliusvonkohout I updated the PR. Can you help me review this PR? Thank you!

@kevin85421 kevin85421 marked this pull request as ready for review November 22, 2022 00:48
@DmitriGekhtman
Copy link
Collaborator

I've exposed the (optional) autoscaler sidecar's security context configuration in this PR:
#752

In a separate PR, we should also tweak the Ray cluster Helm chart to expose Ray containers' securityContext fields.

Copy link
Collaborator

@DmitriGekhtman DmitriGekhtman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good to me.

docs/guidance/pod-security.md Outdated Show resolved Hide resolved
docs/guidance/pod-security.md Outdated Show resolved Hide resolved
Copy link
Collaborator

@Jeffwan Jeffwan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense and looks good to me

@juliusvonkohout
Copy link

Looks good so far. I will try to test it tomorrow. One thing you might should test is installing additional python dependencies in the head and workers during runtime. Just to be sure that the file system permissions are all right and that you cann really roll this out to all users.

@DmitriGekhtman
Copy link
Collaborator

One thing you might should test is installing additional python dependencies in the head and workers during runtime.

@juliusvonkohout could you clarify how the dependencies should be installed and what behavior you would expect?

@juliusvonkohout
Copy link

One thing you might should test is installing additional python dependencies in the head and workers during runtime.

@juliusvonkohout could you clarify how the dependencies should be installed and what behavior you would expect?

I would just like to see pip install tested as described here https://docs.ray.io/en/latest/ray-core/handling-dependencies.html

@DmitriGekhtman
Copy link
Collaborator

Got it, we can verify that a runtime env with a pip dependency works!

@juliusvonkohout
Copy link

juliusvonkohout commented Nov 28, 2022

I just tried

Step 4: Install the KubeRay operator

# Path: helm-chart/kuberay-operator
helm install -n pod-security kuberay-operator .

on an openshift cluster. We might need to remove the seccomp stuff or we will get an error.

Error creating: pods "kuberay-operator-5d7d9fc944-h7fcr" is forbidden: unable to validate against any security context constraint: [pod.metadata.annotations.container.seccomp.security.alpha.kubernetes.io/kuberay-operator: Forbidden: seccomp may not be set
...

So i removed

            seccompProfile:
              type: RuntimeDefault

everywhere

Sadly you are using dockerhub so i will have to wait a day

"Failed to pull image "kuberay/operator:nightly": rpc error: code = Unknown desc = reading manifest nightly in docker.io/kuberay/operator: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit"

With a mirror at mtr.devops.telekom.de/juliusvonkohout/kuberay:nightly i at least got the kuberay operator nightly image. I definitely do not recommend to use dockerhub. There is quay.io and other alternatives which do not have those problems. But that is not relevant for this PR.

@DmitriGekhtman
Copy link
Collaborator

I definitely do not recommend to use dockerhub. There is quay.io and other alternatives which do not have those problems. But that is not relevant for this PR.

Thanks for the suggestion @juliusvonkohout! Would you mind opening an issue discussing that?

@kevin85421
Copy link
Member Author

Thank @juliusvonkohout for your review!

on an openshift cluster. We might need to remove the seccomp stuff or we will get an error.

Error creating: pods "kuberay-operator-5d7d9fc944-h7fcr" is forbidden: unable to validate against any security context constraint: [pod.metadata.annotations.container.seccomp.security.alpha.kubernetes.io/kuberay-operator: Forbidden: seccomp may not be set
...

So I removed seccompProfile everywhere

  1. By the restricted Pod security standard, spec.securityContext.seccompProfile is necessary. Without seccompProfile, the following error message will be reported.

    seccompProfile (pod or container \"ray-head\" must set securityContext.seccompProfile.type to \"RuntimeDefault\" or \"Localhost\")
    
  2. I do not have any experience related to OpenShift. In my understanding, OpenShift plays a resource orchestrator role which is very similar to Kubernetes. What's the relationship between OpenShift and KubeFlow support? Thank you!

@juliusvonkohout
Copy link

juliusvonkohout commented Nov 28, 2022

Alright then this will have to be fixed on the openshift side. They will also switch to PSS soon. Our main target is a normal upstream kubernetes such as kind. I have updated my post. You can ignore the seccomp part then.

and on openshift the image is executed with a random non-root uid. There are filesystem permission errors

$ python3 samples/xgboost_example.py
2022-11-28 14:16:26,296 INFO worker.py:1224 -- Using address 127.0.0.1:6379 set in the environment variable RAY_ADDRESS
2022-11-28 14:16:26,296 INFO worker.py:1333 -- Connecting to existing Ray cluster at address: 10.131.0.155:6379...
2022-11-28 14:16:26,303 INFO worker.py:1515 -- Connected to Ray cluster. View the dashboard at http://10.131.0.155:8265 
Traceback (most recent call last):
  File "samples/xgboost_example.py", line 28, in <module>
    result = trainer.fit()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/train/base_trainer.py", line 343, in fit
    tuner = Tuner(trainable=trainable, run_config=self.run_config)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/tuner.py", line 144, in __init__
    self._local_tuner = TunerInternal(**kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/impl/tuner_internal.py", line 104, in __init__
    self._run_config
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/impl/tuner_internal.py", line 261, in _setup_create_experiment_checkpoint_dir
    os.makedirs(path)
  File "/home/ray/anaconda3/lib/python3.7/os.py", line 213, in makedirs
    makedirs(head, exist_ok=exist_ok)
  File "/home/ray/anaconda3/lib/python3.7/os.py", line 223, in makedirs
    mkdir(name, mode)
PermissionError: [Errno 13] Permission denied: '/home/ray/ray_results'
$ id
uid=1001190000(1001190000) gid=0(root) groups=0(root),1001190000

you need to chmod 777 some directories.

so the following has to be done

  1. We will ignore the seccomp stuff (only applicalbe to openshift) and wiat until it switches to PSS too https://connect.redhat.com/en/blog/important-openshift-changes-pod-security-standards
  2. fix the filesystem permissions for docker.io/rayproject/ray-ml:2.0.0 and make sure that your image really runs with any uid. The other option is to add runAsUser: 1000. So either make sure to specify the right user or make sure to be able to run under any userid.
  3. Please provide a workload with pip installations and virtualenvs. I was able to just to pip install seaborn when i was running as user 1000 in the head and worker pod, so it should be trivial.

here is the approach without fixing the filesystem properly and doing the runasuser workaround instead.
ray-cluster.pod-security.txt
values.txt

I strongly suggest to properly fix the filesystem instead of encorcing a particular user. These commands will be helpfull for you

RUN mkdir -p ${APPLICATION_ROOT} && chown -R 1000:0 ${APPLICATION_ROOT} && chmod -R 0777 ${APPLICATION_ROOT}
COPY --chown=1000:0 --chmod=777 app-root ${APPLICATION_ROOT}

THis also applies for whatever you do in the busybox initcontainer. If possible just use /tmp and you wont have any trouble with permissions.

@kevin85421
Copy link
Member Author

@juliusvonkohout Do you have the time to chat the details on Zoom? If so, we can communicate via Kubeflow slack workspace. Thank you!

@juliusvonkohout
Copy link

And this will fix the seccomp issue https://kubernetes.io/blog/2021/08/25/seccomp-default/ for everyone in the long term. Nothing we have to worry about now.

kevin85421 and others added 16 commits December 7, 2022 17:34
Co-authored-by: Dmitri Gekhtman <62982571+DmitriGekhtman@users.noreply.github.com>
Signed-off-by: Kai-Hsun Chen <kaihsun@apache.org>
Co-authored-by: Dmitri Gekhtman <62982571+DmitriGekhtman@users.noreply.github.com>
Signed-off-by: Kai-Hsun Chen <kaihsun@apache.org>
Co-authored-by: Dmitri Gekhtman <62982571+DmitriGekhtman@users.noreply.github.com>
Signed-off-by: Kai-Hsun Chen <kaihsun@apache.org>
Co-authored-by: Dmitri Gekhtman <62982571+DmitriGekhtman@users.noreply.github.com>
Signed-off-by: Kai-Hsun Chen <kaihsun@apache.org>
Co-authored-by: Dmitri Gekhtman <62982571+DmitriGekhtman@users.noreply.github.com>
Signed-off-by: Kai-Hsun Chen <kaihsun@apache.org>
Co-authored-by: Dmitri Gekhtman <62982571+DmitriGekhtman@users.noreply.github.com>
Signed-off-by: Kai-Hsun Chen <kaihsun@apache.org>
Co-authored-by: Dmitri Gekhtman <62982571+DmitriGekhtman@users.noreply.github.com>
Signed-off-by: Kai-Hsun Chen <kaihsun@apache.org>
Co-authored-by: Dmitri Gekhtman <62982571+DmitriGekhtman@users.noreply.github.com>
Signed-off-by: Kai-Hsun Chen <kaihsun@apache.org>
Co-authored-by: Dmitri Gekhtman <62982571+DmitriGekhtman@users.noreply.github.com>
Signed-off-by: Kai-Hsun Chen <kaihsun@apache.org>
Co-authored-by: Dmitri Gekhtman <62982571+DmitriGekhtman@users.noreply.github.com>
Signed-off-by: Kai-Hsun Chen <kaihsun@apache.org>
Co-authored-by: Dmitri Gekhtman <62982571+DmitriGekhtman@users.noreply.github.com>
Signed-off-by: Kai-Hsun Chen <kaihsun@apache.org>
Co-authored-by: Dmitri Gekhtman <62982571+DmitriGekhtman@users.noreply.github.com>
Signed-off-by: Kai-Hsun Chen <kaihsun@apache.org>
@kevin85421
Copy link
Member Author

Resolved the conflict and opened an issue for UID in ray-project/ray#30959.

@DmitriGekhtman
Copy link
Collaborator

Okidoke, looks good to merge! Thanks.

@DmitriGekhtman DmitriGekhtman merged commit fd27b75 into ray-project:master Dec 8, 2022
kevin85421 pushed a commit that referenced this pull request Jan 18, 2023
This PR is based on #750 and #769. It adds a unit test for the document for the Pod security standard.

What does this unit test do:

* Create namespace pod-security with restricted Pod security policy
* Install the operator in the default namespace(another option is install operator in namespace pod-security)
* Apply the restricted Pod security standard to all Pods in the namespace pod-security.
* test if can create the RayCluster in the namespace pod-security.
lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
… and Kubeflow integration (ray-project#750)

Kubernetes defines three different Pod Security Standards, including privileged, baseline, and restricted, to broadly
cover the security spectrum. The privileged standard allows users to do known privilege escalations, and thus it is not
safe enough for security-critical applications.

This PR describes how to configure RayCluster YAML file to apply restricted Pod security standard.

Signed-off-by: Kai-Hsun Chen <kaihsun@apache.org>
lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
…ject#866)

This PR is based on ray-project#750 and ray-project#769. It adds a unit test for the document for the Pod security standard.

What does this unit test do:

* Create namespace pod-security with restricted Pod security policy
* Install the operator in the default namespace(another option is install operator in namespace pod-security)
* Apply the restricted Pod security standard to all Pods in the namespace pod-security.
* test if can create the RayCluster in the namespace pod-security.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kubeflow-integration P1 Issue that should be fixed within a few weeks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants