Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug][breaking change] Unauthorized 401 error on fetching Ray Custom Resources from K8s API server #1128

Merged
merged 5 commits into from
Jun 6, 2023

Conversation

kevin85421
Copy link
Member

@kevin85421 kevin85421 commented May 31, 2023

Why are these changes needed?

In #1123, the autoscaler container fails to GET the RayCluster from the Kubernetes API Server and receives error code 401 after a zero-downtime upgrade. If you are interested in the root cause, you can read the "debug process" section.

We have two solutions for this issue:

  • Solution 1: Still create a ServiceAccount for users if the specified ServiceAccountName does not exist. The only difference is that we do not set the ControllerReference for the ServiceAccount to avoid it being deleted by the Kubernetes garbage collector. (Step 5 in the section "3. Root cause of this issue")

    • (+) More backward-compatible
    • (-) Users need to delete the ServiceAccount themselves, which may introduce "unknown unknowns" for users.
  • Solution 2 (this PR): If users specify ServiceAccountName, they need to create the ServiceAccount by themselves.

    • (-) Break the backward compatibility for users who specify the ServiceAccountName without creating a ServiceAccount manually.
    • (+) It can avoid "unknown unknowns" and makes it easier to identify and fix any issues. Although it breaks backward compatibility, it provides clarity and transparency in troubleshooting.

Debug process (optional)

1. Understand ServiceAccount

In the head Pod, we can get three files, ca.crt, namespace, token, related to the ServiceAccount. You can check this article for the descriptions about these 3 files.

ls /var/run/secrets/kubernetes.io/serviceaccount/
# example output: ca.crt  namespace  token

[Experiment for ca.crt and token]:

You can execute the following commands in the head Pod of an autoscaling RayCluster. The ServiceAccount should have sufficient permissions to perform a GET operation on the RayCluster.

# Typically, the Kubernetes APIServer will be in the default namespace, so you can use 
export APISERVER=https://kubernetes.default:443
export TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)
curl -X GET $APISERVER/apis/ray.io/v1alpha1/namespaces/$RAYCLUSTER_NAMESPACE/rayclusters/$RAYCLUSTER_CR_NAME --header "Authorization: Bearer $TOKEN" --cacert /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
  • [Case 1] Only use ca.crt: Use the anonymous ServiceAccount "system:anonymous" which does not have the permission to GET RayCluster => 403 Forbidden
  • [Case 2] Only use token: Use the ServiceAccount that associated with this Pod which has the permission to GET RayCluster => Succeed
  • [Case 3] Use ca.crt and token at the same time => Succeed (same as Case 2)

The result is same as the result in this article. I am not sure the purpose of ca.crt. It may only be used when some Kubernetes API Server configurations are set.

2. 401 Unauthorized vs 403 Forbidden

Based on this Kubernetes link,

  • When we did not configure token and Kubernetes API Server enables anonymous access (it is enabled by default after Kubernetes 1.6+, and you can check whether it is disabled or not by this link.) => It will use the anonymous ServiceAccount "system:anonymous". If the ServiceAccount does not have enough permissions to perform the operation. The error code will be 403 Forbidden.

  • If we specify an invalid token in the request, we will get a 401 Unauthorized error. For example, we can get 401 by changing

    # Typically, the Kubernetes APIServer will be in the default namespace, so you can use 
    export APISERVER=https://kubernetes.default:443
    export TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)
    curl -X GET $APISERVER/apis/ray.io/v1alpha1/namespaces/$RAYCLUSTER_NAMESPACE/rayclusters/$RAYCLUSTER_CR_NAME --header "Authorization: Bearer $TOKEN" --cacert /var/run/secrets/kubernetes.io/serviceaccount/ca.crt

    to

    # Typically, the Kubernetes APIServer will be in the default namespace, so you can use 
    export APISERVER=https://kubernetes.default:443
    export TOKEN=WRONG_TOKEN
    curl -X GET $APISERVER/apis/ray.io/v1alpha1/namespaces/$RAYCLUSTER_NAMESPACE/rayclusters/$RAYCLUSTER_CR_NAME --header "Authorization: Bearer $TOKEN" --cacert /var/run/secrets/kubernetes.io/serviceaccount/ca.crt

TL;DR

  • 401 Unauthorized: token is invalid.
  • 403 Forbidden: Kubernetes API Server can associate the ServiceAccount, but the account does not enough permission to perform the operation.

3. Root cause of this issue

Because we received a 401 error instead of a 403 error, our focus should be on why the token is invalid rather than whether the ServiceAccount has sufficient permissions or not.

  • Step 1: When users specify ServiceAccountName (e.g., my-sa) for the head Pod without manually creating the ServiceAccount, KubeRay will create a ServiceAccount for them.
  • Step 2: Users trigger a new RayCluster preparation by updating the RayService CR.
  • Step 3: The new RayCluster will use the same my-sa ServiceAccount as the old RayCluster.
  • Step 4: When the Serve deployments on the new RayCluster are ready and healthy, traffic will be redirected to the new RayCluster. Then, the old RayCluster will be deleted.
  • Step 5: Because the old RayCluster and the my-sa ServiceAccount have an owner/dependent relationship, the my-sa ServiceAccount will be deleted when the old RayCluster is deleted.
  • Step 6: Although KubeRay will create a new ServiceAccount called my-sa (referred to as new my-sa below), the new RayCluster has already mounted the old my-sa ServiceAccount in Step 3.
  • Step 7: The autoscaler of the new RayCluster will attempt to use the token of the old my-sa ServiceAccount to make a GET request to the Kubernetes API Server for the RayCluster information. However, since the ServiceAccount has been deleted, the token is no longer valid. This is why we receive a 401 Unauthorized error.

Related issue number

Closes #1123

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(
# Step 0: Create a Kind cluster
kind create cluster --image=kindest/node:v1.23.0

# Step 1: Build docker image for this PR (path: ray-operator/) and load into the Kind cluster
make docker-image
kind load docker-image controller:latest

# Step 2: Install a KubeRay operator
# path: helm-chart/kuberay-operator
helm install kuberay-operator . --set image.repository=controller,image.tag=latest

# Step 3: Create a ServiceAccount `my-sa` with the following YAML file
apiVersion: v1
kind: ServiceAccount
metadata:
  name: my-sa
  namespace: default

# Step 3: Create a RayService
# Use this gist: https://gist.github.com/kevin85421/b2f2af680aff2562c540cef1f138b4b9
kubectl apply -f ray-service.autoscaler-with-ServiceAccountName.yaml

# Step 4: Edit `spec.rayClusterConfig.rayVersion` from 2.4.0 to 2.100.0.
kubectl edit rayservices.ray.io rayservice-sample

# Step 5: Check the log of the new RayCluster to ensure that no 401 error is present, as the expected behavior is to have no occurrences of the 401 error.

@kevin85421 kevin85421 changed the title WIP [WIP][Bug] Unauthorized 401 error on fetching Ray Custom Resources from K8s API server May 31, 2023
@kevin85421 kevin85421 changed the title [WIP][Bug] Unauthorized 401 error on fetching Ray Custom Resources from K8s API server [WIP][Bug][breaking change] Unauthorized 401 error on fetching Ray Custom Resources from K8s API server May 31, 2023
@kevin85421 kevin85421 changed the title [WIP][Bug][breaking change] Unauthorized 401 error on fetching Ray Custom Resources from K8s API server [Bug][breaking change] Unauthorized 401 error on fetching Ray Custom Resources from K8s API server May 31, 2023
@kevin85421 kevin85421 marked this pull request as ready for review May 31, 2023 07:27
@kevin85421
Copy link
Member Author

cc @anshulomar @msumitjain would you mind taking a look at this PR? You can check the section "3. Root cause of this issue" in the PR description for the root cause. Thanks!

@kevin85421
Copy link
Member Author

cc @davidxia would you mind providing some feedback on this PR? Thank you!

@kevin85421
Copy link
Member Author

cc @Yicheng-Lu-llll

@architkulkarni architkulkarni self-assigned this Jun 1, 2023
@Yicheng-Lu-llll
Copy link
Contributor

Yicheng-Lu-llll commented Jun 2, 2023

LGTM! Thank you for the detailed explanation, I learned a lot!

leave notes for reference:

  1. For kuberay, ServiceAccount and its associated role/rolebinding is used for enable autoscaling. Autoscaler needs permission to ["get", "list", "watch", "patch"] the pods to adjust the number of replicas based on the current workload.

  2. If user provided the ServiceAccount, kuberay will automatically associate role and rolebinding when reconcileAutoscalerRole and reconcileAutoscalerRoleBinding.

  3. [Bug] Unauthorized 401 error on fetching Ray Custom Resources from K8s API server #1123 says it gets a 401 error for worker pods. This is because they set the same ServiceAccountName for the head and workers. Kuberay does nothing to workers' ServiceAccount.

@davidxia
Copy link
Contributor

davidxia commented Jun 2, 2023

cc @davidxia would you mind providing some feedback on this PR? Thank you!

I can try to make some time. Do you have a deadline?

@kevin85421
Copy link
Member Author

@Yicheng-Lu-llll

#1123 says it gets a 401 error for worker pods.

I am not sure why the 401 error occurs on the worker nodes. I guess this is a typo based on the following reasons:

  1. The error messages in [Bug] Unauthorized 401 error on fetching Ray Custom Resources from K8s API server #1123 are from Autoscaler, and the autoscaler is a sidecar container of the head Pod.

  2. In my knowledge, only Autoscaler will communicate with the Kubernetes API Server (i.e. kubernetes.default => A special ClusterIP Service kubernetes in the default namespace.)

This is because they set the same ServiceAccountName for the head and workers. Kuberay does nothing to workers' ServiceAccount.

This explanation is not entirely accurate. The root cause of the issue is that the new RayCluster's head Pod is using the ServiceAccount from the old RayCluster's head Pod. When the old RayCluster is deleted, its ServiceAccount is also deleted because they have an owner/dependent relationship. As a result, the new RayCluster ends up using the token of the deleted ServiceAccount to communicate with the Kubernetes API Server, leading to the 401 error.

@kevin85421
Copy link
Member Author

kevin85421 commented Jun 2, 2023

I can try to make some time. Do you have a deadline?

@davidxia

Thanks! No rush. I can also ping some folks to take a look too. This issue is blocking a user, so it would be ideal to have it merged by next Monday.

Copy link

@anshulomar anshulomar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevin85421 Thanks for taking this up! Your explanation is very thorough and meticulous.

if instance.Spec.HeadGroupSpec.Template.Spec.ServiceAccountName == namespacedName.Name {
r.Log.Error(err, fmt.Sprintf(
"If users specify ServiceAccountName for the head Pod, they need to create a ServiceAccount themselves."+
"However, ServiceAccount %s is not found. Please create one.", namespacedName.Name), "ServiceAccount", namespacedName)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevin85421 just curious, where are we checking whether the serviceaccount exists?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Screen Shot 2023-06-02 at 10 59 04 AM

The function r.Get attempts to retrieve the ServiceAccounts that match namespacedName. Additionally, we will check whether the error is an IsNotFound error or not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, we should make the call-to-action ("Please create one") in the error message as foolproof as possible so that users with varying levels of expertise can unblock themselves. Is there a link to a doc we can include, or a very brief example of how to create a service account?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated 5a9e3e2.

Copy link
Contributor

@architkulkarni architkulkarni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@blublinsky
Copy link
Contributor

blublinsky commented Jun 5, 2023

@kevin85421 as @Yicheng-Lu-llll explains, what is happening is that when a new cluster is being created, it will check whether the service account exists, which it does. So service account will not be created and the existing one is going to be used. Now when the old cluster is being deleted, it will delete this service account which will lead to a 401 error. Note that exactly the same error will occur if Role and RoleBinding are deleted by the initial cluster.
In my opinion, solution 1 (removing ControllerReference) is a much better option. Additionally, controller reference in this case has to be removed from Role and RoleBinding. This will guarantee no 401/403 error.
The downside of this approach, as you are correctly specifying is that ServiceAccount, Role, and RoleBinding are not garbage collected, which is not that harmful, especially considering that practically the same ServiceAccount, Role, and RoleBinding are going to be reused over and over.
The huge advantage of this approach is that an operator is self-contained - it creates all of the resources, that it potentially needs. In your solution, a user will need to create a lot of things - Service Account, Role, and RoleBinding upfront, which he will either forget or even worse will do it incorrectly. Additionally, the backward compatibility of this approach is a huge plus
Just my 2 cents on this

@blublinsky
Copy link
Contributor

blublinsky commented Jun 5, 2023

@kevin85421 there is also additional inconsistency in the current implementation. If ServiceAccount is specified in the head node, depending on whether autoscaling is defined, the ServiceAccount will be created (autoscaler is True) or not (autoscaler is False). So my suggestion is to make sure that ServiceAccount, defined in either the head node or some of the worker groups, exists.

@kevin85421
Copy link
Member Author

Hi @blublinsky, thank you for sharing your thoughts!

Note that exactly the same error will occur if Role and RoleBinding are deleted by the initial cluster.

It is true, but the names of the Role and RoleBinding are the same as the RayCluster custom resource (CR) name. Hence, when RayService triggers a rolling upgrade, the following will happen:

[Example]

Assume:
(1) Old RayCluster: rayservice-sample-raycluster-4p798
(2) New RayCluster: rayservice-sample-raycluster-ab123
(3) serviceAccountName: my-sa

# T=0: (Only old RayCluster):
ServiceAccount1 (`my-sa`) <-- RoleBinding1 (`rayservice-sample-raycluster-4p798`) --> Role1 (`rayservice-sample-raycluster-4p798`)

# T=1: (Old RayCluster + new RayCluster)
                
`my-sa` <-- RoleBinding1 --> Role1
     ^
     |
RoleBinding2 (`rayservice-sample-raycluster-ab123`)
     |
     v
   Role2 (`rayservice-sample-raycluster-ab123`)

# T=2: The old RayCluster will be deleted after Serve deployments on the new RayCluster are ready and healthy.
# That is, RoleBinding1 and Role1 will be deleted.

`my-sa`
     ^
     |
RoleBinding2 (`rayservice-sample-raycluster-ab123`)
     |
     v
   Role2 (`rayservice-sample-raycluster-ab123`)

The ServiceAccount will still have enough permissions after T=2. I also conducted an experiment to test whether the creation or deletion of a Role/RoleBinding is possible when the associated Pods are running. To conclude the experiment, it was observed that unlike ServiceAccount, Role and RoleBinding can be updated dynamically. Hence, in my opinion, the garbage collection for Role and RoleBinding is fine.

The downside of this approach, as you are correctly specifying is that ServiceAccount, Role, and RoleBinding are not garbage collected, which is not that harmful, especially considering that practically the same ServiceAccount, Role, and RoleBinding are going to be reused over and over.

[1] It would be beneficial if it could be reused by other custom resources managed by KubeRay. However, I am concerned about the possibility of other applications also attaching to this ServiceAccount. Typically, it is highly likely to happen because users often use simple names, such as my-sa, for their ServiceAccounts.

[2] The huge advantage of this approach is that an operator is self-contained.

Agree.

The bugs caused by [1] will be very difficult for maintainers to debug. Unlike an in-house product, KubeRay maintainers do not have access to users' Kubernetes clusters. In the case of an in-house product, I completely agree with your point because I have access to all Kubernetes logs, allowing me to focus more on user experience (e.g. self-contained [2]). However, for open-source projects, it is better to surface the issue as early as possible. It is not that painful for users to create the ServiceAccount by themselves.

@kevin85421
Copy link
Member Author

Hi @blublinsky, I will merge this PR tonight because some users are blocked by it. If necessary, we can open another issue to discuss it further. Here is a possible solution for achieving a balance between user experience and "unknown unknowns":

(1) By default, we won't create a ServiceAccount for users if they specify a serviceAccountName.
(2) We can open a new PR with a feature gate (disabled by default) to create RBAC without controller references for users.

Is it OK for you?

@Yicheng-Lu-llll
Copy link
Contributor

nit, We can have this breaking change in the document:

The service account, role, and role binding needed by the autoscaler will be created by the operator out-of-box.

@kevin85421 kevin85421 mentioned this pull request May 18, 2023
2 tasks
@kevin85421
Copy link
Member Author

nit, We can have this breaking change in the document:

The service account, role, and role binding needed by the autoscaler will be created by the operator out-of-box.

Added in #1071.

@kevin85421
Copy link
Member Author

Hi @blublinsky, I will merge this PR. If necessary, we can discuss the points mentioned in #1128 (comment) in a separate issue.

@kevin85421 kevin85421 merged commit 72ca169 into ray-project:master Jun 6, 2023
19 checks passed
lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
…Resources from K8s API server (ray-project#1128)

Unauthorized 401 error on fetching Ray Custom Resources from K8s API server
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] Unauthorized 401 error on fetching Ray Custom Resources from K8s API server
6 participants