-
Notifications
You must be signed in to change notification settings - Fork 317
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add serving service for users traffic with health check #367
Conversation
DefaultRedisPortName: DefaultRedisPort, | ||
DefaultDashboardName: DefaultDashboardPort, | ||
DefaultMetricsName: DefaultMetricsPort, | ||
DefaultDashboardAgentListenPortName: DefaultDashboardAgentListenPort, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove DefaultDashboardAgentListenPortName since it is controlled by annotation.
Sorry, not very familiar with k8s. Which component will send GET request? Will this component become a bottleneck if we have a lot of nodes? |
The scheme proposed in this PR is for the KubeRay operator to send the GET. I'm not familiar with the implementation of the endpoint -- is it a cheap ping? Say we have a total of 2000 Ray pods managed by the operator. Any concerns at that scale? Under "normal" circumstances, maybe this health check would be implemented as a readiness probe, but I guess doing that would conflict with other parts of the HA design. |
Also, how often do we expect to trigger these GETs? Once per reconciler iteration, which at steady state would be whatever the controller's requeue duration is -- I forget the duration. My vague intuition tells me there isn't a scalability concern here, but could you provide some numerical estimates to justify that? |
We'll need to figure out how to test this stuff soon. |
@DmitriGekhtman The reason is that the scheduler of HttpProxy is handled by Ray, not k8s. The flow is like: k8s --- schedule a raylet ---> serve start http proxy actor ---> ray start to schedule the http proxy actor The serve http end point is http proxy. So what we want to do here is to add the node which has http proxy to some svc. This is very dynamic thing. The check is cheap, but we can't guarantee there is no bug. Either we put a lot of testing code to ensure this, or we make the duration of the request not important for serve operator's availability. (maybe, >1s and it failed for 5 times, we consider it as not ready?) |
@iycheng Is the concern that the request might block for a long time? The timeout in this PR is 2 sec, and the code loops through sequentially through the Ray nodes making the request for each one in series. For a large cluster and in the worst case, that could be bad. |
Good point. I think the timeout is better be small like 20ms? The better design is using I will add unit tests in this pr soon. |
Yeah, either a shorter timeout or spawn goroutines to do the health checks. The first option is definitely simpler :) |
nit/personal opinion: |
…ect/kuberay into brucez/httpProxyService
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Notes
- looks like there's a failing test at the moment
- should probably aim to be consistent on one of "serve" or "serving"
Why are these changes needed?
This pr tries to make http proxy HA, namely the traffic serving service with HA.
Ray provides http proxy health check.
We need to list all the pod and conduct the health check to decide if the pod can serve users' traffic.
Then we update the label for the pod.
The serving service will dynamic pick the health pod with label selector.
Have done the manual testing.
Related issue number
Checks