Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add serving service for users traffic with health check #367

Merged
merged 10 commits into from
Jul 12, 2022

Conversation

brucez-anyscale
Copy link
Contributor

@brucez-anyscale brucez-anyscale commented Jul 9, 2022

Why are these changes needed?

This pr tries to make http proxy HA, namely the traffic serving service with HA.
Ray provides http proxy health check.
We need to list all the pod and conduct the health check to decide if the pod can serve users' traffic.
Then we update the label for the pod.
The serving service will dynamic pick the health pod with label selector.

Have done the manual testing.
Screen Shot 2022-07-08 at 9 43 10 PM

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

DefaultRedisPortName: DefaultRedisPort,
DefaultDashboardName: DefaultDashboardPort,
DefaultMetricsName: DefaultMetricsPort,
DefaultDashboardAgentListenPortName: DefaultDashboardAgentListenPort,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove DefaultDashboardAgentListenPortName since it is controlled by annotation.

@fishbone
Copy link
Contributor

fishbone commented Jul 9, 2022

Sorry, not very familiar with k8s. Which component will send GET request? Will this component become a bottleneck if we have a lot of nodes?

@DmitriGekhtman
Copy link
Collaborator

DmitriGekhtman commented Jul 9, 2022

Which component will send GET request? Will this component become a bottleneck if we have a lot of nodes?

The scheme proposed in this PR is for the KubeRay operator to send the GET. I'm not familiar with the implementation of the endpoint -- is it a cheap ping? Say we have a total of 2000 Ray pods managed by the operator. Any concerns at that scale?

Under "normal" circumstances, maybe this health check would be implemented as a readiness probe, but I guess doing that would conflict with other parts of the HA design.

@DmitriGekhtman
Copy link
Collaborator

is it a cheap ping? Say we have a total of 2000 Ray pods managed by the operator. Any concerns at that scale?

Also, how often do we expect to trigger these GETs? Once per reconciler iteration, which at steady state would be whatever the controller's requeue duration is -- I forget the duration.

My vague intuition tells me there isn't a scalability concern here, but could you provide some numerical estimates to justify that?

@DmitriGekhtman
Copy link
Collaborator

We'll need to figure out how to test this stuff soon.

@fishbone
Copy link
Contributor

fishbone commented Jul 9, 2022

Under "normal" circumstances, maybe this health check would be implemented as a readiness probe, but I guess doing that would conflict with other parts of the HA design

@DmitriGekhtman The reason is that the scheduler of HttpProxy is handled by Ray, not k8s. The flow is like:

k8s --- schedule a raylet ---> serve start http proxy actor ---> ray start to schedule the http proxy actor

The serve http end point is http proxy. So what we want to do here is to add the node which has http proxy to some svc. This is very dynamic thing.

The check is cheap, but we can't guarantee there is no bug. Either we put a lot of testing code to ensure this, or we make the duration of the request not important for serve operator's availability. (maybe, >1s and it failed for 5 times, we consider it as not ready?)

@DmitriGekhtman
Copy link
Collaborator

DmitriGekhtman commented Jul 9, 2022

@iycheng Is the concern that the request might block for a long time?

The timeout in this PR is 2 sec, and the code loops through sequentially through the Ray nodes making the request for each one in series. For a large cluster and in the worst case, that could be bad.

@brucez-anyscale
Copy link
Contributor Author

@iycheng Is the concern that the request might block for a long time?

The timeout in this PR is 2 sec, and the code loops through sequentially through the Ray nodes making the request for each one in series. For a large cluster and in the worst case, that could be bad.

Good point. I think the timeout is better be small like 20ms?
The operator reconcile loop can have parallelism, which should help about scalability.

The better design is using push instead of pull. But I think Ray Core is not easy to push health to operator. So for now, we use pull model.

I will add unit tests in this pr soon.

@DmitriGekhtman
Copy link
Collaborator

@iycheng Is the concern that the request might block for a long time?

The timeout in this PR is 2 sec, and the code loops through sequentially through the Ray nodes making the request for each one in series. For a large cluster and in the worst case, that could be bad.

Good point. I think the timeout is better be small like 20ms?

The operator reconcile loop can have parallelism, which should help about scalability.

The better design is using push instead of pull. But I think Ray Core is not easy to push health to operator. So for now, we use pull model.

I will add unit tests in this pr soon.

Yeah, either a shorter timeout or spawn goroutines to do the health checks. The first option is definitely simpler :)

@DmitriGekhtman
Copy link
Collaborator

DmitriGekhtman commented Jul 11, 2022

nit/personal opinion:
In variable names, change Serving to Serve to be consistent with Ray Serve branding :)

Copy link
Collaborator

@DmitriGekhtman DmitriGekhtman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Notes

  • looks like there's a failing test at the moment
  • should probably aim to be consistent on one of "serve" or "serving"

@brucez-anyscale brucez-anyscale merged commit 9c062c5 into master Jul 12, 2022
@DmitriGekhtman DmitriGekhtman deleted the brucez/httpProxyService branch December 3, 2022 00:01
lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
)

* draft for serving service

* add http proxy health check and http proxy service

* address comments

* add unit tests

* fix ut

* update ut

* update

* address comment and add TODO

Co-authored-by: Bruce Zhang <waynegates0@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants