Add serving service for users traffic with health check #367

brucez-anyscale · 2022-07-09T04:44:34Z

Why are these changes needed?

This pr tries to make http proxy HA, namely the traffic serving service with HA.
Ray provides http proxy health check.
We need to list all the pod and conduct the health check to decide if the pod can serve users' traffic.
Then we update the label for the pod.
The serving service will dynamic pick the health pod with label selector.

Have done the manual testing.

Related issue number

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

brucez-anyscale · 2022-07-09T04:48:42Z

ray-operator/controllers/ray/common/service.go

-		DefaultRedisPortName:                DefaultRedisPort,
-		DefaultDashboardName:                DefaultDashboardPort,
-		DefaultMetricsName:                  DefaultMetricsPort,
-		DefaultDashboardAgentListenPortName: DefaultDashboardAgentListenPort,


remove DefaultDashboardAgentListenPortName since it is controlled by annotation.

ray-operator/config/samples/ray_v1alpha1_rayservice.yaml

fishbone · 2022-07-09T05:09:20Z

Sorry, not very familiar with k8s. Which component will send GET request? Will this component become a bottleneck if we have a lot of nodes?

ray-operator/controllers/ray/rayservice_controller.go

DmitriGekhtman · 2022-07-09T06:20:15Z

Which component will send GET request? Will this component become a bottleneck if we have a lot of nodes?

The scheme proposed in this PR is for the KubeRay operator to send the GET. I'm not familiar with the implementation of the endpoint -- is it a cheap ping? Say we have a total of 2000 Ray pods managed by the operator. Any concerns at that scale?

Under "normal" circumstances, maybe this health check would be implemented as a readiness probe, but I guess doing that would conflict with other parts of the HA design.

DmitriGekhtman · 2022-07-09T06:27:41Z

is it a cheap ping? Say we have a total of 2000 Ray pods managed by the operator. Any concerns at that scale?

Also, how often do we expect to trigger these GETs? Once per reconciler iteration, which at steady state would be whatever the controller's requeue duration is -- I forget the duration.

My vague intuition tells me there isn't a scalability concern here, but could you provide some numerical estimates to justify that?

DmitriGekhtman · 2022-07-09T06:28:56Z

We'll need to figure out how to test this stuff soon.

fishbone · 2022-07-09T06:54:54Z

Under "normal" circumstances, maybe this health check would be implemented as a readiness probe, but I guess doing that would conflict with other parts of the HA design

@DmitriGekhtman The reason is that the scheduler of HttpProxy is handled by Ray, not k8s. The flow is like:

k8s --- schedule a raylet ---> serve start http proxy actor ---> ray start to schedule the http proxy actor

The serve http end point is http proxy. So what we want to do here is to add the node which has http proxy to some svc. This is very dynamic thing.

The check is cheap, but we can't guarantee there is no bug. Either we put a lot of testing code to ensure this, or we make the duration of the request not important for serve operator's availability. (maybe, >1s and it failed for 5 times, we consider it as not ready?)

DmitriGekhtman · 2022-07-09T16:09:13Z

@iycheng Is the concern that the request might block for a long time?

The timeout in this PR is 2 sec, and the code loops through sequentially through the Ray nodes making the request for each one in series. For a large cluster and in the worst case, that could be bad.

brucez-anyscale · 2022-07-09T19:25:34Z

@iycheng Is the concern that the request might block for a long time?

The timeout in this PR is 2 sec, and the code loops through sequentially through the Ray nodes making the request for each one in series. For a large cluster and in the worst case, that could be bad.

Good point. I think the timeout is better be small like 20ms?
The operator reconcile loop can have parallelism, which should help about scalability.

The better design is using push instead of pull. But I think Ray Core is not easy to push health to operator. So for now, we use pull model.

I will add unit tests in this pr soon.

DmitriGekhtman · 2022-07-10T00:07:47Z

@iycheng Is the concern that the request might block for a long time?

The timeout in this PR is 2 sec, and the code loops through sequentially through the Ray nodes making the request for each one in series. For a large cluster and in the worst case, that could be bad.

Good point. I think the timeout is better be small like 20ms?

The operator reconcile loop can have parallelism, which should help about scalability.

The better design is using push instead of pull. But I think Ray Core is not easy to push health to operator. So for now, we use pull model.

I will add unit tests in this pr soon.

Yeah, either a shorter timeout or spawn goroutines to do the health checks. The first option is definitely simpler :)

ray-operator/controllers/ray/common/constant.go

DmitriGekhtman · 2022-07-11T17:29:57Z

nit/personal opinion:
In variable names, change Serving to Serve to be consistent with Ray Serve branding :)

…ect/kuberay into brucez/httpProxyService

ray-operator/controllers/ray/utils/fake_httpproxy_httpclient.go

ray-operator/controllers/ray/rayservice_controller.go

DmitriGekhtman

lgtm

Notes

looks like there's a failing test at the moment
should probably aim to be consistent on one of "serve" or "serving"

) * draft for serving service * add http proxy health check and http proxy service * address comments * add unit tests * fix ut * update ut * update * address comment and add TODO Co-authored-by: Bruce Zhang <waynegates0@gmail.com>

brucez-anyscale added 2 commits July 8, 2022 19:40

draft for serving service

e47fa0e

add http proxy health check and http proxy service

8aecb24

brucez-anyscale assigned DmitriGekhtman and fishbone Jul 9, 2022

brucez-anyscale commented Jul 9, 2022

View reviewed changes

fishbone reviewed Jul 9, 2022

View reviewed changes

ray-operator/config/samples/ray_v1alpha1_rayservice.yaml Outdated Show resolved Hide resolved

DmitriGekhtman reviewed Jul 9, 2022

View reviewed changes

ray-operator/controllers/ray/rayservice_controller.go Outdated Show resolved Hide resolved

DmitriGekhtman assigned simon-mo and edoakes Jul 9, 2022

Jeffwan reviewed Jul 11, 2022

View reviewed changes

ray-operator/controllers/ray/common/constant.go Outdated Show resolved Hide resolved

brucez-anyscale and others added 5 commits July 11, 2022 13:09

address comments

3721a4c

add unit tests

06021c3

fix ut

cbcd766

update ut

9abdfa7

update ut

dc985b3

brucez-anyscale requested review from Jeffwan, DmitriGekhtman and fishbone July 11, 2022 21:58

brucez-anyscale added 2 commits July 11, 2022 15:36

update

52d5819

Merge branch 'brucez/httpProxyService' of https://github.com/ray-proj…

f4f6090

…ect/kuberay into brucez/httpProxyService

DmitriGekhtman reviewed Jul 11, 2022

View reviewed changes

ray-operator/controllers/ray/utils/fake_httpproxy_httpclient.go Show resolved Hide resolved

DmitriGekhtman reviewed Jul 11, 2022

View reviewed changes

ray-operator/controllers/ray/rayservice_controller.go Outdated Show resolved Hide resolved

DmitriGekhtman approved these changes Jul 11, 2022

View reviewed changes

address comment and add TODO

f10cace

brucez-anyscale merged commit 9c062c5 into master Jul 12, 2022

DmitriGekhtman deleted the brucez/httpProxyService branch December 3, 2022 00:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add serving service for users traffic with health check #367

Add serving service for users traffic with health check #367

brucez-anyscale commented Jul 9, 2022 •

edited

brucez-anyscale Jul 9, 2022

fishbone commented Jul 9, 2022

DmitriGekhtman commented Jul 9, 2022 •

edited

DmitriGekhtman commented Jul 9, 2022

DmitriGekhtman commented Jul 9, 2022

fishbone commented Jul 9, 2022

DmitriGekhtman commented Jul 9, 2022 •

edited

brucez-anyscale commented Jul 9, 2022

DmitriGekhtman commented Jul 10, 2022

DmitriGekhtman commented Jul 11, 2022 •

edited

DmitriGekhtman left a comment

Add serving service for users traffic with health check #367

Add serving service for users traffic with health check #367

Conversation

brucez-anyscale commented Jul 9, 2022 • edited

Why are these changes needed?

Related issue number

Checks

brucez-anyscale Jul 9, 2022

Choose a reason for hiding this comment

fishbone commented Jul 9, 2022

DmitriGekhtman commented Jul 9, 2022 • edited

DmitriGekhtman commented Jul 9, 2022

DmitriGekhtman commented Jul 9, 2022

fishbone commented Jul 9, 2022

DmitriGekhtman commented Jul 9, 2022 • edited

brucez-anyscale commented Jul 9, 2022

DmitriGekhtman commented Jul 10, 2022

DmitriGekhtman commented Jul 11, 2022 • edited

DmitriGekhtman left a comment

Choose a reason for hiding this comment

brucez-anyscale commented Jul 9, 2022 •

edited

DmitriGekhtman commented Jul 9, 2022 •

edited

DmitriGekhtman commented Jul 9, 2022 •

edited

DmitriGekhtman commented Jul 11, 2022 •

edited