Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support using proxy subresources when connecting to Ray head node #1980

Merged
merged 2 commits into from
Apr 19, 2024

Conversation

andrewsykim
Copy link
Contributor

@andrewsykim andrewsykim commented Mar 10, 2024

Why are these changes needed?

There are some cases where Kuberay may not be able to directly connect to a Ray head node. For example, there might be a NetworkPolicy disallowing ingress from all Pods or KubeRay is running on a network with no connectivity to Pods. This PR allows Kuberay to use the services/proxy subresource to proxy HTTP requests to the Ray head node. This allows Kuberay to make requests to the head node without every connecting to it directly.

Here are some sample HTTP requests in apiserver from my testing using the proxy subresource:

I0310 14:30:19.596708       1 httplog.go:131] "HTTP" verb="GET" URI="/api/v1/namespaces/default/services/rayjob-sample-raycluster-phsj9-head-svc:dashboard/proxy/api/jobs/rayjob-sample-th44t" latency="7.239221ms" userAgent="manager/v0.0.0 (linux/amd64) kubernetes/$Format" audit-ID="c40152de-7a81-46b9-ac4a-f2ea296e44f0" srcIP="10.244.0.6:49524" resp=200

I0310 15:20:43.105571       1 httplog.go:131] "HTTP" verb="GET" URI="/api/v1/namespaces/default/pods/rayservice-sample-raycluster-qm2m2-head-xj44d:8000/proxy/-/healthz" latency="2.915789ms" userAgent="manager/v0.0.0 (linux/amd64) kubernetes/$Format" audit-ID="bad00c23-de01-45ed-9fb5-05bc8b2f6c2d" srcIP="10.244.0.6:47274" resp=200


I0310 15:21:07.446881       1 httplog.go:131] "HTTP" verb="GET" URI="/api/v1/namespaces/default/services/rayservice-sample-raycluster-qm2m2-head-svc:dashboard/proxy/api/serve/applications/" latency="15.347398ms" userAgent="manager/v0.0.0 (linux/amd64) kubernetes/$Format" audit-ID="54544346-b827-4f34-b593-bcbc78199611" srcIP="10.244.0.6:47274" resp=200

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@andrewsykim andrewsykim force-pushed the service-proxy branch 3 times, most recently from b599778 to 530e237 Compare March 10, 2024 15:30
@andrewsykim andrewsykim changed the title support using services/proxy when connecting to Ray head node support using proxy subresources when connecting to Ray head node Mar 10, 2024
@kevin85421 kevin85421 self-requested a review March 11, 2024 00:31
@kevin85421 kevin85421 self-assigned this Mar 11, 2024
@andrewsykim andrewsykim force-pushed the service-proxy branch 2 times, most recently from 0117668 to 1a03e48 Compare March 11, 2024 02:05
@@ -218,9 +222,9 @@ func main() {
ctx := ctrl.SetupSignalHandler()
exitOnError(ray.NewReconciler(ctx, mgr, rayClusterOptions).SetupWithManager(mgr, config.ReconcileConcurrency),
"unable to create controller", "controller", "RayCluster")
exitOnError(ray.NewRayServiceReconciler(ctx, mgr, utils.GetRayDashboardClient, utils.GetRayHttpProxyClient).SetupWithManager(mgr),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it be better if we pass configapi.Configuration to the NewRayServiceReconciler and NewRayJobReconciler functions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's necessary at the moment since we're only using a single field there. But we should revisit this in the future and when config API graduates to beta. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recently worked on an open-source event for newbies. This would be a good first issue for the event. I will open an issue later.

r.dashboardURL = "http://" + url
}

func (r *RayDashboardClient) WithKubernetesServiceProxy(svcNamespace, svcName string) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function initializes the dashboard client. Perhaps we could update InitClient instead of creating a new function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point -- I'll try to update InitClient to support both proxy / non-proxy clients and see if it make sense to combine them.

@@ -640,6 +664,21 @@ func (r *RayJobReconciler) getOrCreateRayClusterInstance(ctx context.Context, ra
return rayClusterInstance, nil
}

func (r *RayJobReconciler) getRayClusterInstance(ctx context.Context, rayJobInstance *rayv1.RayJob) (*rayv1.RayCluster, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if we need this function. Perhaps we could directly use RayJobRayClusterNamespacedName and r.Get(...)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch, will update to just call RayJobRayClusterNamespacedName

Timeout: 20 * time.Millisecond,
}
}

func (r *RayHttpProxyClient) WithKubernetesPodProxy(podNamespace, podName string, port int) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function initializes the client. Perhaps we could update InitClient instead of creating a new function?

@andrewsykim andrewsykim force-pushed the service-proxy branch 3 times, most recently from 34b43a6 to df8a3aa Compare April 11, 2024 15:33
Copy link
Contributor Author

@andrewsykim andrewsykim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevin85421 addressed your comments, please take another look

@@ -218,9 +222,9 @@ func main() {
ctx := ctrl.SetupSignalHandler()
exitOnError(ray.NewReconciler(ctx, mgr, rayClusterOptions).SetupWithManager(mgr, config.ReconcileConcurrency),
"unable to create controller", "controller", "RayCluster")
exitOnError(ray.NewRayServiceReconciler(ctx, mgr, utils.GetRayDashboardClient, utils.GetRayHttpProxyClient).SetupWithManager(mgr),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's necessary at the moment since we're only using a single field there. But we should revisit this in the future and when config API graduates to beta. What do you think?

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

httpProxyURL string

mgr ctrl.Manager
useProxy bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: It's better to use UseKubernetesProxy for consistency.

No need to update this in this PR. I recently worked on an open-source event for newbies. This would be a good first issue for the event. I will open an issue later.

}

func (r *RayHttpProxyClient) InitClient() {
r.client = http.Client{
r.client = &http.Client{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason for this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's mainly because GetHTTPClient() returns a pointer

dashboardURL string
}

func GetRayDashboardClient() RayDashboardClientInterface {
return &RayDashboardClient{}
func GetRayDashboardClientFunc(mgr ctrl.Manager, useProxy bool) func() RayDashboardClientInterface {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason to change from GetRayDashboardClient() to GetRayDashboardClientFunc()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, because the function signature is changed to now return a func that returns a RayDashboardClientInterface

func (r *RayDashboardClient) InitClient(url string) {
r.client = http.Client{
func (r *RayDashboardClient) InitClient(url, svcNamespace, svcName string) {
if r.useProxy {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: It's better to use UseKubernetesProxy for consistency.

No need to update this in this PR. I recently worked on an open-source event for newbies. This would be a good first issue for the event. I will open an issue later.

@@ -1028,8 +1030,13 @@ func (r *RayServiceReconciler) updateStatusForActiveCluster(ctx context.Context,
return err
}

headSvcName, err := utils.GenerateHeadServiceName(utils.RayServiceCRD, rayClusterInstance.Spec, rayClusterInstance.Name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have a more elegant method to initialize the dashboard client. Calling GenerateHeadServiceName every time we want to initialize the client makes the codebase complex and hard to maintain.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I thought this was kind of ugly too. Maybe we can pass the entire RayCluster to InitClient?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can pass the entire RayCluster to InitClient?

I can make this change in this PR if you want

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can make this change in this PR if you want

Thanks! It is helpful! Maybe we can consider adding the head service's name to RayCluster's status in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated InitClient to receive a RayCluster so we can call GenerateHeadServiceName from inside InitClient. This requires InitClient to now return an error

…ad node

Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks! I will test it manually before I merge this PR.

@kevin85421
Copy link
Member

I tested this PR manually.

  • Step 0: Create a Kind cluster with the following configurations:
    kind: Cluster
    apiVersion: kind.x-k8s.io/v1alpha4
    nodes:
    - role: control-plane
      kubeadmConfigPatches:
      - |
        kind: ClusterConfiguration
        apiServer:
            extraArgs:
              v: "4"
  • Step 1: Update deployment.yaml (helm-chart/kuberay-operator/templates/deployment.yaml) to enable this feature.
    {{- $argList = append $argList "--use-kubernetes-proxy" -}}
  • Step 2: Install the KubeRay operator.
  • Step 3: Create a RayJob CR.
  • Step 4: Check the API server log
    I0419 01:57:09.080550       1 httplog.go:132] "HTTP" verb="GET" URI="/api/v1/namespaces/default/services/rayjob-sample-raycluster-rxhpc-head-svc:dashboard/proxy/api/jobs/rayjob-sample-zcjbj" latency="4.394603ms" userAgent="kuberay-operator/nightly" audit-ID="db0c2d98-a5d0-479f-81f8-7e9d7e3dc38b" srcIP="10.244.0.5:47796" resp=200
    

@kevin85421 kevin85421 merged commit ff008a1 into ray-project:master Apr 19, 2024
24 checks passed
@andrewsykim
Copy link
Contributor Author

thanks @kevin85421! I'll look into adding an e2e test as well

Xiao75896453 pushed a commit to Xiao75896453/kuberay that referenced this pull request Apr 21, 2024
Xiao75896453 pushed a commit to Xiao75896453/kuberay that referenced this pull request Apr 21, 2024
Xiao75896453 pushed a commit to Xiao75896453/kuberay that referenced this pull request Apr 21, 2024
Xiao75896453 pushed a commit to Xiao75896453/kuberay that referenced this pull request Apr 21, 2024
Xiao75896453 added a commit to Xiao75896453/kuberay that referenced this pull request Apr 23, 2024
Xiao75896453 added a commit to Xiao75896453/kuberay that referenced this pull request Apr 23, 2024
Xiao75896453 added a commit to Xiao75896453/kuberay that referenced this pull request Apr 23, 2024
Signed-off-by: TingYi <a75896453@gmail.com>
Xiao75896453 added a commit to Xiao75896453/kuberay that referenced this pull request Apr 24, 2024
Signed-off-by: TingYi <a75896453@gmail.com>
Xiao75896453 pushed a commit to Xiao75896453/kuberay that referenced this pull request Apr 25, 2024
Xiao75896453 pushed a commit to Xiao75896453/kuberay that referenced this pull request Apr 25, 2024
Xiao75896453 added a commit to Xiao75896453/kuberay that referenced this pull request Apr 25, 2024
Xiao75896453 added a commit to Xiao75896453/kuberay that referenced this pull request Apr 25, 2024
Xiao75896453 added a commit to Xiao75896453/kuberay that referenced this pull request Apr 25, 2024
Signed-off-by: TingYi <a75896453@gmail.com>
Xiao75896453 added a commit to Xiao75896453/kuberay that referenced this pull request Apr 25, 2024
Signed-off-by: TingYi <a75896453@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants