Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RayService] Add support for multi-app config in yaml-string format #1156

Merged
merged 10 commits into from
Jun 14, 2023

Conversation

zcin
Copy link
Contributor

@zcin zcin commented Jun 9, 2023

Why are these changes needed?

The new field serveConfigV2 under a RayService spec takes a yaml-formatted string (https://yaml-multiline.info/) which matches a Ray Serve multi-app config.

Example from kubectl describe rayservice rayservice-sample:

Serve Config Type:     MULTI_APP
  serveConfigV2:         applications:
  - name: fruit_app
    import_path: fruit.deployment_graph
    route_prefix: /fruit
    runtime_env:
      working_dir: "https://github.com/ray-project/test_dag/archive/41d09119cbdf8450599f993f51318e9e27c59098.zip"
    deployments:
      - name: MangoStand
        num_replicas: 1
        user_config:
          price: 3
        ray_actor_options:
          num_cpus: 0.1
      - name: OrangeStand
        num_replicas: 1
        user_config:
          price: 2
        ray_actor_options:
          num_cpus: 0.1
      - name: PearStand
        num_replicas: 1
        user_config:
          price: 1
        ray_actor_options:
          num_cpus: 0.1
      - name: FruitMarket
        num_replicas: 1
        ray_actor_options:
          num_cpus: 0.1
      - name: DAGDriver
        num_replicas: 1
        ray_actor_options:
          num_cpus: 0.1
  - name: math_app
    import_path: conditional_dag.serve_dag
    route_prefix: /calc
    runtime_env:
      working_dir: "https://github.com/ray-project/test_dag/archive/41d09119cbdf8450599f993f51318e9e27c59098.zip"
    deployments:
      - name: Adder
        num_replicas: 1
        user_config:
          increment: 3
      - name: Multiplier
        num_replicas: 1
        user_config:
          factor: 5
      - name: Router
        num_replicas: 1
      - name: create_order
        num_replicas: 1
      - name: DAGDriver
        num_replicas: 1

  Service Unhealthy Second Threshold:  300
Status:
  Active Service Status:
    Application Statuses:
      fruit_app:
        Health Last Update Time:  2023-06-13T15:18:00Z
        Last Update Time:         2023-06-13T15:18:00Z
        Serve Deployment Statuses:
          fruit_app_DAGDriver:
            Health Last Update Time:  2023-06-13T15:18:00Z
            Last Update Time:         2023-06-13T15:18:00Z
            Status:                   HEALTHY
          fruit_app_FruitMarket:
            Health Last Update Time:  2023-06-13T15:18:00Z
            Last Update Time:         2023-06-13T15:18:00Z
            Status:                   HEALTHY
          fruit_app_MangoStand:
            Health Last Update Time:  2023-06-13T15:18:00Z
            Last Update Time:         2023-06-13T15:18:00Z
            Status:                   HEALTHY
          fruit_app_OrangeStand:
            Health Last Update Time:  2023-06-13T15:18:00Z
            Last Update Time:         2023-06-13T15:18:00Z
            Status:                   HEALTHY
          fruit_app_PearStand:
            Health Last Update Time:  2023-06-13T15:18:00Z
            Last Update Time:         2023-06-13T15:18:00Z
            Status:                   HEALTHY
        Status:                       RUNNING
      math_app:
        Health Last Update Time:  2023-06-13T15:18:00Z
        Last Update Time:         2023-06-13T15:18:00Z
        Serve Deployment Statuses:
          math_app_Adder:
            Health Last Update Time:  2023-06-13T15:18:00Z
            Last Update Time:         2023-06-13T15:18:00Z
            Status:                   HEALTHY
          math_app_DAGDriver:
            Health Last Update Time:  2023-06-13T15:18:00Z
            Last Update Time:         2023-06-13T15:18:00Z
            Status:                   HEALTHY
          math_app_Multiplier:
            Health Last Update Time:  2023-06-13T15:18:00Z
            Last Update Time:         2023-06-13T15:18:00Z
            Status:                   HEALTHY
          math_app_Router:
            Health Last Update Time:  2023-06-13T15:18:00Z
            Last Update Time:         2023-06-13T15:18:00Z
            Status:                   HEALTHY
          math_app_create_order:
            Health Last Update Time:  2023-06-13T15:18:00Z
            Last Update Time:         2023-06-13T15:18:00Z
            Status:                   HEALTHY
        Status:                       RUNNING

Notes:

  • After research online, there was no indication that Kubernetes has a maximum length for string fields. If there is one, it is likely that it is too large to affect us. E.g. the maximum size of any etcd request is 1.5 MiB.
  • Kubernetes schemas allow oneOf which is what we would need to ensure only one of serveConfig or serveConfigV2 is specified. However kubebuilder doesn't support generating oneOf (Support more complex validations (oneOf/anyOf/allOf/etc) kubernetes-sigs/controller-tools#461), so we perform validation in the service controller and consistently throw errors if user specifies both.

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

Signed-off-by: cindyz <cindyz@anyscale.com>
ray-operator/config/samples/ray_v1alpha1_rayservice.yaml Outdated Show resolved Hide resolved
@@ -443,7 +443,7 @@ func (r *RayServiceReconciler) cleanUpServeConfigCache(rayServiceInstance *rayv1
configPrefix := r.generateConfigKeyPrefix(rayServiceInstance)

// Clean up RayCluster serve deployment configs.
for key := range r.ServeDeploymentConfigs.Items() {
for key := range r.ServeConfigs.Items() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic of this function is quite confusing and seems to be causing a memory leak. To elaborate further, when a RayService is deleted, does the cache for the RayService instance get removed or not? I need to conduct some experiments to verify this. This is a technical debt issue and is not caused by this pull request. Therefore, we do not need to address this issue in the scope of this pull request.

ray-operator/controllers/ray/rayservice_controller.go Outdated Show resolved Hide resolved
ray-operator/controllers/ray/rayservice_controller.go Outdated Show resolved Hide resolved
ray-operator/controllers/ray/rayservice_controller.go Outdated Show resolved Hide resolved
ray-operator/controllers/ray/rayservice_controller.go Outdated Show resolved Hide resolved
ImportPath: configV1Spec.ImportPath,
RuntimeEnv: applicationRuntimeEnv,
Deployments: convertedDeploymentSpecs,
Port: configV1Spec.Port,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we add Port here? The behavior seems to be different from the following code snippet.

servingClusterDeployments := utils.ServingClusterDeployments{
ImportPath: rayServiceInstance.Spec.ServeDeploymentGraphSpec.ImportPath,
RuntimeEnv: runtimeEnv,
Deployments: rayDashboardClient.ConvertServeConfig(rayServiceInstance.Spec.ServeDeploymentGraphSpec.ServeConfigSpecs),
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's part of the spec. The lines you linked here were only used for printing out the json in logs, I think it just missed the port. In UpdateDeployments port was included before as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on my understanding, if we intend to make the port dynamically configurable, we will also need to update the configuration of Pods and Services. However, it is worth noting that RayCluster currently does not support dynamic updates to the configuration of Pods and Services."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I believe this is existing functionality so I don't want to change it in this PR - can you create a separate issue for this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with this change, but the corresponding function for this code snippet appears to be this instead of UpdateDeployments in dashboard_httpclient.go.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, if you look closely that code snippet is only used for print a json to the logs. It's not actually sent to the Serve API.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will open an issue later and resolve this comment. However, it seems that the ports for containers in a Pod cannot be updated dynamically. See this link for more details. In this case, the port can only be updated if users expose multiple ports, of which only a few will be used. I am not familiar with serving, so I'm unsure if there are any use cases for this situation.

Signed-off-by: cindyz <cindyz@anyscale.com>
@zcin zcin marked this pull request as ready for review June 12, 2023 18:58
cindyz added 5 commits June 12, 2023 19:44
Signed-off-by: cindyz <cindyz@anyscale.com>
Signed-off-by: cindyz <cindyz@anyscale.com>
fix
Signed-off-by: cindyz <cindyz@anyscale.com>
Signed-off-by: cindyz <cindyz@anyscale.com>
Signed-off-by: cindyz <cindyz@anyscale.com>
ray-operator/apis/ray/v1alpha1/rayservice_types_test.go Outdated Show resolved Hide resolved
ray-operator/controllers/ray/utils/serve_api_models.go Outdated Show resolved Hide resolved
ImportPath: configV1Spec.ImportPath,
RuntimeEnv: applicationRuntimeEnv,
Deployments: convertedDeploymentSpecs,
Port: configV1Spec.Port,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on my understanding, if we intend to make the port dynamically configurable, we will also need to update the configuration of Pods and Services. However, it is worth noting that RayCluster currently does not support dynamic updates to the configuration of Pods and Services."

ray-operator/config/samples/ray_v1alpha1_rayservice.yaml Outdated Show resolved Hide resolved
ray-operator/controllers/ray/rayservice_controller.go Outdated Show resolved Hide resolved
cachedServeConfig, isServeConfig := cachedConfigObj.(rayv1alpha1.ServeDeploymentGraphSpec)
if !isServeConfig {
shouldUpdate = true
reason = fmt.Sprintf("No Serve config has been cached for cluster %s with key %s", rayClusterInstance.Name, cacheKey)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The message seems to be not precise enough. You can refer to https://go.dev/ref/spec#Type_assertions for more details about type assertions and the section "Implementing an interface" in https://go.dev/ref/spec#Method_sets about "implements".

More precisely, if T is not an interface type, x.(T) asserts that the dynamic type of x is identical to the type T. In this case, T must implement the (interface) type of x; otherwise the type assertion is invalid since it is not possible for x to store a value of type T. If T is an interface type, x.(T) asserts that the dynamic type of x implements the interface T.

In our case, cachedConfigObj (i.e. x) is an interface type, so isServeConfig=false means that ServeDeploymentGraphSpec does not implement the interface of cachedConfigObj. In this context, isServeConfig=false indicates that the serve configuration is cached, but the cache does not refer to the ServeDeploymentGraphSpec.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I have updated the messages to include information about v1/v2 and the corresponding field types.

cindyz added 2 commits June 13, 2023 15:30
Signed-off-by: cindyz <cindyz@anyscale.com>
Signed-off-by: cindyz <cindyz@anyscale.com>
Signed-off-by: cindyz <cindyz@anyscale.com>
Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I will approve this PR after I manually test it.

reason = fmt.Sprintf("No V1 Serve Config of type ServeDeploymentGraphSpec has been cached for cluster %s with key %s", rayClusterInstance.Name, cacheKey)
} else if !utils.CompareJsonStruct(cachedServeConfig, rayServiceInstance.Spec.ServeDeploymentGraphSpec) {
shouldUpdate = true
reason = fmt.Sprintf("Current V2 Serve config doesn't match cached Serve config for cluster %s with key %s", rayClusterInstance.Name, cacheKey)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
reason = fmt.Sprintf("Current V2 Serve config doesn't match cached Serve config for cluster %s with key %s", rayClusterInstance.Name, cacheKey)
reason = fmt.Sprintf("Current V1 Serve config doesn't match cached Serve config for cluster %s with key %s", rayClusterInstance.Name, cacheKey)

?

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# Step 0: Create a Kubernetes cluster
kind create cluster --image=kindest/node:v1.23.0

# Step 1: Install a KubeRay operator (path: helm-chart/kuberay-operator/)
helm install kuberay-operator . --set image.repository=controller,image.tag=latest

# Step 2: Create a multi-app RayService (path: ray-operator/config/samples)
kubectl apply -f ray_v1alpha1_rayservice.yaml

# Step 3: Forward the serve port after the RayService is ready.
kubectl port-forward svc/rayservice-sample-serve-svc 8000

# Step 4: Test two applications
curl -X POST -H 'Content-Type: application/json' localhost:8000/fruit/ -d '["PEAR", 12]'
# expected output: 12
curl -X POST -H 'Content-Type: application/json' localhost:8000/calc/ -d '["ADD", 1]'
# expected output: "4 pizzas please!"

# Step 5: Edit the `rayVersion` to 2.100.0
kubectl edit rayservices.ray.io rayservice-sample

# Step 6: After the new RayCluster is ready, repeat Step 4.

@kevin85421 kevin85421 merged commit 44c0d50 into ray-project:master Jun 14, 2023
18 checks passed
@zcin zcin deleted the multi-app-yaml branch August 25, 2023 17:35
lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants