[Core][Autoscaler] Refactor v2 Log Formatting by ryanaoleary · Pull Request #49350 · ray-project/ray

ryanaoleary · 2024-12-19T01:55:02Z

Why are these changes needed?

Currently the V2 Autoscaler formats logs by converting the V2 data structure ClusterStatus to the V1 structures AutoscalerSummary and LoadMetricsSummary and then passing them to the legacy format_info_string. It'd be useful for the V2 autoscaler to directly format ClusterStatus to the correct output log format. This PR refactors utils.py to directly format ClusterStatus. Additionally, this PR changes the node reports to output instance_id rather than ip_address, since the latter is not necessarily unique for failed nodes.

Related issue number

Closes #37856

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: ryanaoleary <ryanaoleary@google.com>

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

ryanaoleary · 2025-02-13T02:03:01Z

cc: @rickyyx, I think I'm going to add some different unit test cases but this should be good to review

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

ryanaoleary · 2025-02-25T00:33:40Z

@kevin85421 This PR would be helpful to add since it moves away from referencing the legacy autoscaler from the V2 autoscaler. This refactor will also make it easier to make other improvements to V2 log formatting, since we won't have to parse it to the V1 format first.

kevin85421

reviewing

kevin85421 · 2025-02-25T01:37:13Z

 worker_node, 1 launching
 worker_node_gpu, 1 launching
- 127.0.0.3: worker_node, starting ray
+ instance4: worker_node, starting ray


I remember that the instance ID for KubeRay is the Pod name. Could you check whether the ray status result also shows the Pod name so that we can map K8s Pods to Ray instances?

I don't think ray status currently shows the Pod name, this is from my manual testing:

(base) ray@raycluster-autoscaler-head-p77pc:~$ ray status --verbose ======== Autoscaler status: 2025-02-25 22:35:58.558544 ======== GCS request time: 0.001412s Node status --------------------------------------------------------------- Active: (no active nodes) Idle: 1 headgroup Pending: : small-group, Recent failures: (no failures) Resources --------------------------------------------------------------- Total Usage: 0B/1.86GiB memory 0B/495.58MiB object_store_memory Total Demands: {'CPU': 1.0, 'TPU': 4.0}: 1+ pending tasks/actors Node: 65d0a32bfeee84475a235b3c290824ec3ac0b1ab5148d96fc674ce93 Idle: 82253 ms Usage: 0B/1.86GiB memory 0B/495.58MiB object_store_memory

Will the key in the key-value pairs in Pending be the Pod name? That's my expectation.

In addition, have you manually tested this PR? The test below shows that either "head node" or "worker node" is appended to the end of the Node: ... line. For example,

Node: fffffffffffffffffffffffffffffffffffffffffffffffffff00001 (head_node)

However, the above output from your manual testing is:

Node: 65d0a32bfeee84475a235b3c290824ec3ac0b1ab5148d96fc674ce93

I misunderstood your initial comment, I thought you were asking whether ray status currently shows the Pod name in KubeRay. The above snippet was using the ray 2.41 image. I've been running into issues building an image lately to test the new changes with the following Dockerfile:

# Use the latest Ray master as base. FROM rayproject/ray:nightly-py310 # Invalidate the cache so that fresh code is pulled in the next step. ARG BUILD_DATE # Retrieve your development code. ADD . ray # Install symlinks to your modified Python code. RUN python ray/python/ray/setup-dev.py -y

where the RayCluster Pods will immediately crash and terminate after pulling the image. Describing the RayCluster just shows:

Normal DeletedHeadPod 5m21s (x8 over 5m22s) raycluster-controller Deleted head Pod default/raycluster-autoscaler-head-ll2vs; Pod status: Running; Pod restart policy: Never; Ray container terminated status: &ContainerStateTerminated{ExitCode:1,Signal:0,Reason:Error,Message:,StartedAt:2025-03-04 11:52:38 +0000 UTC,FinishedAt:2025-03-04 11:52:38 +0000 UTC,ContainerID:containerd://81a7332de2046c934ba6725cbb72eb3b228ee8aa66bc26f2db5f6741607ae82f,} Normal DeletedHeadPod 26s (x159 over 5m9s) raycluster-controller (combined from similar events): Deleted head Pod default/raycluster-autoscaler-head-5g2c4; Pod status: Running; Pod restart policy: Never; Ray container terminated status: &ContainerStateTerminated{ExitCode:1,Signal:0,Reason:Error,Message:,StartedAt:2025-03-04 11:57:33 +0000 UTC,FinishedAt:2025-03-04 11:57:34 +0000 UTC,ContainerID:containerd://f423d2877a176beaecb88e6d1d8e61456233b1359c9e8b94e333ea4560e86b1c,}

The head Pod keeps immediately crashing and re-creating, so I can't get any more useful logs from the container. I tried building an image using the latest changes from master (i.e. I didn't use any of my python changes) and it still had the same issue, is this a problem you've seen before? As soon as I have a working image I can run a manual test to check for Pod name in the key-value pairs in Pending.

I was just able to manually test it with my changes, here is the output of ray status --verbose with a Pending node:

======== Autoscaler status: 2025-03-04 12:11:15.078526 ======== GCS request time: 0.001526s Node status --------------------------------------------------------------- Active: (no active nodes) Idle: 1 headgroup Pending: a4dfeafc-8a5e-47ff-9721-cdd559c00dfc: small-group, Recent failures: (no failures) Resources --------------------------------------------------------------- Total Usage: 0B/1.86GiB memory 0B/511.68MiB object_store_memory Total Demands: {'CPU': 1.0}: 1+ pending tasks/actors Node: dcb068352f72b5244cfdefaa70055d5cd51b5cd29778295b41cd0775 (headgroup) Idle: 10641 ms Usage: 0B/1.86GiB memory 0B/511.68MiB object_store_memory (base) ray@raycluster-autoscaler-head-hr8pd:~$ ray status --verbose ======== Autoscaler status: 2025-03-04 12:11:36.180572 ======== GCS request time: 0.001749s Node status --------------------------------------------------------------- Active: 1 small-group Idle: 1 headgroup Pending: (no pending nodes) Recent failures: (no failures) Resources --------------------------------------------------------------- Total Usage: 1.0/1.0 CPU 0B/2.79GiB memory 0B/781.36MiB object_store_memory Total Demands: (no resource demands) Node: 1c0ff9b00d40e332469adb4fdfacd9d0f21599bac65e50666e808a4d (small-group) Usage: 1.0/1.0 CPU 0B/953.67MiB memory 0B/269.68MiB object_store_memory Activity: Resource: CPU currently in use. Busy workers on node. Node: dcb068352f72b5244cfdefaa70055d5cd51b5cd29778295b41cd0775 (headgroup) Idle: 31744 ms Usage: 0B/1.86GiB memory 0B/511.68MiB object_store_memory Activity: (no activity)

it look like instance_id isn't set to the Pod name, but some other generated unique ID. Looking at the Autoscaler logs, if we wanted it to output Pod name here we should use cloud_instance_id:

2025-03-04 12:11:34,960 - INFO - Update instance ALLOCATED->RAY_RUNNING (id=a4dfeafc-8a5e-47ff-9721-cdd559c00dfc, type=small-group, cloud_instance_id=raycluster-autoscaler-small-group-worker-qbd8r, ray_id=): ray node 1c0ff9b00d40e332469adb4fdfacd9d0f21599bac65e50666e808a4d is RUNNING

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

ryanaoleary · 2025-02-28T01:58:26Z

@kevin85421 I went through and re-factored the code to make it easier to review. I also made sure the PR is consistent in using string concatenation with .join() to format the multi-line strings. The PR is passing the tests and should be ready for another review.

kevin85421 · 2025-03-02T19:16:13Z

 worker_node, 1 launching
 worker_node_gpu, 1 launching
- 127.0.0.3: worker_node, starting ray
+ instance4: worker_node, starting ray


Will the key in the key-value pairs in Pending be the Pod name? That's my expectation.

In addition, have you manually tested this PR? The test below shows that either "head node" or "worker node" is appended to the end of the Node: ... line. For example,

Node: fffffffffffffffffffffffffffffffffffffffffffffffffff00001 (head_node)

However, the above output from your manual testing is:

Node: 65d0a32bfeee84475a235b3c290824ec3ac0b1ab5148d96fc674ce93

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

ryanaoleary · 2025-03-04T20:20:07Z

RayCluster manifest with Autoscaler v2 used for manual testing:

apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: raycluster-autoscaler
spec:
  # The version of Ray you are using. Make sure all Ray containers are running this version of Ray.
  # Use the Ray nightly or Ray version >= 2.10.0 and KubeRay 1.1.0 or later for autoscaler v2.
  enableInTreeAutoscaling: true
  autoscalerOptions:
    upscalingMode: Default
    idleTimeoutSeconds: 60
    imagePullPolicy: IfNotPresent
    # Optionally specify the Autoscaler container's securityContext.
    securityContext: {}
    env: []
    envFrom: []
    resources:
      limits:
        cpu: "500m"
        memory: "512Mi"
      requests:
        cpu: "500m"
        memory: "512Mi"
  # Ray head pod template
  headGroupSpec:
    rayStartParams:
      # Setting "num-cpus: 0" to avoid any Ray actors or tasks being scheduled on the Ray head Pod.
      num-cpus: "0"
    # Pod template
    template:
      spec:
        containers:
        # The Ray head container
        - name: ray-head
          image: us-central1-docker.pkg.dev/ryanaoleary-gke-dev/ryanaoleary-ray/ray-logs:latest
          ports:
          - containerPort: 6379
            name: gcs
          - containerPort: 8265
            name: dashboard
          - containerPort: 10001
            name: client
          resources:
            limits:
              cpu: "1"
              memory: "2G"
            requests:
              cpu: "1"
              memory: "2G"
          env:
            - name: RAY_enable_autoscaler_v2 # Pass env var for the autoscaler v2.
              value: "1"
          volumeMounts:
            - mountPath: /home/ray/samples
              name: ray-example-configmap
        volumes:
          - name: ray-example-configmap
            configMap:
              name: ray-example
              defaultMode: 0777
              items:
                - key: detached_actor.py
                  path: detached_actor.py
                - key: terminate_detached_actor.py
                  path: terminate_detached_actor.py
        restartPolicy: OnFailure # No restart to avoid reuse of pod for different ray nodes.
  workerGroupSpecs:
  # the Pod replicas in this group typed worker
  - replicas: 0
    minReplicas: 0
    maxReplicas: 10
    groupName: small-group
    rayStartParams: {}
    # Pod template
    template:
      spec:
        containers:
        - name: ray-worker
          image: us-central1-docker.pkg.dev/ryanaoleary-gke-dev/ryanaoleary-ray/ray-logs:latest
          resources:
            limits:
              cpu: "1"
              memory: "1G"
            requests:
              cpu: "1"
              memory: "1G"
        restartPolicy: OnFailure # Never restart a pod to avoid pod reuse
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: ray-example
data:
  detached_actor.py: |
    import ray
    import sys

    @ray.remote(num_cpus=1)
    class Actor:
      pass

    ray.init(namespace="default_namespace")
    Actor.options(name=sys.argv[1], lifetime="detached").remote()

  terminate_detached_actor.py: |
    import ray
    import sys

    ray.init(namespace="default_namespace")
    detached_actor = ray.get_actor(sys.argv[1])
    ray.kill(detached_actor)

kevin85421 · 2025-03-05T02:14:27Z

@ryanaoleary would you mind fixing the CI errors?

ryanaoleary · 2025-03-06T01:43:09Z

I see it passing the test it failed in the CI (test_ray_status) locally, rebasing with master again to see if that fixes the issue:
test_ray_status_output.txt

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

ryanaoleary · 2025-03-06T09:25:38Z

It's now passing the CI after 2084439 cc: @kevin85421

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

Currently the V2 Autoscaler formats logs by converting the V2 data structure `ClusterStatus` to the V1 structures `AutoscalerSummary` and `LoadMetricsSummary` and then passing them to the legacy `format_info_string`. It'd be useful for the V2 autoscaler to directly format `ClusterStatus` to the correct output log format. This PR refactors `utils.py` to directly format `ClusterStatus`. Additionally, this PR changes the node reports to output `instance_id` rather than `ip_address`, since the latter is not necessarily unique for failed nodes. ## Related issue number Closes ray-project#37856 --------- Signed-off-by: ryanaoleary <ryanaoleary@google.com> Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

Currently the V2 Autoscaler formats logs by converting the V2 data structure `ClusterStatus` to the V1 structures `AutoscalerSummary` and `LoadMetricsSummary` and then passing them to the legacy `format_info_string`. It'd be useful for the V2 autoscaler to directly format `ClusterStatus` to the correct output log format. This PR refactors `utils.py` to directly format `ClusterStatus`. Additionally, this PR changes the node reports to output `instance_id` rather than `ip_address`, since the latter is not necessarily unique for failed nodes. ## Related issue number Closes ray-project#37856 --------- Signed-off-by: ryanaoleary <ryanaoleary@google.com> Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> Signed-off-by: Dhakshin Suriakannu <d_suriakannu@apple.com>

ryanaoleary changed the title ~~Initial commit to refactor v2 format code~~ [Core][Autoscaler] Refactor v2 format code Dec 19, 2024

ryanaoleary changed the title ~~[Core][Autoscaler] Refactor v2 format code~~ [Core][Autoscaler] Refactor v2 Log Formatting Dec 19, 2024

ryanaoleary force-pushed the refactor-v2-logs branch 2 times, most recently from 7589e14 to 30fe22c Compare January 3, 2025 02:37

Initial commit to refactor v2 format code

d36e447

Signed-off-by: ryanaoleary <ryanaoleary@google.com>

ryanaoleary force-pushed the refactor-v2-logs branch from 30fe22c to d36e447 Compare January 3, 2025 02:42

ryanaoleary and others added 2 commits February 11, 2025 07:24

Merge branch 'master' into refactor-v2-logs

9418577

Update refactor

56b4be8

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

ryanaoleary marked this pull request as ready for review February 13, 2025 00:44

ryanaoleary requested review from a team and hongchaodeng as code owners February 13, 2025 00:44

ryanaoleary added 2 commits February 13, 2025 00:45

Remove fixed TODO comment

a5750a1

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

Remove fixed comments and now unused methods

ff1368b

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

Cover index -1 case

aed3f7a

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

kevin85421 reviewed Feb 25, 2025

View reviewed changes

Comment thread python/ray/autoscaler/v2/utils.py Outdated

ryanaoleary added 2 commits February 26, 2025 02:57

Change classmethods to staticmethods

949a295

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

Remove StringIO

64776ef

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

ryanaoleary requested a review from kevin85421 February 26, 2025 06:38

Refactor and format to make code easier to review

e6488ce

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

kevin85421 reviewed Mar 2, 2025

View reviewed changes

ryanaoleary and others added 3 commits March 4, 2025 11:01

Fix comments

d07db55

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

Merge branch 'master' into refactor-v2-logs

f047ca6

Fix tests and refactor failure_lines sorting

a7e437c

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

ryanaoleary requested a review from kevin85421 March 4, 2025 20:18

kevin85421 approved these changes Mar 4, 2025

View reviewed changes

kevin85421 added the go add ONLY when ready to merge, run all tests label Mar 4, 2025

ryanaoleary added 2 commits March 5, 2025 23:00

Merge branch 'master' into refactor-v2-logs

d498395

Merge branch 'master' into refactor-v2-logs

f6db4ea

Fix missing newline causing 0 len seperator

2084439

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

ryanaoleary force-pushed the refactor-v2-logs branch from 79c124c to 2084439 Compare March 6, 2025 05:20

kevin85421 reviewed Mar 6, 2025

View reviewed changes

Comment thread python/ray/autoscaler/v2/utils.py Outdated

ryanaoleary requested a review from kevin85421 March 6, 2025 09:56

Return separator length from header_info

f73f023

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

kevin85421 approved these changes Mar 6, 2025

View reviewed changes

kevin85421 assigned edoakes Mar 6, 2025

edoakes merged commit 4a6fbd9 into ray-project:master Mar 6, 2025

hainesmichaelc added the community-backlog label May 22, 2025

Conversation

ryanaoleary commented Dec 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

ryanaoleary commented Feb 13, 2025

Uh oh!

ryanaoleary commented Feb 25, 2025

Uh oh!

kevin85421 left a comment

Choose a reason for hiding this comment

Uh oh!

kevin85421 Feb 25, 2025

Choose a reason for hiding this comment

Uh oh!

ryanaoleary Feb 26, 2025

Choose a reason for hiding this comment

Uh oh!

kevin85421 Mar 2, 2025

Choose a reason for hiding this comment

Uh oh!

ryanaoleary Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ryanaoleary Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ryanaoleary commented Feb 28, 2025

Uh oh!

kevin85421 Mar 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ryanaoleary commented Mar 4, 2025

Uh oh!

kevin85421 commented Mar 5, 2025

Uh oh!

ryanaoleary commented Mar 6, 2025

Uh oh!

ryanaoleary commented Mar 6, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ryanaoleary commented Dec 19, 2024 •

edited

Loading

ryanaoleary Mar 4, 2025 •

edited

Loading

ryanaoleary Mar 4, 2025 •

edited

Loading