Skip to content

[Core] Autoscaler Node Recovery Ignores Node-Specific Docker Config #53987

Closed
@JohnsonKuan

Description

@JohnsonKuan

What happened + What you expected to happen

When Ray autoscaler recovers worker nodes after heartbeat loss, it uses the top-level global docker.image configuration instead of the node-specific docker.worker_image configuration defined in available_node_types.

In python/ray/autoscaler/_private/autoscaler.py, the recover_if_needed() method uses:

docker_config=self.config.get("docker"),  # BUG: Uses top-level global config

Instead of:

docker_config=self._get_node_specific_docker_config(node_id),  # Should use node-specific config

Configuration Example

Cluster with multiple worker node types and node-specific docker config

cluster_name: my-cluster

provider:
  type: aws
  region: us-east-1
  cache_stopped_nodes: true

docker:
  # Top-level global defaults
  image: head-image:version
  container_name: ray-container
  pull_before_run: false

available_node_types:
  ray.worker.type_a:
    docker:
      worker_image: worker-a-image:version
  ray.worker.type_b:
    docker:
      worker_image: worker-b-image:version
...

Expected vs Actual

Expected: Worker nodes should use their node-specific docker.worker_image during recovery
Actual: Worker nodes download and use the global docker.image during recovery

Impact

This breaks heterogeneous clusters where different node types require different Docker images. Worker nodes end up running the wrong image during recovery due to heartbeat loss (in monitor.log for affected worker nodes:

WARNING autoscaler.py:1235 -- StandardAutoscaler: i-xxx: No recent heartbeat, restarting Ray to recover...)

Versions / Dependencies

Ray 2.45.0

Reproduction script

NA

Issue Severity

Medium: It is a significant difficulty but I can work around it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Important issue, but not time-criticalbugSomething that is supposed to be working; but isn'tcoreIssues that should be addressed in Ray Corecore-autoscalerautoscaler related issuesstability

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions