Description
What happened + What you expected to happen
When Ray autoscaler recovers worker nodes after heartbeat loss, it uses the top-level global docker.image
configuration instead of the node-specific docker.worker_image
configuration defined in available_node_types
.
In python/ray/autoscaler/_private/autoscaler.py
, the recover_if_needed()
method uses:
docker_config=self.config.get("docker"), # BUG: Uses top-level global config
Instead of:
docker_config=self._get_node_specific_docker_config(node_id), # Should use node-specific config
Configuration Example
Cluster with multiple worker node types and node-specific docker config
cluster_name: my-cluster
provider:
type: aws
region: us-east-1
cache_stopped_nodes: true
docker:
# Top-level global defaults
image: head-image:version
container_name: ray-container
pull_before_run: false
available_node_types:
ray.worker.type_a:
docker:
worker_image: worker-a-image:version
ray.worker.type_b:
docker:
worker_image: worker-b-image:version
...
Expected vs Actual
Expected: Worker nodes should use their node-specific docker.worker_image
during recovery
Actual: Worker nodes download and use the global docker.image
during recovery
Impact
This breaks heterogeneous clusters where different node types require different Docker images. Worker nodes end up running the wrong image during recovery due to heartbeat loss (in monitor.log for affected worker nodes:
WARNING autoscaler.py:1235 -- StandardAutoscaler: i-xxx: No recent heartbeat, restarting Ray to recover...
)
Versions / Dependencies
Ray 2.45.0
Reproduction script
NA
Issue Severity
Medium: It is a significant difficulty but I can work around it.