[cluster launcher] Fix `ray down` not stopping Docker containers on worker nodes for local clusters by dev-miro26 · Pull Request #62169 · ray-project/ray

dev-miro26 · 2026-03-29T05:52:36Z

Description

ray down on an SSH Docker cluster stops the head container but skips all workers. Their Docker containers keep running indefinitely.

Root cause: Two separate LocalNodeProvider instances maintain independent state files — one on the machine invoking ray down and one on the head node (managed by the autoscaler). Workers are only ever marked "running" by the head's autoscaler in its own /tmp/ray/cluster-<name>.state file. The invoking machine's state file initializes workers as "terminated" in ClusterState.__init__ and never receives those updates. When teardown_cluster calls remaining_nodes() → provider.non_terminated_nodes(), all workers are filtered out, so the docker stop loop has nothing to iterate.

Fix: Add NodeProvider.get_all_node_ids(tag_filters) that returns all known node IDs regardless of state. The base class delegates to non_terminated_nodes() (no behavior change for cloud providers that query live infrastructure). LocalNodeProvider overrides it to skip the state == "terminated" filter. teardown_cluster now uses get_all_node_ids to build the Docker stop target list, ensuring worker containers are stopped even when the local state file is stale.

Related issues

Closes: #62058

Additional information

Files changed:

python/ray/autoscaler/node_provider.py — Added get_all_node_ids() to base NodeProvider class (defaults to non_terminated_nodes)
python/ray/autoscaler/_private/local/node_provider.py — LocalNodeProvider override that includes terminated nodes
python/ray/autoscaler/_private/commands.py — teardown_cluster Docker stop phase uses get_all_node_ids instead of remaining_nodes()
python/ray/tests/test_coordinator_server.py — Added testGetAllNodeIdsIncludesTerminated with _make_local_provider helper

Backward compatibility: The terminate_nodes loop is unchanged (still uses non_terminated_nodes). For cloud providers (AWS, GCP, Azure, etc.), get_all_node_ids delegates to non_terminated_nodes, so behavior is identical. The only change is that Docker stop now targets all configured local nodes during teardown.

gemini-code-assist

Code Review

This pull request introduces the get_all_node_ids method to the NodeProvider interface and implements it for LocalNodeProvider to ensure Docker containers are correctly stopped during teardown, even when local state is out of sync. The teardown logic is updated to use this method, and unit tests are added to verify the fix. Reviewer feedback suggests refactoring duplicated logic into a helper, using Pythonic list comprehensions for node filtering, correcting an invalid IP address in tests, and improving test robustness with set-based assertions.

gemini-code-assist · 2026-03-29T05:54:24Z

+        docker_workers = provider.get_all_node_ids(
+            {TAG_RAY_NODE_KIND: NODE_KIND_WORKER}
+        )
+        if keep_min_workers:
+            min_workers = config.get("min_workers", 0)
+            if len(docker_workers) > min_workers:
+                docker_workers = random.sample(
+                    docker_workers, len(docker_workers) - min_workers
+                )
+        if workers_only:
+            docker_stop_nodes = docker_workers
+        else:
+            docker_heads = provider.get_all_node_ids(
+                {TAG_RAY_NODE_KIND: NODE_KIND_HEAD}
+            )
+            docker_stop_nodes = docker_heads + docker_workers


This logic for determining which nodes to stop is very similar to the remaining_nodes function defined earlier in this file. This duplication could be avoided by abstracting the common logic into a helper function that accepts the node-retrieval function (e.g., provider.get_all_node_ids or provider.non_terminated_nodes) as an argument. This would improve maintainability and reduce code duplication. While refactoring remaining_nodes is outside the scope of this change, creating a helper here would be a good first step.

gemini-code-assist · 2026-03-29T05:54:24Z

+        workers = self.state.get()
+        matching_ips = []
+        for worker_ip, info in workers.items():
+            ok = True
+            for k, v in tag_filters.items():
+                if info["tags"].get(k) != v:
+                    ok = False
+                    break
+            if ok:
+                matching_ips.append(worker_ip)
+        return matching_ips


This logic can be simplified using a more concise and Pythonic list comprehension with all().

workers = self.state.get() return [ worker_ip for worker_ip, info in workers.items() if all(info["tags"].get(k) == v for k, v in tag_filters.items()) ]

gemini-code-assist · 2026-03-29T05:54:24Z

+            "type": "local",
+            "head_ip": head_ip,
+            "worker_ips": worker_ips,
+            "external_head_ip": "0.0.0.0.3",


The IP address "0.0.0.0.3" is invalid. While it may not affect the current tests, it's incorrect and could lead to confusion or failures in the future. Please use a valid IPv4 address format.

Suggested change

"external_head_ip": "0.0.0.0.3",

"external_head_ip": "1.2.3.4",

gemini-code-assist · 2026-03-29T05:54:24Z

+        assert provider.get_all_node_ids(
+            {TAG_RAY_NODE_KIND: NODE_KIND_HEAD}
+        ) == [head_ip]


The order of nodes returned by get_all_node_ids is not guaranteed, as it depends on dictionary iteration order. To make this assertion more robust, it's better to compare sets instead of lists.

Suggested change

assert provider.get_all_node_ids(

{TAG_RAY_NODE_KIND: NODE_KIND_HEAD}

) == [head_ip]

assert set(provider.get_all_node_ids(

{TAG_RAY_NODE_KIND: NODE_KIND_HEAD}

)) == {head_ip}

dev-miro26 · 2026-03-29T06:37:36Z

Hi, @edoakes
Could you please review my first PR?
Please leave your feedback kindly.
Thank you.

edoakes · 2026-04-02T15:03:35Z

@dev-miro26 thanks for the contribution. @rueian will review the PR shortly

rueian · 2026-04-03T23:43:42Z

                matching_ips.append(worker_ip)
        return matching_ips

+    def get_all_node_ids(self, tag_filters):


I think this should be named as something like nodes_for_teardown to make it clear that this is only for the teardown function.

rueian · 2026-04-03T23:45:11Z

+        # started them and their Docker containers are still running.
+        # For cloud providers this adds nothing because get_all_node_ids
+        # delegates to non_terminated_nodes.
+        stale_terminated = set(provider.get_all_node_ids({})) - set(


Shouldn't we just fix the remaining_nodes function with our new method on the node provider? Do we really need the changes here?

…orker nodes for local clusters `ray down` on an SSH Docker cluster stops the head container but skips workers. The root cause is that LocalNodeProvider on the invoking machine maintains a separate state file from the head node's autoscaler — workers are never marked as running in the local file, so `non_terminated_nodes` returns an empty list and the `docker stop` loop has nothing to iterate. Add `NodeProvider.get_all_node_ids(tag_filters)` which includes terminated nodes. The base class delegates to `non_terminated_nodes()` (no change for cloud providers). `LocalNodeProvider` overrides it to skip the terminated filter. `teardown_cluster` now uses this for the Docker stop phase. Extract `_collect_nodes(node_retrieval_fn)` helper to deduplicate the worker/head selection logic between `remaining_nodes()` and the Docker stop target list. Made-with: Cursor Signed-off-by: dev-miro26 <devmiro26@gmail.com>

… assertions, avoid double-sampling Signed-off-by: dev-miro26 <devmiro26@gmail.com>

…ress format Signed-off-by: dev-miro26 <devmiro26@gmail.com>

…rkers by setting workers to an empty list. This ensures proper cleanup during cluster teardown. Signed-off-by: dev-miro26 <devmiro26@gmail.com>

Signed-off-by: dev-miro26 <devmiro26@gmail.com>

…Update references in the teardown_cluster function and related tests to reflect this change, ensuring functionality remains intact for node identification during teardown processes. Signed-off-by: dev-miro26 <devmiro26@gmail.com>

…The method now reflects its purpose of including terminated nodes for teardown processes. Signed-off-by: dev-miro26 <devmiro26@gmail.com>

rueian · 2026-04-10T18:25:18Z

+        # Start with A (which already respects --keep-min-workers and
+        # --workers-only).  On top of that, include nodes the provider
+        # knows about but reports as terminated.  This handles
+        # LocalNodeProvider where the invoking machine's state file
+        # marks workers as terminated even though the head's autoscaler
+        # started them and their Docker containers are still running.
+        # For cloud providers this adds nothing because nodes_for_teardown
+        # delegates to non_terminated_nodes.
+        stale_terminated = set(provider.nodes_for_teardown({})) - set(
+            provider.non_terminated_nodes({})
+        )
+        if workers_only:
+            stale_terminated -= set(
+                provider.nodes_for_teardown({TAG_RAY_NODE_KIND: NODE_KIND_HEAD})
+            )
+        docker_stop_nodes = list(set(A) | stale_terminated)


Are these changes necessary? Can't we use the nodes_for_teardown in the remaining_nodes?

let me check and update asap

remaining_nodes() is also used as the exit condition in the teardown loop

Yes, but can't these be moved into the nodes_for_teardown of the local node provider?

I think, we can't move this into nodes_for_teardown because the provide rdoes not know about the --keep-min-workers.
Moving it into the provider would break the abstraction.

And remaining_nodes() controls a while loop, if it returned terminated nodes, the loop would never stop.

Please let me know your opinion.
Thanks

Could you also modify terminate_node to add some additional flags so that nodes_for_teardown won't return terminated nodes?

Added teardown_complete flag to prevent re-targeting terminated nodes in nodes_for_teardown

Signed-off-by: dev-miro26 <devmiro26@gmail.com>

… teardown Signed-off-by: dev-miro26 <devmiro26@gmail.com>

dev-miro26 · 2026-04-12T17:20:12Z

Please review my last changes again.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Reviewed by Cursor Bugbot for commit e742286. Configure here.}

… remaining_nodes Signed-off-by: dev-miro26 <devmiro26@gmail.com>

…n nodes_for_teardown Signed-off-by: dev-miro26 <devmiro26@gmail.com>

rueian · 2026-04-14T03:07:32Z

-        head = provider.non_terminated_nodes({TAG_RAY_NODE_KIND: NODE_KIND_HEAD})
-
-        return head + workers
+        return _nodes_to_teardown(provider.non_terminated_nodes)


Can't this be _nodes_to_teardown(provider.nodes_for_teardown)?

…iltering Signed-off-by: dev-miro26 <devmiro26@gmail.com>

rueian · 2026-04-16T18:17:04Z


-    def remaining_nodes():
-        workers = provider.non_terminated_nodes({TAG_RAY_NODE_KIND: NODE_KIND_WORKER})
+    def _nodes_to_teardown(get_nodes):


Can you merge this back to remaining_nodes?

rueian · 2026-04-16T18:17:41Z

+        all_teardown = set(_nodes_to_teardown(provider.nodes_for_teardown))
+        docker_stop_nodes = list(set(A) | all_teardown)


Do we still need this?

…odes Signed-off-by: dev-miro26 <devmiro26@gmail.com>

dev-miro26 · 2026-04-21T01:45:24Z

Hi, @rueian
Hope you had a great weeekend.
Could you please review again?
Thanks

rueian · 2026-04-22T21:09:19Z

Hi, @rueian Hope you had a great weeekend. Could you please review again? Thanks

LGTM!

Hi @edoakes, please help merge this PR 🙏

dev-miro26 · 2026-04-22T21:40:16Z

Thanks

dev-miro26 requested a review from a team as a code owner March 29, 2026 05:52

gemini-code-assist Bot reviewed Mar 29, 2026

View reviewed changes

cursor Bot reviewed Mar 29, 2026

View reviewed changes

Comment thread python/ray/autoscaler/_private/commands.py Outdated

dev-miro26 force-pushed the fix-ray-down-local-docker-worker-teardown branch from 711a74c to 54d5082 Compare March 29, 2026 05:58

cursor Bot reviewed Mar 29, 2026

View reviewed changes

Comment thread python/ray/autoscaler/_private/commands.py Outdated

dev-miro26 force-pushed the fix-ray-down-local-docker-worker-teardown branch from 76b4943 to c1c436f Compare March 29, 2026 06:15

dev-miro26 requested review from a team as code owners March 29, 2026 06:15

dev-miro26 force-pushed the fix-ray-down-local-docker-worker-teardown branch from c1c436f to af7e115 Compare March 29, 2026 06:19

cursor Bot reviewed Mar 29, 2026

View reviewed changes

Comment thread python/ray/tests/test_coordinator_server.py Outdated

cursor Bot reviewed Mar 29, 2026

View reviewed changes

Comment thread python/ray/autoscaler/_private/commands.py Outdated

ray-gardener Bot added core Issues that should be addressed in Ray Core community-contribution Contributed by the community labels Mar 29, 2026

aslonnie removed request for a team March 30, 2026 05:34

edoakes added the go add ONLY when ready to merge, run all tests label Mar 31, 2026

rueian self-assigned this Apr 1, 2026

rueian reviewed Apr 3, 2026

View reviewed changes

dev-miro26 force-pushed the fix-ray-down-local-docker-worker-teardown branch from ca65bd7 to bd568a0 Compare April 6, 2026 01:53

dev-miro26 added 7 commits April 6, 2026 01:57

Address review feedback: simplify list comprehension, fix IP, use set…

3a35c2c

… assertions, avoid double-sampling Signed-off-by: dev-miro26 <devmiro26@gmail.com>

Fix external_head_ip in OnPremCoordinatorServerTest to correct IP add…

d6c3ae5

…ress format Signed-off-by: dev-miro26 <devmiro26@gmail.com>

Fix teardown_cluster to handle case when workers are less than min_wo…

653c54f

…rkers by setting workers to an empty list. This ensures proper cleanup during cluster teardown. Signed-off-by: dev-miro26 <devmiro26@gmail.com>

Refactor and test assertions for clarity

d32f3b4

Signed-off-by: dev-miro26 <devmiro26@gmail.com>

Rename test method for clarity in OnPremCoordinatorServerTest class. …

4c107c4

…The method now reflects its purpose of including terminated nodes for teardown processes. Signed-off-by: dev-miro26 <devmiro26@gmail.com>

dev-miro26 force-pushed the fix-ray-down-local-docker-worker-teardown branch from bd568a0 to 4c107c4 Compare April 6, 2026 01:58

dev-miro26 requested a review from rueian April 6, 2026 06:15

dev-miro26 added 2 commits April 9, 2026 00:46

Merge branch 'master' into fix-ray-down-local-docker-worker-teardown

ad7c1c0

Merge branch 'master' into fix-ray-down-local-docker-worker-teardown

52f5cdd

rueian reviewed Apr 10, 2026

View reviewed changes

Simplify docker_stop_nodes to use nodes_for_teardown directly

7ae5425

Signed-off-by: dev-miro26 <devmiro26@gmail.com>

cursor Bot reviewed Apr 10, 2026

View reviewed changes

Comment thread python/ray/autoscaler/_private/commands.py Outdated

dev-miro26 and others added 4 commits April 10, 2026 18:44

Revert docker_stop_nodes simplification to preserve keep_min_workers

cbb97d9

Signed-off-by: dev-miro26 <devmiro26@gmail.com>

Merge branch 'master' into fix-ray-down-local-docker-worker-teardown

9ad0e42

Fix stale terminated nodes bypassing --keep-min-workers during docker…

665fbd1

… teardown Signed-off-by: dev-miro26 <devmiro26@gmail.com>

Merge branch 'master' into fix-ray-down-local-docker-worker-teardown

e742286

cursor Bot reviewed Apr 12, 2026

View reviewed changes

Comment thread python/ray/autoscaler/_private/commands.py Outdated

Extract _nodes_to_teardown helper to unify docker stop filtering with…

74971ff

… remaining_nodes Signed-off-by: dev-miro26 <devmiro26@gmail.com>

dev-miro26 requested a review from rueian April 12, 2026 17:37

Add teardown_complete flag to prevent re-targeting terminated nodes i…

f5603d3

…n nodes_for_teardown Signed-off-by: dev-miro26 <devmiro26@gmail.com>

rueian reviewed Apr 14, 2026

View reviewed changes

dev-miro26 and others added 2 commits April 14, 2026 10:53

Use nodes_for_teardown in remaining_nodes() for consistent teardown f…

a4e6b56

…iltering Signed-off-by: dev-miro26 <devmiro26@gmail.com>

Merge branch 'master' into fix-ray-down-local-docker-worker-teardown

ba375a7

rueian reviewed Apr 16, 2026

View reviewed changes

dev-miro26 and others added 3 commits April 16, 2026 14:18

Merge branch 'master' into fix-ray-down-local-docker-worker-teardown

e20924b

Simplify teardown by merging _nodes_to_teardown back into remaining_n…

42f5b67

…odes Signed-off-by: dev-miro26 <devmiro26@gmail.com>

Merge branch 'master' into fix-ray-down-local-docker-worker-teardown

6038b5f

rueian approved these changes Apr 22, 2026

View reviewed changes

edoakes merged commit ec1779b into ray-project:master Apr 22, 2026
6 checks passed

	"external_head_ip": "0.0.0.0.3",
	"external_head_ip": "1.2.3.4",

		all_teardown = set(_nodes_to_teardown(provider.nodes_for_teardown))
		docker_stop_nodes = list(set(A) \| all_teardown)

Conversation

dev-miro26 commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issues

Additional information

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dev-miro26 commented Mar 29, 2026

Uh oh!

edoakes commented Apr 2, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dev-miro26 Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dev-miro26 commented Apr 12, 2026

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rueian Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dev-miro26 commented Apr 21, 2026

Uh oh!

rueian commented Apr 22, 2026

Uh oh!

Uh oh!

dev-miro26 commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

dev-miro26 commented Mar 29, 2026 •

edited

Loading

dev-miro26 Apr 12, 2026 •

edited

Loading

rueian Apr 16, 2026 •

edited

Loading