Skip to content

[cluster launcher] Fix ray down not stopping Docker containers on worker nodes for local clusters#62169

Merged
edoakes merged 21 commits intoray-project:masterfrom
dev-miro26:fix-ray-down-local-docker-worker-teardown
Apr 22, 2026
Merged

[cluster launcher] Fix ray down not stopping Docker containers on worker nodes for local clusters#62169
edoakes merged 21 commits intoray-project:masterfrom
dev-miro26:fix-ray-down-local-docker-worker-teardown

Conversation

@dev-miro26
Copy link
Copy Markdown
Contributor

@dev-miro26 dev-miro26 commented Mar 29, 2026

Description

ray down on an SSH Docker cluster stops the head container but skips all workers. Their Docker containers keep running indefinitely.

Root cause: Two separate LocalNodeProvider instances maintain independent state files — one on the machine invoking ray down and one on the head node (managed by the autoscaler). Workers are only ever marked "running" by the head's autoscaler in its own /tmp/ray/cluster-<name>.state file. The invoking machine's state file initializes workers as "terminated" in ClusterState.__init__ and never receives those updates. When teardown_cluster calls remaining_nodes()provider.non_terminated_nodes(), all workers are filtered out, so the docker stop loop has nothing to iterate.

Fix: Add NodeProvider.get_all_node_ids(tag_filters) that returns all known node IDs regardless of state. The base class delegates to non_terminated_nodes() (no behavior change for cloud providers that query live infrastructure). LocalNodeProvider overrides it to skip the state == "terminated" filter. teardown_cluster now uses get_all_node_ids to build the Docker stop target list, ensuring worker containers are stopped even when the local state file is stale.

Related issues

Closes: #62058

Additional information

Files changed:

  • python/ray/autoscaler/node_provider.py — Added get_all_node_ids() to base NodeProvider class (defaults to non_terminated_nodes)
  • python/ray/autoscaler/_private/local/node_provider.pyLocalNodeProvider override that includes terminated nodes
  • python/ray/autoscaler/_private/commands.pyteardown_cluster Docker stop phase uses get_all_node_ids instead of remaining_nodes()
  • python/ray/tests/test_coordinator_server.py — Added testGetAllNodeIdsIncludesTerminated with _make_local_provider helper

Backward compatibility: The terminate_nodes loop is unchanged (still uses non_terminated_nodes). For cloud providers (AWS, GCP, Azure, etc.), get_all_node_ids delegates to non_terminated_nodes, so behavior is identical. The only change is that Docker stop now targets all configured local nodes during teardown.

@dev-miro26 dev-miro26 requested a review from a team as a code owner March 29, 2026 05:52
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the get_all_node_ids method to the NodeProvider interface and implements it for LocalNodeProvider to ensure Docker containers are correctly stopped during teardown, even when local state is out of sync. The teardown logic is updated to use this method, and unit tests are added to verify the fix. Reviewer feedback suggests refactoring duplicated logic into a helper, using Pythonic list comprehensions for node filtering, correcting an invalid IP address in tests, and improving test robustness with set-based assertions.

Comment on lines +567 to +582
docker_workers = provider.get_all_node_ids(
{TAG_RAY_NODE_KIND: NODE_KIND_WORKER}
)
if keep_min_workers:
min_workers = config.get("min_workers", 0)
if len(docker_workers) > min_workers:
docker_workers = random.sample(
docker_workers, len(docker_workers) - min_workers
)
if workers_only:
docker_stop_nodes = docker_workers
else:
docker_heads = provider.get_all_node_ids(
{TAG_RAY_NODE_KIND: NODE_KIND_HEAD}
)
docker_stop_nodes = docker_heads + docker_workers
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This logic for determining which nodes to stop is very similar to the remaining_nodes function defined earlier in this file. This duplication could be avoided by abstracting the common logic into a helper function that accepts the node-retrieval function (e.g., provider.get_all_node_ids or provider.non_terminated_nodes) as an argument. This would improve maintainability and reduce code duplication. While refactoring remaining_nodes is outside the scope of this change, creating a helper here would be a good first step.

Comment on lines +227 to +237
workers = self.state.get()
matching_ips = []
for worker_ip, info in workers.items():
ok = True
for k, v in tag_filters.items():
if info["tags"].get(k) != v:
ok = False
break
if ok:
matching_ips.append(worker_ip)
return matching_ips
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This logic can be simplified using a more concise and Pythonic list comprehension with all().

        workers = self.state.get()
        return [
            worker_ip
            for worker_ip, info in workers.items()
            if all(info["tags"].get(k) == v for k, v in tag_filters.items())
        ]

"type": "local",
"head_ip": head_ip,
"worker_ips": worker_ips,
"external_head_ip": "0.0.0.0.3",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The IP address "0.0.0.0.3" is invalid. While it may not affect the current tests, it's incorrect and could lead to confusion or failures in the future. Please use a valid IPv4 address format.

Suggested change
"external_head_ip": "0.0.0.0.3",
"external_head_ip": "1.2.3.4",

Comment on lines +111 to +113
assert provider.get_all_node_ids(
{TAG_RAY_NODE_KIND: NODE_KIND_HEAD}
) == [head_ip]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The order of nodes returned by get_all_node_ids is not guaranteed, as it depends on dictionary iteration order. To make this assertion more robust, it's better to compare sets instead of lists.

Suggested change
assert provider.get_all_node_ids(
{TAG_RAY_NODE_KIND: NODE_KIND_HEAD}
) == [head_ip]
assert set(provider.get_all_node_ids(
{TAG_RAY_NODE_KIND: NODE_KIND_HEAD}
)) == {head_ip}

Comment thread python/ray/autoscaler/_private/commands.py Outdated
@dev-miro26 dev-miro26 force-pushed the fix-ray-down-local-docker-worker-teardown branch from 711a74c to 54d5082 Compare March 29, 2026 05:58
Comment thread python/ray/autoscaler/_private/commands.py Outdated
@dev-miro26 dev-miro26 force-pushed the fix-ray-down-local-docker-worker-teardown branch from 76b4943 to c1c436f Compare March 29, 2026 06:15
@dev-miro26 dev-miro26 requested review from a team as code owners March 29, 2026 06:15
@dev-miro26 dev-miro26 force-pushed the fix-ray-down-local-docker-worker-teardown branch from c1c436f to af7e115 Compare March 29, 2026 06:19
Comment thread python/ray/tests/test_coordinator_server.py Outdated
Comment thread python/ray/autoscaler/_private/commands.py Outdated
@dev-miro26
Copy link
Copy Markdown
Contributor Author

Hi, @edoakes
Could you please review my first PR?
Please leave your feedback kindly.
Thank you.

@ray-gardener ray-gardener Bot added core Issues that should be addressed in Ray Core community-contribution Contributed by the community labels Mar 29, 2026
@aslonnie aslonnie removed request for a team March 30, 2026 05:34
@edoakes edoakes added the go add ONLY when ready to merge, run all tests label Mar 31, 2026
@rueian rueian self-assigned this Apr 1, 2026
@edoakes
Copy link
Copy Markdown
Collaborator

edoakes commented Apr 2, 2026

@dev-miro26 thanks for the contribution. @rueian will review the PR shortly

matching_ips.append(worker_ip)
return matching_ips

def get_all_node_ids(self, tag_filters):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be named as something like nodes_for_teardown to make it clear that this is only for the teardown function.

# started them and their Docker containers are still running.
# For cloud providers this adds nothing because get_all_node_ids
# delegates to non_terminated_nodes.
stale_terminated = set(provider.get_all_node_ids({})) - set(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we just fix the remaining_nodes function with our new method on the node provider? Do we really need the changes here?

@dev-miro26 dev-miro26 force-pushed the fix-ray-down-local-docker-worker-teardown branch from ca65bd7 to bd568a0 Compare April 6, 2026 01:53
…orker nodes for local clusters

`ray down` on an SSH Docker cluster stops the head container but skips
workers. The root cause is that LocalNodeProvider on the invoking machine
maintains a separate state file from the head node's autoscaler — workers
are never marked as running in the local file, so `non_terminated_nodes`
returns an empty list and the `docker stop` loop has nothing to iterate.

Add `NodeProvider.get_all_node_ids(tag_filters)` which includes terminated
nodes. The base class delegates to `non_terminated_nodes()` (no change for
cloud providers). `LocalNodeProvider` overrides it to skip the terminated
filter. `teardown_cluster` now uses this for the Docker stop phase.

Extract `_collect_nodes(node_retrieval_fn)` helper to deduplicate the
worker/head selection logic between `remaining_nodes()` and the Docker
stop target list.

Made-with: Cursor
Signed-off-by: dev-miro26 <devmiro26@gmail.com>
… assertions, avoid double-sampling

Signed-off-by: dev-miro26 <devmiro26@gmail.com>
…ress format

Signed-off-by: dev-miro26 <devmiro26@gmail.com>
…rkers by setting workers to an empty list. This ensures proper cleanup during cluster teardown.

Signed-off-by: dev-miro26 <devmiro26@gmail.com>
Signed-off-by: dev-miro26 <devmiro26@gmail.com>
…Update references in the teardown_cluster function and related tests to reflect this change, ensuring functionality remains intact for node identification during teardown processes.

Signed-off-by: dev-miro26 <devmiro26@gmail.com>
…The method now reflects its purpose of including terminated nodes for teardown processes.

Signed-off-by: dev-miro26 <devmiro26@gmail.com>
@dev-miro26 dev-miro26 force-pushed the fix-ray-down-local-docker-worker-teardown branch from bd568a0 to 4c107c4 Compare April 6, 2026 01:58
@dev-miro26 dev-miro26 requested a review from rueian April 6, 2026 06:15
Comment on lines +563 to +578
# Start with A (which already respects --keep-min-workers and
# --workers-only). On top of that, include nodes the provider
# knows about but reports as terminated. This handles
# LocalNodeProvider where the invoking machine's state file
# marks workers as terminated even though the head's autoscaler
# started them and their Docker containers are still running.
# For cloud providers this adds nothing because nodes_for_teardown
# delegates to non_terminated_nodes.
stale_terminated = set(provider.nodes_for_teardown({})) - set(
provider.non_terminated_nodes({})
)
if workers_only:
stale_terminated -= set(
provider.nodes_for_teardown({TAG_RAY_NODE_KIND: NODE_KIND_HEAD})
)
docker_stop_nodes = list(set(A) | stale_terminated)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these changes necessary? Can't we use the nodes_for_teardown in the remaining_nodes?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me check and update asap

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remaining_nodes() is also used as the exit condition in the teardown loop

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but can't these be moved into the nodes_for_teardown of the local node provider?

Copy link
Copy Markdown
Contributor Author

@dev-miro26 dev-miro26 Apr 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, we can't move this into nodes_for_teardown because the provide rdoes not know about the --keep-min-workers.
Moving it into the provider would break the abstraction.

And remaining_nodes() controls a while loop, if it returned terminated nodes, the loop would never stop.

Please let me know your opinion.
Thanks

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also modify terminate_node to add some additional flags so that nodes_for_teardown won't return terminated nodes?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added teardown_complete flag to prevent re-targeting terminated nodes in nodes_for_teardown

Signed-off-by: dev-miro26 <devmiro26@gmail.com>
Comment thread python/ray/autoscaler/_private/commands.py Outdated
@dev-miro26
Copy link
Copy Markdown
Contributor Author

Please review my last changes again.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit e742286. Configure here.

Comment thread python/ray/autoscaler/_private/commands.py Outdated
… remaining_nodes

Signed-off-by: dev-miro26 <devmiro26@gmail.com>
@dev-miro26 dev-miro26 requested a review from rueian April 12, 2026 17:37
…n nodes_for_teardown

Signed-off-by: dev-miro26 <devmiro26@gmail.com>
head = provider.non_terminated_nodes({TAG_RAY_NODE_KIND: NODE_KIND_HEAD})

return head + workers
return _nodes_to_teardown(provider.non_terminated_nodes)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't this be _nodes_to_teardown(provider.nodes_for_teardown)?


def remaining_nodes():
workers = provider.non_terminated_nodes({TAG_RAY_NODE_KIND: NODE_KIND_WORKER})
def _nodes_to_teardown(get_nodes):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you merge this back to remaining_nodes?

Comment on lines +583 to +584
all_teardown = set(_nodes_to_teardown(provider.nodes_for_teardown))
docker_stop_nodes = list(set(A) | all_teardown)
Copy link
Copy Markdown
Contributor

@rueian rueian Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need this?

@dev-miro26
Copy link
Copy Markdown
Contributor Author

Hi, @rueian
Hope you had a great weeekend.
Could you please review again?
Thanks

@rueian
Copy link
Copy Markdown
Contributor

rueian commented Apr 22, 2026

Hi, @rueian Hope you had a great weeekend. Could you please review again? Thanks

LGTM!

Hi @edoakes, please help merge this PR 🙏

@edoakes edoakes merged commit ec1779b into ray-project:master Apr 22, 2026
6 checks passed
@dev-miro26
Copy link
Copy Markdown
Contributor Author

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Cluster] worker docker containers are not stopped when ray down is invoked for an ssh docker cluster

4 participants