[core] Add release test to simulate network transient error via ip tables #58241

Sparks0219 · 2025-10-28T06:01:16Z

Description

Briefly describe what this PR accomplishes and why it's needed.

To simulate transient network errors we were initially going to use AWS FIS testing but there's a couple problems:
1.) The minimum timeout is 60 seconds causing node death. We weren't able to determine whether we could fix this by tuning ray_config env variables or whether this was due to some mechanism in Anyscale. It would probably take a decent amount of time to figure out + coordination with infra so decided to try the IP table route first
2.) The FIS experiment will introduce cross AZ transient network errors across all clusters in staging

Hence we instead explored an approach of ssh-ing into each node and modifying the IP tables to drop all ingoing/outgoing traffic except intra-node traffic to simulate cross AZ fault tolerance. It actually wasn't so bad to implement, and it works pretty well. You can tune the timeout to what you want and it only affects the nodes in your cluster hence avoids the cons of both listed above. Adding the script into prod since I think it could be helpful for future chaos testing.

Signed-off-by: joshlee <joshlee@anyscale.com>

python/ray/_private/simulate_cross_az_network_failure.py

gemini-code-assist

Code Review

This pull request introduces a new script to simulate transient network failures across availability zones by manipulating iptables on Ray nodes. This is a useful tool for testing fault tolerance.

My review focuses on improving the robustness and maintainability of the script. The main points are:

The iptables commands should be made idempotent by using a custom chain to avoid potential issues with leftover rules.
The script should use public Ray APIs instead of internal ones to ensure it doesn't break with future Ray updates.
The default concurrency for SSH connections is very high and could be lowered to prevent resource exhaustion.

Overall, the script is well-structured and the approach is sound. The suggested changes will make it more reliable and easier to maintain.

gemini-code-assist · 2025-10-28T06:04:21Z

python/ray/_private/simulate_cross_az_network_failure.py

+def iptables_cmd(self_ip: str) -> str:
+    return f"""\
+nohup setsid bash -lc '
+  sudo iptables -w -A INPUT  -p tcp --dport 22 -j ACCEPT
+  sudo iptables -w -A OUTPUT -p tcp --sport 22 -j ACCEPT
+  sudo iptables -w -A INPUT  -s 127.0.0.0/8 -d 127.0.0.0/8 -j ACCEPT
+  sudo iptables -w -A OUTPUT -s 127.0.0.0/8 -d 127.0.0.0/8 -j ACCEPT
+  sudo iptables -w -A INPUT  -s {self_ip} -d {self_ip} -j ACCEPT
+  sudo iptables -w -A OUTPUT -s {self_ip} -d {self_ip} -j ACCEPT
+  sudo iptables -w -A INPUT  -j DROP
+  sudo iptables -w -A OUTPUT -j DROP
+  sleep {SECONDS}
+  sudo iptables -w -D OUTPUT -j DROP
+  sudo iptables -w -D INPUT  -j DROP
+  sudo iptables -w -D OUTPUT -s {self_ip} -d {self_ip} -j ACCEPT
+  sudo iptables -w -D INPUT  -s {self_ip} -d {self_ip} -j ACCEPT
+  sudo iptables -w -D OUTPUT -s 127.0.0.0/8 -d 127.0.0.0/8 -j ACCEPT
+  sudo iptables -w -D INPUT  -s 127.0.0.0/8 -d 127.0.0.0/8 -j ACCEPT
+  sudo iptables -w -D OUTPUT -p tcp --sport 22 -j ACCEPT
+  sudo iptables -w -D INPUT  -p tcp --dport 22 -j ACCEPT
+' &>/dev/null &
+"""


The current implementation of iptables_cmd is not idempotent. If the script is run twice on the same node without the first run completing its cleanup, it will add a second set of rules. The cleanup phase will only remove one set, potentially leaving DROP rules in place and causing a permanent network partition.

A more robust approach is to use a dedicated iptables chain. This makes the operation atomic and idempotent, as you can flush and delete the chain at the start and end of the script.

def iptables_cmd(self_ip: str) -> str: return f"""\ nohup setsid bash -lc ' CHAIN_NAME="RAY_NETWORK_BLACKOUT" # Clean up previous runs if any, ignoring errors if chain/rules don't exist sudo iptables -w -D INPUT -j $CHAIN_NAME 2>/dev/null || true sudo iptables -w -D OUTPUT -j $CHAIN_NAME 2>/dev/null || true sudo iptables -w -F $CHAIN_NAME 2>/dev/null || true sudo iptables -w -X $CHAIN_NAME 2>/dev/null || true # Create a new chain sudo iptables -w -N $CHAIN_NAME # Rules for allowed traffic sudo iptables -w -A $CHAIN_NAME -p tcp --dport 22 -j ACCEPT sudo iptables -w -A $CHAIN_NAME -p tcp --sport 22 -j ACCEPT sudo iptables -w -A $CHAIN_NAME -s 127.0.0.0/8 -d 127.0.0.0/8 -j ACCEPT sudo iptables -w -A $CHAIN_NAME -s {self_ip} -d {self_ip} -j ACCEPT # Drop everything else in this chain sudo iptables -w -A $CHAIN_NAME -j DROP # Insert the chain at the top of INPUT and OUTPUT sudo iptables -w -I INPUT 1 -j $CHAIN_NAME sudo iptables -w -I OUTPUT 1 -j $CHAIN_NAME sleep {SECONDS} # Cleanup sudo iptables -w -D INPUT -j $CHAIN_NAME sudo iptables -w -D OUTPUT -j $CHAIN_NAME sudo iptables -w -F $CHAIN_NAME sudo iptables -w -X $CHAIN_NAME ' &>/dev/null & """

didn't the other ai want chains too...

I tried this and something is breaking 💀

gemini-code-assist · 2025-10-28T06:04:21Z

python/ray/_private/simulate_cross_az_network_failure.py

+
+
+SECONDS = 5  # blackout duration (seconds)
+PARALLEL = 500  # concurrent SSH sessions


A value of 500 for PARALLEL is quite high and may cause issues. A ThreadPoolExecutor with 500 worker threads can consume a significant amount of memory (for thread stacks) on the machine running the script. Additionally, opening 500 concurrent SSH sessions might hit system limits (like file descriptors or MaxStartups on the SSH server). It would be safer to use a more conservative number, or make it configurable based on the environment. For example, a value closer to the number of CPU cores or a number like 32 or 64 is often a safer default.

python/ray/_private/simulate_cross_az_network_failure.py

dayshah

this is amazing 🤩

dayshah · 2025-10-28T06:17:18Z

python/ray/_private/simulate_cross_az_network_failure.py

+
+
+SECONDS = 5  # blackout duration (seconds)
+PARALLEL = 500  # concurrent SSH sessions


why 500? If you just need to keep all active at the same time just launch and join threads?

This is just to limit the number of max concurrent ssh connections. Should only be a problem for like super large clusters.

dayshah · 2025-10-28T06:18:39Z

python/ray/_private/simulate_cross_az_network_failure.py

+def iptables_cmd(self_ip: str) -> str:
+    return f"""\
+nohup setsid bash -lc '
+  sudo iptables -w -A INPUT  -p tcp --dport 22 -j ACCEPT
+  sudo iptables -w -A OUTPUT -p tcp --sport 22 -j ACCEPT
+  sudo iptables -w -A INPUT  -s 127.0.0.0/8 -d 127.0.0.0/8 -j ACCEPT
+  sudo iptables -w -A OUTPUT -s 127.0.0.0/8 -d 127.0.0.0/8 -j ACCEPT
+  sudo iptables -w -A INPUT  -s {self_ip} -d {self_ip} -j ACCEPT
+  sudo iptables -w -A OUTPUT -s {self_ip} -d {self_ip} -j ACCEPT
+  sudo iptables -w -A INPUT  -j DROP
+  sudo iptables -w -A OUTPUT -j DROP
+  sleep {SECONDS}
+  sudo iptables -w -D OUTPUT -j DROP
+  sudo iptables -w -D INPUT  -j DROP
+  sudo iptables -w -D OUTPUT -s {self_ip} -d {self_ip} -j ACCEPT
+  sudo iptables -w -D INPUT  -s {self_ip} -d {self_ip} -j ACCEPT
+  sudo iptables -w -D OUTPUT -s 127.0.0.0/8 -d 127.0.0.0/8 -j ACCEPT
+  sudo iptables -w -D INPUT  -s 127.0.0.0/8 -d 127.0.0.0/8 -j ACCEPT
+  sudo iptables -w -D OUTPUT -p tcp --sport 22 -j ACCEPT
+  sudo iptables -w -D INPUT  -p tcp --dport 22 -j ACCEPT
+' &>/dev/null &
+"""


didn't the other ai want chains too...

python/ray/_private/simulate_cross_az_network_failure.py

…ses (#58265) ## Description > Briefly describe what this PR accomplishes and why it's needed. Using the ip tables script created in #58241 we found a bug in RequestWorkerLease where a RAY_CHECK was being triggered here: https://github.com/ray-project/ray/blob/66c08b47a195bcfac6878a234dc804142e488fc2/src/ray/raylet/lease_dependency_manager.cc#L222-L223 The issue is that transient network errors can happen ANYTIME, including when the server logic is executing and has not yet replied back to the client. Our original testing framework using an env variable to drop the request or reply when it's being sent, hence this was missed. The issue specifically is that RequestWorkerLease could be in the process of pulling the lease dependencies to it's local plasma store, and the retry can arrive triggering this check. Created a cpp unit test that specifically triggers this RAY_CHECK without this change and is fixed. I decided to store the callbacks instead of replacing the older one with the new one due to the possibility of message reordering where the new one could arrive before the old one. --------- Signed-off-by: joshlee <joshlee@anyscale.com>

…s-az-iptable-scriot

…ses (ray-project#58265) ## Description > Briefly describe what this PR accomplishes and why it's needed. Using the ip tables script created in ray-project#58241 we found a bug in RequestWorkerLease where a RAY_CHECK was being triggered here: https://github.com/ray-project/ray/blob/66c08b47a195bcfac6878a234dc804142e488fc2/src/ray/raylet/lease_dependency_manager.cc#L222-L223 The issue is that transient network errors can happen ANYTIME, including when the server logic is executing and has not yet replied back to the client. Our original testing framework using an env variable to drop the request or reply when it's being sent, hence this was missed. The issue specifically is that RequestWorkerLease could be in the process of pulling the lease dependencies to it's local plasma store, and the retry can arrive triggering this check. Created a cpp unit test that specifically triggers this RAY_CHECK without this change and is fixed. I decided to store the callbacks instead of replacing the older one with the new one due to the possibility of message reordering where the new one could arrive before the old one. --------- Signed-off-by: joshlee <joshlee@anyscale.com>

dayshah

can you add a release test to run this with the 250 node data job

Sparks0219 · 2025-11-10T20:13:10Z

can you add a release test to run this with the 250 node data job

working on that right now, should get something up by EOD

Signed-off-by: joshlee <joshlee@anyscale.com>

release/nightly_tests/simulate_cross_az_network_failure.py

Signed-off-by: joshlee <joshlee@anyscale.com>

Sparks0219 · 2025-11-11T01:51:24Z

@dayshah Main change was I modified the ip table script to take in the map_benchmark data release test as a parameter.
Ideally I want the ip table script to be used with any future release test, so it periodically runs the iptable blackout as a separate thread. Wasn't really sure where to put the ip table script though, stuck it in the release test directory since I think it should be only used there?

dayshah

i can feel the vibes from the code lol 👀

release/release_data_tests.yaml

dayshah · 2025-11-11T01:54:01Z

release/nightly_tests/simulate_cross_az_network_failure.py

+
+    except Exception as e:
+        print(f"[MAIN] ERROR: {e}")
+        exit_code = 1


I'm a little worried anyscale will say the job succeeded even if it actually failed since the actual job is this wrapper script. Can you test with it failing to make sure it actually says the release test failed if it does fail

weirdly enough running the map_benchmark.py on anyscale workspace by itself (so not even through the wrapper script) and triggering a failure doesn't mark the job as Failed. However via the release test anyscale_job_wrapper.py, it seems like it's being reported correctly... 🤷

release/nightly_tests/simulate_cross_az_network_failure.py

dayshah · 2025-11-11T02:01:03Z

release/nightly_tests/simulate_cross_az_network_failure.py

+# NOTE: The script itself does not spin up a Ray cluster, it operates on the assumption that an existing
+# Ray cluster is running and we are able to SSH into the nodes (like on Anyscale).
+
+SECONDS = 5  # failure duration (seconds)


in the past when you found the check failure was it 5 second or 10 second injection?

I actually just ran into the check failure in the lease dependency thingy again using 5 seconds since my image wasn't nightly. Can try bumping it up though

Sparks0219 · 2025-11-12T21:59:12Z

i can feel the vibes from the code lol 👀

Bro I vibed so hard on this that I should be called VibeCoder 😎

Signed-off-by: joshlee <joshlee@anyscale.com>

dayshah

🚢

Just to confirm when the data script fails buildkite says it failed right

dayshah · 2025-11-17T07:24:57Z

release/release_data_tests.yaml

+        - RAY_health_check_period_ms=10000
+        - RAY_health_check_timeout_ms=100000
+        - RAY_health_check_failure_threshold=10
+        - RAY_gcs_rpc_server_connect_timeout_s=60


do you need all of these? Maybe just the gcs connect one?

I think I needed to adjust these when I went to 15 seconds for the timeout, probably good to leave in so people in the future know what env variables they need to tune?

release/release_data_tests.yaml

Signed-off-by: joshlee <joshlee@anyscale.com>

Sparks0219 · 2025-11-17T18:09:38Z

Just to confirm when the data script fails buildkite says it failed right

Yup, when I forgot to add the RAY_gcs_rpc_server_connect_timeout_s the test failed and it showed an error. https://console.anyscale-staging.com/cld_vy7xqacrvddvbuy95auinvuqmt/prj_xqmpk8ps6civt438u1hp5pi88g/jobs/prodjob_n783fmtyqs6zlhlvbcstdevqpp?job-logs-section-tabs=application_logs&job-tab=overview

…ses (ray-project#58265) ## Description > Briefly describe what this PR accomplishes and why it's needed. Using the ip tables script created in ray-project#58241 we found a bug in RequestWorkerLease where a RAY_CHECK was being triggered here: https://github.com/ray-project/ray/blob/66c08b47a195bcfac6878a234dc804142e488fc2/src/ray/raylet/lease_dependency_manager.cc#L222-L223 The issue is that transient network errors can happen ANYTIME, including when the server logic is executing and has not yet replied back to the client. Our original testing framework using an env variable to drop the request or reply when it's being sent, hence this was missed. The issue specifically is that RequestWorkerLease could be in the process of pulling the lease dependencies to it's local plasma store, and the retry can arrive triggering this check. Created a cpp unit test that specifically triggers this RAY_CHECK without this change and is fixed. I decided to store the callbacks instead of replacing the older one with the new one due to the possibility of message reordering where the new one could arrive before the old one. --------- Signed-off-by: joshlee <joshlee@anyscale.com>

…ses (ray-project#58265) ## Description > Briefly describe what this PR accomplishes and why it's needed. Using the ip tables script created in ray-project#58241 we found a bug in RequestWorkerLease where a RAY_CHECK was being triggered here: https://github.com/ray-project/ray/blob/66c08b47a195bcfac6878a234dc804142e488fc2/src/ray/raylet/lease_dependency_manager.cc#L222-L223 The issue is that transient network errors can happen ANYTIME, including when the server logic is executing and has not yet replied back to the client. Our original testing framework using an env variable to drop the request or reply when it's being sent, hence this was missed. The issue specifically is that RequestWorkerLease could be in the process of pulling the lease dependencies to it's local plasma store, and the retry can arrive triggering this check. Created a cpp unit test that specifically triggers this RAY_CHECK without this change and is fixed. I decided to store the callbacks instead of replacing the older one with the new one due to the possibility of message reordering where the new one could arrive before the old one. --------- Signed-off-by: joshlee <joshlee@anyscale.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>

…bles (ray-project#58241) Signed-off-by: joshlee <joshlee@anyscale.com> Co-authored-by: Dhyey Shah <dhyey2019@gmail.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>

Sparks0219 added 2 commits October 28, 2025 05:35

Simulate transient network failure via iptable

aa883d4

Signed-off-by: joshlee <joshlee@anyscale.com>

Add script to simulate network transient error via ip tables

1026993

Signed-off-by: joshlee <joshlee@anyscale.com>

Sparks0219 requested a review from a team as a code owner October 28, 2025 06:01

cursor bot reviewed Oct 28, 2025

View reviewed changes

python/ray/_private/simulate_cross_az_network_failure.py Outdated Show resolved Hide resolved

gemini-code-assist bot reviewed Oct 28, 2025

View reviewed changes

Sparks0219 requested review from dayshah and edoakes October 28, 2025 06:12

dayshah approved these changes Oct 28, 2025

View reviewed changes

ray-gardener bot added the core Issues that should be addressed in Ray Core label Oct 28, 2025

Sparks0219 mentioned this pull request Oct 29, 2025

[core] Fix idempotency issues in RequestWorkerLease for scheduled leases #58265

Merged

Merge remote-tracking branch 'upstream/master' into joshlee/make-cros…

d936aa3

…s-az-iptable-scriot

dayshah reviewed Nov 10, 2025

View reviewed changes

Adding release test + adding periodic failure injection functionality

0c4fb07

Signed-off-by: joshlee <joshlee@anyscale.com>

cursor bot reviewed Nov 11, 2025

View reviewed changes

release/nightly_tests/simulate_cross_az_network_failure.py Show resolved Hide resolved

release/nightly_tests/simulate_cross_az_network_failure.py Outdated Show resolved Hide resolved

release/nightly_tests/simulate_cross_az_network_failure.py Outdated Show resolved Hide resolved

Sparks0219 added 2 commits November 11, 2025 01:45

AI comments

513dc75

Signed-off-by: joshlee <joshlee@anyscale.com>

better test name

f1618c7

Signed-off-by: joshlee <joshlee@anyscale.com>

Sparks0219 requested a review from dayshah November 11, 2025 01:51

dayshah reviewed Nov 11, 2025

View reviewed changes

Sparks0219 added the go add ONLY when ready to merge, run all tests label Nov 11, 2025

Sparks0219 added 2 commits November 13, 2025 01:25

Addressing comments

0e0a55e

Signed-off-by: joshlee <joshlee@anyscale.com>

Separating iptable tests for better readabibility

0c0202b

Signed-off-by: joshlee <joshlee@anyscale.com>

Sparks0219 requested a review from dayshah November 13, 2025 22:46

Sparks0219 added the release-test release test label Nov 14, 2025

Sparks0219 changed the title ~~[core] Add script to simulate network transient error via ip tables~~ [core] Add release test to simulate network transient error via ip tables Nov 14, 2025

dayshah approved these changes Nov 17, 2025

View reviewed changes

Addressing comments

8806d8b

Signed-off-by: joshlee <joshlee@anyscale.com>

Sparks0219 requested a review from dayshah November 17, 2025 18:09

dayshah enabled auto-merge (squash) November 17, 2025 18:24

dayshah disabled auto-merge November 17, 2025 18:24

dayshah enabled auto-merge (squash) November 17, 2025 18:24

Merge branch 'master' into joshlee/make-cross-az-iptable-scriot

af3cfbe

github-actions bot disabled auto-merge November 17, 2025 21:43

dayshah enabled auto-merge (squash) November 17, 2025 21:45

dayshah merged commit 3ff6ed3 into ray-project:master Nov 17, 2025
7 checks passed



		SECONDS = 5 # blackout duration (seconds)
		PARALLEL = 500 # concurrent SSH sessions

[core] Add release test to simulate network transient error via ip tables #58241

[core] Add release test to simulate network transient error via ip tables #58241

Uh oh!

Conversation

Sparks0219 commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dayshah left a comment

Choose a reason for hiding this comment

Uh oh!

dayshah Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Sparks0219 Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dayshah left a comment

Choose a reason for hiding this comment

Uh oh!

Sparks0219 commented Nov 10, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Sparks0219 commented Nov 11, 2025

Uh oh!

dayshah left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Sparks0219 Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Sparks0219 commented Nov 12, 2025

Uh oh!

dayshah left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Sparks0219 commented Nov 17, 2025

Uh oh!

Uh oh!

Reviewers

Sparks0219 commented Oct 28, 2025 •

edited

Loading

dayshah Oct 28, 2025 •

edited

Loading

Sparks0219 Nov 11, 2025 •

edited

Loading

Sparks0219 Nov 12, 2025 •

edited

Loading

dayshah left a comment •

edited

Loading