Skip to content

Conversation

@Sparks0219
Copy link
Contributor

@Sparks0219 Sparks0219 commented Oct 28, 2025

Description

Briefly describe what this PR accomplishes and why it's needed.

To simulate transient network errors we were initially going to use AWS FIS testing but there's a couple problems:
1.) The minimum timeout is 60 seconds causing node death. We weren't able to determine whether we could fix this by tuning ray_config env variables or whether this was due to some mechanism in Anyscale. It would probably take a decent amount of time to figure out + coordination with infra so decided to try the IP table route first
2.) The FIS experiment will introduce cross AZ transient network errors across all clusters in staging

Hence we instead explored an approach of ssh-ing into each node and modifying the IP tables to drop all ingoing/outgoing traffic except intra-node traffic to simulate cross AZ fault tolerance. It actually wasn't so bad to implement, and it works pretty well. You can tune the timeout to what you want and it only affects the nodes in your cluster hence avoids the cons of both listed above. Adding the script into prod since I think it could be helpful for future chaos testing.

Signed-off-by: joshlee <joshlee@anyscale.com>
Signed-off-by: joshlee <joshlee@anyscale.com>
@Sparks0219 Sparks0219 requested a review from a team as a code owner October 28, 2025 06:01
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new script to simulate transient network failures across availability zones by manipulating iptables on Ray nodes. This is a useful tool for testing fault tolerance.

My review focuses on improving the robustness and maintainability of the script. The main points are:

  • The iptables commands should be made idempotent by using a custom chain to avoid potential issues with leftover rules.
  • The script should use public Ray APIs instead of internal ones to ensure it doesn't break with future Ray updates.
  • The default concurrency for SSH connections is very high and could be lowered to prevent resource exhaustion.

Overall, the script is well-structured and the approach is sound. The suggested changes will make it more reliable and easier to maintain.

Comment on lines 29 to 50
def iptables_cmd(self_ip: str) -> str:
return f"""\
nohup setsid bash -lc '
sudo iptables -w -A INPUT -p tcp --dport 22 -j ACCEPT
sudo iptables -w -A OUTPUT -p tcp --sport 22 -j ACCEPT
sudo iptables -w -A INPUT -s 127.0.0.0/8 -d 127.0.0.0/8 -j ACCEPT
sudo iptables -w -A OUTPUT -s 127.0.0.0/8 -d 127.0.0.0/8 -j ACCEPT
sudo iptables -w -A INPUT -s {self_ip} -d {self_ip} -j ACCEPT
sudo iptables -w -A OUTPUT -s {self_ip} -d {self_ip} -j ACCEPT
sudo iptables -w -A INPUT -j DROP
sudo iptables -w -A OUTPUT -j DROP
sleep {SECONDS}
sudo iptables -w -D OUTPUT -j DROP
sudo iptables -w -D INPUT -j DROP
sudo iptables -w -D OUTPUT -s {self_ip} -d {self_ip} -j ACCEPT
sudo iptables -w -D INPUT -s {self_ip} -d {self_ip} -j ACCEPT
sudo iptables -w -D OUTPUT -s 127.0.0.0/8 -d 127.0.0.0/8 -j ACCEPT
sudo iptables -w -D INPUT -s 127.0.0.0/8 -d 127.0.0.0/8 -j ACCEPT
sudo iptables -w -D OUTPUT -p tcp --sport 22 -j ACCEPT
sudo iptables -w -D INPUT -p tcp --dport 22 -j ACCEPT
' &>/dev/null &
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation of iptables_cmd is not idempotent. If the script is run twice on the same node without the first run completing its cleanup, it will add a second set of rules. The cleanup phase will only remove one set, potentially leaving DROP rules in place and causing a permanent network partition.

A more robust approach is to use a dedicated iptables chain. This makes the operation atomic and idempotent, as you can flush and delete the chain at the start and end of the script.

def iptables_cmd(self_ip: str) -> str:
    return f"""\
nohup setsid bash -lc '
  CHAIN_NAME="RAY_NETWORK_BLACKOUT"
  # Clean up previous runs if any, ignoring errors if chain/rules don't exist
  sudo iptables -w -D INPUT -j $CHAIN_NAME 2>/dev/null || true
  sudo iptables -w -D OUTPUT -j $CHAIN_NAME 2>/dev/null || true
  sudo iptables -w -F $CHAIN_NAME 2>/dev/null || true
  sudo iptables -w -X $CHAIN_NAME 2>/dev/null || true

  # Create a new chain
  sudo iptables -w -N $CHAIN_NAME
  # Rules for allowed traffic
  sudo iptables -w -A $CHAIN_NAME -p tcp --dport 22 -j ACCEPT
  sudo iptables -w -A $CHAIN_NAME -p tcp --sport 22 -j ACCEPT
  sudo iptables -w -A $CHAIN_NAME -s 127.0.0.0/8 -d 127.0.0.0/8 -j ACCEPT
  sudo iptables -w -A $CHAIN_NAME -s {self_ip} -d {self_ip} -j ACCEPT
  # Drop everything else in this chain
  sudo iptables -w -A $CHAIN_NAME -j DROP

  # Insert the chain at the top of INPUT and OUTPUT
  sudo iptables -w -I INPUT 1 -j $CHAIN_NAME
  sudo iptables -w -I OUTPUT 1 -j $CHAIN_NAME

  sleep {SECONDS}

  # Cleanup
  sudo iptables -w -D INPUT -j $CHAIN_NAME
  sudo iptables -w -D OUTPUT -j $CHAIN_NAME
  sudo iptables -w -F $CHAIN_NAME
  sudo iptables -w -X $CHAIN_NAME
' &>/dev/null &
"""

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't the other ai want chains too...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this and something is breaking 💀



SECONDS = 5 # blackout duration (seconds)
PARALLEL = 500 # concurrent SSH sessions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

A value of 500 for PARALLEL is quite high and may cause issues. A ThreadPoolExecutor with 500 worker threads can consume a significant amount of memory (for thread stacks) on the machine running the script. Additionally, opening 500 concurrent SSH sessions might hit system limits (like file descriptors or MaxStartups on the SSH server). It would be safer to use a more conservative number, or make it configurable based on the environment. For example, a value closer to the number of CPU cores or a number like 32 or 64 is often a safer default.

@Sparks0219 Sparks0219 requested review from dayshah and edoakes October 28, 2025 06:12
Copy link
Contributor

@dayshah dayshah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is amazing 🤩



SECONDS = 5 # blackout duration (seconds)
PARALLEL = 500 # concurrent SSH sessions
Copy link
Contributor

@dayshah dayshah Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why 500? If you just need to keep all active at the same time just launch and join threads?

Copy link
Contributor Author

@Sparks0219 Sparks0219 Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just to limit the number of max concurrent ssh connections. Should only be a problem for like super large clusters.

Comment on lines 29 to 50
def iptables_cmd(self_ip: str) -> str:
return f"""\
nohup setsid bash -lc '
sudo iptables -w -A INPUT -p tcp --dport 22 -j ACCEPT
sudo iptables -w -A OUTPUT -p tcp --sport 22 -j ACCEPT
sudo iptables -w -A INPUT -s 127.0.0.0/8 -d 127.0.0.0/8 -j ACCEPT
sudo iptables -w -A OUTPUT -s 127.0.0.0/8 -d 127.0.0.0/8 -j ACCEPT
sudo iptables -w -A INPUT -s {self_ip} -d {self_ip} -j ACCEPT
sudo iptables -w -A OUTPUT -s {self_ip} -d {self_ip} -j ACCEPT
sudo iptables -w -A INPUT -j DROP
sudo iptables -w -A OUTPUT -j DROP
sleep {SECONDS}
sudo iptables -w -D OUTPUT -j DROP
sudo iptables -w -D INPUT -j DROP
sudo iptables -w -D OUTPUT -s {self_ip} -d {self_ip} -j ACCEPT
sudo iptables -w -D INPUT -s {self_ip} -d {self_ip} -j ACCEPT
sudo iptables -w -D OUTPUT -s 127.0.0.0/8 -d 127.0.0.0/8 -j ACCEPT
sudo iptables -w -D INPUT -s 127.0.0.0/8 -d 127.0.0.0/8 -j ACCEPT
sudo iptables -w -D OUTPUT -p tcp --sport 22 -j ACCEPT
sudo iptables -w -D INPUT -p tcp --dport 22 -j ACCEPT
' &>/dev/null &
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't the other ai want chains too...

@ray-gardener ray-gardener bot added the core Issues that should be addressed in Ray Core label Oct 28, 2025
edoakes pushed a commit that referenced this pull request Oct 31, 2025
…ses (#58265)

## Description
> Briefly describe what this PR accomplishes and why it's needed.

Using the ip tables script created in #58241 we found a bug in
RequestWorkerLease where a RAY_CHECK was being triggered here:
https://github.com/ray-project/ray/blob/66c08b47a195bcfac6878a234dc804142e488fc2/src/ray/raylet/lease_dependency_manager.cc#L222-L223
The issue is that transient network errors can happen ANYTIME, including
when the server logic is executing and has not yet replied back to the
client. Our original testing framework using an env variable to drop the
request or reply when it's being sent, hence this was missed. The issue
specifically is that RequestWorkerLease could be in the process of
pulling the lease dependencies to it's local plasma store, and the retry
can arrive triggering this check. Created a cpp unit test that
specifically triggers this RAY_CHECK without this change and is fixed. I
decided to store the callbacks instead of replacing the older one with
the new one due to the possibility of message reordering where the new
one could arrive before the old one.

---------

Signed-off-by: joshlee <joshlee@anyscale.com>
YoussefEssDS pushed a commit to YoussefEssDS/ray that referenced this pull request Nov 8, 2025
…ses (ray-project#58265)

## Description
> Briefly describe what this PR accomplishes and why it's needed.

Using the ip tables script created in ray-project#58241 we found a bug in
RequestWorkerLease where a RAY_CHECK was being triggered here:
https://github.com/ray-project/ray/blob/66c08b47a195bcfac6878a234dc804142e488fc2/src/ray/raylet/lease_dependency_manager.cc#L222-L223
The issue is that transient network errors can happen ANYTIME, including
when the server logic is executing and has not yet replied back to the
client. Our original testing framework using an env variable to drop the
request or reply when it's being sent, hence this was missed. The issue
specifically is that RequestWorkerLease could be in the process of
pulling the lease dependencies to it's local plasma store, and the retry
can arrive triggering this check. Created a cpp unit test that
specifically triggers this RAY_CHECK without this change and is fixed. I
decided to store the callbacks instead of replacing the older one with
the new one due to the possibility of message reordering where the new
one could arrive before the old one.

---------

Signed-off-by: joshlee <joshlee@anyscale.com>
Copy link
Contributor

@dayshah dayshah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a release test to run this with the 250 node data job

@Sparks0219
Copy link
Contributor Author

can you add a release test to run this with the 250 node data job

working on that right now, should get something up by EOD

Signed-off-by: joshlee <joshlee@anyscale.com>
Signed-off-by: joshlee <joshlee@anyscale.com>
Signed-off-by: joshlee <joshlee@anyscale.com>
@Sparks0219
Copy link
Contributor Author

@dayshah Main change was I modified the ip table script to take in the map_benchmark data release test as a parameter.
Ideally I want the ip table script to be used with any future release test, so it periodically runs the iptable blackout as a separate thread. Wasn't really sure where to put the ip table script though, stuck it in the release test directory since I think it should be only used there?

@Sparks0219 Sparks0219 requested a review from dayshah November 11, 2025 01:51
Copy link
Contributor

@dayshah dayshah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i can feel the vibes from the code lol 👀


except Exception as e:
print(f"[MAIN] ERROR: {e}")
exit_code = 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little worried anyscale will say the job succeeded even if it actually failed since the actual job is this wrapper script. Can you test with it failing to make sure it actually says the release test failed if it does fail

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weirdly enough running the map_benchmark.py on anyscale workspace by itself (so not even through the wrapper script) and triggering a failure doesn't mark the job as Failed. However via the release test anyscale_job_wrapper.py, it seems like it's being reported correctly... 🤷

# NOTE: The script itself does not spin up a Ray cluster, it operates on the assumption that an existing
# Ray cluster is running and we are able to SSH into the nodes (like on Anyscale).

SECONDS = 5 # failure duration (seconds)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the past when you found the check failure was it 5 second or 10 second injection?

Copy link
Contributor Author

@Sparks0219 Sparks0219 Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually just ran into the check failure in the lease dependency thingy again using 5 seconds since my image wasn't nightly. Can try bumping it up though

@Sparks0219 Sparks0219 added the go add ONLY when ready to merge, run all tests label Nov 11, 2025
@Sparks0219
Copy link
Contributor Author

i can feel the vibes from the code lol 👀

Bro I vibed so hard on this that I should be called VibeCoder 😎

Signed-off-by: joshlee <joshlee@anyscale.com>
Signed-off-by: joshlee <joshlee@anyscale.com>
@Sparks0219 Sparks0219 requested a review from dayshah November 13, 2025 22:46
@Sparks0219 Sparks0219 added the release-test release test label Nov 14, 2025
@Sparks0219 Sparks0219 changed the title [core] Add script to simulate network transient error via ip tables [core] Add release test to simulate network transient error via ip tables Nov 14, 2025
Copy link
Contributor

@dayshah dayshah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢

Just to confirm when the data script fails buildkite says it failed right

- RAY_health_check_period_ms=10000
- RAY_health_check_timeout_ms=100000
- RAY_health_check_failure_threshold=10
- RAY_gcs_rpc_server_connect_timeout_s=60
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you need all of these? Maybe just the gcs connect one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I needed to adjust these when I went to 15 seconds for the timeout, probably good to leave in so people in the future know what env variables they need to tune?

Signed-off-by: joshlee <joshlee@anyscale.com>
@Sparks0219
Copy link
Contributor Author

Just to confirm when the data script fails buildkite says it failed right

Yup, when I forgot to add the RAY_gcs_rpc_server_connect_timeout_s the test failed and it showed an error. https://console.anyscale-staging.com/cld_vy7xqacrvddvbuy95auinvuqmt/prj_xqmpk8ps6civt438u1hp5pi88g/jobs/prodjob_n783fmtyqs6zlhlvbcstdevqpp?job-logs-section-tabs=application_logs&job-tab=overview

@Sparks0219 Sparks0219 requested a review from dayshah November 17, 2025 18:09
@dayshah dayshah enabled auto-merge (squash) November 17, 2025 18:24
@dayshah dayshah disabled auto-merge November 17, 2025 18:24
@dayshah dayshah enabled auto-merge (squash) November 17, 2025 18:24
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
…ses (ray-project#58265)

## Description
> Briefly describe what this PR accomplishes and why it's needed.

Using the ip tables script created in ray-project#58241 we found a bug in
RequestWorkerLease where a RAY_CHECK was being triggered here:
https://github.com/ray-project/ray/blob/66c08b47a195bcfac6878a234dc804142e488fc2/src/ray/raylet/lease_dependency_manager.cc#L222-L223
The issue is that transient network errors can happen ANYTIME, including
when the server logic is executing and has not yet replied back to the
client. Our original testing framework using an env variable to drop the
request or reply when it's being sent, hence this was missed. The issue
specifically is that RequestWorkerLease could be in the process of
pulling the lease dependencies to it's local plasma store, and the retry
can arrive triggering this check. Created a cpp unit test that
specifically triggers this RAY_CHECK without this change and is fixed. I
decided to store the callbacks instead of replacing the older one with
the new one due to the possibility of message reordering where the new
one could arrive before the old one.

---------

Signed-off-by: joshlee <joshlee@anyscale.com>
@github-actions github-actions bot disabled auto-merge November 17, 2025 21:43
@dayshah dayshah enabled auto-merge (squash) November 17, 2025 21:45
@dayshah dayshah merged commit 3ff6ed3 into ray-project:master Nov 17, 2025
7 checks passed
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
…ses (ray-project#58265)

## Description
> Briefly describe what this PR accomplishes and why it's needed.

Using the ip tables script created in ray-project#58241 we found a bug in
RequestWorkerLease where a RAY_CHECK was being triggered here:
https://github.com/ray-project/ray/blob/66c08b47a195bcfac6878a234dc804142e488fc2/src/ray/raylet/lease_dependency_manager.cc#L222-L223
The issue is that transient network errors can happen ANYTIME, including
when the server logic is executing and has not yet replied back to the
client. Our original testing framework using an env variable to drop the
request or reply when it's being sent, hence this was missed. The issue
specifically is that RequestWorkerLease could be in the process of
pulling the lease dependencies to it's local plasma store, and the retry
can arrive triggering this check. Created a cpp unit test that
specifically triggers this RAY_CHECK without this change and is fixed. I
decided to store the callbacks instead of replacing the older one with
the new one due to the possibility of message reordering where the new
one could arrive before the old one.

---------

Signed-off-by: joshlee <joshlee@anyscale.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
…bles (ray-project#58241)

Signed-off-by: joshlee <joshlee@anyscale.com>
Co-authored-by: Dhyey Shah <dhyey2019@gmail.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests release-test release test

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants