-
Notifications
You must be signed in to change notification settings - Fork 6.9k
[core] Add release test to simulate network transient error via ip tables #58241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] Add release test to simulate network transient error via ip tables #58241
Conversation
Signed-off-by: joshlee <joshlee@anyscale.com>
Signed-off-by: joshlee <joshlee@anyscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new script to simulate transient network failures across availability zones by manipulating iptables on Ray nodes. This is a useful tool for testing fault tolerance.
My review focuses on improving the robustness and maintainability of the script. The main points are:
- The
iptablescommands should be made idempotent by using a custom chain to avoid potential issues with leftover rules. - The script should use public Ray APIs instead of internal ones to ensure it doesn't break with future Ray updates.
- The default concurrency for SSH connections is very high and could be lowered to prevent resource exhaustion.
Overall, the script is well-structured and the approach is sound. The suggested changes will make it more reliable and easier to maintain.
| def iptables_cmd(self_ip: str) -> str: | ||
| return f"""\ | ||
| nohup setsid bash -lc ' | ||
| sudo iptables -w -A INPUT -p tcp --dport 22 -j ACCEPT | ||
| sudo iptables -w -A OUTPUT -p tcp --sport 22 -j ACCEPT | ||
| sudo iptables -w -A INPUT -s 127.0.0.0/8 -d 127.0.0.0/8 -j ACCEPT | ||
| sudo iptables -w -A OUTPUT -s 127.0.0.0/8 -d 127.0.0.0/8 -j ACCEPT | ||
| sudo iptables -w -A INPUT -s {self_ip} -d {self_ip} -j ACCEPT | ||
| sudo iptables -w -A OUTPUT -s {self_ip} -d {self_ip} -j ACCEPT | ||
| sudo iptables -w -A INPUT -j DROP | ||
| sudo iptables -w -A OUTPUT -j DROP | ||
| sleep {SECONDS} | ||
| sudo iptables -w -D OUTPUT -j DROP | ||
| sudo iptables -w -D INPUT -j DROP | ||
| sudo iptables -w -D OUTPUT -s {self_ip} -d {self_ip} -j ACCEPT | ||
| sudo iptables -w -D INPUT -s {self_ip} -d {self_ip} -j ACCEPT | ||
| sudo iptables -w -D OUTPUT -s 127.0.0.0/8 -d 127.0.0.0/8 -j ACCEPT | ||
| sudo iptables -w -D INPUT -s 127.0.0.0/8 -d 127.0.0.0/8 -j ACCEPT | ||
| sudo iptables -w -D OUTPUT -p tcp --sport 22 -j ACCEPT | ||
| sudo iptables -w -D INPUT -p tcp --dport 22 -j ACCEPT | ||
| ' &>/dev/null & | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation of iptables_cmd is not idempotent. If the script is run twice on the same node without the first run completing its cleanup, it will add a second set of rules. The cleanup phase will only remove one set, potentially leaving DROP rules in place and causing a permanent network partition.
A more robust approach is to use a dedicated iptables chain. This makes the operation atomic and idempotent, as you can flush and delete the chain at the start and end of the script.
def iptables_cmd(self_ip: str) -> str:
return f"""\
nohup setsid bash -lc '
CHAIN_NAME="RAY_NETWORK_BLACKOUT"
# Clean up previous runs if any, ignoring errors if chain/rules don't exist
sudo iptables -w -D INPUT -j $CHAIN_NAME 2>/dev/null || true
sudo iptables -w -D OUTPUT -j $CHAIN_NAME 2>/dev/null || true
sudo iptables -w -F $CHAIN_NAME 2>/dev/null || true
sudo iptables -w -X $CHAIN_NAME 2>/dev/null || true
# Create a new chain
sudo iptables -w -N $CHAIN_NAME
# Rules for allowed traffic
sudo iptables -w -A $CHAIN_NAME -p tcp --dport 22 -j ACCEPT
sudo iptables -w -A $CHAIN_NAME -p tcp --sport 22 -j ACCEPT
sudo iptables -w -A $CHAIN_NAME -s 127.0.0.0/8 -d 127.0.0.0/8 -j ACCEPT
sudo iptables -w -A $CHAIN_NAME -s {self_ip} -d {self_ip} -j ACCEPT
# Drop everything else in this chain
sudo iptables -w -A $CHAIN_NAME -j DROP
# Insert the chain at the top of INPUT and OUTPUT
sudo iptables -w -I INPUT 1 -j $CHAIN_NAME
sudo iptables -w -I OUTPUT 1 -j $CHAIN_NAME
sleep {SECONDS}
# Cleanup
sudo iptables -w -D INPUT -j $CHAIN_NAME
sudo iptables -w -D OUTPUT -j $CHAIN_NAME
sudo iptables -w -F $CHAIN_NAME
sudo iptables -w -X $CHAIN_NAME
' &>/dev/null &
"""There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
didn't the other ai want chains too...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried this and something is breaking 💀
|
|
||
|
|
||
| SECONDS = 5 # blackout duration (seconds) | ||
| PARALLEL = 500 # concurrent SSH sessions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A value of 500 for PARALLEL is quite high and may cause issues. A ThreadPoolExecutor with 500 worker threads can consume a significant amount of memory (for thread stacks) on the machine running the script. Additionally, opening 500 concurrent SSH sessions might hit system limits (like file descriptors or MaxStartups on the SSH server). It would be safer to use a more conservative number, or make it configurable based on the environment. For example, a value closer to the number of CPU cores or a number like 32 or 64 is often a safer default.
dayshah
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is amazing 🤩
|
|
||
|
|
||
| SECONDS = 5 # blackout duration (seconds) | ||
| PARALLEL = 500 # concurrent SSH sessions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why 500? If you just need to keep all active at the same time just launch and join threads?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just to limit the number of max concurrent ssh connections. Should only be a problem for like super large clusters.
| def iptables_cmd(self_ip: str) -> str: | ||
| return f"""\ | ||
| nohup setsid bash -lc ' | ||
| sudo iptables -w -A INPUT -p tcp --dport 22 -j ACCEPT | ||
| sudo iptables -w -A OUTPUT -p tcp --sport 22 -j ACCEPT | ||
| sudo iptables -w -A INPUT -s 127.0.0.0/8 -d 127.0.0.0/8 -j ACCEPT | ||
| sudo iptables -w -A OUTPUT -s 127.0.0.0/8 -d 127.0.0.0/8 -j ACCEPT | ||
| sudo iptables -w -A INPUT -s {self_ip} -d {self_ip} -j ACCEPT | ||
| sudo iptables -w -A OUTPUT -s {self_ip} -d {self_ip} -j ACCEPT | ||
| sudo iptables -w -A INPUT -j DROP | ||
| sudo iptables -w -A OUTPUT -j DROP | ||
| sleep {SECONDS} | ||
| sudo iptables -w -D OUTPUT -j DROP | ||
| sudo iptables -w -D INPUT -j DROP | ||
| sudo iptables -w -D OUTPUT -s {self_ip} -d {self_ip} -j ACCEPT | ||
| sudo iptables -w -D INPUT -s {self_ip} -d {self_ip} -j ACCEPT | ||
| sudo iptables -w -D OUTPUT -s 127.0.0.0/8 -d 127.0.0.0/8 -j ACCEPT | ||
| sudo iptables -w -D INPUT -s 127.0.0.0/8 -d 127.0.0.0/8 -j ACCEPT | ||
| sudo iptables -w -D OUTPUT -p tcp --sport 22 -j ACCEPT | ||
| sudo iptables -w -D INPUT -p tcp --dport 22 -j ACCEPT | ||
| ' &>/dev/null & | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
didn't the other ai want chains too...
…ses (#58265) ## Description > Briefly describe what this PR accomplishes and why it's needed. Using the ip tables script created in #58241 we found a bug in RequestWorkerLease where a RAY_CHECK was being triggered here: https://github.com/ray-project/ray/blob/66c08b47a195bcfac6878a234dc804142e488fc2/src/ray/raylet/lease_dependency_manager.cc#L222-L223 The issue is that transient network errors can happen ANYTIME, including when the server logic is executing and has not yet replied back to the client. Our original testing framework using an env variable to drop the request or reply when it's being sent, hence this was missed. The issue specifically is that RequestWorkerLease could be in the process of pulling the lease dependencies to it's local plasma store, and the retry can arrive triggering this check. Created a cpp unit test that specifically triggers this RAY_CHECK without this change and is fixed. I decided to store the callbacks instead of replacing the older one with the new one due to the possibility of message reordering where the new one could arrive before the old one. --------- Signed-off-by: joshlee <joshlee@anyscale.com>
…s-az-iptable-scriot
…ses (ray-project#58265) ## Description > Briefly describe what this PR accomplishes and why it's needed. Using the ip tables script created in ray-project#58241 we found a bug in RequestWorkerLease where a RAY_CHECK was being triggered here: https://github.com/ray-project/ray/blob/66c08b47a195bcfac6878a234dc804142e488fc2/src/ray/raylet/lease_dependency_manager.cc#L222-L223 The issue is that transient network errors can happen ANYTIME, including when the server logic is executing and has not yet replied back to the client. Our original testing framework using an env variable to drop the request or reply when it's being sent, hence this was missed. The issue specifically is that RequestWorkerLease could be in the process of pulling the lease dependencies to it's local plasma store, and the retry can arrive triggering this check. Created a cpp unit test that specifically triggers this RAY_CHECK without this change and is fixed. I decided to store the callbacks instead of replacing the older one with the new one due to the possibility of message reordering where the new one could arrive before the old one. --------- Signed-off-by: joshlee <joshlee@anyscale.com>
dayshah
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a release test to run this with the 250 node data job
working on that right now, should get something up by EOD |
Signed-off-by: joshlee <joshlee@anyscale.com>
Signed-off-by: joshlee <joshlee@anyscale.com>
Signed-off-by: joshlee <joshlee@anyscale.com>
|
@dayshah Main change was I modified the ip table script to take in the map_benchmark data release test as a parameter. |
dayshah
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i can feel the vibes from the code lol 👀
|
|
||
| except Exception as e: | ||
| print(f"[MAIN] ERROR: {e}") | ||
| exit_code = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little worried anyscale will say the job succeeded even if it actually failed since the actual job is this wrapper script. Can you test with it failing to make sure it actually says the release test failed if it does fail
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
weirdly enough running the map_benchmark.py on anyscale workspace by itself (so not even through the wrapper script) and triggering a failure doesn't mark the job as Failed. However via the release test anyscale_job_wrapper.py, it seems like it's being reported correctly... 🤷
| # NOTE: The script itself does not spin up a Ray cluster, it operates on the assumption that an existing | ||
| # Ray cluster is running and we are able to SSH into the nodes (like on Anyscale). | ||
|
|
||
| SECONDS = 5 # failure duration (seconds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the past when you found the check failure was it 5 second or 10 second injection?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually just ran into the check failure in the lease dependency thingy again using 5 seconds since my image wasn't nightly. Can try bumping it up though
Bro I vibed so hard on this that I should be called VibeCoder 😎 |
Signed-off-by: joshlee <joshlee@anyscale.com>
Signed-off-by: joshlee <joshlee@anyscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚢
Just to confirm when the data script fails buildkite says it failed right
| - RAY_health_check_period_ms=10000 | ||
| - RAY_health_check_timeout_ms=100000 | ||
| - RAY_health_check_failure_threshold=10 | ||
| - RAY_gcs_rpc_server_connect_timeout_s=60 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you need all of these? Maybe just the gcs connect one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I needed to adjust these when I went to 15 seconds for the timeout, probably good to leave in so people in the future know what env variables they need to tune?
Signed-off-by: joshlee <joshlee@anyscale.com>
Yup, when I forgot to add the RAY_gcs_rpc_server_connect_timeout_s the test failed and it showed an error. https://console.anyscale-staging.com/cld_vy7xqacrvddvbuy95auinvuqmt/prj_xqmpk8ps6civt438u1hp5pi88g/jobs/prodjob_n783fmtyqs6zlhlvbcstdevqpp?job-logs-section-tabs=application_logs&job-tab=overview |
…ses (ray-project#58265) ## Description > Briefly describe what this PR accomplishes and why it's needed. Using the ip tables script created in ray-project#58241 we found a bug in RequestWorkerLease where a RAY_CHECK was being triggered here: https://github.com/ray-project/ray/blob/66c08b47a195bcfac6878a234dc804142e488fc2/src/ray/raylet/lease_dependency_manager.cc#L222-L223 The issue is that transient network errors can happen ANYTIME, including when the server logic is executing and has not yet replied back to the client. Our original testing framework using an env variable to drop the request or reply when it's being sent, hence this was missed. The issue specifically is that RequestWorkerLease could be in the process of pulling the lease dependencies to it's local plasma store, and the retry can arrive triggering this check. Created a cpp unit test that specifically triggers this RAY_CHECK without this change and is fixed. I decided to store the callbacks instead of replacing the older one with the new one due to the possibility of message reordering where the new one could arrive before the old one. --------- Signed-off-by: joshlee <joshlee@anyscale.com>
…ses (ray-project#58265) ## Description > Briefly describe what this PR accomplishes and why it's needed. Using the ip tables script created in ray-project#58241 we found a bug in RequestWorkerLease where a RAY_CHECK was being triggered here: https://github.com/ray-project/ray/blob/66c08b47a195bcfac6878a234dc804142e488fc2/src/ray/raylet/lease_dependency_manager.cc#L222-L223 The issue is that transient network errors can happen ANYTIME, including when the server logic is executing and has not yet replied back to the client. Our original testing framework using an env variable to drop the request or reply when it's being sent, hence this was missed. The issue specifically is that RequestWorkerLease could be in the process of pulling the lease dependencies to it's local plasma store, and the retry can arrive triggering this check. Created a cpp unit test that specifically triggers this RAY_CHECK without this change and is fixed. I decided to store the callbacks instead of replacing the older one with the new one due to the possibility of message reordering where the new one could arrive before the old one. --------- Signed-off-by: joshlee <joshlee@anyscale.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>
…bles (ray-project#58241) Signed-off-by: joshlee <joshlee@anyscale.com> Co-authored-by: Dhyey Shah <dhyey2019@gmail.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Description
To simulate transient network errors we were initially going to use AWS FIS testing but there's a couple problems:
1.) The minimum timeout is 60 seconds causing node death. We weren't able to determine whether we could fix this by tuning ray_config env variables or whether this was due to some mechanism in Anyscale. It would probably take a decent amount of time to figure out + coordination with infra so decided to try the IP table route first
2.) The FIS experiment will introduce cross AZ transient network errors across all clusters in staging
Hence we instead explored an approach of ssh-ing into each node and modifying the IP tables to drop all ingoing/outgoing traffic except intra-node traffic to simulate cross AZ fault tolerance. It actually wasn't so bad to implement, and it works pretty well. You can tune the timeout to what you want and it only affects the nodes in your cluster hence avoids the cons of both listed above. Adding the script into prod since I think it could be helpful for future chaos testing.