-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Object Spilling] 100GB shuffle release test #13729
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also add this to the release process, and include what the success condition is, and any information that needs tp be recorded?
rows_per_partition = partition_size // (8 * 2) | ||
object_store_size = 20 * 1024 * 1024 * 1024 # 20G | ||
|
||
system_config = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't these be set automatically once object spilling is turned on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. But we'd like to run this in the current release, and object spilling will be turned off for the current release. I can create other PR to fix it later.
Sounds good. Will update them tomorrow. If you'd like to run it asap, the success criteria now is just that it is finished. (I will write more details soon). |
# Command to start ray on the head node. You don't need to change this. | ||
head_start_ray_commands: | ||
- ray stop | ||
# - ulimit -n 65536; ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml --system-config='{"automatic_object_spilling_enabled":true,"max_io_workers":1,"object_spilling_config":"{\"type\":\"filesystem\",\"params\":{\"directory_path\":\"/tmp/spill\"}}"}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need it because we will start a ray instance in the driver. If you'd like to remove this, lmk (I prefer to keep it as a reference).
- ``data_processing_tests/workloads/streaming_shuffle.py`` run the 100GB streaming shuffle in a single node & fake 4 nodes cluster. | ||
|
||
**IMPORTANT** Check if the workload scripts has terminated. If so, please record the result (both read/write bandwidth and the shuffle result) to the ``release_logs/data_processing_tests/[test_name]``. | ||
Both shuffling runtime and read/write bandwidth shouldn't be decreasing more than 15% compared to the previous release. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
15% might be a bit ambitious, but let's see how it looks from the next release.
…)" This reverts commit e993ffb.
Why are these changes needed?
Add a single / 4 nodes streaming shuffle stress test. I made output pretty too lol.
Related issue number
Checks
scripts/format.sh
to lint the changes in this PR.