-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[autoscaler] Auto detect memory resource #14567
Conversation
cc @ericl, this problem exists in both ray 1.1.0 and 1.2.0. |
Will this also override the object_store_memory settings on the node when it starts? This is a little dangerous since the safe value for object_store_memory depends on factors such as the size of /dev/shm, and is in general un-knowable until the node is actually up. Since it's not allowed to use object_store_memory as a resource request any more, I think we should drop that from this PR at least. Setting "memory" is relatively safe since it's only a scheduling hint. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few comments. Can we also update AutoscalingConfigTest.testValidateDefaultConfig
and AutoscalingConfigTest.testValidateDefaultConfigAWSMultiNodeTypes
Hi @wuisawesome, thanks for reviewing.
You mean to add more tests? |
autodetected_resources.update( | ||
config["available_node_types"][node_type]["resources"]) | ||
config["available_node_types"][node_type][ | ||
"resources"] = autodetected_resources |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this!
autodetected_resources = {"CPU": cpus} | ||
if node_type != head_node_type: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only need to autodetect the worker node type memory resources. Because the head node type memory can be updated from runtime. And also the head node type memory is total_memory * (1 - REDIS_PROPORTION - OBJECT_STORE_PROPORTION)
, however the worker node type is total_memory * (1 - OBJECT_STORE_PROPORTION)
.
@wuisawesome @ericl @DmitriGekhtman , do the new changes look good to you? |
K8s part looks good. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ConeyLiu for this great contribution! We greatly value that. We look forward to more PRs from you!
thanks all. |
Why are these changes needed?
In the current autoscaler, we could auto detect cpu/gpu resources. However, we could not know the memory size when
min_workers
set to zero. This could lead to those tasks withmemory
requirements that could not be satisfied forever.This patch adds memory auto detecting for k8s and aws.
Related issue number
Closes #14553
Checks
scripts/format.sh
to lint the changes in this PR.