[autoscaler] Auto detect memory resource #14567

ConeyLiu · 2021-03-09T15:38:20Z

Why are these changes needed?

In the current autoscaler, we could auto detect cpu/gpu resources. However, we could not know the memory size when min_workers set to zero. This could lead to those tasks with memory requirements that could not be satisfied forever.

This patch adds memory auto detecting for k8s and aws.

Related issue number

Closes #14553

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

ConeyLiu · 2021-03-09T15:42:05Z

cc @ericl, this problem exists in both ray 1.1.0 and 1.2.0.

ericl · 2021-03-09T20:20:26Z

Will this also override the object_store_memory settings on the node when it starts? This is a little dangerous since the safe value for object_store_memory depends on factors such as the size of /dev/shm, and is in general un-knowable until the node is actually up. Since it's not allowed to use object_store_memory as a resource request any more, I think we should drop that from this PR at least.

Setting "memory" is relatively safe since it's only a scheduling hint.

wuisawesome

Left a few comments. Can we also update AutoscalingConfigTest.testValidateDefaultConfig and AutoscalingConfigTest.testValidateDefaultConfigAWSMultiNodeTypes

python/ray/autoscaler/_private/resource_demand_scheduler.py

python/ray/autoscaler/_private/aws/node_provider.py

python/ray/autoscaler/_private/kubernetes/config.py

ConeyLiu · 2021-03-12T03:20:34Z

Hi @wuisawesome, thanks for reviewing.

Can we also update AutoscalingConfigTest.testValidateDefaultConfig and AutoscalingConfigTest.testValidateDefaultConfigAWSMultiNodeTypes

You mean to add more tests? testValidateDefaultConfigAWSMultiNodeTypes has updated. Not sure what should do for testValidateDefaultConfig

DmitriGekhtman · 2021-03-12T04:20:56Z

python/ray/autoscaler/_private/kubernetes/config.py

+        autodetected_resources.update(
+            config["available_node_types"][node_type]["resources"])
+        config["available_node_types"][node_type][
+            "resources"] = autodetected_resources


Thanks for fixing this!

python/ray/autoscaler/_private/kubernetes/config.py

ConeyLiu · 2021-03-13T08:40:28Z

python/ray/autoscaler/_private/aws/node_provider.py

                autodetected_resources = {"CPU": cpus}
+                if node_type != head_node_type:


We only need to autodetect the worker node type memory resources. Because the head node type memory can be updated from runtime. And also the head node type memory is total_memory * (1 - REDIS_PROPORTION - OBJECT_STORE_PROPORTION), however the worker node type is total_memory * (1 - OBJECT_STORE_PROPORTION).

AmeerHajAli · 2021-03-15T15:42:27Z

@wuisawesome @ericl @DmitriGekhtman , do the new changes look good to you?

DmitriGekhtman · 2021-03-15T16:03:09Z

@wuisawesome @ericl @DmitriGekhtman , do the new changes look good to you?

K8s part looks good.

AmeerHajAli

Thanks @ConeyLiu for this great contribution! We greatly value that. We look forward to more PRs from you!

ConeyLiu · 2021-03-16T02:09:41Z

thanks all.

ConeyLiu added 7 commits March 9, 2021 20:34

auto detect memory resources

4083019

format

9a8bf78

fixes

6402f3c

fixes

0043523

fixes

555735e

fixes

b3a8ffd

fixes

0202b30

ConeyLiu mentioned this pull request Mar 9, 2021

Can't init spark on multi-nodes ray cluster on k8s oap-project/raydp#76

Closed

ericl assigned ericl and AmeerHajAli Mar 9, 2021

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Mar 9, 2021

AmeerHajAli requested a review from DmitriGekhtman March 9, 2021 20:36

AmeerHajAli assigned DmitriGekhtman Mar 9, 2021

address comments and fix ci

91df925

ConeyLiu changed the title ~~[autoscaler] Auto detect memory and object_store_memory~~ [autoscaler] Auto detect memory resource Mar 10, 2021

AmeerHajAli requested a review from wuisawesome March 10, 2021 21:05

AmeerHajAli assigned wuisawesome Mar 10, 2021

wuisawesome requested changes Mar 10, 2021

View reviewed changes

DmitriGekhtman reviewed Mar 12, 2021

View reviewed changes

python/ray/autoscaler/_private/kubernetes/config.py Outdated Show resolved Hide resolved

ConeyLiu added 2 commits March 13, 2021 16:28

address comments

3ac2060

address comments

c16f4f5

ConeyLiu commented Mar 13, 2021

View reviewed changes

corrent comments

269c06f

ConeyLiu mentioned this pull request Mar 13, 2021

Unable to create nodes in an autoscaling cluster oap-project/raydp#104

Closed

fixes

b6b8d69

AmeerHajAli removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Mar 15, 2021

ericl removed their assignment Mar 15, 2021

wuisawesome approved these changes Mar 16, 2021

View reviewed changes

AmeerHajAli approved these changes Mar 16, 2021

View reviewed changes

AmeerHajAli merged commit c3d8ef1 into ray-project:master Mar 16, 2021

DmitriGekhtman mentioned this pull request Mar 16, 2021

[autoscaler][kubernetes] K8s memory detection tweak #14712

Merged

6 tasks

ConeyLiu deleted the autoscaler branch May 25, 2021 03:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[autoscaler] Auto detect memory resource #14567

[autoscaler] Auto detect memory resource #14567

ConeyLiu commented Mar 9, 2021 •

edited by DmitriGekhtman

ConeyLiu commented Mar 9, 2021

ericl commented Mar 9, 2021

wuisawesome left a comment

ConeyLiu commented Mar 12, 2021

DmitriGekhtman Mar 12, 2021

ConeyLiu Mar 13, 2021 •

edited

AmeerHajAli commented Mar 15, 2021

DmitriGekhtman commented Mar 15, 2021

AmeerHajAli left a comment

ConeyLiu commented Mar 16, 2021

		autodetected_resources = {"CPU": cpus}
		if node_type != head_node_type:

[autoscaler] Auto detect memory resource #14567

[autoscaler] Auto detect memory resource #14567

Conversation

ConeyLiu commented Mar 9, 2021 • edited by DmitriGekhtman

Why are these changes needed?

Related issue number

Checks

ConeyLiu commented Mar 9, 2021

ericl commented Mar 9, 2021

wuisawesome left a comment

Choose a reason for hiding this comment

ConeyLiu commented Mar 12, 2021

DmitriGekhtman Mar 12, 2021

Choose a reason for hiding this comment

ConeyLiu Mar 13, 2021 • edited

Choose a reason for hiding this comment

AmeerHajAli commented Mar 15, 2021

DmitriGekhtman commented Mar 15, 2021

AmeerHajAli left a comment

Choose a reason for hiding this comment

ConeyLiu commented Mar 16, 2021

ConeyLiu commented Mar 9, 2021 •

edited by DmitriGekhtman

ConeyLiu Mar 13, 2021 •

edited