Placement Group Support #28

amogkam · 2020-12-15T01:28:31Z

Depends on #29.

Adds placement group for xgboost training workers.

Logic is as follows:

If not using Tune:
- If elastic_training == True
  - Don't use placement groups (Requires placement group resizing [Placement Group] Support resizing ray#12562)
- If elastic_training == False
  - Use placement group with SPREAD strategy
If using Tune:
- If elastic_training == True
  - Raise Error. Tune does not support elastic training.
- If elastic_training == False
  - Use placement group with PACK strategy

amogkam · 2020-12-15T01:29:47Z

@rkooo567 if you get a chance, do you think you can give this a quick look as well to see if I'm using placement groups correctly?

xgboost_ray/main.py

amogkam · 2020-12-16T05:40:49Z

Fyi @krfricke I’m planning to work on this more tomorrow. Can you take a look today and give some early feedback?

krfricke

Draft looks good so far. Placement group resizing is probably the greatest blocker right now.

krfricke · 2020-12-16T10:19:06Z

xgboost_ray/main.py

+           cpus_per_actor: int,
+           gpus_per_actor: int,


Should we still read those from ray_params (and not pass them here?)

These have gone through automatic detection already at this point, so they might not be the same as what's in ray_params.

krfricke · 2020-12-16T10:24:14Z

xgboost_ray/main.py

+    pg = _create_placement_group(cpus_per_actor, gpus_per_actor,
+                                 ray_params.resources_per_actor,
+                                 ray_params.num_actors)
+


The main question I have here is if we should always use placement groups even if we're not using Tune. Does this make a difference in practice? If not using Tune we would probably want to spread actors as much as possible.

Right, this is one of the assumptions made in the doc that if not using Tune, the cluster is fully utilized (all CPUs are being used by Xgboost Actors). In this case, it doesn't matter whether we use Pack, Spread, or no placement groups at all.

But you're right if we don't make that assumption, we would want to use Spread strategy. I added that in here since it was pretty straightforward.

xgboost_ray/util.py

richardliaw · 2020-12-16T19:02:36Z

I'd advocate for only implementing placement group support for non-resizeable workloads.

Currently (and for the near future), Tune is going to be the primary benefit for placement group support on workers, and these worker groups will be static in size.

amogkam · 2020-12-16T19:10:35Z

Currently (and for the near future), Tune is going to be the primary benefit for placement group support on workers, and these worker groups will be static in size.

@richardliaw can you elaborate on this? Does this mean that Tune will support Pack strategy on workers, and once that's implemented we would remove it from xgboost_ray?

richardliaw · 2020-12-16T19:22:02Z

No, I'm just saying that if you use Tune, you'll need placement groups, but don't need elastic support for placement groups. If you don't use Tune, maybe we can get away without using placement groups entirely (for workers).

amogkam · 2020-12-16T19:36:25Z

Why can’t Tune be used with scale-down elastic training? In that case the worker group size won’t be static.

richardliaw · 2020-12-16T19:52:22Z

There is currently no support for that (maybe later, but out of scope for the next 4 weeks).

richardliaw · 2020-12-16T19:54:15Z

When a node dies, Tune currently does not know which trials resources are lost. The current model for global resource changes is for the trial whose resources have changed to be relaunched with appropriate resources.

amogkam · 2020-12-16T19:57:50Z

But scale down elastic training is being handled at the xgboost level, right? So as long as the Trainable is still alive, from Tunes perspective there is no failure since it’s being handled in xgboost_ray.train. There is no trial that’s being relaunched in this case.

richardliaw · 2020-12-16T20:07:08Z

Tune keeps track of global cluster resources (total and allocated). Without retriggering execution of the trial, Tune will not know that one trial has reduced its resource usage. This puts the resource accounting at an inconsistent state, and can lead to a variety of problems.

This is again, a limitation of Tune resource accounting model. We are revisiting it in this sprint, but before we actually make changes to Tune, anything that doesn't work with its existing form will not work.

rkooo567

LGTM for the placement group!

xgboost_ray/main.py

…cation

xgboost_ray/main.py

amogkam · 2020-12-18T04:15:21Z

xgboost_ray/main.py

+    if cpus_per_actor <= 0:
+        cluster_cpus = _ray_get_cluster_cpus() or 1
+        cpus_per_actor = min(
+            int(_get_min_node_cpus() or 1),


Btw I think this should be _get_min_node_cpus, not _get_max_node_cpus. If we have a cluster with 3 4 nodes: 1 2 with 1 CPU each, and 2 with 3 CPUS each, and we have num_actors=4, how many cpus should we have per actor?

Previously this would give 2 cpus per actor, but that's currently not possible with this setup. We should be bounded by the minimum number of CPUs not the maximum, if I'm understanding this correctly.

Hmm, ideally you end up with 3 actors, but I think you'll want to do 3 CPUs each per actor (resulting in 2 created actors).

The reason is because xgboost is largely memory bound, and doubling 2 actors onto 1 machine will be less desirable (increases the likelihood of OOM).

Right, but we currently don't automatically configure the number of actors, only the number of cpus/gpus per actor. So if the the user passes in num_actors=4, we will always create 4 actors even if it's undesirable.

Also, the example cluster setup I gave was incorrect, it should be 4 nodes in total, 2 with 1 CPU each, 2 with 3 CPUs each.

In master, autodetecting CPUs would set num_cpus_per_actor to 2 with this setup:
_get_max_node_cpus() would give 3,
int(cluster_cpus // num_actors) == int(8 // 4) which would give 2,
and we would take the min of these to get 2 cpus per actor.
But having 4 actors with 2 CPUs each is not able to be scheduled with this cluster setup.

Instead we want to have get_min_node_cpus which would return 1. So we would have 4 actors with 1 CPU each which can be scheduled.

I see; that makes sense.

…cation

amogkam · 2020-12-18T16:57:35Z

Now that #29 is merged, this is ready for review @krfricke @richardliaw

richardliaw · 2020-12-19T23:48:50Z

xgboost_ray/main.py

+                cpus_per_actor=cpus_per_actor,
+                gpus_per_actor=gpus_per_actor,


maybe instead we can pass this in through a actor_options kwarg instead of 3 args?

I'd actually prefer to leave it as is so it's clear as to what all of the arguments are. Plus this is an internal function so it should be fine to add these extra args. @krfricke any thoughts here?

xgboost_ray/main.py

richardliaw

Just a couple comments. It's very close!

krfricke

This looks really good. I have a general question for placement groups, and I would like to see this in a medium scale release test. I'll run this tomorrow after benchmarking to make sure that it works.

xgboost_ray/main.py

krfricke · 2020-12-20T20:44:42Z

xgboost_ray/main.py

+    if cpus_per_actor <= 0:
+        cluster_cpus = _ray_get_cluster_cpus() or 1
+        cpus_per_actor = min(
+            int(_get_min_node_cpus() or 1),


Changing this from max to min leads to quite different behavior, but I guess we can assume that people with heterogeneous cluster setups might set this manually anyway.

(So I'm fine with this change)

Just to be clear, for homogenous clusters this does not make a difference since min==max. But for heterogenous clusters I believe using max is actually incorrect behavior.

min is definitely the safe choice. Generally, max makes sure that at least one node can schedule an actor, min makes sure that each node can schedule actors. I had max in as a sanity check, to make sure that the number of requested CPUs is never larger than the largest node size, just as an upper bound. For scheduling this was desirable in the case where we have a small head node and large workers (e.g. for GPU training, where I used this setup).

Generally we should just stress to set num cpus per actor manually. We might even want to log a warning if we connect to an existing cluster and num cpus per actor is not set. Let's discuss this sometime!

xgboost_ray/main.py

krfricke · 2020-12-20T20:56:10Z

xgboost_ray/main.py

+        pg = _create_placement_group(cpus_per_actor, gpus_per_actor,
+                                     ray_params.resources_per_actor,
+                                     ray_params.num_actors, strategy)


The main question I have here is if we always want to create placement groups. I'm not familiar enough with them to know the benefits and drawbacks. I get that we would want them with Tune, but without Tune and without elastic training, what is the benefit of using placement groups? Does SPREAD give us any benefits? Do placement groups have any significant overhead?

@richardliaw @rkooo567 please feel free to add more info/correct me if I'm wrong here:

One of the benefits of placement groups is atomic resource reservation which is useful on a non-autoscaling cluster. With placement groups, if all the required resources cannot be reserved at the same time, then execution will fail (unless autoscaling is enabled). Without placement groups, some of the training workers will be scheduled, and others will not, causing execution to hang I believe. Note that if we are on an autoscaling cluster, both cases should trigger autoscaling, so there should not be a difference between the two.

As for the actual strategy, PACK vs. SPREAD vs. no placement groups should not make a difference if the entire Ray cluster is fully utilized. But if that's not the case, then SPREAD would improve fault tolerance since it minimizes the number of workers that fail when a node goes down (https://docs.ray.io/en/master/auto_examples/placement-group.html#improve-fault-tolerance). However, this comes at a cost of increased inter-node communication. So SPREAD prioritizes fault tolerance over reducing network communication. PACK does the complete opposite and prioritizes reducing network communication over fault tolerance. Not using placement groups does not make any prioritization, and the placement behavior is not deterministic.

As for additional overheads, I don't know the particulars here, but I'm sure at the very least there is some accounting overhead. But I believe that this overhead only occurs for scheduling decisions when actors are created. Once all of the training workers have been placed, actual training times should not increase.

Overall, I think for the non-tune case SPREAD is what we want here since it will improve fault tolerance. But, I'm also OK if you think we should discuss this further, and in this PR we can disable placement groups for the non-tune case. It's only a 1 line change to toggle this.

We can also add an environment variable to toggle this, which will probably make it easier to run benchmarking with/without placement groups.

I added an environment variable to toggle this for your benchmarking tomorrow.

Awesome. Thanks for the thorough explanation. It sounds like placement groups are generally desired default behavior. I'll try it out tomorrow. If it works I don't have anything else to add and we should be ready to merge.

Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>

…cation

README.md

…cation

krfricke · 2020-12-21T15:56:21Z

These are some results for running with tune:

For IP 172.31.24.143 got trial IDs ['a159f_00000', 'a159f_00000', 'a159f_00000', 'a159f_00000', 'a159f_00001', 'a159f_00001', 'a159f_00001', 'a159f_00001']
For IP 172.31.26.125 got trial IDs ['a159f_00002']
For IP 172.31.23.54 got trial IDs ['a159f_00002', 'a159f_00002', 'a159f_00002', 'a159f_00003', 'a159f_00003', 'a159f_00003']
For IP 172.31.26.111 got trial IDs ['a159f_00003']

For IP 172.31.26.111 got trial IDs ['c961f_00000', 'c961f_00000', 'c961f_00000', 'c961f_00003', 'c961f_00003', 'c961f_00003']
For IP 172.31.23.54 got trial IDs ['c961f_00000', 'c961f_00001', 'c961f_00003']
For IP 172.31.24.143 got trial IDs ['c961f_00001', 'c961f_00001', 'c961f_00001']
For IP 172.31.26.125 got trial IDs ['c961f_00002', 'c961f_00002', 'c961f_00002', 'c961f_00002']

This does seem like PACK is working, right?

colocation

a8d3ca6

amogkam commented Dec 15, 2020

View reviewed changes

xgboost_ray/main.py Show resolved Hide resolved

amogkam commented Dec 15, 2020

View reviewed changes

xgboost_ray/main.py Outdated Show resolved Hide resolved

krfricke reviewed Dec 16, 2020

View reviewed changes

amogkam added 5 commits December 16, 2020 19:51

wip

44f56f4

add colocation part 1

55176fc

lint

f918df5

copy over conftest

cf0475b

driver-colocation

5a8ec69

rkooo567 self-assigned this Dec 17, 2020

rkooo567 reviewed Dec 17, 2020

View reviewed changes

xgboost_ray/main.py Show resolved Hide resolved

xgboost_ray/main.py Show resolved Hide resolved

amogkam added 4 commits December 17, 2020 17:53

Merge branch 'master' of github.com:ray-project/xgboost_ray into colo…

0a2a69d

…cation

Merge branch 'driver-colocation' into colocation

8eedb4b

add placement group support

6a20f70

lint

30d9ada

amogkam commented Dec 18, 2020

View reviewed changes

xgboost_ray/main.py Show resolved Hide resolved

amogkam commented Dec 18, 2020

View reviewed changes

fix

5b89319

amogkam changed the title ~~[WIP] Colocation Support~~ Colocation Support Dec 18, 2020

amogkam marked this pull request as ready for review December 18, 2020 04:16

amogkam added 2 commits December 18, 2020 08:55

Merge branch 'master' of github.com:ray-project/xgboost_ray into colo…

5a61d19

…cation

formatting

7c7665d

amogkam requested review from krfricke and richardliaw December 18, 2020 16:57

amogkam added 3 commits December 18, 2020 09:03

fix

1f6ed7e

fix

74d3b7f

update shutdown logic

1ef02d0

richardliaw reviewed Dec 19, 2020

View reviewed changes

xgboost_ray/main.py Outdated Show resolved Hide resolved

richardliaw reviewed Dec 19, 2020

View reviewed changes

xgboost_ray/main.py Outdated Show resolved Hide resolved

richardliaw reviewed Dec 20, 2020

View reviewed changes

amogkam unassigned rkooo567 Dec 20, 2020

amogkam added 4 commits December 19, 2020 19:09

address comments

7d2efd1

lint

73f1211

add comment

da27cd2

convert to int

43c14cf

krfricke reviewed Dec 20, 2020

View reviewed changes

amogkam and others added 6 commits December 20, 2020 14:30

Update xgboost_ray/main.py

c6f0224

Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>

add SPREAD env var

f261c01

Merge branch 'colocation' of github.com:amogkam/xgboost_ray into colo…

08c35af

…cation

update args

b0627bb

revert

bb66275

update readme

5651b65

amogkam commented Dec 20, 2020

View reviewed changes

README.md Outdated Show resolved Hide resolved

amogkam and others added 2 commits December 20, 2020 15:14

Update README.md

a8665c7

Merge branch 'master' of github.com:ray-project/xgboost_ray into colo…

dea57f6

…cation

krfricke approved these changes Dec 21, 2020

View reviewed changes

krfricke merged commit aa1f918 into ray-project:master Dec 21, 2020

		cpus_per_actor=cpus_per_actor,
		gpus_per_actor=gpus_per_actor,

Placement Group Support #28

Placement Group Support #28

Conversation

amogkam commented Dec 15, 2020 • edited

amogkam commented Dec 15, 2020

amogkam commented Dec 16, 2020

krfricke left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

richardliaw commented Dec 16, 2020

amogkam commented Dec 16, 2020

richardliaw commented Dec 16, 2020

amogkam commented Dec 16, 2020

richardliaw commented Dec 16, 2020

richardliaw commented Dec 16, 2020

amogkam commented Dec 16, 2020 • edited

richardliaw commented Dec 16, 2020

rkooo567 left a comment

Choose a reason for hiding this comment

amogkam Dec 18, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amogkam commented Dec 18, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

richardliaw left a comment

Choose a reason for hiding this comment

krfricke left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amogkam Dec 20, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

krfricke commented Dec 21, 2020

amogkam commented Dec 15, 2020 •

edited

amogkam commented Dec 16, 2020 •

edited

amogkam Dec 18, 2020 •

edited

amogkam Dec 20, 2020 •

edited