Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Placement Group Support #28

Merged
merged 28 commits into from Dec 21, 2020
Merged

Placement Group Support #28

merged 28 commits into from Dec 21, 2020

Conversation

amogkam
Copy link
Contributor

@amogkam amogkam commented Dec 15, 2020

Depends on #29.

Adds placement group for xgboost training workers.

Logic is as follows:

  • If not using Tune:
  • If using Tune:
    • If elastic_training == True
      • Raise Error. Tune does not support elastic training.
    • If elastic_training == False
      • Use placement group with PACK strategy

@amogkam
Copy link
Contributor Author

amogkam commented Dec 15, 2020

@rkooo567 if you get a chance, do you think you can give this a quick look as well to see if I'm using placement groups correctly?

xgboost_ray/main.py Outdated Show resolved Hide resolved
@amogkam
Copy link
Contributor Author

amogkam commented Dec 16, 2020

Fyi @krfricke I’m planning to work on this more tomorrow. Can you take a look today and give some early feedback?

Copy link
Collaborator

@krfricke krfricke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Draft looks good so far. Placement group resizing is probably the greatest blocker right now.

Comment on lines +557 to +558
cpus_per_actor: int,
gpus_per_actor: int,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we still read those from ray_params (and not pass them here?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These have gone through automatic detection already at this point, so they might not be the same as what's in ray_params.

Comment on lines 859 to 862
pg = _create_placement_group(cpus_per_actor, gpus_per_actor,
ray_params.resources_per_actor,
ray_params.num_actors)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main question I have here is if we should always use placement groups even if we're not using Tune. Does this make a difference in practice? If not using Tune we would probably want to spread actors as much as possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, this is one of the assumptions made in the doc that if not using Tune, the cluster is fully utilized (all CPUs are being used by Xgboost Actors). In this case, it doesn't matter whether we use Pack, Spread, or no placement groups at all.

But you're right if we don't make that assumption, we would want to use Spread strategy. I added that in here since it was pretty straightforward.

xgboost_ray/util.py Outdated Show resolved Hide resolved
@richardliaw
Copy link
Collaborator

I'd advocate for only implementing placement group support for non-resizeable workloads.

Currently (and for the near future), Tune is going to be the primary benefit for placement group support on workers, and these worker groups will be static in size.

@amogkam
Copy link
Contributor Author

amogkam commented Dec 16, 2020

Currently (and for the near future), Tune is going to be the primary benefit for placement group support on workers, and these worker groups will be static in size.

@richardliaw can you elaborate on this? Does this mean that Tune will support Pack strategy on workers, and once that's implemented we would remove it from xgboost_ray?

@richardliaw
Copy link
Collaborator

No, I'm just saying that if you use Tune, you'll need placement groups, but don't need elastic support for placement groups. If you don't use Tune, maybe we can get away without using placement groups entirely (for workers).

@amogkam
Copy link
Contributor Author

amogkam commented Dec 16, 2020

Why can’t Tune be used with scale-down elastic training? In that case the worker group size won’t be static.

@richardliaw
Copy link
Collaborator

There is currently no support for that (maybe later, but out of scope for the next 4 weeks).

@richardliaw
Copy link
Collaborator

When a node dies, Tune currently does not know which trials resources are lost. The current model for global resource changes is for the trial whose resources have changed to be relaunched with appropriate resources.

@amogkam
Copy link
Contributor Author

amogkam commented Dec 16, 2020

But scale down elastic training is being handled at the xgboost level, right? So as long as the Trainable is still alive, from Tunes perspective there is no failure since it’s being handled in xgboost_ray.train. There is no trial that’s being relaunched in this case.

@richardliaw
Copy link
Collaborator

Tune keeps track of global cluster resources (total and allocated). Without retriggering execution of the trial, Tune will not know that one trial has reduced its resource usage. This puts the resource accounting at an inconsistent state, and can lead to a variety of problems.

This is again, a limitation of Tune resource accounting model. We are revisiting it in this sprint, but before we actually make changes to Tune, anything that doesn't work with its existing form will not work.

@rkooo567 rkooo567 self-assigned this Dec 17, 2020
Copy link

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for the placement group!

xgboost_ray/main.py Show resolved Hide resolved
xgboost_ray/main.py Show resolved Hide resolved
if cpus_per_actor <= 0:
cluster_cpus = _ray_get_cluster_cpus() or 1
cpus_per_actor = min(
int(_get_min_node_cpus() or 1),
Copy link
Contributor Author

@amogkam amogkam Dec 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw I think this should be _get_min_node_cpus, not _get_max_node_cpus. If we have a cluster with 3 4 nodes: 1 2 with 1 CPU each, and 2 with 3 CPUS each, and we have num_actors=4, how many cpus should we have per actor?

Previously this would give 2 cpus per actor, but that's currently not possible with this setup. We should be bounded by the minimum number of CPUs not the maximum, if I'm understanding this correctly.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, ideally you end up with 3 actors, but I think you'll want to do 3 CPUs each per actor (resulting in 2 created actors).

The reason is because xgboost is largely memory bound, and doubling 2 actors onto 1 machine will be less desirable (increases the likelihood of OOM).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but we currently don't automatically configure the number of actors, only the number of cpus/gpus per actor. So if the the user passes in num_actors=4, we will always create 4 actors even if it's undesirable.

Also, the example cluster setup I gave was incorrect, it should be 4 nodes in total, 2 with 1 CPU each, 2 with 3 CPUs each.

In master, autodetecting CPUs would set num_cpus_per_actor to 2 with this setup:
_get_max_node_cpus() would give 3,
int(cluster_cpus // num_actors) == int(8 // 4) which would give 2,
and we would take the min of these to get 2 cpus per actor.
But having 4 actors with 2 CPUs each is not able to be scheduled with this cluster setup.

Instead we want to have get_min_node_cpus which would return 1. So we would have 4 actors with 1 CPU each which can be scheduled.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see; that makes sense.

@amogkam amogkam changed the title [WIP] Colocation Support Colocation Support Dec 18, 2020
@amogkam amogkam marked this pull request as ready for review December 18, 2020 04:16
@amogkam
Copy link
Contributor Author

amogkam commented Dec 18, 2020

Now that #29 is merged, this is ready for review @krfricke @richardliaw

Comment on lines +896 to +897
cpus_per_actor=cpus_per_actor,
gpus_per_actor=gpus_per_actor,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe instead we can pass this in through a actor_options kwarg instead of 3 args?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd actually prefer to leave it as is so it's clear as to what all of the arguments are. Plus this is an internal function so it should be fine to add these extra args. @krfricke any thoughts here?

xgboost_ray/main.py Outdated Show resolved Hide resolved
xgboost_ray/main.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@richardliaw richardliaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple comments. It's very close!

Copy link
Collaborator

@krfricke krfricke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good. I have a general question for placement groups, and I would like to see this in a medium scale release test. I'll run this tomorrow after benchmarking to make sure that it works.

xgboost_ray/main.py Show resolved Hide resolved
xgboost_ray/main.py Outdated Show resolved Hide resolved
if cpus_per_actor <= 0:
cluster_cpus = _ray_get_cluster_cpus() or 1
cpus_per_actor = min(
int(_get_min_node_cpus() or 1),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing this from max to min leads to quite different behavior, but I guess we can assume that people with heterogeneous cluster setups might set this manually anyway.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(So I'm fine with this change)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear, for homogenous clusters this does not make a difference since min==max. But for heterogenous clusters I believe using max is actually incorrect behavior.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

min is definitely the safe choice. Generally, max makes sure that at least one node can schedule an actor, min makes sure that each node can schedule actors. I had max in as a sanity check, to make sure that the number of requested CPUs is never larger than the largest node size, just as an upper bound. For scheduling this was desirable in the case where we have a small head node and large workers (e.g. for GPU training, where I used this setup).

Generally we should just stress to set num cpus per actor manually. We might even want to log a warning if we connect to an existing cluster and num cpus per actor is not set. Let's discuss this sometime!

xgboost_ray/main.py Show resolved Hide resolved
Comment on lines 893 to 895
pg = _create_placement_group(cpus_per_actor, gpus_per_actor,
ray_params.resources_per_actor,
ray_params.num_actors, strategy)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main question I have here is if we always want to create placement groups. I'm not familiar enough with them to know the benefits and drawbacks. I get that we would want them with Tune, but without Tune and without elastic training, what is the benefit of using placement groups? Does SPREAD give us any benefits? Do placement groups have any significant overhead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@richardliaw @rkooo567 please feel free to add more info/correct me if I'm wrong here:

One of the benefits of placement groups is atomic resource reservation which is useful on a non-autoscaling cluster. With placement groups, if all the required resources cannot be reserved at the same time, then execution will fail (unless autoscaling is enabled). Without placement groups, some of the training workers will be scheduled, and others will not, causing execution to hang I believe. Note that if we are on an autoscaling cluster, both cases should trigger autoscaling, so there should not be a difference between the two.

As for the actual strategy, PACK vs. SPREAD vs. no placement groups should not make a difference if the entire Ray cluster is fully utilized. But if that's not the case, then SPREAD would improve fault tolerance since it minimizes the number of workers that fail when a node goes down (https://docs.ray.io/en/master/auto_examples/placement-group.html#improve-fault-tolerance). However, this comes at a cost of increased inter-node communication. So SPREAD prioritizes fault tolerance over reducing network communication. PACK does the complete opposite and prioritizes reducing network communication over fault tolerance. Not using placement groups does not make any prioritization, and the placement behavior is not deterministic.

As for additional overheads, I don't know the particulars here, but I'm sure at the very least there is some accounting overhead. But I believe that this overhead only occurs for scheduling decisions when actors are created. Once all of the training workers have been placed, actual training times should not increase.

Copy link
Contributor Author

@amogkam amogkam Dec 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I think for the non-tune case SPREAD is what we want here since it will improve fault tolerance. But, I'm also OK if you think we should discuss this further, and in this PR we can disable placement groups for the non-tune case. It's only a 1 line change to toggle this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also add an environment variable to toggle this, which will probably make it easier to run benchmarking with/without placement groups.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added an environment variable to toggle this for your benchmarking tomorrow.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome. Thanks for the thorough explanation. It sounds like placement groups are generally desired default behavior. I'll try it out tomorrow. If it works I don't have anything else to add and we should be ready to merge.

README.md Outdated Show resolved Hide resolved
@krfricke
Copy link
Collaborator

These are some results for running with tune:

For IP 172.31.24.143 got trial IDs ['a159f_00000', 'a159f_00000', 'a159f_00000', 'a159f_00000', 'a159f_00001', 'a159f_00001', 'a159f_00001', 'a159f_00001']
For IP 172.31.26.125 got trial IDs ['a159f_00002']
For IP 172.31.23.54 got trial IDs ['a159f_00002', 'a159f_00002', 'a159f_00002', 'a159f_00003', 'a159f_00003', 'a159f_00003']
For IP 172.31.26.111 got trial IDs ['a159f_00003']
For IP 172.31.26.111 got trial IDs ['c961f_00000', 'c961f_00000', 'c961f_00000', 'c961f_00003', 'c961f_00003', 'c961f_00003']
For IP 172.31.23.54 got trial IDs ['c961f_00000', 'c961f_00001', 'c961f_00003']
For IP 172.31.24.143 got trial IDs ['c961f_00001', 'c961f_00001', 'c961f_00001']
For IP 172.31.26.125 got trial IDs ['c961f_00002', 'c961f_00002', 'c961f_00002', 'c961f_00002']

This does seem like PACK is working, right?

@krfricke krfricke merged commit aa1f918 into ray-project:master Dec 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants