New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Placement Group Support #28
Conversation
@rkooo567 if you get a chance, do you think you can give this a quick look as well to see if I'm using placement groups correctly? |
Fyi @krfricke I’m planning to work on this more tomorrow. Can you take a look today and give some early feedback? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Draft looks good so far. Placement group resizing is probably the greatest blocker right now.
cpus_per_actor: int, | ||
gpus_per_actor: int, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we still read those from ray_params
(and not pass them here?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These have gone through automatic detection already at this point, so they might not be the same as what's in ray_params.
xgboost_ray/main.py
Outdated
pg = _create_placement_group(cpus_per_actor, gpus_per_actor, | ||
ray_params.resources_per_actor, | ||
ray_params.num_actors) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main question I have here is if we should always use placement groups even if we're not using Tune. Does this make a difference in practice? If not using Tune we would probably want to spread actors as much as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, this is one of the assumptions made in the doc that if not using Tune, the cluster is fully utilized (all CPUs are being used by Xgboost Actors). In this case, it doesn't matter whether we use Pack, Spread, or no placement groups at all.
But you're right if we don't make that assumption, we would want to use Spread strategy. I added that in here since it was pretty straightforward.
I'd advocate for only implementing placement group support for non-resizeable workloads. Currently (and for the near future), Tune is going to be the primary benefit for placement group support on workers, and these worker groups will be static in size. |
@richardliaw can you elaborate on this? Does this mean that Tune will support Pack strategy on workers, and once that's implemented we would remove it from xgboost_ray? |
No, I'm just saying that if you use Tune, you'll need placement groups, but don't need elastic support for placement groups. If you don't use Tune, maybe we can get away without using placement groups entirely (for workers). |
Why can’t Tune be used with scale-down elastic training? In that case the worker group size won’t be static. |
There is currently no support for that (maybe later, but out of scope for the next 4 weeks). |
When a node dies, Tune currently does not know which trials resources are lost. The current model for global resource changes is for the trial whose resources have changed to be relaunched with appropriate resources. |
But scale down elastic training is being handled at the xgboost level, right? So as long as the Trainable is still alive, from Tunes perspective there is no failure since it’s being handled in xgboost_ray.train. There is no trial that’s being relaunched in this case. |
Tune keeps track of global cluster resources (total and allocated). Without retriggering execution of the trial, Tune will not know that one trial has reduced its resource usage. This puts the resource accounting at an inconsistent state, and can lead to a variety of problems. This is again, a limitation of Tune resource accounting model. We are revisiting it in this sprint, but before we actually make changes to Tune, anything that doesn't work with its existing form will not work. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for the placement group!
xgboost_ray/main.py
Outdated
if cpus_per_actor <= 0: | ||
cluster_cpus = _ray_get_cluster_cpus() or 1 | ||
cpus_per_actor = min( | ||
int(_get_min_node_cpus() or 1), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw I think this should be _get_min_node_cpus
, not _get_max_node_cpus
. If we have a cluster with 3 4 nodes: 1 2 with 1 CPU each, and 2 with 3 CPUS each, and we have num_actors=4, how many cpus should we have per actor?
Previously this would give 2 cpus per actor, but that's currently not possible with this setup. We should be bounded by the minimum number of CPUs not the maximum, if I'm understanding this correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, ideally you end up with 3 actors, but I think you'll want to do 3 CPUs each per actor (resulting in 2 created actors).
The reason is because xgboost is largely memory bound, and doubling 2 actors onto 1 machine will be less desirable (increases the likelihood of OOM).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but we currently don't automatically configure the number of actors, only the number of cpus/gpus per actor. So if the the user passes in num_actors=4
, we will always create 4 actors even if it's undesirable.
Also, the example cluster setup I gave was incorrect, it should be 4 nodes in total, 2 with 1 CPU each, 2 with 3 CPUs each.
In master, autodetecting CPUs would set num_cpus_per_actor
to 2 with this setup:
_get_max_node_cpus()
would give 3,
int(cluster_cpus // num_actors)
== int(8 // 4)
which would give 2,
and we would take the min of these to get 2 cpus per actor.
But having 4 actors with 2 CPUs each is not able to be scheduled with this cluster setup.
Instead we want to have get_min_node_cpus
which would return 1. So we would have 4 actors with 1 CPU each which can be scheduled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see; that makes sense.
Now that #29 is merged, this is ready for review @krfricke @richardliaw |
cpus_per_actor=cpus_per_actor, | ||
gpus_per_actor=gpus_per_actor, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe instead we can pass this in through a actor_options
kwarg instead of 3 args?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd actually prefer to leave it as is so it's clear as to what all of the arguments are. Plus this is an internal function so it should be fine to add these extra args. @krfricke any thoughts here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple comments. It's very close!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks really good. I have a general question for placement groups, and I would like to see this in a medium scale release test. I'll run this tomorrow after benchmarking to make sure that it works.
if cpus_per_actor <= 0: | ||
cluster_cpus = _ray_get_cluster_cpus() or 1 | ||
cpus_per_actor = min( | ||
int(_get_min_node_cpus() or 1), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing this from max to min leads to quite different behavior, but I guess we can assume that people with heterogeneous cluster setups might set this manually anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(So I'm fine with this change)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be clear, for homogenous clusters this does not make a difference since min==max. But for heterogenous clusters I believe using max
is actually incorrect behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
min is definitely the safe choice. Generally, max makes sure that at least one node can schedule an actor, min makes sure that each node can schedule actors. I had max in as a sanity check, to make sure that the number of requested CPUs is never larger than the largest node size, just as an upper bound. For scheduling this was desirable in the case where we have a small head node and large workers (e.g. for GPU training, where I used this setup).
Generally we should just stress to set num cpus per actor manually. We might even want to log a warning if we connect to an existing cluster and num cpus per actor is not set. Let's discuss this sometime!
xgboost_ray/main.py
Outdated
pg = _create_placement_group(cpus_per_actor, gpus_per_actor, | ||
ray_params.resources_per_actor, | ||
ray_params.num_actors, strategy) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main question I have here is if we always want to create placement groups. I'm not familiar enough with them to know the benefits and drawbacks. I get that we would want them with Tune, but without Tune and without elastic training, what is the benefit of using placement groups? Does SPREAD give us any benefits? Do placement groups have any significant overhead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@richardliaw @rkooo567 please feel free to add more info/correct me if I'm wrong here:
One of the benefits of placement groups is atomic resource reservation which is useful on a non-autoscaling cluster. With placement groups, if all the required resources cannot be reserved at the same time, then execution will fail (unless autoscaling is enabled). Without placement groups, some of the training workers will be scheduled, and others will not, causing execution to hang I believe. Note that if we are on an autoscaling cluster, both cases should trigger autoscaling, so there should not be a difference between the two.
As for the actual strategy, PACK vs. SPREAD vs. no placement groups should not make a difference if the entire Ray cluster is fully utilized. But if that's not the case, then SPREAD would improve fault tolerance since it minimizes the number of workers that fail when a node goes down (https://docs.ray.io/en/master/auto_examples/placement-group.html#improve-fault-tolerance). However, this comes at a cost of increased inter-node communication. So SPREAD prioritizes fault tolerance over reducing network communication. PACK does the complete opposite and prioritizes reducing network communication over fault tolerance. Not using placement groups does not make any prioritization, and the placement behavior is not deterministic.
As for additional overheads, I don't know the particulars here, but I'm sure at the very least there is some accounting overhead. But I believe that this overhead only occurs for scheduling decisions when actors are created. Once all of the training workers have been placed, actual training times should not increase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, I think for the non-tune case SPREAD is what we want here since it will improve fault tolerance. But, I'm also OK if you think we should discuss this further, and in this PR we can disable placement groups for the non-tune case. It's only a 1 line change to toggle this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can also add an environment variable to toggle this, which will probably make it easier to run benchmarking with/without placement groups.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added an environment variable to toggle this for your benchmarking tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome. Thanks for the thorough explanation. It sounds like placement groups are generally desired default behavior. I'll try it out tomorrow. If it works I don't have anything else to add and we should be ready to merge.
Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>
These are some results for running with tune:
This does seem like PACK is working, right? |
Depends on #29.
Adds placement group for xgboost training workers.
Logic is as follows: