-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Train] Sort Local Train Workers by GPU id #40953
[Train] Sort Local Train Workers by GPU id #40953
Conversation
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
# More details: https://github.com/ray-project/ray/issues/40803 | ||
def get_lowest_gpu_id(worker) -> int: | ||
gpu_ids = worker.metadata.resource_ids.get("GPU", []) | ||
return min(map(int, gpu_ids), default=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Converting to int won't work in the future if we support UUIDs (e.g for MIG). Maybe we can first try to convert to int and then fallback to string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Make sense. I've made take str ID as a backup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if we should update the name of this method now that it's no longer grouping but also sorting. At the very least we should update the docstring.
In the future we may want to start generalizing this more (e.g. have it be a generic sort function that takes in a comparator) and define the comparison logic upstream in the caller, since this IP/GPU sorting logic doesn't actually belong to the "worker group", but the "backend" layer on top of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. This is a temporary solution for now. We should actually put the ranking allocation logic into backend executor (e.g. here:
def _create_rank_world_size_mappings(self) -> List[Dict]: |
sort_workers_by_ip_and_gpu_id
?
"pids": [0, 1, 2, 3, 4, 5, 6, 7], | ||
"ips": ["2", "2", "1", "1", "2", "1", "1", "2"], | ||
"gpu_ids": [None] * 8, | ||
"expected_local_ranks": None, # No expected ranks for CPU workers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should still test default sorting behavior when there are no GPU IDs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When using cpu actors, GPU ids should be empty, then the sorted order will be non-deterministic since all the worker has the same key 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I think sort will retain the ordering when the entries have the same sort value. But to your point I don't know if this is guaranteed or just based on the implementation.
entries = [5, 4, 3, 2, 1]
print(entries)
entries.sort(key = lambda x: 0)
print(entries)
entries.sort(key = lambda x: x)
print(entries)
[5, 4, 3, 2, 1]
[5, 4, 3, 2, 1]
[1, 2, 3, 4, 5]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I got you. Seems that the python's sort will maintain the order if multiple entries have identical keys: https://docs.python.org/3.7/howto/sorting.html#sort-stability-and-complex-sorts
I'll add a test for it.
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
…into train/sort_worker_by_device_id
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Why are these changes needed?
Sort the local Ray Train workers according to the GPU device id. This ensures that the allocated GPU == "cuda:{local_rank}". More details in #40803
Related issue number
Close #40803
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.