-
Notifications
You must be signed in to change notification settings - Fork 338
Description
Currently creation of Gloo Pairs is done in a "full mesh". On initialization, every single worker connects to every other worker and builds an O(n^2) mesh of connections between the workers. For larger sized jobs this is very expensive and we should offer a way of building them asynchronously ondemand.
For large scale jobs (1k+ GPUs) Gloo does offer very performant algorithms such as it's tree barrier implementation but initialization slowness is problematic. This is partially due to many connections being established as well as heavy use of TCPStore which is a single point of info and has limited QPS.
There's some prototype code for how to do this at D69698406 but has serious memory safety issues. We want to make a production ready implementation of this.
#413 should help with this work as we can manage connections asynchronously via the Loop