Skip to content

Lazy creation of Pairs for better scalability #419

@d4l3k

Description

@d4l3k

Currently creation of Gloo Pairs is done in a "full mesh". On initialization, every single worker connects to every other worker and builds an O(n^2) mesh of connections between the workers. For larger sized jobs this is very expensive and we should offer a way of building them asynchronously ondemand.

For large scale jobs (1k+ GPUs) Gloo does offer very performant algorithms such as it's tree barrier implementation but initialization slowness is problematic. This is partially due to many connections being established as well as heavy use of TCPStore which is a single point of info and has limited QPS.

There's some prototype code for how to do this at D69698406 but has serious memory safety issues. We want to make a production ready implementation of this.

#413 should help with this work as we can manage connections asynchronously via the Loop

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions