Skip to content

[core][gpu-objects] Driver should order all collective calls to avoid deadlock #51264

@kevin85421

Description

@kevin85421

Description

Similar to compiled graphs, the driver should order all collective calls to avoid deadlocks.

Example 1:

  • Avoid passing tensors within the same actor using NCCL. Instead, we should access the in-actor store directly.

Example 2: Both actors are single-threaded and synchronous. If t1_1 is the input for t2_2 and t1_2 is the input for t2_1, both use NCCL to transfer data. In this case, we should call NCCL recv of t2_2 before t2_1 to avoid deadlock.

Actor 1: t1_1, t1_2
Actor 2: t2_1, t2_2

Note: Check if this will work if we only have one CUDA stream.

Use case

No response

Metadata

Metadata

Assignees

Labels

P0Issues that should be fixed in short ordercommunity-backlogcoreIssues that should be addressed in Ray CoreenhancementRequest for new feature and/or capabilitygpu-objects

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions