[core][gpu-objects] Driver should order all collective calls to avoid deadlock

### Description

Similar to compiled graphs, the driver should order all collective calls to avoid deadlocks.

Example 1:
* Avoid passing tensors within the same actor using NCCL. Instead, we should access the in-actor store directly.

Example 2: Both actors are single-threaded and synchronous. If `t1_1` is the input for `t2_2` and `t1_2` is the input for `t2_1`, both use NCCL to transfer data. In this case, we should call NCCL recv of `t2_2` before `t2_1` to avoid deadlock.

```
Actor 1: t1_1, t1_2
Actor 2: t2_1, t2_2
```

Note: Check if this will work if we only have one CUDA stream.

### Use case

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[core][gpu-objects] Driver should order all collective calls to avoid deadlock #51264

Description

Use case

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[core][gpu-objects] Driver should order all collective calls to avoid deadlock #51264

Description

Description

Use case

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions