Skip to content

TensorPipes RPC Agent Message Acknowledgements #33989

@osalpekar

Description

@osalpekar

🚀 Feature

Since TensorPipes RPC Agent will support a result_placement argument, which allows the receiver to send the computed result to a worker that's not necessarily the sender, we must ensure that the sender receives confirmation of message success/failure.

For example, if CPU0 sends a message to CPU1, and indicates that the result should be sent to GPU0, then GPU0 must indicate to CPU0 that the message was received. This would likely require that we have longer timeouts for such messages.

Furthermore if the result must be placed on multiple devices, the original sender must ensure each of the machines receiving the result confirms successful message completion.

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @rohan-varma @xush6528 @jjlilley @osalpekar

Metadata

Metadata

Assignees

No one assigned

    Labels

    module: rpcRelated to RPC, distributed autograd, RRef, and distributed optimizertriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions