Skip to content

Ensuring RPC Reliability for Internal Messages [Umbrella Issue] #32119

@osalpekar

Description

@osalpekar

🚀 Feature

Ensure at-least-once semantics for RRef Control Messages and certain Distributed Autograd internal messages using the RPC framework. Currently there are no guarantees around RPC for internal messages which could lead to inconsistent behavior.

Motivation

Internal control messages for the RRef protocol and Distributed Autograd can fail due to network or other issues. This may lead to unwanted behavior, for example the caller incorrectly assumes RPC was successful. In order to correct for this, the fault-tolerant RPC design will offer a function that supports sending RPC's with retries as well as guarantees around idempotency for internal RRef functions.

Child Issues:

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @rohan-varma @xush6528 @jjlilley @osalpekar

Metadata

Metadata

Assignees

Labels

module: rpcRelated to RPC, distributed autograd, RRef, and distributed optimizertriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions