-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Description
🚀 Feature
Ensure at-least-once semantics for RRef Control Messages and certain Distributed Autograd internal messages using the RPC framework. Currently there are no guarantees around RPC for internal messages which could lead to inconsistent behavior.
Motivation
Internal control messages for the RRef protocol and Distributed Autograd can fail due to network or other issues. This may lead to unwanted behavior, for example the caller incorrectly assumes RPC was successful. In order to correct for this, the fault-tolerant RPC design will offer a function that supports sending RPC's with retries as well as guarantees around idempotency for internal RRef functions.
Child Issues:
- Retry Queue for RPC Reliability: Retry Queue for RPC Reliability #32124
- Efficient Sleeping Implementation for RPC Retries: Efficient Sleeping Implementation for RPC Retries #32126
- RPC Retry Algorithm Integration and Policies: RPC Retry Algorithm Policy Tuning #32129
- Idempotency Keys for RPC Retry: Idempotency Keys for RPC Retry #32130
cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @rohan-varma @xush6528 @jjlilley @osalpekar