Add client-side retry for oz starting shared sessions#12111
Conversation
ec26006 to
04224d1
Compare
|
I'm starting a first review of this pull request. You can view the conversation on Warp. I completed the review and no human review was requested for this pull request. Comment Powered by Oz |
| } | ||
| } | ||
|
|
||
| fn log_diagnostic(&self, event: &'static str, details: impl std::fmt::Display) { |
There was a problem hiding this comment.
@captainsafia I'm removing this log because it seemed too verbose and duplicated some existing logs. I added ad hoc logs to replace some of the removals.
But I'm curious about the motivation - is the session id/source type/source task id needed on all logs for something? I'd be down to add it back or enforce these are included in the logs if needed
There was a problem hiding this comment.
I included the source ID and the task ID in the log originally to make it easier to correlate this log with the task-specific logs that are emitted from the server. I think it might be helpful at minimum to include the task ID if the session is a cloud mode session?
There was a problem hiding this comment.
I thought we would generally know the task and session ID by the point we're looking at worker logs, but I don't see a harm in adding them back. Will add back a version of this that includes those values
|
/oz-review |
|
I'm re-reviewing this pull request in response to a review request. You can view the conversation on Warp. I completed the review and no human review was requested for this pull request. Comment Powered by Oz |
There was a problem hiding this comment.
Overview
This PR adds ambient-agent-only retry handling around shared-session creation, including per-attempt timeouts, retry bookkeeping, heartbeat cleanup between attempts, and an extended agent-driver wait window.
Concerns
- The startup attempt guard remains active after
SessionInitializedclearsstartup_retry, so the established websocket drops all subsequent downstream messages and skips the close/reconnect path.
Verdict
Found: 0 critical, 1 important, 0 suggestions
Request changes
Comment /oz-review on this pull request to retrigger a review (up to 3 times on the same pull request).
Powered by Oz
|
/oz-review |
|
I'm re-reviewing this pull request in response to a review request. You can view the conversation on Warp. I completed the review and no human review was requested for this pull request. Comment Powered by Oz |
There was a problem hiding this comment.
Overview
This PR adds client-side retry handling for ambient-agent shared-session creation, including per-attempt timeout handling and tests for retryability/state filtering.
Concerns
- The new retry/final-failure log messages drop the
source_type/source_task_iddiagnostic context that previously made shared-session failures traceable to a specific Oz task.
Verdict
Found: 0 critical, 0 important, 1 suggestions
Approve with nits
Comment /oz-review on this pull request to retrigger a review (up to 3 times on the same pull request).
Powered by Oz
| if self.should_retry_startup_failure(&failure) { | ||
| if let Some(cause) = cause.as_ref() { | ||
| log::warn!( | ||
| "Shared session creation attempt failed, will retry; attempt={attempt} max_attempts={max_attempts} reason={reason} cause={cause:#}" |
There was a problem hiding this comment.
💡 [SUGGESTION] Preserve the ambient-agent source context in retry/final-failure logs so these failures can still be correlated to a specific Oz task; include startup_config.source here or keep equivalent diagnostic context on Network.
captainsafia
left a comment
There was a problem hiding this comment.
This approach seems good overall! Left one question inline about whether it's worth adding back-offs here. I think we want to optimize for a snappier connection experience in the event the connection recovers but leaving it as food for though.
| } | ||
| } | ||
|
|
||
| fn log_diagnostic(&self, event: &'static str, details: impl std::fmt::Display) { |
There was a problem hiding this comment.
I included the source ID and the task ID in the log originally to make it easier to correlate this log with the task-specific logs that are emitted from the server. I think it might be helpful at minimum to include the task ID if the session is a cloud mode session?
|
|
||
| impl StartupRetryState { | ||
| #[cfg_attr(any(test, feature = "integration_tests"), allow(dead_code))] | ||
| fn new(max_attempts: usize) -> Self { |
There was a problem hiding this comment.
We're not doing any backoff coupled with the retries here AFAICT. Is that intentional or do we want to incorporate backoffs here?
There was a problem hiding this comment.
This was intentional because my current belief is that the request gets stuck for some reason and retrying immediately would help, and we want to keep the time to session share low. I'll revisit this if needed though
|
I added |
Description
Motivation for this: https://linear.app/warpdotdev/issue/REMOTE-1802/session-sharing-server-times-out-creating-session
When we fail to start session-sharing-server for retryable reasons in an oz run, we make up to 3 attempts, with each attempt getting 5s (total 20s wait on the oz agent driver).
Testing
Tested with local session-sharing-server injecting errors. The first two succeed on the third attempt and run proceeds normally. Manual user sharing also still works
Inject two internal server errors
Inject 2 hangs (no session initialized sent)
No session-sharing-server running (expected failure after 3 attempts):