Skip to content

[codex] implement session bind failure recovery#465

Merged
k82cn merged 4 commits into
xflops:mainfrom
k82cn:codex/session-bind-failure-recovery
May 19, 2026
Merged

[codex] implement session bind failure recovery#465
k82cn merged 4 commits into
xflops:mainfrom
k82cn:codex/session-bind-failure-recovery

Conversation

@k82cn
Copy link
Copy Markdown
Contributor

@k82cn k82cn commented May 19, 2026

Summary

Implements session bind failure recovery for RFE25.

  • Adds BindExecutorCompletedRequest.result and internal FlameResult conversions.
  • Reports executor bind/install/shim/enter failures through bind-completed results.
  • Moves bind-completed result handling into executor state, with failed binding executors transitioning to Unbinding.
  • Tracks transient session-level retry_count, records session bind failure and retry-limit events, and skips not-ready sessions through SessionFilter predicates.
  • Surfaces session events through storage, SDK conversion, and flmctl view -s.
  • Adds the design doc for the recovery behavior.

Validation

  • cargo fmt --all
  • cargo test -p flame-session-manager model::tests::test_snapshot_find_sessions_by_predicate
  • cargo test -p flame-session-manager scheduler::tests::test_scheduler_skips_not_ready_session
  • cargo test -p flame-session-manager binding_state_tests
  • cargo test -p flame-session-manager bind_session_failed_tests
  • cargo check -p flame-session-manager
  • cargo test -p flame-session-manager
  • cargo test -p flame-executor-manager
  • cargo check -p flame-session-manager -p flame-executor-manager -p flmctl -p flame-rs
  • git diff --check

Full cargo test -p flame-rs was not run as part of the final pass because the package includes benchmark integration tests that expect a live local session-manager. The library tests passed earlier with cargo test -p flame-rs --lib.

fix #25

@k82cn k82cn marked this pull request as ready for review May 19, 2026 04:56
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a session bind failure recovery mechanism, introducing a transient retry counter for sessions and a reporting path for bind failures from the executor manager to the session manager. Key changes include the addition of session-level events, configuration for retry limits, and scheduler updates to skip sessions exceeding these limits. Feedback was provided to improve the robustness of the binding state machine by ensuring an executor cannot transition to a 'Bound' state if its associated session ID is missing during completion.

Comment on lines +33 to +37
if let Some(ssn_id) = self.bound_session_id()? {
let ssn_ptr = self.storage.get_session_ptr(ssn_id)?;
let mut ssn = lock_ptr!(ssn_ptr)?;
ssn.retry_count = 0;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

In bind_session_success, if bound_session_id() returns None, the executor's state is still transitioned to Bound without being associated with a session. A Bound executor should always have a session. This could lead to an invalid state where an executor is considered bound but has no session, potentially causing issues in other parts of the system that assume a bound executor has a session.

It would be more robust to return an InvalidState error if the session ID is missing, similar to how bind_session_failed handles it.

        let ssn_id = self.bound_session_id()?.ok_or_else(|| {
            let e = lock_ptr!(self.executor).unwrap();
            FlameError::InvalidState(format!(
                "Executor <{}> has no session attached in Binding state on successful completion",
                e.id
            ))
        })?;

        let ssn_ptr = self.storage.get_session_ptr(ssn_id)?;
        let mut ssn = lock_ptr!(ssn_ptr)?;
        ssn.retry_count = 0;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 57ecd12d: bind_session_success now requires an attached session and returns InvalidState before transitioning to Bound when it is missing. I also added unit coverage for the missing-session success path.

@k82cn k82cn force-pushed the codex/session-bind-failure-recovery branch from 68329d2 to 260c8eb Compare May 19, 2026 05:07
@k82cn k82cn force-pushed the codex/session-bind-failure-recovery branch from 260c8eb to 57ecd12 Compare May 19, 2026 05:54
let ssn_id = self.bound_session_id()?;
let ssn_ptr = self.storage.get_session_ptr(ssn_id)?;
let mut ssn = lock_ptr!(ssn_ptr)?;
ssn.retry_count = 0;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should not reset the retry_count, just keep it there.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in b257331e: successful bind completion now validates the attached session but leaves retry_count unchanged.

Comment on lines +45 to +49
let (executor_id, node, ssn_id) = {
let e = lock_ptr!(self.executor)?;
let ssn_id = self.bound_session_id_from_executor(&e)?;
(e.id.clone(), e.node.clone(), ssn_id)
};
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move those to record_bind_failure and increment_session_retry_count, those two function should only have an executor paramenter, it should get info themself accordingly.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in b257331e: record_bind_failure and increment_session_retry_count now take the executor and derive the session/executor details internally.

Comment on lines +63 to +75
fn bound_session_id(&self) -> Result<String, FlameError> {
let e = lock_ptr!(self.executor)?;
self.bound_session_id_from_executor(&e)
}

fn bound_session_id_from_executor(
&self,
executor: &crate::model::Executor,
) -> Result<String, FlameError> {
executor.ssn_id.clone().ok_or_else(|| {
FlameError::InvalidState(format!("Executor <{}> has no bound session", executor.id))
})
}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are they still necessary after refactor?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in b257331e: removed the redundant bound-session helper path after the binding-state refactor.

Comment thread session_manager/src/model/mod.rs Outdated

pub const OPEN_SESSION: Option<SessionFilter> = Some(SessionFilter::by_state(SessionState::Open));

pub fn open_ready_session(retry_limits: u32) -> Option<SessionFilter> {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace it with READY_SESSION const, similar to OPEN_SESSION.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in b257331e: replaced the ready-session helper with READY_SESSION, implemented as a const inline lambda predicate, and updated scheduler actions to use it directly.

@k82cn k82cn merged commit 3df014b into xflops:main May 19, 2026
6 of 7 checks passed
@k82cn k82cn deleted the codex/session-bind-failure-recovery branch May 19, 2026 20:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add more error handling for Session and Task

1 participant