[codex] implement session bind failure recovery#465
Conversation
There was a problem hiding this comment.
Code Review
This pull request implements a session bind failure recovery mechanism, introducing a transient retry counter for sessions and a reporting path for bind failures from the executor manager to the session manager. Key changes include the addition of session-level events, configuration for retry limits, and scheduler updates to skip sessions exceeding these limits. Feedback was provided to improve the robustness of the binding state machine by ensuring an executor cannot transition to a 'Bound' state if its associated session ID is missing during completion.
| if let Some(ssn_id) = self.bound_session_id()? { | ||
| let ssn_ptr = self.storage.get_session_ptr(ssn_id)?; | ||
| let mut ssn = lock_ptr!(ssn_ptr)?; | ||
| ssn.retry_count = 0; | ||
| } |
There was a problem hiding this comment.
In bind_session_success, if bound_session_id() returns None, the executor's state is still transitioned to Bound without being associated with a session. A Bound executor should always have a session. This could lead to an invalid state where an executor is considered bound but has no session, potentially causing issues in other parts of the system that assume a bound executor has a session.
It would be more robust to return an InvalidState error if the session ID is missing, similar to how bind_session_failed handles it.
let ssn_id = self.bound_session_id()?.ok_or_else(|| {
let e = lock_ptr!(self.executor).unwrap();
FlameError::InvalidState(format!(
"Executor <{}> has no session attached in Binding state on successful completion",
e.id
))
})?;
let ssn_ptr = self.storage.get_session_ptr(ssn_id)?;
let mut ssn = lock_ptr!(ssn_ptr)?;
ssn.retry_count = 0;There was a problem hiding this comment.
Addressed in 57ecd12d: bind_session_success now requires an attached session and returns InvalidState before transitioning to Bound when it is missing. I also added unit coverage for the missing-session success path.
68329d2 to
260c8eb
Compare
260c8eb to
57ecd12
Compare
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
| let ssn_id = self.bound_session_id()?; | ||
| let ssn_ptr = self.storage.get_session_ptr(ssn_id)?; | ||
| let mut ssn = lock_ptr!(ssn_ptr)?; | ||
| ssn.retry_count = 0; |
There was a problem hiding this comment.
we should not reset the retry_count, just keep it there.
There was a problem hiding this comment.
Addressed in b257331e: successful bind completion now validates the attached session but leaves retry_count unchanged.
| let (executor_id, node, ssn_id) = { | ||
| let e = lock_ptr!(self.executor)?; | ||
| let ssn_id = self.bound_session_id_from_executor(&e)?; | ||
| (e.id.clone(), e.node.clone(), ssn_id) | ||
| }; |
There was a problem hiding this comment.
move those to record_bind_failure and increment_session_retry_count, those two function should only have an executor paramenter, it should get info themself accordingly.
There was a problem hiding this comment.
Addressed in b257331e: record_bind_failure and increment_session_retry_count now take the executor and derive the session/executor details internally.
| fn bound_session_id(&self) -> Result<String, FlameError> { | ||
| let e = lock_ptr!(self.executor)?; | ||
| self.bound_session_id_from_executor(&e) | ||
| } | ||
|
|
||
| fn bound_session_id_from_executor( | ||
| &self, | ||
| executor: &crate::model::Executor, | ||
| ) -> Result<String, FlameError> { | ||
| executor.ssn_id.clone().ok_or_else(|| { | ||
| FlameError::InvalidState(format!("Executor <{}> has no bound session", executor.id)) | ||
| }) | ||
| } |
There was a problem hiding this comment.
are they still necessary after refactor?
There was a problem hiding this comment.
Addressed in b257331e: removed the redundant bound-session helper path after the binding-state refactor.
|
|
||
| pub const OPEN_SESSION: Option<SessionFilter> = Some(SessionFilter::by_state(SessionState::Open)); | ||
|
|
||
| pub fn open_ready_session(retry_limits: u32) -> Option<SessionFilter> { |
There was a problem hiding this comment.
replace it with READY_SESSION const, similar to OPEN_SESSION.
There was a problem hiding this comment.
Addressed in b257331e: replaced the ready-session helper with READY_SESSION, implemented as a const inline lambda predicate, and updated scheduler actions to use it directly.
Summary
Implements session bind failure recovery for RFE25.
BindExecutorCompletedRequest.resultand internalFlameResultconversions.Unbinding.retry_count, records session bind failure and retry-limit events, and skips not-ready sessions throughSessionFilterpredicates.flmctl view -s.Validation
cargo fmt --allcargo test -p flame-session-manager model::tests::test_snapshot_find_sessions_by_predicatecargo test -p flame-session-manager scheduler::tests::test_scheduler_skips_not_ready_sessioncargo test -p flame-session-manager binding_state_testscargo test -p flame-session-manager bind_session_failed_testscargo check -p flame-session-managercargo test -p flame-session-managercargo test -p flame-executor-managercargo check -p flame-session-manager -p flame-executor-manager -p flmctl -p flame-rsgit diff --checkFull
cargo test -p flame-rswas not run as part of the final pass because the package includes benchmark integration tests that expect a live local session-manager. The library tests passed earlier withcargo test -p flame-rs --lib.fix #25