Skip to content

add Strix Checkpoint Feature (Custom Modification) resume/checkpoint system for interrupted scans#380

Open
Ahmex000 wants to merge 16 commits intousestrix:mainfrom
Ahmex000:main
Open

add Strix Checkpoint Feature (Custom Modification) resume/checkpoint system for interrupted scans#380
Ahmex000 wants to merge 16 commits intousestrix:mainfrom
Ahmex000:main

Conversation

@Ahmex000
Copy link

Strix Checkpoint Feature (Custom Modification)

I implemented a simple modification that introduces an important feature:
the ability to stop a running scan and resume it later without losing progress or restarting from scratch.

🔹 How It Works

Run Strix with a --run-name to create a checkpoint:

strix --target http://php.testinvicti.com/ --run-name my-scan
  • This saves a checkpoint under the name: my-scan.
  • You can resume the scan at any time using the same command.

🔹 Starting a New Scan with the Same Name

If you want to start a fresh scan using the same name:

strix --target http://php.testinvicti.com/ --run-name my-scan --new

🔹 Using a Different Checkpoint Name

Alternatively, you can simply use a new run name:

strix --target http://php.testinvicti.com/ --run-name my-scan-2
  • Note : I have fixed the issue that was in the previous code https://github.com/usestrix/strix/pull/373 , where you could only create one checkpoint, but in the latest update you can complete more than once.

  • Note : I have fixed the issue that was in the previous code https://github.com/usestrix/strix/pull/378 Previously, there were some issues with file locations, but the process is now easier and lighter on the system.

The current version works without any problems, and you can pause and resume scans multiple times.

Ahmex000 and others added 15 commits March 19, 2026 07:30
- New strix/telemetry/checkpoint.py: Pydantic CheckpointModel + CheckpointManager
  with atomic writes (.tmp → rename), non-fatal errors, target-hash validation
- base_agent.py: save checkpoint after every iteration (root agents only),
  delete on clean completion, guard against duplicate task message on resume
- main.py: add --run-name, --resume, --new/--force-new CLI flags;
  _setup_checkpoint_on_args() handles load/validate/corrupt-recovery
- cli.py: resume banner, history replay (previous vulns + last 3 thoughts),
  restore AgentState with fresh sandbox + extended max_iterations budget
- tui.py: pre-populate tracer from checkpoint, restore AgentState in agent_config
- README.md: add "Resuming Interrupted Scans" section with usage examples

Original scan behavior is 100% preserved when --run-name is not used.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previously checkpoints only saved tracer.chat_messages and
tracer.vulnerability_reports, leaving tracer.agents and
tracer.tool_executions empty on resume — so all sub-agents
(both in-progress and completed) were invisible after resuming.

Changes:
- checkpoint.py: add tracer_agents, tracer_tool_executions,
  tracer_next_execution_id fields to CheckpointModel; populate
  them in CheckpointManager.save() from the live tracer
- cli.py: on resume, restore agents dict, tool_executions dict,
  and advance _next_execution_id to avoid ID collisions
- tui.py: same restore logic so TUI sidebar shows all previous
  agents and their tool results immediately on resume

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The previous guard `if not self.state.messages` broke sub-agents because
they can have pre-loaded context messages in their state before agent_loop
is called. This caused them to start without a task and produce no output.

Fix: only skip the initial task message when parent_id is None AND messages
is already populated (= root agent resume). Sub-agents always get their
task message regardless of whether their state has prior context.

- Fresh root agent:        parent_id=None, messages=[]   → adds task ✓
- Fresh sub-agent:         parent_id=set,  messages=[]   → adds task ✓
- Sub-agent with context:  parent_id=set,  messages=[..] → adds task ✓
- Resumed root agent:      parent_id=None, messages=[..] → skips  ✓

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Three bugs fixed:

1. Ghost sub-agents (root cause of all sub-agent issues):
   Restoring tracer.agents/tool_executions injected old sub-agent
   entries that had no live instances. The TUI showed them as
   interactive but they could not receive messages or run. Worse,
   they polluted the agent message-routing system so new sub-agents
   spawned after resume failed to communicate with the root agent.
   Fix: only restore chat_messages, vulnerability_reports, and the
   execution ID counter. The root agent's LLM context (message
   history) already knows what all sub-agents did.

2. Root agent stuck in wait state after resume:
   If the scan was interrupted while the root agent was in a wait
   state (waiting_for_input=True, stop_requested=True, etc.) the
   restored AgentState had those flags set and the loop froze
   immediately. Fix: reset all blocking flags on restore in both
   cli.py and tui.py.

3. Completed flag causing instant exit:
   If completed=True was serialised into the checkpoint (edge case)
   the loop would exit on the first should_stop() check. Fix:
   reset completed=False on restore.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…sume

Defines the missing helper function called in cli.py and adds the equivalent
to tui.py. Injects a user message into the restored AgentState so the LLM
knows the scan was interrupted and must continue rather than call finish_scan
or agent_finish due to an abruptly-ended message history.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Root cause of resume not working:
- generate_run_name() adds a random suffix every time, so without
  --run-name the checkpoint from a previous session was never found.
- Ctrl+C during the first iteration (before any checkpoint was saved)
  left no checkpoint to resume from.

Fixes:
1. _find_checkpoint_by_target_hash(): scans strix_runs/ for the most
   recent checkpoint whose target_hash matches the current targets.
   Now running `strix --target example.com` again automatically
   resumes the last interrupted scan without needing --run-name.

2. _save_checkpoint_on_interrupt(): saves current agent state in both
   the signal handler and atexit in cli.py and tui.py, so a Ctrl+C
   mid-first-iteration still produces a valid checkpoint.

3. _setup_checkpoint_on_args() restructured: handles run_name=None,
   --force-new, explicit --run-name, and auto-detect in one place.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Root causes:
1. Resume message said "re-spawn sub-agents" but didn't say WHICH ones
   or that old agent IDs are dead — LLM tried to interact with old IDs
   and got confused.
2. send_message_to_agent returned unhelpful "not found" error when the
   LLM used old (dead) agent IDs after resume.

Fixes:
- _build_resume_context_message / _inject_resume_context_message now
  accept the full checkpoint_data object and extract tracer_agents to
  list every non-completed sub-agent by name and task. The LLM now
  knows exactly which agents to re-spawn.
- Message explicitly forbids interacting with any agent ID from history
  and instructs the LLM to call view_agent_graph first.
- send_message_to_agent returns an actionable error when target is not
  found: explains it may be a dead session ID and tells the LLM to use
  view_agent_graph then create_agent.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… context

Root cause: create_agent passes the parent's full conversation history
(inherited_messages) to each new sub-agent. After resume, the parent's
history ends with a [SYSTEM - SCAN RESUMED] message that says:
  "ALL previous sub-agents have been terminated"
  "Do NOT call agent_finish unless all testing is genuinely complete"

Sub-agents reading this in their inherited context got confused:
- They thought they were the "terminated" agents and shouldn't be running
- They avoided calling agent_finish even when their task was done
- This caused them to hang, loop, or exit immediately without reporting

Fix: filter out any [SYSTEM - SCAN RESUMED] messages from the inherited
context before giving it to sub-agents. The resume instructions are only
relevant to the root agent — sub-agents should see normal parent context.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previously sub-agents were terminated on Ctrl+C and never properly restored
— only the root agent resumed. This is the full fix.

Architecture change:
- checkpoint.py: Added sub_agent_states field (dict[agent_id -> AgentState
  dump]) saved from _agent_instances at every checkpoint write. Every
  currently-running non-root agent is captured.

- base_agent.py: Replaced fragile _is_root_resume heuristic with an explicit
  is_resumed flag (set via agent config). Works for both root and sub-agents.
  Prevents duplicate task message from being added to restored agents.

- cli.py / tui.py: Added _restore_sub_agents() which, on resume, iterates
  checkpoint sub_agent_states in topological order (parents before children),
  restores each agent's full AgentState, resets blocking flags, clears the
  old sandbox, injects a [SYSTEM - SUB-AGENT RESUMED] message, and spawns
  each agent in a daemon thread — identical to how the root agent is handled.
  Sub-agents are spawned BEFORE execute_scan so root agent can communicate
  with them immediately using their original IDs.

- Root agent's resume message now says "these sub-agents are ALREADY RUNNING
  at IDs [X, Y]" instead of "re-spawn them" — prevents double-spawning.

- agents_graph_actions.py: [SYSTEM - SUB-AGENT RESUMED] filtered from
  inherited context alongside [SYSTEM - SCAN RESUMED] so freshly-spawned
  child agents never see these system markers.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add _needs_fresh_container flag: always create a fresh container on
  first call to _get_or_create_container in a new runtime instance,
  preventing reuse of stale containers from a previous session whose
  async docker-rm hasn't completed yet.

- Add _cleanup_existing_containers(): uses subprocess.run (synchronous
  docker rm -f) instead of the SDK remove() which returns before Docker
  fully frees the container name, causing 409 Conflict on containers.run().
  Searches by both name filter and strix-scan-id label to catch containers
  in mid-removal state.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Multiple sub-agent threads starting simultaneously all saw
_needs_fresh_container=True and each called _create_container,
which removes the container the previous thread just created.
Result: only 1 of N sub-agents got a live container; the rest
got 'Container X not found' on every tool call, and the root
agent's sandbox init also failed when sub-agents trashed the
container underneath it.

Fix: add threading.Lock (_container_init_lock) around the
slow path in _get_or_create_container. Only the first thread
to acquire the lock creates the container; all waiting threads
re-check _scan_container inside the lock and reuse the
already-running one, paying zero extra Docker overhead.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Root cause: checkpoint and tracer run directories used CWD-relative
paths (Path("strix_runs") and Path.cwd() / "strix_runs"). Launching
strix from different directories across sessions created separate
checkpoint files that never updated each other, so the third session
always resumed from the first session's iteration.

Fix: use Path.home() / "strix_runs" as the canonical absolute path in
both tracer.py and main.py so all sessions write to the same location
regardless of CWD.

Also includes earlier serialization robustness fixes (mode="json" +
_json_default fallback) and explicit checkpoint save in action_custom_quit.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Writes SAVED/FAILED entries to ~/strix_checkpoint_debug.log on every
save attempt, bypassing the suppressed warning logger. This lets us
see if saves are happening during resumed sessions and what error (if
any) is causing them to fail silently.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- config.py: cli-config.json LLM vars now always override shell env,
  preventing stale shell values from reverting the configured model
- checkpoint_restore.py: extract shared restore logic from cli/tui to
  eliminate code duplication
- cli.py / tui.py: use shared checkpoint_restore module, add
  double-save guard via threading.Event
- agents_graph_actions.py: add _agents_lock for thread-safe access to
  _running_agents and _agent_instances, fix mutable default arg in
  restore_sub_agents

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- checkpoint.py: remove debug log file (strix_checkpoint_debug.log)
  that was writing to every user's home directory on each save
- checkpoint.py: acquire _agents_lock before iterating _agent_instances
  to prevent RuntimeError on concurrent sub-agent creation/removal
- checkpoint_restore.py: register restored sub-agents in _agent_graph,
  _agent_instances, and _agent_states so send_message_to_agent can
  route to them — previously they were unreachable after resume
- tracer.py: revert Path.home() back to Path.cwd() to avoid silent
  breaking change for all users; checkpoint logic in main.py already
  uses Path.home() directly so tracer change was not needed
- cli.py / tui.py: move checkpoint_restore imports to top of file
  per PEP 8, remove noqa: E402 suppressions

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 21, 2026

Greptile Summary

This PR introduces a checkpoint/resume system for interrupted Strix scans, allowing users to pause a running pentest and continue it later via --run-name, --resume, and --new CLI flags. The overall architecture is well thought-out — atomic checkpoint writes, thread-safe agent registry locking, Docker container race condition fixes, and clean integration into both CLI and TUI paths. However there are four issues that need to be resolved before merging:

  • Checkpoint re-saved after clean completion (CLI + TUI): The atexit-registered cleanup_on_exit in cli.py and the equivalent closure in tui.py's _setup_cleanup_handlers both call _save_checkpoint_on_interrupt() / _save_checkpoint_on_interrupt() unconditionally on process exit. When a scan completes successfully, base_agent.py correctly deletes the checkpoint — but then the exit handler fires and writes it back. The same issue exists in tui.py's action_custom_quit. Every successful scan will leave a stale checkpoint, causing the next run to incorrectly auto-resume a finished scan.
  • Infinite recursion in _depth() (checkpoint_restore.py): The memoisation guard is written after the recursive call, so a cyclic parent_id reference (possible with a corrupted checkpoint file) causes unbounded recursion and a RecursionError. Setting _memo[aid] = 0 before recursing into the parent breaks the cycle.
  • LLM API key precedence reversed (config.py): An unrelated change makes stored LLM credentials in cli-config.json unconditionally override shell environment variables, reversing the previous behaviour where the shell always won. Users who rotate API keys via shell env (a common pattern) will silently have the old stored key applied.

Confidence Score: 2/5

  • Not safe to merge — three P1 bugs need to be fixed before the feature can be relied upon.
  • The checkpoint architecture and Docker/threading fixes are solid, but two independent logic bugs (post-completion checkpoint re-save in both CLI and TUI, and the cyclic recursion in _depth) would cause incorrect behaviour on every normal scan run. The config.py precedence reversal is an unrelated breaking change for any user who manages LLM credentials via shell env vars.
  • strix/interface/cli.py, strix/interface/tui.py (checkpoint saved after clean completion in both atexit and quit handlers), strix/interface/checkpoint_restore.py (_depth infinite recursion), strix/config/config.py (LLM env var precedence reversal)

Important Files Changed

Filename Overview
strix/telemetry/checkpoint.py New file implementing CheckpointModel (Pydantic), CheckpointManager (atomic save/load/delete), and compute_target_hash. Design is solid: atomic writes via .tmp → rename, non-fatal errors, version field. No issues found here.
strix/interface/checkpoint_restore.py New shared helpers for restoring state on resume. Contains a critical infinite recursion bug in _depth() when checkpoint data has cyclic parent_id references — the memo is written after recursing, so cycles are not detected. Also writes to _agent_graph outside the _agents_lock, inconsistent with other guarded writes in this module.
strix/interface/cli.py Checkpoint/resume wiring for the CLI runner. Critical bug: cleanup_on_exit (atexit) calls _save_checkpoint_on_interrupt() unconditionally, re-creating the checkpoint after a successful scan that already deleted it. The next run will incorrectly auto-resume a completed scan.
strix/interface/tui.py Checkpoint/resume wiring for the TUI. Same post-completion checkpoint re-save bug as cli.py appears in both action_custom_quit and the atexit cleanup closure in _setup_cleanup_handlers.
strix/config/config.py Unrelated behavior change: LLM env vars in cli-config.json now always override shell environment variables, reversing the previous precedence. Breaks users who manage API keys via shell env — their shell key is silently ignored in favour of potentially stale stored credentials.
strix/interface/main.py Adds --run-name, --resume, and --new CLI flags plus _setup_checkpoint_on_args() and _find_checkpoint_by_target_hash(). Logic is well-structured with proper fallbacks (auto-generate name, target-hash mismatch warning, corruption warning). Results path changed to ~/strix_runs consistently.
strix/agents/base_agent.py Clean integration: checkpoint saved after every iteration (root agents only), deleted on clean completion, task message skipped when resuming. Non-intrusive — zero impact when checkpoint_manager is absent.
strix/runtime/docker_runtime.py Adds _container_init_lock to prevent multiple resumed sub-agent threads from racing to create the same Docker container. _cleanup_existing_containers uses synchronous subprocess.run for docker rm -f to avoid Docker name-registry race. Well-reasoned changes.
strix/tools/agents_graph/agents_graph_actions.py Adds _agents_lock used consistently around _running_agents and _agent_instances mutations. Filters resume marker messages from inherited context so sub-agents don't misinterpret the root's resume instructions. Improved error message for missing agent IDs.

Comments Outside Diff (1)

  1. strix/interface/tui.py, line 854-873 (link)

    P1 TUI cleanup handler also saves after clean completion

    The _save_checkpoint_on_interrupt() closure defined inside _setup_cleanup_handlers is also called from cleanup_on_exit (atexit). This duplicates the same race described in action_custom_quit: a successful scan deletes the checkpoint in base_agent.py, then atexit fires and re-saves it. Apply the same completed guard here so the cleanup path is consistent with the recommended fix above.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: strix/interface/tui.py
    Line: 854-873
    
    Comment:
    **TUI cleanup handler also saves after clean completion**
    
    The `_save_checkpoint_on_interrupt()` closure defined inside `_setup_cleanup_handlers` is also called from `cleanup_on_exit` (atexit). This duplicates the same race described in `action_custom_quit`: a successful scan deletes the checkpoint in `base_agent.py`, then atexit fires and re-saves it. Apply the same `completed` guard here so the cleanup path is consistent with the recommended fix above.
    
    How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: strix/interface/cli.py
Line: 263-285

Comment:
**Checkpoint saved after clean scan completion**

`_save_checkpoint_on_interrupt` is called unconditionally from `cleanup_on_exit`, which is registered with `atexit`. When a scan completes successfully, `base_agent.py` calls `checkpoint_manager.delete()`, but then as the process exits normally, `atexit` triggers `cleanup_on_exit()``_save_checkpoint_on_interrupt()`. At that point `_checkpoint_saved` is **not** set (no interrupt occurred), so the guard passes and the checkpoint is **written back** over the deletion.

The consequence is that after every successful scan, a stale checkpoint file is left on disk. The next time the user runs the same `--run-name`, Strix will detect the checkpoint and auto-resume a scan that already finished, replaying all previous findings and re-running the agent from a "completed" state.

Add a guard that skips the save when the agent finished cleanly:

```python
def _save_checkpoint_on_interrupt() -> None:
    """Persist current agent state before exit so the scan can be resumed."""
    if not checkpoint_manager or not _agent_ref:
        return
    if _checkpoint_saved.is_set():
        return
    agent_instance = _agent_ref[0]
    # Skip if the scan already completed successfully — the checkpoint
    # was deleted in base_agent.py and there is nothing to resume.
    if getattr(agent_instance.state, "completed", False):
        return
    _checkpoint_saved.set()
    try:
        checkpoint_manager.save(
            agent_instance.state,
            tracer,
            scan_config,
            target_hash,
            agent_instance.max_iterations,
        )
    except Exception:  # noqa: BLE001
        pass  # non-fatal
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: strix/interface/tui.py
Line: 2020-2031

Comment:
**TUI also saves checkpoint after clean scan completion**

`action_custom_quit` unconditionally calls `_mgr.save(...)` before exiting. If the scan ran to completion, `base_agent.py` already deleted the checkpoint via `checkpoint_manager.delete()`. This code then immediately re-creates it, so the next run finds a stale checkpoint for a completed scan and auto-resumes it.

The same `completed` flag guard recommended for `cli.py` should be applied here:

```python
if _mgr and _agent:
    import contextlib
    # Only save if the scan was interrupted, not if it finished cleanly.
    if not getattr(_agent.state, "completed", False):
        with contextlib.suppress(Exception):
            _mgr.save(
                _agent.state,
                self.tracer,
                self.scan_config,
                self.agent_config.get("target_hash", ""),
                _agent.max_iterations,
            )
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: strix/interface/checkpoint_restore.py
Line: 34-43

Comment:
**Infinite recursion on cyclic `parent_id` references**

`_depth()` uses memoisation to avoid re-computation, but the memo entry for `aid` is only written **after** `_depth(parent)` returns. If the checkpoint data contains a cycle (e.g., agent A's `parent_id` is B and B's is A — possible with a corrupted checkpoint), the call stack grows unboundedly:

```
_depth("A") → _depth("B") → _depth("A") → _depth("B") → ... RecursionError
```

Fix: mark `aid` as visited (depth 0) **before** recursing, so any back-edge resolves immediately:

```python
def _depth(aid: str) -> int:
    if aid in _memo:
        return _memo[aid]
    # Mark before recursing to break any cycle
    _memo[aid] = 0
    parent = sub_agent_states.get(aid, {}).get("parent_id")
    if parent is not None and parent in sub_agent_states:
        _memo[aid] = 1 + _depth(parent)
    return _memo[aid]
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: strix/config/config.py
Line: 144-150

Comment:
**LLM API key precedence silently reversed**

The original code treated shell environment variables as the authoritative source for LLM credentials:

```python
# old logic (simplified)
if var_name not in os.environ or force:
    os.environ[var_name] = var_value
```

The new code makes `cli-config.json` **always win** over the shell for LLM vars:

```python
if var_name in llm_vars or force or var_name not in os.environ:
    os.environ[var_name] = var_value
```

This means if a user has `ANTHROPIC_API_KEY` set in their shell (e.g., pointing to a new or rotated key) and an older key is stored in `~/.strix/cli-config.json`, the **stored** key now silently overrides the shell key. The user's scan will fail authentication with no clear indication of why.

This looks unrelated to the checkpoint feature and is a breaking change for anyone who manages LLM credentials via shell environment variables. Consider reverting this specific hunk or adding a warning when the stored key overrides the shell value.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: strix/interface/tui.py
Line: 854-873

Comment:
**TUI cleanup handler also saves after clean completion**

The `_save_checkpoint_on_interrupt()` closure defined inside `_setup_cleanup_handlers` is also called from `cleanup_on_exit` (atexit). This duplicates the same race described in `action_custom_quit`: a successful scan deletes the checkpoint in `base_agent.py`, then atexit fires and re-saves it. Apply the same `completed` guard here so the cleanup path is consistent with the recommended fix above.

How can I resolve this? If you propose a fix, please make it concise.

Last reviewed commit: "Fix all critical iss..."

@Ahmex000
Copy link
Author

2026-03-21.11-02-02.mp4

Checkpoint / resume fixes (bot review on PR usestrix#380):
- cli.py: skip checkpoint save when scan completed cleanly (agent.state.completed)
  to prevent stale checkpoint re-creating after base_agent.py deletes it
- tui.py: same completed guard in both _save_checkpoint_on_interrupt and
  action_custom_quit to cover all TUI exit paths
- checkpoint_restore.py: fix infinite recursion in _depth() for cyclic
  parent_id references in corrupted checkpoints — mark node before recursing
- config.py: restore original shell-env-wins precedence for LLM vars;
  cli-config.json only applies when the shell var is absent, preventing
  silent override of rotated keys managed via shell environment

New vulnerability skills (from upstream PRs usestrix#204 and usestrix#334):
- clickjacking, cors_misconfiguration, nosql_injection, prototype_pollution,
  ssti, websocket_security (PR usestrix#204)
- mfa_bypass, edge_cases (PR usestrix#334)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant