add Strix Checkpoint Feature (Custom Modification) resume/checkpoint system for interrupted scans#380
add Strix Checkpoint Feature (Custom Modification) resume/checkpoint system for interrupted scans#380Ahmex000 wants to merge 16 commits intousestrix:mainfrom
Conversation
- New strix/telemetry/checkpoint.py: Pydantic CheckpointModel + CheckpointManager with atomic writes (.tmp → rename), non-fatal errors, target-hash validation - base_agent.py: save checkpoint after every iteration (root agents only), delete on clean completion, guard against duplicate task message on resume - main.py: add --run-name, --resume, --new/--force-new CLI flags; _setup_checkpoint_on_args() handles load/validate/corrupt-recovery - cli.py: resume banner, history replay (previous vulns + last 3 thoughts), restore AgentState with fresh sandbox + extended max_iterations budget - tui.py: pre-populate tracer from checkpoint, restore AgentState in agent_config - README.md: add "Resuming Interrupted Scans" section with usage examples Original scan behavior is 100% preserved when --run-name is not used. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previously checkpoints only saved tracer.chat_messages and tracer.vulnerability_reports, leaving tracer.agents and tracer.tool_executions empty on resume — so all sub-agents (both in-progress and completed) were invisible after resuming. Changes: - checkpoint.py: add tracer_agents, tracer_tool_executions, tracer_next_execution_id fields to CheckpointModel; populate them in CheckpointManager.save() from the live tracer - cli.py: on resume, restore agents dict, tool_executions dict, and advance _next_execution_id to avoid ID collisions - tui.py: same restore logic so TUI sidebar shows all previous agents and their tool results immediately on resume Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The previous guard `if not self.state.messages` broke sub-agents because they can have pre-loaded context messages in their state before agent_loop is called. This caused them to start without a task and produce no output. Fix: only skip the initial task message when parent_id is None AND messages is already populated (= root agent resume). Sub-agents always get their task message regardless of whether their state has prior context. - Fresh root agent: parent_id=None, messages=[] → adds task ✓ - Fresh sub-agent: parent_id=set, messages=[] → adds task ✓ - Sub-agent with context: parent_id=set, messages=[..] → adds task ✓ - Resumed root agent: parent_id=None, messages=[..] → skips ✓ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Three bugs fixed: 1. Ghost sub-agents (root cause of all sub-agent issues): Restoring tracer.agents/tool_executions injected old sub-agent entries that had no live instances. The TUI showed them as interactive but they could not receive messages or run. Worse, they polluted the agent message-routing system so new sub-agents spawned after resume failed to communicate with the root agent. Fix: only restore chat_messages, vulnerability_reports, and the execution ID counter. The root agent's LLM context (message history) already knows what all sub-agents did. 2. Root agent stuck in wait state after resume: If the scan was interrupted while the root agent was in a wait state (waiting_for_input=True, stop_requested=True, etc.) the restored AgentState had those flags set and the loop froze immediately. Fix: reset all blocking flags on restore in both cli.py and tui.py. 3. Completed flag causing instant exit: If completed=True was serialised into the checkpoint (edge case) the loop would exit on the first should_stop() check. Fix: reset completed=False on restore. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…sume Defines the missing helper function called in cli.py and adds the equivalent to tui.py. Injects a user message into the restored AgentState so the LLM knows the scan was interrupted and must continue rather than call finish_scan or agent_finish due to an abruptly-ended message history. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Root cause of resume not working: - generate_run_name() adds a random suffix every time, so without --run-name the checkpoint from a previous session was never found. - Ctrl+C during the first iteration (before any checkpoint was saved) left no checkpoint to resume from. Fixes: 1. _find_checkpoint_by_target_hash(): scans strix_runs/ for the most recent checkpoint whose target_hash matches the current targets. Now running `strix --target example.com` again automatically resumes the last interrupted scan without needing --run-name. 2. _save_checkpoint_on_interrupt(): saves current agent state in both the signal handler and atexit in cli.py and tui.py, so a Ctrl+C mid-first-iteration still produces a valid checkpoint. 3. _setup_checkpoint_on_args() restructured: handles run_name=None, --force-new, explicit --run-name, and auto-detect in one place. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Root causes: 1. Resume message said "re-spawn sub-agents" but didn't say WHICH ones or that old agent IDs are dead — LLM tried to interact with old IDs and got confused. 2. send_message_to_agent returned unhelpful "not found" error when the LLM used old (dead) agent IDs after resume. Fixes: - _build_resume_context_message / _inject_resume_context_message now accept the full checkpoint_data object and extract tracer_agents to list every non-completed sub-agent by name and task. The LLM now knows exactly which agents to re-spawn. - Message explicitly forbids interacting with any agent ID from history and instructs the LLM to call view_agent_graph first. - send_message_to_agent returns an actionable error when target is not found: explains it may be a dead session ID and tells the LLM to use view_agent_graph then create_agent. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… context Root cause: create_agent passes the parent's full conversation history (inherited_messages) to each new sub-agent. After resume, the parent's history ends with a [SYSTEM - SCAN RESUMED] message that says: "ALL previous sub-agents have been terminated" "Do NOT call agent_finish unless all testing is genuinely complete" Sub-agents reading this in their inherited context got confused: - They thought they were the "terminated" agents and shouldn't be running - They avoided calling agent_finish even when their task was done - This caused them to hang, loop, or exit immediately without reporting Fix: filter out any [SYSTEM - SCAN RESUMED] messages from the inherited context before giving it to sub-agents. The resume instructions are only relevant to the root agent — sub-agents should see normal parent context. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previously sub-agents were terminated on Ctrl+C and never properly restored — only the root agent resumed. This is the full fix. Architecture change: - checkpoint.py: Added sub_agent_states field (dict[agent_id -> AgentState dump]) saved from _agent_instances at every checkpoint write. Every currently-running non-root agent is captured. - base_agent.py: Replaced fragile _is_root_resume heuristic with an explicit is_resumed flag (set via agent config). Works for both root and sub-agents. Prevents duplicate task message from being added to restored agents. - cli.py / tui.py: Added _restore_sub_agents() which, on resume, iterates checkpoint sub_agent_states in topological order (parents before children), restores each agent's full AgentState, resets blocking flags, clears the old sandbox, injects a [SYSTEM - SUB-AGENT RESUMED] message, and spawns each agent in a daemon thread — identical to how the root agent is handled. Sub-agents are spawned BEFORE execute_scan so root agent can communicate with them immediately using their original IDs. - Root agent's resume message now says "these sub-agents are ALREADY RUNNING at IDs [X, Y]" instead of "re-spawn them" — prevents double-spawning. - agents_graph_actions.py: [SYSTEM - SUB-AGENT RESUMED] filtered from inherited context alongside [SYSTEM - SCAN RESUMED] so freshly-spawned child agents never see these system markers. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add _needs_fresh_container flag: always create a fresh container on first call to _get_or_create_container in a new runtime instance, preventing reuse of stale containers from a previous session whose async docker-rm hasn't completed yet. - Add _cleanup_existing_containers(): uses subprocess.run (synchronous docker rm -f) instead of the SDK remove() which returns before Docker fully frees the container name, causing 409 Conflict on containers.run(). Searches by both name filter and strix-scan-id label to catch containers in mid-removal state. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Multiple sub-agent threads starting simultaneously all saw _needs_fresh_container=True and each called _create_container, which removes the container the previous thread just created. Result: only 1 of N sub-agents got a live container; the rest got 'Container X not found' on every tool call, and the root agent's sandbox init also failed when sub-agents trashed the container underneath it. Fix: add threading.Lock (_container_init_lock) around the slow path in _get_or_create_container. Only the first thread to acquire the lock creates the container; all waiting threads re-check _scan_container inside the lock and reuse the already-running one, paying zero extra Docker overhead. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Root cause: checkpoint and tracer run directories used CWD-relative
paths (Path("strix_runs") and Path.cwd() / "strix_runs"). Launching
strix from different directories across sessions created separate
checkpoint files that never updated each other, so the third session
always resumed from the first session's iteration.
Fix: use Path.home() / "strix_runs" as the canonical absolute path in
both tracer.py and main.py so all sessions write to the same location
regardless of CWD.
Also includes earlier serialization robustness fixes (mode="json" +
_json_default fallback) and explicit checkpoint save in action_custom_quit.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Writes SAVED/FAILED entries to ~/strix_checkpoint_debug.log on every save attempt, bypassing the suppressed warning logger. This lets us see if saves are happening during resumed sessions and what error (if any) is causing them to fail silently. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- config.py: cli-config.json LLM vars now always override shell env, preventing stale shell values from reverting the configured model - checkpoint_restore.py: extract shared restore logic from cli/tui to eliminate code duplication - cli.py / tui.py: use shared checkpoint_restore module, add double-save guard via threading.Event - agents_graph_actions.py: add _agents_lock for thread-safe access to _running_agents and _agent_instances, fix mutable default arg in restore_sub_agents Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- checkpoint.py: remove debug log file (strix_checkpoint_debug.log) that was writing to every user's home directory on each save - checkpoint.py: acquire _agents_lock before iterating _agent_instances to prevent RuntimeError on concurrent sub-agent creation/removal - checkpoint_restore.py: register restored sub-agents in _agent_graph, _agent_instances, and _agent_states so send_message_to_agent can route to them — previously they were unreachable after resume - tracer.py: revert Path.home() back to Path.cwd() to avoid silent breaking change for all users; checkpoint logic in main.py already uses Path.home() directly so tracer change was not needed - cli.py / tui.py: move checkpoint_restore imports to top of file per PEP 8, remove noqa: E402 suppressions Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Greptile SummaryThis PR introduces a checkpoint/resume system for interrupted Strix scans, allowing users to pause a running pentest and continue it later via
Confidence Score: 2/5
Important Files Changed
|
2026-03-21.11-02-02.mp4 |
Checkpoint / resume fixes (bot review on PR usestrix#380): - cli.py: skip checkpoint save when scan completed cleanly (agent.state.completed) to prevent stale checkpoint re-creating after base_agent.py deletes it - tui.py: same completed guard in both _save_checkpoint_on_interrupt and action_custom_quit to cover all TUI exit paths - checkpoint_restore.py: fix infinite recursion in _depth() for cyclic parent_id references in corrupted checkpoints — mark node before recursing - config.py: restore original shell-env-wins precedence for LLM vars; cli-config.json only applies when the shell var is absent, preventing silent override of rotated keys managed via shell environment New vulnerability skills (from upstream PRs usestrix#204 and usestrix#334): - clickjacking, cors_misconfiguration, nosql_injection, prototype_pollution, ssti, websocket_security (PR usestrix#204) - mfa_bypass, edge_cases (PR usestrix#334) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Strix Checkpoint Feature (Custom Modification)
I implemented a simple modification that introduces an important feature:
the ability to stop a running scan and resume it later without losing progress or restarting from scratch.
🔹 How It Works
Run Strix with a
--run-nameto create a checkpoint:my-scan.🔹 Starting a New Scan with the Same Name
If you want to start a fresh scan using the same name:
🔹 Using a Different Checkpoint Name
Alternatively, you can simply use a new run name:
Note : I have fixed the issue that was in the previous code
https://github.com/usestrix/strix/pull/373, where you could only create one checkpoint, but in the latest update you can complete more than once.Note : I have fixed the issue that was in the previous code
https://github.com/usestrix/strix/pull/378Previously, there were some issues with file locations, but the process is now easier and lighter on the system.The current version works without any problems, and you can pause and resume scans multiple times.