Feature/ci fixer by rnagulapalle · Pull Request #1 · usephalanx/phalanx

rnagulapalle · 2026-04-12T00:08:20Z

No description provided.

…deploy Content pages (/, /changelog, /documentation) get today's lastmod so Google always sees the freshest date. Legal pages keep a fixed date. Hooked into deploy.sh step 5b — runs on the server after containers are healthy.

…tal, ux designer - Add AgentTrace, CIFixRun, CIIntegration, Demo ORM models to models.py - Add _reflect() and _trace() soul methods to BaseAgent - Add openai_client.py (sync wrapper with retry + JSON parsing) - Register traces, ci_webhooks, demos routers in api/main.py - Add CommandType.FIX + fix command parsing to command_parser.py - Add _inject_ux_designer_task() to commander.py - Add phalanx_enable_demo_deploy + buildkite_webhook_token to Settings - Add alembic migrations for agent_traces, ci_fixer, demos tables - Add all untracked agent files: soul, ci_fixer, ux_designer, prompt_enricher pipeline - Add all untracked test files: soul phases 2-4, ci_fixer, traces, webhooks, front_door

…enai → _call_claude - base.py: add _load_episode_memory, _call_claude_with_thinking, _escalate_trace_to_slack, _load_cross_run_memory, _write_cross_run_pattern, _write_complexity_calibration, _load_complexity_calibration, _decide; SOUL-008 escalation in _trace() - builder.py: fix all _call_openai → _call_claude; add _load_reviewer_feedback, _write_handoff_note, _self_check_has_issues, _fix_self_check_issues; branch isolation in _workspace_path; extended thinking for complexity >= 4; reflexion injection in _build_prompt - reviewer.py: fix _call_openai → _call_claude; add _load_builder_handoff, _write_cross_run_review_pattern; use module-level get_db so tests can patch - planner.py: fix _call_openai → _call_claude; add PLANNER_SOUL reflection + complexity calibration - commander.py, release.py: fix _call_openai → _call_claude - models.py: remove duplicate CIIntegration, CIFixRun, Demo classes that caused Table already defined error - api/main.py: add /healthz liveness probe + ci_integrations router - command_parser.py: parse fix acme/backend#42 format (split on #) - tests: add test_sre_unit, test_memory_writer_unit, test_api_health_route; patch cleanup exception path - Coverage: 70.03% (passes --cov-fail-under=70)

…r api key middleware - Move inline health handlers from main.py into phalanx/api/routes/health.py - Wire health_router into app at root (no prefix) - /healthz now also bypasses api key middleware - Add tests: api_key_middleware rejection, health bypass, cors origins branch - Coverage: 70.03%

…reasoning Accidentally removed in ci-fixer branch: demo_base_url, demo_docker_network, demo_nginx_container, demo_max_running, demo_nginx_conf_dir, openai_model_reasoning. SRE agent crashed on prod with 'Settings has no attribute demo_docker_network'.

The root cause of "no_structured_errors" on the first testbed run. Ruff's default output since v0.5 is the rich/diagnostic format: E501 Line too long (129 > 100) --> src/calc/formatting.py:13:101 | 13 | return "long string here" The classic one-liner `file:line:col: CODE msg` only appears with `--output-format=concise`. Most real repos (including our testbed's default `ruff check .`) emit rich format. _RUFF_RICH_RE was defined but never called — _parse_ruff only used the classic _RUFF_RE. Dead code meant we extracted 0 errors from any real ruff log. v1's agent bailed with "no_structured_errors" and v2's fingerprint pipeline hashed an empty feature list (deterministic garbage hash `4f53cda18c2baa0c` on every v2 run hitting the same dead path). Two coupled fixes: 1. _parse_ruff now runs BOTH regexes and dedupes on (file, line, col, code) — robust to logs that carry either or both. 2. Tool detection (`tool = "ruff" if _RUFF_RE.search else "eslint"`) also checks _RUFF_RICH_RE — rich-only logs were being mis-identified as eslint. Regex subtlety: rich regex uses `\n\s*-->` (zero-or-more whitespace), not `\n\s+-->`. The timestamp cleaner's greedy trailing \s* eats the 2-space indent before `-->` when each line is prefixed with a GitHub Actions timestamp. Without `\s*`, the regex misses exactly the case it exists to handle. Comment at the regex makes this explicit. Regression net (tests/unit/test_log_parser_unit.py::TestParseLogRuffRich): - Indented and unindented rich-format variants - Autofix marker `[*]` stripping - tool='ruff' identification for rich-only logs - Dedupe when both classic and rich formats coexist - **Real GitHub Actions CI log fixture** at tests/fixtures/ci_logs/github_actions_ruff_rich_e501.txt — pulled directly from the failed run of our testbed PR #1. This fixture is the TRUE regression guard: any future cleaner/regex change that breaks parsing the exact log GitHub emits fails this test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Python × lint, test_fail, flake closed end-to-end on prod (PRs #1–3 on usephalanx/phalanx-ci-fixer-testbed). Running all three through the real agent + real sandbox + real GitHub CI surfaced a clear finding: identical tool sequence, identical coder flow, identical prompt, cost in the same ballpark. The architecture already handles fix_type variance implicitly because validate_cmd + env setup + target files are extracted from the CI log and manifest files, not hardcoded per class. Router / StrategyRegistry abstraction would add code without adding capability. Instead: language playbooks (deterministic env setup per stack), one coverage-specific prompt rule, two new escalation enums. Language router still exists (sandbox image + env planner), but that axis is already wired.

Addresses two pre-deploy blockers surfaced by code review: (1) Retry idempotency (review blocker #2) Previous: on celery retry, _create_run attempted a second INSERT with the same run_id, hit IntegrityError, and the task re-raised — leaving the Run stuck in INTAKE forever. Fix: - _create_run → _create_or_load_run: first SELECT by id; if present, return it; otherwise INSERT (original path). - execute() now branches on run.status after load: status == INTAKE → do the full ceremony (transitions + DAG) status != INTAKE → skip ceremony, jump straight to poll loop This lets a retry resume mid-run instead of duplicating work OR raising on invalid transitions (validate_transition would reject e.g. VERIFYING → RESEARCHING). (2) Append-then-transition race (review blocker #3) Previous: _append_iteration_dag committed 3 PENDING tasks, then a separate _transition_run committed VERIFYING → EXECUTING. Between the two commits, a scheduled advance_run tick could observe status=VERIFYING + new PENDING tasks and dispatch a techlead task against the wrong run state. Fix: - New _append_iteration_and_transition() method that performs BOTH writes (task INSERTs + Run status UPDATE to EXECUTING) in the SAME DB session and commits once. advance_run's read is now always of a consistent state. - validate_transition(VERIFYING, EXECUTING) is called at the top of the helper so an invalid transition fails BEFORE any task writes. - Old _append_iteration_dag deleted (dead code after the rename). (3) Iteration cap clean-up (review blocker #1) Previous: for _ in range(_MAX_ITERATIONS + 1) gave 4 loop passes but the cap check terminates on pass 3 — the +1 was dead code and the comment ("+1 = the initial pass") was misleading. Fix: range(_MAX_ITERATIONS) exactly. Comment explains the semantics. Verified: - commander helper inventory correct (6 private methods; old _append_iteration_dag + _create_run removed) - Full import matrix clean (4 v3 agents + build + v2) - 117 regression tests pass Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Catches the bug classes that bit us during the humanize canary. Each bug = one prod deploy cycle (12 min). The harness runs in 1.1s, so the next round of v3 work pays a 600x faster feedback loop on the infrastructure-bug class. Files: tests/integration/v3_harness/ fixtures/python/ — pyproject.toml + workflow YAML with apt deps fixtures/typescript/ — package.json + pnpm-lock.yaml + tsconfig.json fixtures/javascript/ — package.json + package-lock.json + workflow fixtures/java/ — pom.xml + workflow setting up JDK 17 + maven fixtures/csharp/ — .csproj + global.json (SDK pin) + workflow test_celery_wiring.py (8 tests) test_dag_persist_shape.py (4 tests) test_env_detector_per_lang.py (21 tests, 2 xfail markers + 1 xpass) test_fix_spec_parser.py (18 tests covering humanize bug #4 shapes) Per-language coverage (the lesson: Python's bugs do NOT generalize — TS/Java/C# break differently in nature): Python — full assertions (we have full env_detector here) base_image respects requires-python lower bound (PB6), extras group install, apt deps from workflow YAML, ruff-modern-config flagged in tool_versions. TypeScript — Phase-1 contract: stack='node', node-bearing image, Phase-1 notes flag incomplete detection. xfail markers document the future contract (pnpm-lock.yaml → pnpm install). JavaScript — same Phase-1 contract; explicit "no phantom pnpm" check. Java — Phase-1: stack='java', JDK-bearing image. xfail marker documents future <maven.compiler.target> pin → image tag. C# — Phase-1: stack='csharp', dotnet SDK image. xfail marker for global.json sdk.version → image tag. Cross-lang — apt-regex regression test parameterized over all 5 langs (Bug PB7: regex must stop at && | ; etc). What this harness DOES catch (the canary's bug classes): Bug #1 (celery include missing for new agents) — test_v3_agent_module_in_celery_include Bug #3 (task lifecycle persistence missing) — test_v3_persist_task_completion_helper_imports Bug #4 (fix_spec parser too strict) — test_parse_json_embedded_in_prose +14 others Bug #7 partial (apt regex shell-noise) — test_apt_regex_does_not_swallow_shell_noise DAG-shape regressions (4 tasks, sre_modes, ordering, ci_context propagation) What it does NOT catch (deferred to Tier-2): Bug #2 (_audit signature mismatch) — needs real BaseAgent integration Bug #5 (tool_result API shape) — needs real OpenAI Responses API call Bug #6 (Sonnet stub) — needs real Anthropic call OR in-process integration with run_coder_subagent Tier-2 (real Postgres + real Docker + mocked LLM) is the next harness to build, ~200 LOC follow-up. Tier-3 is the canary process we already have. The 3 layers cover ascending blast radius + cost. Updated docs/ci-fixer-v3-canary-retro.md to reflect the harness is now built, not deferred.

Three test files, 13 tests, ~0.8s. Each one is a static or schema-level guard against a bug class that bit us during canary, without requiring real Anthropic, real OpenAI, or real Docker: test_techlead_openai_message_shape.py (5 tests, bug #5) Mimics the OpenAI Responses API's input contract via a small schema validator. Re-runs cifix_techlead._tool_result_message and asserts it would be ACCEPTED. If a future refactor regresses to role='tool' or top-level tool_use_id (the actual canary failure), the validator raises ResponsesApiSchemaError before deploy. test_engineer_wires_llm_call.py (5 tests, bug #6) Source-level inspection of cifix_engineer.execute(). Asserts: - run_coder_subagent is called - llm_call= is passed (not the test-only NotImplementedError stub) - build_sonnet_coder_callable + coder_subagent_tool_schemas + CODER_SUBAGENT_SYSTEM_PROMPT are imported Plus a sister check that v2's _call_sonnet_llm IS still a stub — the day someone wires it for real, this test reminds us we no longer need the explicit injection. test_state_transition_audit.py (3 tests, bug #2) Asserts ALL four v3 agents inherit BaseAgent._audit unchanged (no shadowing). The signature-mismatch bug from canary #2 fails this check at import time. Plus a real-DB integration test that runs cifix_commander._transition_run('INTAKE','RESEARCHING') against a live Postgres row and verifies it doesn't TypeError — skips cleanly if DATABASE_URL isn't reachable so dev workflow isn't blocked. conftest.py Real-Postgres fixtures (db_engine module-scoped, db_session per- test with rollback) following tests/integration/test_db_constraints pattern. Plus cifix_project + cifix_work_order fixtures with work_order_type='ci_fix' shape. Coverage of the canary bug list now: Bug | Class | Tier-1 | Tier-2 | Tier-3 #1 | infra | ✓ | | #2 | shadowing | | ✓ | #3 | infra | ✓ | | #4 | parser | ✓ | | #5 | provider | | ✓ | #6 | wiring | | ✓ | #7 | prompt | | | (canary) #8 | prompt | | | (canary) apt | regex | ✓ | | 6 of 8 humanize-canary bugs are now caught locally pre-deploy. The remaining 2 (prompt issues) require real LLM + real repo and stay in the canary process. Combined harness runtime: 51 + 13 = 64 tests, ~2 seconds total. Run with: pytest tests/integration/v3_harness/ (Tier-1, no deps) pytest tests/integration/v3_harness_t2/ (Tier-2, skips DB tests if Postgres absent)

FORGE added 5 commits April 11, 2026 07:51

rnagulapalle merged commit c3ff81f into main Apr 12, 2026
6 of 8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/ci fixer#1

Feature/ci fixer#1
rnagulapalle merged 5 commits intomainfrom
feature/ci-fixer

rnagulapalle commented Apr 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rnagulapalle commented Apr 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant