Feature/ci fixer#1
Merged
rnagulapalle merged 5 commits intomainfrom Apr 12, 2026
Merged
Conversation
added 5 commits
April 11, 2026 07:51
…deploy Content pages (/, /changelog, /documentation) get today's lastmod so Google always sees the freshest date. Legal pages keep a fixed date. Hooked into deploy.sh step 5b — runs on the server after containers are healthy.
…tal, ux designer - Add AgentTrace, CIFixRun, CIIntegration, Demo ORM models to models.py - Add _reflect() and _trace() soul methods to BaseAgent - Add openai_client.py (sync wrapper with retry + JSON parsing) - Register traces, ci_webhooks, demos routers in api/main.py - Add CommandType.FIX + fix command parsing to command_parser.py - Add _inject_ux_designer_task() to commander.py - Add phalanx_enable_demo_deploy + buildkite_webhook_token to Settings - Add alembic migrations for agent_traces, ci_fixer, demos tables - Add all untracked agent files: soul, ci_fixer, ux_designer, prompt_enricher pipeline - Add all untracked test files: soul phases 2-4, ci_fixer, traces, webhooks, front_door
…enai → _call_claude - base.py: add _load_episode_memory, _call_claude_with_thinking, _escalate_trace_to_slack, _load_cross_run_memory, _write_cross_run_pattern, _write_complexity_calibration, _load_complexity_calibration, _decide; SOUL-008 escalation in _trace() - builder.py: fix all _call_openai → _call_claude; add _load_reviewer_feedback, _write_handoff_note, _self_check_has_issues, _fix_self_check_issues; branch isolation in _workspace_path; extended thinking for complexity >= 4; reflexion injection in _build_prompt - reviewer.py: fix _call_openai → _call_claude; add _load_builder_handoff, _write_cross_run_review_pattern; use module-level get_db so tests can patch - planner.py: fix _call_openai → _call_claude; add PLANNER_SOUL reflection + complexity calibration - commander.py, release.py: fix _call_openai → _call_claude - models.py: remove duplicate CIIntegration, CIFixRun, Demo classes that caused Table already defined error - api/main.py: add /healthz liveness probe + ci_integrations router - command_parser.py: parse fix acme/backend#42 format (split on #) - tests: add test_sre_unit, test_memory_writer_unit, test_api_health_route; patch cleanup exception path - Coverage: 70.03% (passes --cov-fail-under=70)
…r api key middleware - Move inline health handlers from main.py into phalanx/api/routes/health.py - Wire health_router into app at root (no prefix) - /healthz now also bypasses api key middleware - Add tests: api_key_middleware rejection, health bypass, cors origins branch - Coverage: 70.03%
…reasoning Accidentally removed in ci-fixer branch: demo_base_url, demo_docker_network, demo_nginx_container, demo_max_running, demo_nginx_conf_dir, openai_model_reasoning. SRE agent crashed on prod with 'Settings has no attribute demo_docker_network'.
rnagulapalle
added a commit
that referenced
this pull request
Apr 19, 2026
The root cause of "no_structured_errors" on the first testbed run.
Ruff's default output since v0.5 is the rich/diagnostic format:
E501 Line too long (129 > 100)
--> src/calc/formatting.py:13:101
|
13 | return "long string here"
The classic one-liner `file:line:col: CODE msg` only appears with
`--output-format=concise`. Most real repos (including our testbed's
default `ruff check .`) emit rich format.
_RUFF_RICH_RE was defined but never called — _parse_ruff only used
the classic _RUFF_RE. Dead code meant we extracted 0 errors from any
real ruff log. v1's agent bailed with "no_structured_errors" and v2's
fingerprint pipeline hashed an empty feature list (deterministic
garbage hash `4f53cda18c2baa0c` on every v2 run hitting the same dead
path).
Two coupled fixes:
1. _parse_ruff now runs BOTH regexes and dedupes on
(file, line, col, code) — robust to logs that carry either or both.
2. Tool detection (`tool = "ruff" if _RUFF_RE.search else "eslint"`)
also checks _RUFF_RICH_RE — rich-only logs were being
mis-identified as eslint.
Regex subtlety: rich regex uses `\n\s*-->` (zero-or-more whitespace),
not `\n\s+-->`. The timestamp cleaner's greedy trailing \s* eats the
2-space indent before `-->` when each line is prefixed with a GitHub
Actions timestamp. Without `\s*`, the regex misses exactly the case
it exists to handle. Comment at the regex makes this explicit.
Regression net (tests/unit/test_log_parser_unit.py::TestParseLogRuffRich):
- Indented and unindented rich-format variants
- Autofix marker `[*]` stripping
- tool='ruff' identification for rich-only logs
- Dedupe when both classic and rich formats coexist
- **Real GitHub Actions CI log fixture** at
tests/fixtures/ci_logs/github_actions_ruff_rich_e501.txt — pulled
directly from the failed run of our testbed PR #1. This fixture
is the TRUE regression guard: any future cleaner/regex change
that breaks parsing the exact log GitHub emits fails this test.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
rnagulapalle
added a commit
that referenced
this pull request
Apr 20, 2026
Python × lint, test_fail, flake closed end-to-end on prod (PRs #1–3 on usephalanx/phalanx-ci-fixer-testbed). Running all three through the real agent + real sandbox + real GitHub CI surfaced a clear finding: identical tool sequence, identical coder flow, identical prompt, cost in the same ballpark. The architecture already handles fix_type variance implicitly because validate_cmd + env setup + target files are extracted from the CI log and manifest files, not hardcoded per class. Router / StrategyRegistry abstraction would add code without adding capability. Instead: language playbooks (deterministic env setup per stack), one coverage-specific prompt rule, two new escalation enums. Language router still exists (sandbox image + env planner), but that axis is already wired.
rnagulapalle
added a commit
that referenced
this pull request
Apr 24, 2026
Addresses two pre-deploy blockers surfaced by code review: (1) Retry idempotency (review blocker #2) Previous: on celery retry, _create_run attempted a second INSERT with the same run_id, hit IntegrityError, and the task re-raised — leaving the Run stuck in INTAKE forever. Fix: - _create_run → _create_or_load_run: first SELECT by id; if present, return it; otherwise INSERT (original path). - execute() now branches on run.status after load: status == INTAKE → do the full ceremony (transitions + DAG) status != INTAKE → skip ceremony, jump straight to poll loop This lets a retry resume mid-run instead of duplicating work OR raising on invalid transitions (validate_transition would reject e.g. VERIFYING → RESEARCHING). (2) Append-then-transition race (review blocker #3) Previous: _append_iteration_dag committed 3 PENDING tasks, then a separate _transition_run committed VERIFYING → EXECUTING. Between the two commits, a scheduled advance_run tick could observe status=VERIFYING + new PENDING tasks and dispatch a techlead task against the wrong run state. Fix: - New _append_iteration_and_transition() method that performs BOTH writes (task INSERTs + Run status UPDATE to EXECUTING) in the SAME DB session and commits once. advance_run's read is now always of a consistent state. - validate_transition(VERIFYING, EXECUTING) is called at the top of the helper so an invalid transition fails BEFORE any task writes. - Old _append_iteration_dag deleted (dead code after the rename). (3) Iteration cap clean-up (review blocker #1) Previous: for _ in range(_MAX_ITERATIONS + 1) gave 4 loop passes but the cap check terminates on pass 3 — the +1 was dead code and the comment ("+1 = the initial pass") was misleading. Fix: range(_MAX_ITERATIONS) exactly. Comment explains the semantics. Verified: - commander helper inventory correct (6 private methods; old _append_iteration_dag + _create_run removed) - Full import matrix clean (4 v3 agents + build + v2) - 117 regression tests pass Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
rnagulapalle
added a commit
that referenced
this pull request
Apr 25, 2026
Catches the bug classes that bit us during the humanize canary. Each
bug = one prod deploy cycle (12 min). The harness runs in 1.1s, so
the next round of v3 work pays a 600x faster feedback loop on the
infrastructure-bug class.
Files:
tests/integration/v3_harness/
fixtures/python/ — pyproject.toml + workflow YAML with apt deps
fixtures/typescript/ — package.json + pnpm-lock.yaml + tsconfig.json
fixtures/javascript/ — package.json + package-lock.json + workflow
fixtures/java/ — pom.xml + workflow setting up JDK 17 + maven
fixtures/csharp/ — .csproj + global.json (SDK pin) + workflow
test_celery_wiring.py (8 tests)
test_dag_persist_shape.py (4 tests)
test_env_detector_per_lang.py (21 tests, 2 xfail markers + 1 xpass)
test_fix_spec_parser.py (18 tests covering humanize bug #4 shapes)
Per-language coverage (the lesson: Python's bugs do NOT generalize —
TS/Java/C# break differently in nature):
Python — full assertions (we have full env_detector here)
base_image respects requires-python lower bound (PB6),
extras group install,
apt deps from workflow YAML,
ruff-modern-config flagged in tool_versions.
TypeScript — Phase-1 contract: stack='node', node-bearing image,
Phase-1 notes flag incomplete detection. xfail markers
document the future contract (pnpm-lock.yaml → pnpm install).
JavaScript — same Phase-1 contract; explicit "no phantom pnpm" check.
Java — Phase-1: stack='java', JDK-bearing image. xfail marker
documents future <maven.compiler.target> pin → image tag.
C# — Phase-1: stack='csharp', dotnet SDK image. xfail marker
for global.json sdk.version → image tag.
Cross-lang — apt-regex regression test parameterized over all 5 langs
(Bug PB7: regex must stop at && | ; etc).
What this harness DOES catch (the canary's bug classes):
Bug #1 (celery include missing for new agents) — test_v3_agent_module_in_celery_include
Bug #3 (task lifecycle persistence missing) — test_v3_persist_task_completion_helper_imports
Bug #4 (fix_spec parser too strict) — test_parse_json_embedded_in_prose +14 others
Bug #7 partial (apt regex shell-noise) — test_apt_regex_does_not_swallow_shell_noise
DAG-shape regressions (4 tasks, sre_modes, ordering, ci_context propagation)
What it does NOT catch (deferred to Tier-2):
Bug #2 (_audit signature mismatch) — needs real BaseAgent integration
Bug #5 (tool_result API shape) — needs real OpenAI Responses API call
Bug #6 (Sonnet stub) — needs real Anthropic call OR
in-process integration with run_coder_subagent
Tier-2 (real Postgres + real Docker + mocked LLM) is the next harness
to build, ~200 LOC follow-up. Tier-3 is the canary process we already
have. The 3 layers cover ascending blast radius + cost.
Updated docs/ci-fixer-v3-canary-retro.md to reflect the harness is
now built, not deferred.
rnagulapalle
added a commit
that referenced
this pull request
Apr 25, 2026
Three test files, 13 tests, ~0.8s. Each one is a static or schema-level guard against a bug class that bit us during canary, without requiring real Anthropic, real OpenAI, or real Docker: test_techlead_openai_message_shape.py (5 tests, bug #5) Mimics the OpenAI Responses API's input contract via a small schema validator. Re-runs cifix_techlead._tool_result_message and asserts it would be ACCEPTED. If a future refactor regresses to role='tool' or top-level tool_use_id (the actual canary failure), the validator raises ResponsesApiSchemaError before deploy. test_engineer_wires_llm_call.py (5 tests, bug #6) Source-level inspection of cifix_engineer.execute(). Asserts: - run_coder_subagent is called - llm_call= is passed (not the test-only NotImplementedError stub) - build_sonnet_coder_callable + coder_subagent_tool_schemas + CODER_SUBAGENT_SYSTEM_PROMPT are imported Plus a sister check that v2's _call_sonnet_llm IS still a stub — the day someone wires it for real, this test reminds us we no longer need the explicit injection. test_state_transition_audit.py (3 tests, bug #2) Asserts ALL four v3 agents inherit BaseAgent._audit unchanged (no shadowing). The signature-mismatch bug from canary #2 fails this check at import time. Plus a real-DB integration test that runs cifix_commander._transition_run('INTAKE','RESEARCHING') against a live Postgres row and verifies it doesn't TypeError — skips cleanly if DATABASE_URL isn't reachable so dev workflow isn't blocked. conftest.py Real-Postgres fixtures (db_engine module-scoped, db_session per- test with rollback) following tests/integration/test_db_constraints pattern. Plus cifix_project + cifix_work_order fixtures with work_order_type='ci_fix' shape. Coverage of the canary bug list now: Bug | Class | Tier-1 | Tier-2 | Tier-3 #1 | infra | ✓ | | #2 | shadowing | | ✓ | #3 | infra | ✓ | | #4 | parser | ✓ | | #5 | provider | | ✓ | #6 | wiring | | ✓ | #7 | prompt | | | (canary) #8 | prompt | | | (canary) apt | regex | ✓ | | 6 of 8 humanize-canary bugs are now caught locally pre-deploy. The remaining 2 (prompt issues) require real LLM + real repo and stay in the canary process. Combined harness runtime: 51 + 13 = 64 tests, ~2 seconds total. Run with: pytest tests/integration/v3_harness/ (Tier-1, no deps) pytest tests/integration/v3_harness_t2/ (Tier-2, skips DB tests if Postgres absent)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.