Skip to content

design: hook/event contract + PR lifecycle for gr2#572

Merged
laynepenney merged 4 commits intosprint-20from
design/hook-event-contract
Apr 15, 2026
Merged

design: hook/event contract + PR lifecycle for gr2#572
laynepenney merged 4 commits intosprint-20from
design/hook-event-contract

Conversation

@laynepenney
Copy link
Copy Markdown
Collaborator

Summary

Sprint 20 design deliverables for Apollo's lane (PR lifecycle + hook/event contract):

  • HOOK-EVENT-CONTRACT.md: Full event contract for gr2. 26 event types across 5 domains (lane, lease, hook, PR, sync). Append-only JSONL outbox, cursor-based consumers (channel bridge, recall indexer, spawn watcher). Section 14: failure recovery contract covering lease reclaim (TTL-first), hook failure markers (forward-only, no rollback), and dirty state handling (--dirty=stash|block|discard).

  • PR-LIFECYCLE.md: PR group as first-class entity with pr_group_id cross-repo correlation. gr2 pr create/status/merge/checks/list commands. PlatformAdapter integration (Atlas's protocol). Explicit merge ordering. 10 QA adversarial scenarios (folded into Sentinel's arena).

Design decisions (all Opus-approved)

  1. Lease reclaim: TTL-first, lazy check, 900s default, lease renew for extension
  2. Hook failure: forward-only with failure markers at .grip/state/failures/
  3. Lane switch dirty state: stash as default, --dirty=stash|block|discard shared flag

Premium boundary: core OSS (event infrastructure, hook execution contract, PR orchestration primitives). spawn_watcher consumer is premium.

Test plan

  • Design review by team (Atlas confirms sync.* event payloads, Sentinel confirms QA scenarios)
  • Verify section 14 failure recovery contract aligns with Atlas's SYNC-FAILURE-CONTRACT.md
  • Confirm --dirty flag contract is adopted consistently across lane and sync commands

🤖 Generated with Claude Code

laynepenney and others added 2 commits April 15, 2026 10:27
Covers lease reclaim (TTL-first + optional heartbeat renewal),
hook failure markers (forward-only, no rollback), and dirty state
handling on lane switch (--dirty=stash|block|discard).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@laynepenney
Copy link
Copy Markdown
Collaborator Author

PR group semantics look directionally right, but I think PR-LIFECYCLE.md should be sharper about where pr_group_id lives relative to PlatformAdapter. The current adapter in grip#573 is intentionally per-repo (create_pr, merge_pr, pr_status, list_prs, pr_checks) and does not know about groups. That means pr_group_id must be synthesized and persisted by gr2 orchestration state, not returned by the adapter or implied by the platform backend. I'd make that explicit in the doc: PlatformAdapter returns child PR refs/status only; gr2 assembles them into a pr_group_id, owns forward/reverse lookup, and emits grouped events. Otherwise we risk a leaky abstraction where future implementations start stuffing group semantics into adapters and the OSS/premium boundary gets blurrier than it needs to be.

Copy link
Copy Markdown
Collaborator Author

@laynepenney laynepenney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One cross-lane mismatch to resolve before we treat this contract as stable: the common event envelope in HOOK-EVENT-CONTRACT.md currently requires a nested payload object plus top-level workspace / actor / owner_unit, while the Sprint 20 sync implementation in grip#573 is emitting flat JSONL rows for sync events (type, seq, event_id, timestamp, then repo-scoped fields like repo, old_sha, new_sha, scope, branch). I'm fine with either shape, but we need one contract. Right now a watcher built to this doc would not be able to consume the sync outbox Atlas shipped without a translation layer. My recommendation: either (a) explicitly state that sync.* events are allowed to use the flat envelope during Python-first gr2 and reserve the nested payload shape for the future normalized emitter, or (b) tighten grip#573 later to emit the exact envelope defined here. As written, the design doc and the shipping prototype diverge on the primary OSS/premium seam.

@laynepenney
Copy link
Copy Markdown
Collaborator Author

Review comment 2:

The lease reclaim ruling is mostly aligned, but the force-break notification path is still underspecified relative to the QA arena. Section 14.2 says spawn can route lease.force_broken to the original holder, but the event contract never defines any machine-readable way to tell whether notification was enqueued, delivered, or degraded to audit-only.

That leaves Sentinel's force_break_notification scenario uncovered: we can observe the break, but not whether the previous holder was actually notified or whether notification failed.

I would either add an explicit notification event/status (lease.notification_enqueued / lease.notification_failed) or extend lease.force_broken with notification outcome fields so the premium consumer path is auditable.

@laynepenney
Copy link
Copy Markdown
Collaborator Author

Review comment 1:

Section 14 introduces failure.resolved as part of the forward-only recovery flow, but that event never gets added to the taxonomy or the EventType enum earlier in the doc. The same gap exists for terminal sync failure: Atlas's sync lane and Sentinel's arena both rely on an explicit failure terminal (sync.failed or equivalent), but the sync taxonomy currently stops at sync.completed / sync.conflict.

Why this matters for the arena: the compound hook_failure_during_partial_materialization case needs a failure marker + explicit resolve event, otherwise the recovery path isn't actually observable through the contract you define here.

I would tighten this by making section 3.2 / 7.2 the single source of truth and adding the missing recovery/terminal failure events there, not just in section 14 prose.

1. Align sync event envelope: sync.completed is the single terminal
   event with status field (success/partial_failure/blocked/failed),
   matching Atlas's syncops.py pattern. No separate sync.failed type.

2. Scope pr_group_id to orchestration layer: PlatformAdapter is
   group-unaware. pr.py assigns group IDs and correlates per-repo PRs.

3. Add recovery events to taxonomy: failure.resolved and
   lease.reclaimed added to section 3.2 and EventType enum.

4. Clarify force-break notification routing: lease.force_broken,
   failure.resolved, and lease.reclaimed added to channel bridge
   mapping. Notification routing is a consumer responsibility, not
   a core gr2 event concern.
C1: Fix event_id length -- examples now use 16-char hex matching
    os.urandom(8).hex() spec. Was 12-char in all examples.

C2: Add lane_name to failure.resolved payload in section 14.1 to
    match taxonomy table and channel bridge template.

C3: Insert lease.reclaimed emission (step 8) in section 14.2
    reclaim flow. Was defined as event type but never emitted.

M1: Standardize on emit() everywhere. Sections 4.2 and 10.1 said
    emit_event(); section 7 API said emit(). Now consistent.

M2: Fix get_pr_status -> get_pr in PR-LIFECYCLE.md flow to match
    the PlatformAdapter protocol definition.

M3: Add filter-vs-mapping note to section 8. Globs like lane.*
    match types not in the mapping table; those are silently dropped.
@laynepenney laynepenney merged commit e43c019 into sprint-20 Apr 15, 2026
1 check passed
@laynepenney laynepenney deleted the design/hook-event-contract branch April 15, 2026 17:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant