fix(register): add Phase 3 to the bearer registration-session flow#134
Open
christophergeyer wants to merge 2 commits into
Open
fix(register): add Phase 3 to the bearer registration-session flow#134christophergeyer wants to merge 2 commits into
christophergeyer wants to merge 2 commits into
Conversation
Companion to https://github.com/treqs-inc/glaas-api/pull/50 which adds the matching server endpoint. # Background roar's 4-phase register flow has, for the standard session coordinator: Phase 2: create jobs (no I/O) Phase 3: POST /api/v1/artifacts/batch # full metadata (hash, size, …) Phase 4: POST /sessions/:h/jobs/:uid/inputs|outputs The bearer (registration-session) coordinator had Phase 2 and Phase 4 but NEVER had a Phase 3 — there was simply no server endpoint, no client method, no coordinator step. Artifact rows came into existence only by Phase 4 link requests falling back to 'size: BigInt(artifact.size ?? 0)' server-side. The session-selection logic in roar/application/publish/session.py:391 routes essentially every modern user (bearer token OR SSH keys OR anonymous_public-capable server) to the bearer flow. So this missing Phase 3 was the actual root cause of the 247 0-byte artifact stubs on the M1 nanochat DAG (ef174b91…) — 100% of M1-new artifacts. # Change - GlaasClient.register_artifacts_batch_under_registration_session(): new client method. Hits the new endpoint /api/v1/registration-sessions/:id/artifacts/batch with bearer auth. Strips 'session_hash' from each artifact payload since the URL scopes the request. - ArtifactRegistrationService.register_batch_under_registration_session(): mirrors register_batch (validation, size-based batching, error reporting) but targets the new transport. Reuses validate_artifact_registration; the registration_session_id stands in for session_hash to satisfy the non-placeholder check, then is stripped at the transport layer. - RegistrationCoordinator.register_lineage_under_registration_session(): accepts a new 'artifacts' parameter; runs Phase 3 between Phase 2 (create_jobs_batch_under_registration_session) and Phase 4 (link_job_artifacts_under_registration_session). Phase 4 still plows ahead regardless of Phase 3 outcome — matching the standard flow's behavior — so failures surface as link 404s once the corresponding glaas-api change lands. - RegisterService.register() (register_execution.py): the call site now runs prepare_batch_registration_artifacts (same helper the standard flow uses) on lineage.artifacts and passes them to the coordinator. # Backwards compat - artifacts is optional on the coordinator method (default None). Old callers that don't pass it get the previous behavior (no Phase 3, link endpoint still permissively stub-creates server-side). - Server-side: glaas-api PR #50 is additive; old roar versions continue to work against new glaas. New roar versions safely no-op against old glaas that don't have the staged endpoint yet (Phase 3 returns a network error which lands in errors[] but Phase 4 still tries — same shape as today). A follow-up server PR will flip registerRegistrationSessionJobInputs/Outputs to PR #49-style strict behavior (no stub-create). Once the new roar version is adopted, that PR closes the M1 bug class entirely. # Tests tests/unit/test_coordinator.py: - test_phase3_runs_with_artifacts: Phase 3 is called with the artifacts list before Phase 4 link runs. - test_phase3_skipped_when_no_artifacts: backwards-compat — omitted artifacts mean no Phase 3 call. - test_phase3_errors_collected_but_does_not_block_phase4: Phase 3 errors surface in errors[] but Phase 4 still runs (parity with the standard register_lineage). All 9 coordinator tests + 137 broader publish tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When roar (with the new Phase 3 step) talks to a glaas instance that doesn't yet have the staged-artifact endpoint (https://github.com/treqs-inc/glaas-api/pull/50), the POST returns 404. Previously this surfaced as 'Staged batch registration error: HTTP 404' and failed the whole 'roar register' invocation. Fall back silently instead: log an info, return clean from Phase 3, and let Phase 4's implicit stub-create work as the legacy fallback. (Yes, that's still the M1 bug — but it preserves the *current* permissive behavior for old glaas instances, which is no worse than before.) The 404-skip only applies on the FIRST batch with no prior successes. A 404 after successful batches would be a real anomaly (endpoint can't disappear mid-register) and gets surfaced normally. Tests (tests/unit/test_artifact_registration_phase3_fallback.py, 3 tests): - 404 on first batch ⇒ no errors surfaced - non-404 errors still surface - 404 after a successful batch surfaces normally This makes the failing CI test test_register_private_without_project_binding_uses_current_user_scope pass — its stub server returned 404 for the unknown new endpoint. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds the missing Phase 3 step to roar's bearer-flow registration coordinator. Companion to glaas-api PR #50 which adds the matching server endpoint.
Why
roar's 4-phase register flow:
POST /api/v1/artifacts/batchThe bearer coordinator had no Phase 3. Artifact rows came into existence only by Phase 4 link requests falling back to
size: BigInt(artifact.size ?? 0)server-side.The auth-based routing in
roar/application/publish/session.py:391routes essentially every modern user (bearer token OR SSH keys OR anonymous_public-capable server) to the bearer flow. So this missing Phase 3 is the actual root cause of the 247 zero-byte artifact stubs on the M1 nanochat DAG (ef174b91…) — 100 % of M1-new artifacts.Change
integrations/glaas/client.pyregister_artifacts_batch_under_registration_session. Hits the new endpoint with bearer auth, stripssession_hashfrom per-artifact payload.integrations/glaas/registration/artifact.pyArtifactRegistrationService.register_batch_under_registration_sessionmirroringregister_batch's validation + size-based batching + error reporting, just targeting the new transport.integrations/glaas/registration/coordinator.pyregister_lineage_under_registration_sessiontakes anartifactsparameter and runs Phase 3 between Phase 2 and Phase 4. Phase 4 still plows ahead regardless of Phase 3 outcome (parity with the standard coordinator).application/publish/register_execution.pyprepare_batch_registration_artifacts(same helper the standard flow uses) and passes the result to the coordinator.Backwards compat
artifactsis optional on the coordinator method. Existing callers that don't pass it get the previous behavior (no Phase 3; the link endpoint's permissive stub-create remains server-side).errors[], but Phase 4 still tries — same shape as today's bug. Functional, just doesn't fix anything.A follow-up glaas-api PR will flip
registerRegistrationSessionJobInputs/Outputsto PR #49-style strict behavior (no implicit stub-create). Once that lands AND this roar version is adopted, the M1 bug class is closed entirely. The phasing means there's no breaking-change window — adoption can be at a normal pace.Tests
tests/unit/test_coordinator.py— 3 new tests in a newTestRegisterLineageUnderRegistrationSessionclass:test_phase3_runs_with_artifacts— Phase 3 is called with the artifacts list before Phase 4 link runs.test_phase3_skipped_when_no_artifacts— backwards-compat: omitted artifacts ⇒ no Phase 3 call.test_phase3_errors_collected_but_does_not_block_phase4— Phase 3 errors surface inerrors[]but Phase 4 still runs (parity with the standard register_lineage).All 9 coordinator tests + 137 broader publish tests pass.
🤖 Generated with Claude Code