Skip to content

feat(ray): rework host-submit lineage around roar-worker and fragments#32

Merged
TrevorBasinger merged 8 commits into
mainfrom
tb/cloud-demo-lineage-contracts
Mar 10, 2026
Merged

feat(ray): rework host-submit lineage around roar-worker and fragments#32
TrevorBasinger merged 8 commits into
mainfrom
tb/cloud-demo-lineage-contracts

Conversation

@TrevorBasinger
Copy link
Copy Markdown
Member

@TrevorBasinger TrevorBasinger commented Mar 10, 2026

Summary

This PR reworks the roar run ray job submit ... path so Ray jobs bootstrap through a dedicated roar-worker, stream lineage as encrypted fragments, and reconstitute that lineage back into the host-local .roar/roar.db after submission completes.

It also rounds out the supporting pieces: packaged preload/native tracer artifacts, new register targets for Ray-shaped lineage, and a much larger host-submit/cloud-topology test suite.

What Changed

  • Ray submit/bootstrap: wrap Ray job entrypoints with roar.ray.driver_entrypoint, use worker_process_setup_hook plus roar-worker instead of the old roar.ray.worker path, add fixed
    local proxy / node-agent bootstrap behavior, and support cluster-visible GLaaS/S3 endpoint overrides.
  • Driver/worker env handling: use RAY_JOB_ID and ROAR_JOB_INSTRUMENTED for Ray job auto-init and git-bypass behavior when drivers run from extracted working dirs rather than a repo checkout.
  • Fragment delivery: move Ray collection to fragment-first semantics, stream encrypted fragment batches to GLaaS, and treat local collection as reconstitution from fragments rather than the old actor/filesystem log path.
  • Fragment reconstitution: merge streamed fragments back into local jobs/artifacts, dedupe proxy fallback refs, restore packaged working_dir paths to host project paths, and materialize composite outputs during rebuild.
  • Tracing/packaging: package preload/native tracer artifacts with new sync/build scripts, ship roar_inject.pth, probe preload launcher readiness, add Linux GLIBC floor verification for wheels, and carry thread-aware metadata through native tracing.
  • Registration UX: extend roar register to accept artifact paths, DAG step refs like @4, and local session hash/prefixes, with better parent normalization for Ray-shaped/reconstituted jobs.
  • Docs/tests: update Ray integration docs to match the fragment-driven architecture and replace older actor/filesystem-focused Ray tests with host-submit/cloud-topology contract coverage, including S3, native tracing, timing, proxy reachability, crash/partial-fragment, and cloud-demo-like workloads.
  • Quality cleanup: remove stale coverage that still targeted the deleted roar.ray.worker contract and restore ruff, mypy, and unit-test green status after the refactor.

@TrevorBasinger TrevorBasinger force-pushed the tb/cloud-demo-lineage-contracts branch from b76a953 to 0236bc8 Compare March 10, 2026 18:05
@TrevorBasinger TrevorBasinger force-pushed the tb/cloud-demo-lineage-contracts branch from 0236bc8 to f07cda7 Compare March 10, 2026 18:37
@TrevorBasinger TrevorBasinger changed the title Tb/cloud demo lineage contracts feat(ray): rework host-submit lineage around roar-worker and fragments Mar 10, 2026
@TrevorBasinger TrevorBasinger marked this pull request as ready for review March 10, 2026 18:57
@TrevorBasinger TrevorBasinger merged commit dab4497 into main Mar 10, 2026
29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant