Skip to content

feat: Ray integration#17

Merged
TrevorBasinger merged 49 commits into
mainfrom
tb/ray-io
Feb 27, 2026
Merged

feat: Ray integration#17
TrevorBasinger merged 49 commits into
mainfrom
tb/ray-io

Conversation

@TrevorBasinger
Copy link
Copy Markdown
Member

@TrevorBasinger TrevorBasinger commented Feb 26, 2026

Adds end-to-end lineage tracking for Ray workloads.

What this does

  • Per-task file I/O capture via a worker setup hook (roar-worker entrypoint + builtins.open patch)
  • S3 artifact tracking through boto3, pandas, and PyArrow filesystem patches
  • Native OS-level tracing via LD_PRELOAD on each Ray node, coordinated by a per-node RoarNodeAgent actor
  • Multi-node support with a tiered log backend (Ray actor aggregator → filesystem fallback)
  • GLaaS registration — tasks, artifacts, and lineage edges written to the lineage store
  • Clean lineage — editable-install package paths and internal worker bundle dirs filtered out

Tests

18 e2e Ray tests (Docker cluster), 733 unit tests — all green. Live S3 pipeline validated.

Docs

  • docs/developer/ray-integration.md
  • docs/end-user/ray-integration.md

Trevor Basinger added 18 commits February 26, 2026 15:17
- Multi-node Ray cluster via Docker Compose (head + 2 workers + MinIO)
- Dockerfile based on rayproject/ray:2.44.1-py312 with roar installed
- Session-scoped pytest fixture for cluster lifecycle management
- Smoke tests: reachability, multi-node distribution, MinIO, shared volume
- Sample Ray jobs: basic file I/O, S3 I/O, multi-step ETL pipeline
- .dockerignore to exclude rust/target (8.3GB) from build context
- Design docs: ray-integration.md, ray-test-harness.md
- ray_e2e marker registered; e2e/ray excluded from default test run

All 5 smoke tests passing.
Four test files defining the target behaviour — all failing until
roar instruments Ray workers:

- test_file_io_capture.py: worker file writes/reads appear in lineage DB
- test_s3_capture.py: S3 I/O from workers captured via proxy
- test_task_attribution.py: each artifact tagged with ray_task_id
- test_multi_node_capture.py: I/O captured from remote worker containers,
  merged into single job record, native tracer on remote nodes

Also adds jobs/attributed_file_io.py — 6-task job for attribution testing.

These tests are the spec. Implementation follows.
- bench_ray_startup.py: job startup overhead vs baseline
- bench_ray_actor_ipc.py: actor IPC overhead
- bench_ray_preload.py: preload/native tracing overhead
- bench_ray_proxy.py: proxy operation overhead
- bench_ray_e2e.py: end-to-end job overhead
- ray_bench_utils.py: shared helpers
- run_ray_benchmarks.sh: runner script with summary output
- docs/design/ray-benchmarks.md: benchmark design doc
- docs/design/ray-log-collection.md: log collection design doc
- docs/design/ray-native-tracing.md: native tracing design doc

Results directory excluded via .gitignore.
Running all baseline iterations before roar allowed connection and cluster
warmup to artificially reduce measured roar overhead. Interleave runs so
both conditions experience identical thermal state each iteration.
Also bump DEFAULT_ITERATIONS to 10 and QUICK_ITERATIONS to 5.
…, add --iterations flag

- bench_ray_proxy: pre-upload unique keys per GetObject iteration to avoid page-cache hits giving artificially low overhead readings for large objects
- all benchmarks: swap ensure_ray_docker_cluster_running for wait_for_cluster_readiness so benchmarks don't start before Ray head is actually ready
- run_ray_benchmarks.sh: forward --iterations N to all benchmark scripts
Trevor Basinger added 11 commits February 26, 2026 16:22
Wire ROAR_WRAP=1, ROAR_PROJECT_DIR, ROAR_JOB_ID into the subprocess env so
that sitecustomize.py's Ray patching activates automatically when a user runs
`roar run python some_ray_script.py`. Fix collector job lookup to use the
existing job row (identified by ROAR_JOB_ID) rather than creating a duplicate.
Guard the execution service artifact recorder against double-counting.

Tests: unit tests for env propagation, collector job lookup, and dedup guard;
integration smoke test passes end-to-end.
Three options analyzed against ROAR_GOALS:
- Option A: extend current architecture + content hashing (low risk)
- Option B: roar run as py_executable (max fidelity, requires max_calls=1)
- Option C: roar-worker entry point (clean long-term target)

Recommendation: A now → A+hashing → C as target architecture
Deep dive into roar-worker entry point, per-task job UIDs, parent_job_uid
schema extension, fragment model, and full GLaaS registration payload.
Includes 6 open questions for review.
…tion C)

- roar/ray/fragment.py: TaskFragment + ArtifactRef dataclasses, derive_task_uid
- roar/ray/actor.py: append_fragment/get_all_fragments alongside legacy batch API
- roar/ray/roar_worker.py: long-lived entry point with task boundary detection,
  streaming blake3 hash-on-close, S3 ETag capture, actor attribution config
- roar/ray/collector.py: collect_fragments() writes per-task job records with
  parent_job_uid; deduplicates artifacts by hash
- roar/db/schema.py: parent_job_uid column + migration
- sitecustomize.py: use bash wrapper + roar.ray.roar_worker._startup setup hook
- roar/core/models/glaas.py: parent_job_uid in RegisterJobRequest
- roar/glaas_client.py: send parent_job_uid in job registration
- roar/services/upload/lineage_collector.py: include ray_task jobs in lineage walk
- roar/cli/commands/register.py: --as-blake3 flag for S3 ETag upgrade
- pyproject.toml: roar-worker entry point
- config: ray.actor_attribution option (per_call | per_actor)
- 6 live tests covering Ray job registration, parent_job_uid linkage, nested DAG, intermediate artifact inclusion, idempotency, dry-run
- scripts/setup_live_glaas_env.sh: one-command test env setup
- include parent-linked ray_task jobs in lineage collection so register/dry-run include Ray tasks
Trevor Basinger added 19 commits February 26, 2026 20:58
…y dedup

- jobs/s3_pipeline.py: 3-shard × 3-stage pipeline (ingest→train→eval→report)

- test_s3_pipeline.py: 5 DB-level e2e tests including cross-task S3 identity dedup

- test_ray_s3_pipeline_live.py: 4 live GLaaS tests covering session jobs, depth, cross-task lineage, and S3 hashes

- add roar register support for s3:// artifact paths with unit coverage

- include parent jobs in lineage collection for parent_job_uid-safe GLaaS registration
@TrevorBasinger TrevorBasinger changed the title WIP - Ray integration feat(ray): Ray integration — distributed pipeline provenance Feb 27, 2026
@TrevorBasinger TrevorBasinger changed the title feat(ray): Ray integration — distributed pipeline provenance feat: Ray integration Feb 27, 2026
@TrevorBasinger TrevorBasinger marked this pull request as ready for review February 27, 2026 19:03
@TrevorBasinger TrevorBasinger merged commit 53a7cdd into main Feb 27, 2026
33 of 36 checks passed
TrevorBasinger added a commit that referenced this pull request Mar 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant