feat: Ray integration#17
Merged
Merged
Conversation
added 18 commits
February 26, 2026 15:17
- Multi-node Ray cluster via Docker Compose (head + 2 workers + MinIO) - Dockerfile based on rayproject/ray:2.44.1-py312 with roar installed - Session-scoped pytest fixture for cluster lifecycle management - Smoke tests: reachability, multi-node distribution, MinIO, shared volume - Sample Ray jobs: basic file I/O, S3 I/O, multi-step ETL pipeline - .dockerignore to exclude rust/target (8.3GB) from build context - Design docs: ray-integration.md, ray-test-harness.md - ray_e2e marker registered; e2e/ray excluded from default test run All 5 smoke tests passing.
Four test files defining the target behaviour — all failing until roar instruments Ray workers: - test_file_io_capture.py: worker file writes/reads appear in lineage DB - test_s3_capture.py: S3 I/O from workers captured via proxy - test_task_attribution.py: each artifact tagged with ray_task_id - test_multi_node_capture.py: I/O captured from remote worker containers, merged into single job record, native tracer on remote nodes Also adds jobs/attributed_file_io.py — 6-task job for attribution testing. These tests are the spec. Implementation follows.
- bench_ray_startup.py: job startup overhead vs baseline - bench_ray_actor_ipc.py: actor IPC overhead - bench_ray_preload.py: preload/native tracing overhead - bench_ray_proxy.py: proxy operation overhead - bench_ray_e2e.py: end-to-end job overhead - ray_bench_utils.py: shared helpers - run_ray_benchmarks.sh: runner script with summary output - docs/design/ray-benchmarks.md: benchmark design doc - docs/design/ray-log-collection.md: log collection design doc - docs/design/ray-native-tracing.md: native tracing design doc Results directory excluded via .gitignore.
Running all baseline iterations before roar allowed connection and cluster warmup to artificially reduce measured roar overhead. Interleave runs so both conditions experience identical thermal state each iteration. Also bump DEFAULT_ITERATIONS to 10 and QUICK_ITERATIONS to 5.
…, add --iterations flag - bench_ray_proxy: pre-upload unique keys per GetObject iteration to avoid page-cache hits giving artificially low overhead readings for large objects - all benchmarks: swap ensure_ray_docker_cluster_running for wait_for_cluster_readiness so benchmarks don't start before Ray head is actually ready - run_ray_benchmarks.sh: forward --iterations N to all benchmark scripts
67833d2 to
832209f
Compare
added 11 commits
February 26, 2026 16:22
Wire ROAR_WRAP=1, ROAR_PROJECT_DIR, ROAR_JOB_ID into the subprocess env so that sitecustomize.py's Ray patching activates automatically when a user runs `roar run python some_ray_script.py`. Fix collector job lookup to use the existing job row (identified by ROAR_JOB_ID) rather than creating a duplicate. Guard the execution service artifact recorder against double-counting. Tests: unit tests for env propagation, collector job lookup, and dedup guard; integration smoke test passes end-to-end.
Three options analyzed against ROAR_GOALS: - Option A: extend current architecture + content hashing (low risk) - Option B: roar run as py_executable (max fidelity, requires max_calls=1) - Option C: roar-worker entry point (clean long-term target) Recommendation: A now → A+hashing → C as target architecture
Deep dive into roar-worker entry point, per-task job UIDs, parent_job_uid schema extension, fragment model, and full GLaaS registration payload. Includes 6 open questions for review.
…tion C) - roar/ray/fragment.py: TaskFragment + ArtifactRef dataclasses, derive_task_uid - roar/ray/actor.py: append_fragment/get_all_fragments alongside legacy batch API - roar/ray/roar_worker.py: long-lived entry point with task boundary detection, streaming blake3 hash-on-close, S3 ETag capture, actor attribution config - roar/ray/collector.py: collect_fragments() writes per-task job records with parent_job_uid; deduplicates artifacts by hash - roar/db/schema.py: parent_job_uid column + migration - sitecustomize.py: use bash wrapper + roar.ray.roar_worker._startup setup hook - roar/core/models/glaas.py: parent_job_uid in RegisterJobRequest - roar/glaas_client.py: send parent_job_uid in job registration - roar/services/upload/lineage_collector.py: include ray_task jobs in lineage walk - roar/cli/commands/register.py: --as-blake3 flag for S3 ETag upgrade - pyproject.toml: roar-worker entry point - config: ray.actor_attribution option (per_call | per_actor)
- 6 live tests covering Ray job registration, parent_job_uid linkage, nested DAG, intermediate artifact inclusion, idempotency, dry-run - scripts/setup_live_glaas_env.sh: one-command test env setup - include parent-linked ray_task jobs in lineage collection so register/dry-run include Ray tasks
added 19 commits
February 26, 2026 20:58
…y dedup - jobs/s3_pipeline.py: 3-shard × 3-stage pipeline (ingest→train→eval→report) - test_s3_pipeline.py: 5 DB-level e2e tests including cross-task S3 identity dedup - test_ray_s3_pipeline_live.py: 4 live GLaaS tests covering session jobs, depth, cross-task lineage, and S3 hashes - add roar register support for s3:// artifact paths with unit coverage - include parent jobs in lineage collection for parent_job_uid-safe GLaaS registration
…ments without ray
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds end-to-end lineage tracking for Ray workloads.
What this does
roar-workerentrypoint +builtins.openpatch)RoarNodeAgentactorTests
18 e2e Ray tests (Docker cluster), 733 unit tests — all green. Live S3 pipeline validated.
Docs
docs/developer/ray-integration.mddocs/end-user/ray-integration.md