Train Benchmark: Enable multiprocess forkserver (CUDA compatibility) #52598

srinathk10 · 2025-04-25T04:44:03Z

Why are these changes needed?

Train release test: Enable multiprocess forkserver (CUDA compatibility)

Related issue number

#52641

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>

raulchen · 2025-04-25T17:01:11Z

release/train_tests/benchmark/train_benchmark.py

            self._metrics["validation/step"].get()
-            + self._metrics["validation/iter_first_batch"].get()
+            # Exclude the time it takes to get the first batch.
+            # + self._metrics["validation/iter_first_batch"].get()


can you also export time-to-first-batch as a benchmark result?

This information is already there. We can dashboard this if it's useful by adding to dataset.

Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>

srinathk10 · 2025-04-29T18:45:11Z

Verified deadlock without the fix:

create_batch_iterator deadlocked here.

Process 119809: ray::RayTrainWorker
Python v3.9.22 (/home/ray/anaconda3/bin/python3.9)

Thread 119809 (idle): "MainThread"
    main_loop (ray/_private/worker.py:946)
    <module> (ray/_private/workers/default_worker.py:330)
Thread 120110 (idle): "Thread-1"
    push_local_metrics (ray/train/v2/_internal/callbacks/metrics.py:223)
    run (threading.py:917)
    _bootstrap_inner (threading.py:980)
    _bootstrap (threading.py:937)
Thread 120142 (idle): "TrainingThread(train_fn_per_worker)"
    select (selectors.py:416)
    wait (multiprocessing/connection.py:931)
    _poll (multiprocessing/connection.py:424)
    poll (multiprocessing/connection.py:257)
    get (multiprocessing/queues.py:113)
    _try_get_data (torch/utils/data/dataloader.py:1133)
    _get_data (torch/utils/data/dataloader.py:1278)
    _next_data (torch/utils/data/dataloader.py:1329)
    __next__ (torch/utils/data/dataloader.py:631)
    create_batch_iterator (image_classification/factory.py:132)
    get_next_batch (train_benchmark.py:155)
    _train_epoch (train_benchmark.py:132)
    train_epoch (train_benchmark.py:92)
    run (train_benchmark.py:82)
    train_fn_per_worker (train_benchmark.py:340)
    train_fn (ray/train/v2/_internal/util.py:77)
    _run_target (ray/train/v2/_internal/execution/worker_group/thread_runner.py:37)
    run (threading.py:917)
    _bootstrap_inner (threading.py:980)
    _bootstrap (threading.py:937)
Thread 139277 (idle): "QueueFeederThread"
    wait (threading.py:312)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:917)
    _bootstrap_inner (threading.py:980)
    _bootstrap (threading.py:937)
Thread 139278 (idle): "QueueFeederThread"
    wait (threading.py:312)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:917)
    _bootstrap_inner (threading.py:980)
    _bootstrap (threading.py:937)
Thread 139279 (idle): "QueueFeederThread"
    wait (threading.py:312)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:917)
    _bootstrap_inner (threading.py:980)
    _bootstrap (threading.py:937)
Thread 139280 (idle): "QueueFeederThread"
    wait (threading.py:312)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:917)
    _bootstrap_inner (threading.py:980)
    _bootstrap (threading.py:937)
Thread 139281 (idle): "QueueFeederThread"
    wait (threading.py:312)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:917)
    _bootstrap_inner (threading.py:980)
    _bootstrap (threading.py:937)
Thread 139286 (idle): "QueueFeederThread"
    wait (threading.py:312)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:917)
    _bootstrap_inner (threading.py:980)
    _bootstrap (threading.py:937)
Thread 139293 (idle): "QueueFeederThread"
    wait (threading.py:312)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:917)
    _bootstrap_inner (threading.py:980)
    _bootstrap (threading.py:937)
Thread 139296 (idle): "QueueFeederThread"
    wait (threading.py:312)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:917)
    _bootstrap_inner (threading.py:980)
    _bootstrap (threading.py:937)
Thread 140082 (idle)
Thread 140083 (idle)
Thread 140084 (idle)
Thread 140085 (idle)

After fix:

2025-04-29 11:39:51,471 INFO test_utils.py:1953 -- Wrote results to /tmp/release_test_output.json
2025-04-29 11:39:51,472 INFO test_utils.py:1954 -- {"train/epoch-avg": 137.243852951, "train/epoch-min": 137.243852951, "train/epoch-max": 137.243852951, "train/epoch-total": 137.243852951, "train/iter_first_batch-avg": 12.170507315000123, "train/iter_first_batch-min": 12.170507315000123, "train/iter_first_batch-max": 12.170507315000123, "train/iter_first_batch-total": 12.170507315000123, "train/step-avg": 3.845457996243433e-06, "train/step-min": 1.8659998204384465e-06, "train/step-max": 6.698699962726096e-05, "train/step-total": 0.0075063340086671815, "train/rows_processed-avg": 32.0, "train/rows_processed-min": 32, "train/rows_processed-max": 32, "train/rows_processed-total": 62464, "train/iter_batch-avg": 0.05544941758657926, "train/iter_batch-min": 0.008546631000172056, "train/iter_batch-max": 3.4219362219996583, "train/iter_batch-total": 108.23726312900271, "validation/step-avg": Infinity, "validation/step-min": Infinity, "validation/step-max": 0, "validation/step-total": 0, "validation/iter_batch-avg": Infinity, "validation/iter_batch-min": Infinity, "validation/iter_batch-max": 0, "validation/iter_batch-total": 0, "checkpoint/download-avg": Infinity, "checkpoint/download-min": Infinity, "checkpoint/download-max": 0, "checkpoint/download-total": 0, "checkpoint/load-avg": Infinity, "checkpoint/load-min": Infinity, "checkpoint/load-max": 0, "checkpoint/load-total": 0, "train/iter_skip_batch-avg": Infinity, "train/iter_skip_batch-min": Infinity, "train/iter_skip_batch-max": 0, "train/iter_skip_batch-total": 0, "train/local_throughput": 577.0625251444112, "train/global_throughput": 9233.00040231058}
(RayTrainWorker pid=97376, ip=10.0.110.194) [ImageClassificationParquetTorchDataLoaderFactory/create_batch_iterator] Worker 4: Processing batch 1951 (shape: torch.Size([32, 3, 224, 224]), time since last: 0.08s) [repeated 93x across cluster]
(RayTrainWorker pid=97376, ip=10.0.110.194) [ImageClassificationParquetTorchDataLoaderFactory/create_batch_iterator] Worker 4: Completed device transfer for batch 1951 in 0.01s [repeated 94x across cluster]

srinathk10 · 2025-04-29T23:20:56Z

Here is the release test run.
https://buildkite.com/ray-project/release/builds/40093#019683b7-31a5-4783-ac92-f8383d88acee

Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>

srinathk10 · 2025-04-30T16:41:08Z

Release test runs https://buildkite.com/ray-project/release/builds/40180

srinathk10 · 2025-04-30T16:42:37Z

https://buildkite.com/ray-project/release/builds/40180

Ray Data: train/global_throughput = 2689.2372718538104 Images/sec
Torch DataLoader: train/global_throughput = 2578.371649999563 Images/sec

Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>

srinathk10 · 2025-04-30T20:27:57Z

https://buildkite.com/ray-project/release/builds/40258#_

training_ingest_benchmark-task=image_classification.skip_training.jpeg.local_fs.torch_dataloader
(RayTrainWorker pid=7458)   'train/global_throughput': 2803.249444614005,
(RayTrainWorker pid=7458)   'train/iter_first_batch-total': 9.588871331000007,

training_ingest_benchmark-task=image_classification.skip_training.jpeg.local_fs.tuned
train/global_throughput = 2525.355817549022
train/iter_first_batch-total = 24.637718398999993

release/train_tests/benchmark/image_classification/localfs_image_classification_jpeg/factory.py

Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>

…52598) --------- Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com> Signed-off-by: jhsu <jhsu@anyscale.com>

Train release test: Enable multiprocess spawn (CUDA compatability)

6219895

Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>

srinathk10 added the go add ONLY when ready to merge, run all tests label Apr 25, 2025

Fixes

817b5be

Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>

raulchen reviewed Apr 25, 2025

View reviewed changes

srinathk10 and others added 2 commits April 25, 2025 19:10

Misc fixes

e2333c3

Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>

Merge branch 'master' into srinathk10-train-release-test-fixes

c22efa9

srinathk10 marked this pull request as draft April 25, 2025 19:13

srinathk10 changed the title ~~Train release test: Enable multiprocess spawn (CUDA compatability)~~ WIP: Train release test: Enable multiprocess spawn (CUDA compatability) Apr 25, 2025

srinathk10 and others added 3 commits April 25, 2025 19:34

Lint

0b2debe

Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>

Fixes

2ed3665

Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>

Merge branch 'master' into srinathk10-train-release-test-fixes

c8e6947

srinathk10 changed the title ~~WIP: Train release test: Enable multiprocess spawn (CUDA compatability)~~ Train Benchmark: Enable multiprocess spawn (CUDA compatability) Apr 29, 2025

Fixes

3059c7d

Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>

srinathk10 marked this pull request as ready for review April 29, 2025 18:37

Merge branch 'master' into srinathk10-train-release-test-fixes

f66082a

Merge branch 'master' into srinathk10-train-release-test-fixes

be71af5

srinathk10 and others added 4 commits April 29, 2025 16:30

Merge branch 'master' into srinathk10-train-release-test-fixes

1620e9d

Fixes

78f344d

Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>

Merge branch 'master' into srinathk10-train-release-test-fixes

a6e7ac1

Merge branch 'master' into srinathk10-train-release-test-fixes

2b5f1ab

srinathk10 and others added 2 commits April 30, 2025 19:07

Fixes

1835cff

Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>

Merge branch 'master' into srinathk10-train-release-test-fixes

9bfc400

srinathk10 changed the title ~~Train Benchmark: Enable multiprocess spawn (CUDA compatability)~~ Train Benchmark: Enable multiprocess forkserver (CUDA compatability) Apr 30, 2025

srinathk10 changed the title ~~Train Benchmark: Enable multiprocess forkserver (CUDA compatability)~~ Train Benchmark: Enable multiprocess forkserver (CUDA compatibility) Apr 30, 2025

raulchen approved these changes Apr 30, 2025

View reviewed changes

release/train_tests/benchmark/image_classification/localfs_image_classification_jpeg/factory.py Show resolved Hide resolved

srinathk10 and others added 2 commits April 30, 2025 21:04

Fixes

66d440b

Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>

Merge branch 'master' into srinathk10-train-release-test-fixes

9f40524

raulchen merged commit 78661cc into master Apr 30, 2025
5 checks passed

raulchen deleted the srinathk10-train-release-test-fixes branch April 30, 2025 21:57

omatthew98 mentioned this pull request May 1, 2025

Release test training_ingest_benchmark-task=image_classification.full_training.parquet.torch_dataloader failed #52641

Closed

iamjustinhsu pushed a commit that referenced this pull request May 3, 2025

Train Benchmark: Enable multiprocess forkserver (CUDA compatibility) (#…

7ce3bdf

…52598) --------- Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com> Signed-off-by: jhsu <jhsu@anyscale.com>

hainesmichaelc added the community-backlog label May 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Train Benchmark: Enable multiprocess forkserver (CUDA compatibility) #52598

Train Benchmark: Enable multiprocess forkserver (CUDA compatibility) #52598

Uh oh!

srinathk10 commented Apr 25, 2025 •

edited

Loading

Uh oh!

raulchen Apr 25, 2025

Uh oh!

srinathk10 Apr 25, 2025

Uh oh!

srinathk10 commented Apr 29, 2025

Uh oh!

srinathk10 commented Apr 29, 2025

Uh oh!

srinathk10 commented Apr 30, 2025

Uh oh!

srinathk10 commented Apr 30, 2025

Uh oh!

srinathk10 commented Apr 30, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Train Benchmark: Enable multiprocess forkserver (CUDA compatibility) #52598

Train Benchmark: Enable multiprocess forkserver (CUDA compatibility) #52598

Uh oh!

Conversation

srinathk10 commented Apr 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

raulchen Apr 25, 2025

Choose a reason for hiding this comment

Uh oh!

srinathk10 Apr 25, 2025

Choose a reason for hiding this comment

Uh oh!

srinathk10 commented Apr 29, 2025

Uh oh!

srinathk10 commented Apr 29, 2025

Uh oh!

srinathk10 commented Apr 30, 2025

Uh oh!

srinathk10 commented Apr 30, 2025

Uh oh!

srinathk10 commented Apr 30, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

srinathk10 commented Apr 25, 2025 •

edited

Loading