Skip to content

Conversation

@srinathk10
Copy link
Contributor

@srinathk10 srinathk10 commented Apr 25, 2025

Why are these changes needed?

Train release test: Enable multiprocess forkserver (CUDA compatibility)

Related issue number

#52641

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
@srinathk10 srinathk10 added the go add ONLY when ready to merge, run all tests label Apr 25, 2025
Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
self._metrics["validation/step"].get()
+ self._metrics["validation/iter_first_batch"].get()
# Exclude the time it takes to get the first batch.
# + self._metrics["validation/iter_first_batch"].get()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also export time-to-first-batch as a benchmark result?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This information is already there. We can dashboard this if it's useful by adding to dataset.

srinathk10 and others added 2 commits April 25, 2025 19:10
Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
@srinathk10 srinathk10 marked this pull request as draft April 25, 2025 19:13
@srinathk10 srinathk10 changed the title Train release test: Enable multiprocess spawn (CUDA compatability) WIP: Train release test: Enable multiprocess spawn (CUDA compatability) Apr 25, 2025
srinathk10 and others added 3 commits April 25, 2025 19:34
Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
@srinathk10 srinathk10 changed the title WIP: Train release test: Enable multiprocess spawn (CUDA compatability) Train Benchmark: Enable multiprocess spawn (CUDA compatability) Apr 29, 2025
Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
@srinathk10 srinathk10 marked this pull request as ready for review April 29, 2025 18:37
@srinathk10
Copy link
Contributor Author

Verified deadlock without the fix:

create_batch_iterator deadlocked here.

Process 119809: ray::RayTrainWorker
Python v3.9.22 (/home/ray/anaconda3/bin/python3.9)

Thread 119809 (idle): "MainThread"
    main_loop (ray/_private/worker.py:946)
    <module> (ray/_private/workers/default_worker.py:330)
Thread 120110 (idle): "Thread-1"
    push_local_metrics (ray/train/v2/_internal/callbacks/metrics.py:223)
    run (threading.py:917)
    _bootstrap_inner (threading.py:980)
    _bootstrap (threading.py:937)
Thread 120142 (idle): "TrainingThread(train_fn_per_worker)"
    select (selectors.py:416)
    wait (multiprocessing/connection.py:931)
    _poll (multiprocessing/connection.py:424)
    poll (multiprocessing/connection.py:257)
    get (multiprocessing/queues.py:113)
    _try_get_data (torch/utils/data/dataloader.py:1133)
    _get_data (torch/utils/data/dataloader.py:1278)
    _next_data (torch/utils/data/dataloader.py:1329)
    __next__ (torch/utils/data/dataloader.py:631)
    create_batch_iterator (image_classification/factory.py:132)
    get_next_batch (train_benchmark.py:155)
    _train_epoch (train_benchmark.py:132)
    train_epoch (train_benchmark.py:92)
    run (train_benchmark.py:82)
    train_fn_per_worker (train_benchmark.py:340)
    train_fn (ray/train/v2/_internal/util.py:77)
    _run_target (ray/train/v2/_internal/execution/worker_group/thread_runner.py:37)
    run (threading.py:917)
    _bootstrap_inner (threading.py:980)
    _bootstrap (threading.py:937)
Thread 139277 (idle): "QueueFeederThread"
    wait (threading.py:312)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:917)
    _bootstrap_inner (threading.py:980)
    _bootstrap (threading.py:937)
Thread 139278 (idle): "QueueFeederThread"
    wait (threading.py:312)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:917)
    _bootstrap_inner (threading.py:980)
    _bootstrap (threading.py:937)
Thread 139279 (idle): "QueueFeederThread"
    wait (threading.py:312)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:917)
    _bootstrap_inner (threading.py:980)
    _bootstrap (threading.py:937)
Thread 139280 (idle): "QueueFeederThread"
    wait (threading.py:312)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:917)
    _bootstrap_inner (threading.py:980)
    _bootstrap (threading.py:937)
Thread 139281 (idle): "QueueFeederThread"
    wait (threading.py:312)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:917)
    _bootstrap_inner (threading.py:980)
    _bootstrap (threading.py:937)
Thread 139286 (idle): "QueueFeederThread"
    wait (threading.py:312)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:917)
    _bootstrap_inner (threading.py:980)
    _bootstrap (threading.py:937)
Thread 139293 (idle): "QueueFeederThread"
    wait (threading.py:312)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:917)
    _bootstrap_inner (threading.py:980)
    _bootstrap (threading.py:937)
Thread 139296 (idle): "QueueFeederThread"
    wait (threading.py:312)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:917)
    _bootstrap_inner (threading.py:980)
    _bootstrap (threading.py:937)
Thread 140082 (idle)
Thread 140083 (idle)
Thread 140084 (idle)
Thread 140085 (idle)

After fix:

2025-04-29 11:39:51,471 INFO test_utils.py:1953 -- Wrote results to /tmp/release_test_output.json
2025-04-29 11:39:51,472 INFO test_utils.py:1954 -- {"train/epoch-avg": 137.243852951, "train/epoch-min": 137.243852951, "train/epoch-max": 137.243852951, "train/epoch-total": 137.243852951, "train/iter_first_batch-avg": 12.170507315000123, "train/iter_first_batch-min": 12.170507315000123, "train/iter_first_batch-max": 12.170507315000123, "train/iter_first_batch-total": 12.170507315000123, "train/step-avg": 3.845457996243433e-06, "train/step-min": 1.8659998204384465e-06, "train/step-max": 6.698699962726096e-05, "train/step-total": 0.0075063340086671815, "train/rows_processed-avg": 32.0, "train/rows_processed-min": 32, "train/rows_processed-max": 32, "train/rows_processed-total": 62464, "train/iter_batch-avg": 0.05544941758657926, "train/iter_batch-min": 0.008546631000172056, "train/iter_batch-max": 3.4219362219996583, "train/iter_batch-total": 108.23726312900271, "validation/step-avg": Infinity, "validation/step-min": Infinity, "validation/step-max": 0, "validation/step-total": 0, "validation/iter_batch-avg": Infinity, "validation/iter_batch-min": Infinity, "validation/iter_batch-max": 0, "validation/iter_batch-total": 0, "checkpoint/download-avg": Infinity, "checkpoint/download-min": Infinity, "checkpoint/download-max": 0, "checkpoint/download-total": 0, "checkpoint/load-avg": Infinity, "checkpoint/load-min": Infinity, "checkpoint/load-max": 0, "checkpoint/load-total": 0, "train/iter_skip_batch-avg": Infinity, "train/iter_skip_batch-min": Infinity, "train/iter_skip_batch-max": 0, "train/iter_skip_batch-total": 0, "train/local_throughput": 577.0625251444112, "train/global_throughput": 9233.00040231058}
(RayTrainWorker pid=97376, ip=10.0.110.194) [ImageClassificationParquetTorchDataLoaderFactory/create_batch_iterator] Worker 4: Processing batch 1951 (shape: torch.Size([32, 3, 224, 224]), time since last: 0.08s) [repeated 93x across cluster]
(RayTrainWorker pid=97376, ip=10.0.110.194) [ImageClassificationParquetTorchDataLoaderFactory/create_batch_iterator] Worker 4: Completed device transfer for batch 1951 in 0.01s [repeated 94x across cluster]

@srinathk10
Copy link
Contributor Author

@srinathk10
Copy link
Contributor Author

@srinathk10
Copy link
Contributor Author

https://buildkite.com/ray-project/release/builds/40180

Ray Data: train/global_throughput = 2689.2372718538104 Images/sec
Torch DataLoader: train/global_throughput = 2578.371649999563 Images/sec

srinathk10 and others added 2 commits April 30, 2025 19:07
Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
@srinathk10 srinathk10 changed the title Train Benchmark: Enable multiprocess spawn (CUDA compatability) Train Benchmark: Enable multiprocess forkserver (CUDA compatability) Apr 30, 2025
@srinathk10 srinathk10 changed the title Train Benchmark: Enable multiprocess forkserver (CUDA compatability) Train Benchmark: Enable multiprocess forkserver (CUDA compatibility) Apr 30, 2025
@srinathk10
Copy link
Contributor Author

https://buildkite.com/ray-project/release/builds/40258#_

training_ingest_benchmark-task=image_classification.skip_training.jpeg.local_fs.torch_dataloader
(RayTrainWorker pid=7458)   'train/global_throughput': 2803.249444614005,
(RayTrainWorker pid=7458)   'train/iter_first_batch-total': 9.588871331000007,

training_ingest_benchmark-task=image_classification.skip_training.jpeg.local_fs.tuned
train/global_throughput = 2525.355817549022
train/iter_first_batch-total = 24.637718398999993

srinathk10 and others added 2 commits April 30, 2025 21:04
Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
@raulchen raulchen merged commit 78661cc into master Apr 30, 2025
5 checks passed
@raulchen raulchen deleted the srinathk10-train-release-test-fixes branch April 30, 2025 21:57
iamjustinhsu pushed a commit that referenced this pull request May 3, 2025
…52598)

---------

Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>
Signed-off-by: jhsu <jhsu@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-backlog go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants