New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Tune] Multi GPU, Multi Node hyperparameter search not functioning #38505
Comments
Hi @f2010126 , I think it's the problem of scaling config, here you specified scaling_config = ScalingConfig(
num_workers=2, use_gpu=use_gpu, resources_per_worker={"CPU": 2, "GPU": gpus_per_trial}
) This ScalingConfig is for per trial. It will try to allocate num_workers * gpus_per_trial GPUs for 1 trial. So the correct configuration would be scaling_config = ScalingConfig(
num_workers=gpus_per_trial, use_gpu=use_gpu, resources_per_worker={"CPU": 2, "GPU":1}
) In this case, you will launch |
Hi @woshiyyya , I don't think that's it. Documentation calls the
When I use the scaling config you suggest, and check the ray status:
It's only using 2 GPUs, leaving out the other 2. When I use the following to try and use all 4:
My run hangs.:
Ray Status Output:
Does Ray Tune and Ray Train in general not work for the multinode combined with multi GPU use case? |
Hi @f2010126, I think there may be a misunderstanding of terms here. Let me clarify.
A trial is a Ray Tune trial. If you run 10 trials, you'll run 10 LightningTrainers, each occupying the resources specified in the If you use
this means, each trial will start 2 workers, and each worker will occupy 2 CPUs and 2 GPUs. Thus, if your cluster has 4 GPUs, exactly one trial can run at the same time. If you use e.g.
this would mean that each trial will still start 2 workers, but each worker will only occupy 1 CPU and 1 GPU. In the same cluster this means that 2 trials can run at the same time. Note that the LightningTrainer itself also occupies 1 CPU. So if your nodes have exactly 2 CPUs and 2 GPUs, you should consider passing Multi node + multi GPU training is one of the core use cases for Ray Train and Ray Tune :-) It's mostly a matter of configuration. |
Hi @krfricke Thank you for the detailed explanation. I also used this tutorial and it is as you said. I am mixing up the concepts of actors, workers, and nodes. I have a few doubts:
Does this only one LightningTrainer will run with access to 8 GPUs? Would it still train in a distributed fashion?
Each trial has 2 workers, each using 1 GPU, 1 CPU. Given 4 GPUS on the SLURM cluster, 2 Trials would run. What config can I use so 2 trials run in parallel, each with 2 GPUs per worker? I couldn't find the flag to set number of trials (Do I add the ConcurrencyLimiter?)
Thank you! |
@krfricke, @woshiyyya I tried your suggestions as well as what is suggested in the linked tutorial I added. A further doubt though, (should I raise this as a separate issue?): config={
"layer_1_size": tune.choice([32, 64, 128]),
"layer_2_size": tune.choice([64, 128, 256]),
"lr": tune.loguniform(1e-4, 1e-1),
} Is it possible to change the batch size of the DataModule by maybe adding it to searchable_lightning_config = (
LightningConfigBuilder()
.module(config={
"layer_1_size": tune.choice([32, 64, 128]),
"layer_2_size": tune.choice([64, 128, 256]),
"lr": tune.loguniform(1e-4, 1e-1),
}).fit_params(datamodule=dm, config= {"batch_size": tune.choice([32, 64, 128])} )
.build()
)
Or should I use the Vanilla Pytorch Lightning with Tune example instead if I want to add batch size as a Hyperparamter? Thanks again! |
I think for tuning batch size, you can define train_dataloader and test_dataloader in your By the way, for Ray 2.7, we are deprecating LightningTrainer, and support running lightning code with TorchTrainer (use your custom function) to provide more flexibility. In this case, you can pass any parameters including batch_size by the |
@woshiyyya , yes I migrated to the TorchTrainer, it's much more customisable. Thanks! |
What happened + What you expected to happen
I am unable to run Ray Tune for multiple GPUs on multiple nodes. I am using a slurm cluster.
Steps:
ray status
Expected behavior
The Ray tune should use the available resources and do the hyperparmeter tuning. Instead it is stuck at PENDING.
A similar issue is referenced here: #24259
I too wish to use HPBandster. But even with the default example Asha, it isn't working.
Observed behavior
The output remains at:
ray status
returnsDemands: {'CPU': 1.0} * 1, {'CPU': 2.0, 'GPU': 2.0} * 2 (PACK): 9+ pending placement groups
Would using ray_lightning plugin be a better solution?
Note When I have 1 GPU each on separate node or 8 GPUs on a single node, things work just fine.
It's the
n GPUs on m Nodes
that's the issue.Please advise.
Versions / Dependencies
ray == 2.5.0
pytorch-lightning == 2.0.7
NCCL version 2.14.3+cuda11.7
Python 3.10.6
Reproduction script
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered: