New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Code stuck on "initalizing ddp" when using more than one gpu #4612
Comments
Hi! thanks for your contribution!, great first issue! |
Hey! Can you try to reproduce using our simple boring model? Just to verify the bug isn't something in your model. |
Thanks for the response @edenlightning. This morning (about 12 hrs after my last attempt), I ran the simple boring model and it ran. I also added gpu=[0] and it worked. When I added gpu=[0,1] it worked the first time I ran it. I successfully ctrl+c'd out of it, then tried to run it again and it hangs forever. nvidia-smi shows no processes running and I can continually run the single gpu approach after this. However, anytime I do gpu > 1 from now on it does not run. Thus, it seems like something bad happens when I ctrl+c a ddp process that blocks it from happening again. Any idea what that might be? |
Update: 1) Have had my system admin restart this GPU node and it didn't run again. Not sure how it got through that one time. 2) None of the accelerators ['ddp', 'ddp_spawn', 'dp'] run for me when using gpu=[0,1] on boring model. Just curious, could this be a related issue? pytorch/pytorch#1637 (comment) I am working on a 4-K80 node. |
@edenlightning any ideas about what the problem might be? ddp still isnt working on the boring model and ddp_spawn gives the pickle error on the boring model ... i am very stuck. |
@justusschock mind taking a look? |
Hi @JosephGatto I am sorry that I cannot reproduce this, since I don't have these kind of GPUs. But I can try to guide you during troubleshooting. Yes pytorch/pytorch#1637 (comment) seems to be related. Have you tried the steps mentioned there to track down the problem? |
Hi @justusschock thanks for offering your help! Sadly, this did not work. Any other ideas? |
Hi, Following the tutorial, I set tasks per node to 4 (number of GPUs) and it hung on Initializing DDP. I solved this by setting tasks per node to 8. |
@JosephGatto When you ctrl+c the dip process, do they still appear in |
@justusschock Correct. I can't even ctrl+c I have to usually ctrl+z and then manually kill the process when I use ddp. |
What happens I you try to ctrl+c? |
@justusschock nothing, it just stays frozen. I am forced to ctrl+z. |
Even ctrl+c multiple times does not work? |
@justusschock correct |
I have similar problem, and I m using the university's cluster, the exact same problem, hope someone can help out. |
I have the exact same problem. Any idea of what could fix this? |
I faced the same problem. Looks like it somehow connected to zeromq (?)
|
@alexionby I am no expert but an 'Address already in use' error in my pytorch-lightning experience has been related to an occupied port or ip address. |
@JosephGatto, I figured out that in my case it was connected to jupyter-notebook and ZMQ. As python script it works nice. My case closed. |
Hi @edenlightning @justusschock, Does PyTorch-lightning support compute capability 3.7? One of the HPC specialists who manage my compute cluster tried debugging this today and said the issue was isolated to the K80 nodes and that he got it to work on other nodes that used compute capability 7.0. Note: The K80s failed even after a driver update and he said that all GPUs passed to PyTorch-lightning were working very hard, but whatever process pushes the workload to the GPU is not returning to the host. Thanks again. |
@JosephGatto We support whatever pytorch supports. we don't do anything related to specific cuda compute capabilities. |
@JosephGatto |
I have the same problem. Weirdly, it works well with V100 and P100 GPUs. But when I try using Tesla T4 GPUs, the code hangs. |
@justusschock I have a main program that looks like this: def main(args):
config = AutoConfig.from_pretrained(
args.model_name
)
model = MyModel.from_pretrained(args.model_name, config=config, args=args)
if args.accelerator == "ddp":
plugins = DDPPlugin(find_unused_parameters=True)
else:
plugins = None
trainer = Trainer.from_argparse_args(args, callbacks=[
ModelCheckpoint(
monitor="mlm_val_loss",
dirpath=args.output_dir,
filename=f"{args.model_name}" + "-{epoch:02d}-mlm_loss={mlm_val_loss:.2f}"
)
], plugins=plugins)
dm = PretrainingDataModule(args)
trainer.fit(model, datamodule=dm)
if __name__ == "__main__":
parent_parser = ArgumentParser(add_help=False)
parent_parser = Trainer.add_argparse_args(parent_parser)
parser = MyModel.add_model_specific_args(parent_parser)
args = parse_with_config(parser)
main(args) Then I run this using python on the command line specifying |
This solved it for me.
|
Dear @aleSuglia, Any chance you can provide a fully reproducible script with imports, data ? Best, |
@tchaton Sorry, unfortunately I cannot. I can definitely say that I'm using Huggingface Transformers as my main library. My datasets are implemented using classic Pytorch |
@tchaton @justusschock I was profiling my code, I noticed that the |
I am experienceing the same problem. Except that it does not work in 'dp' setting as well as in 'ddp' setting' I ran the cifar-10 example with multiple gpus
The output is hanged after working for just one step of training_step(one batch for each gpu). Also, even if I press Ctrl+C multiple times, it does not halt. So I had to kill the process by looking up in htop. |
Hey @saitjinwon, I have tried your script using PyTorch Lightning master on using both dp and ddp on 2 gpus and it seems to work fine. Would you mind trying out master ? pip install git+https://github.com/PyTorchLightning/pytorch-lightning.git Best, |
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team! |
same here |
I am using a SLURM cluster and am experiencing the same problem when I try to use 2 GPUs on the same node for trainer.fit(). I tested using the Boring model and a pytorch torchvision model wrapped in a Lightning module, and the process hangs here:
SLURM flags:
I am able to use DP without issues but not DDP or DDP2. I set NCCL_DEBUG=INFO in my slurm batch script, but I don't see any extra information. I saw in this issue that num_gpus*num_nodes in the trainer should be the same as --ntasks in SLURM. I have set my trainer as follows:
pytorch version: 1.9.0 |
Is --cpus-per-task compatible? I'm not sure.
it should print
I'm surprised you don't get anything with |
I was able to run on a single GPU with these flags, maybe --cpus-per-task is too high for more than one GPU. I will reduce it and try again with the print statements. I did set NCCL_DEBUG=INFO using export, I'm not sure why nothing from NCCL_DEBUG was in the logs. |
Decreasing --cpus-per-task did the trick, I can also see the NCCL_DEBUG information in the logs now. The print statements came out as:
and the training was able to start. Thank you @awaelchli! |
I have the same issue. I filed the issue here #10471 |
Closing. If you are a future reader and none of the existing discussions helped you, please open a new issue with details and reproduction for your hang. |
Worked for me. |
Hi guys! 'NVIDIA GeForce RTX 4090 with CUDA capability sm_89 is not compatible with the current PyTorch installation. The current PyTorch installation supports CUDA capabilities sm_37 sm_50 sm_60 sm_61 sm_70 sm_75 compute_37.' in the newer versions of PyTorch. so you just need to update to a newer version of PyTorch with a compatible CUDAToolkit version. Hopefully, this can help some people out. |
Hi guys, I ran into the same problem, I am using slurm from university. just change --ntasks-per-node=1 solves my problem |
using srun before python solved my problem: srun python train.py ... |
🐛 Bug
I am trying to run a pytorch lightning model on a 4-GPU node. In my trainer, if I specify
It runs fine. However, once I add another GPU
I get this output:
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/4
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/4
initializing ddp: GLOBAL_RANK: 2, MEMBER: 3/4
initializing ddp: GLOBAL_RANK: 3, MEMBER: 4/4
And the model just hangs there forever. I have tried this with only 2 GPUs and get the same behavior.
Any idea why this may happen? I have tried with both ddp and ddp_spawn.
The text was updated successfully, but these errors were encountered: