Code stuck on "initalizing ddp" when using more than one gpu #4612

JosephGatto · 2020-11-10T23:20:01Z

🐛 Bug

I am trying to run a pytorch lightning model on a 4-GPU node. In my trainer, if I specify

pl.Trainer(gpus=[0])

It runs fine. However, once I add another GPU

pl.Trainer(gpus=[0,1,2,3])

I get this output:

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/4
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/4
initializing ddp: GLOBAL_RANK: 2, MEMBER: 3/4
initializing ddp: GLOBAL_RANK: 3, MEMBER: 4/4

And the model just hangs there forever. I have tried this with only 2 GPUs and get the same behavior.

Any idea why this may happen? I have tried with both ddp and ddp_spawn.

PyTorch Version-- tried both 1.4 and 1.7
OS-- Linux
Installed with pip
Python version: 3.8.5
CUDA/cuDNN version: 10.1
GPU models and configuration: NVIDIA K80s

The text was updated successfully, but these errors were encountered:

github-actions · 2020-11-10T23:20:44Z

Hi! thanks for your contribution!, great first issue!

edenlightning · 2020-11-11T01:50:52Z

Hey! Can you try to reproduce using our simple boring model?

Just to verify the bug isn't something in your model.

JosephGatto · 2020-11-11T16:24:24Z

Thanks for the response @edenlightning. This morning (about 12 hrs after my last attempt), I ran the simple boring model and it ran. I also added gpu=[0] and it worked. When I added gpu=[0,1] it worked the first time I ran it. I successfully ctrl+c'd out of it, then tried to run it again and it hangs forever. nvidia-smi shows no processes running and I can continually run the single gpu approach after this. However, anytime I do gpu > 1 from now on it does not run.

Thus, it seems like something bad happens when I ctrl+c a ddp process that blocks it from happening again. Any idea what that might be?

JosephGatto · 2020-11-11T18:36:41Z

Update: 1) Have had my system admin restart this GPU node and it didn't run again. Not sure how it got through that one time. 2) None of the accelerators ['ddp', 'ddp_spawn', 'dp'] run for me when using gpu=[0,1] on boring model.

Just curious, could this be a related issue? pytorch/pytorch#1637 (comment) I am working on a 4-K80 node.

JosephGatto · 2020-11-16T14:45:04Z

@edenlightning any ideas about what the problem might be? ddp still isnt working on the boring model and ddp_spawn gives the pickle error on the boring model ... i am very stuck.

edenlightning · 2020-11-17T19:43:51Z

@justusschock mind taking a look?

justusschock · 2020-11-18T13:29:02Z

Hi @JosephGatto I am sorry that I cannot reproduce this, since I don't have these kind of GPUs. But I can try to guide you during troubleshooting.

Yes pytorch/pytorch#1637 (comment) seems to be related. Have you tried the steps mentioned there to track down the problem?

JosephGatto · 2020-11-18T17:30:40Z

Hi @justusschock thanks for offering your help! Sadly, this did not work. Any other ideas?

SohamTamba · 2020-11-19T03:45:50Z

Hi,
I had this problem to while running PL on my university's SLURM cluster. I was trying to use DDP on 4 GPUs.

Following the tutorial, I set tasks per node to 4 (number of GPUs) and it hung on Initializing DDP.

I solved this by setting tasks per node to 8.

justusschock · 2020-11-19T11:19:12Z

@JosephGatto When you ctrl+c the dip process, do they still appear in Nvidia-smi as zombie processes?

JosephGatto · 2020-11-19T13:49:13Z

@justusschock Correct. I can't even ctrl+c I have to usually ctrl+z and then manually kill the process when I use ddp.

justusschock · 2020-11-19T14:41:13Z

What happens I you try to ctrl+c?

JosephGatto · 2020-11-19T16:38:08Z

@justusschock nothing, it just stays frozen. I am forced to ctrl+z.

justusschock · 2020-11-23T08:14:23Z

Even ctrl+c multiple times does not work?

JosephGatto · 2020-11-23T18:12:09Z

@justusschock correct

yikuanli · 2020-12-01T13:22:51Z

I have similar problem, and I m using the university's cluster, the exact same problem, hope someone can help out.

Stellakats · 2020-12-01T21:18:04Z

I have the exact same problem. Any idea of what could fix this?

alexionby · 2020-12-06T12:24:53Z

I faced the same problem. Looks like it somehow connected to zeromq (?)

Traceback (most recent call last):
  File "/home/alex/anaconda3/envs/pt/lib/python3.8/site-packages/ipykernel_launcher.py", line 16, in <module>
    app.launch_new_instance()
  File "/home/alex/anaconda3/envs/pt/lib/python3.8/site-packages/traitlets/config/application.py", line 663, in launch_instance
    app.initialize(argv)
  File "<decorator-gen-124>", line 2, in initialize
  File "/home/alex/anaconda3/envs/pt/lib/python3.8/site-packages/traitlets/config/application.py", line 87, in catch_config_error
    return method(app, *args, **kwargs)
  File "/home/alex/anaconda3/envs/pt/lib/python3.8/site-packages/ipykernel/kernelapp.py", line 567, in initialize
    self.init_sockets()
  File "/home/alex/anaconda3/envs/pt/lib/python3.8/site-packages/ipykernel/kernelapp.py", line 271, in init_sockets
    self.shell_port = self._bind_socket(self.shell_socket, self.shell_port)
  File "/home/alex/anaconda3/envs/pt/lib/python3.8/site-packages/ipykernel/kernelapp.py", line 218, in _bind_socket
    return self._try_bind_socket(s, port)
  File "/home/alex/anaconda3/envs/pt/lib/python3.8/site-packages/ipykernel/kernelapp.py", line 194, in _try_bind_socket
    s.bind("tcp://%s:%i" % (self.ip, port))
  File "zmq/backend/cython/socket.pyx", line 550, in zmq.backend.cython.socket.Socket.bind
  File "zmq/backend/cython/checkrc.pxd", line 26, in zmq.backend.cython.checkrc._check_rc
zmq.error.ZMQError: Address already in use

JosephGatto · 2020-12-06T13:20:51Z

@alexionby I am no expert but an 'Address already in use' error in my pytorch-lightning experience has been related to an occupied port or ip address.

alexionby · 2020-12-06T16:38:06Z

@JosephGatto, I figured out that in my case it was connected to jupyter-notebook and ZMQ. As python script it works nice. My case closed.

JosephGatto · 2020-12-09T19:52:56Z

Hi @edenlightning @justusschock,

Does PyTorch-lightning support compute capability 3.7? One of the HPC specialists who manage my compute cluster tried debugging this today and said the issue was isolated to the K80 nodes and that he got it to work on other nodes that used compute capability 7.0.

Note: The K80s failed even after a driver update and he said that all GPUs passed to PyTorch-lightning were working very hard, but whatever process pushes the workload to the GPU is not returning to the host.

Thanks again.

justusschock · 2020-12-10T16:03:16Z

@JosephGatto We support whatever pytorch supports. we don't do anything related to specific cuda compute capabilities.
But even if the prebuilt binaries of pytorch don't support your GPU, you should be able to compile it yourself

MarsSu0618 · 2020-12-24T15:45:18Z

@JosephGatto
Did you solve it? Because i have same issues. It stucks when i set ddp ...

shreyaskamathkm · 2021-01-04T19:20:01Z

I have the same problem. Weirdly, it works well with V100 and P100 GPUs. But when I try using Tesla T4 GPUs, the code hangs.

aleSuglia · 2021-08-25T21:04:16Z

@justusschock I have a main program that looks like this:

def main(args):
    config = AutoConfig.from_pretrained(
        args.model_name
    )

    model = MyModel.from_pretrained(args.model_name, config=config, args=args)

    if args.accelerator == "ddp":
        plugins = DDPPlugin(find_unused_parameters=True)
    else:
        plugins = None

    trainer = Trainer.from_argparse_args(args, callbacks=[
        ModelCheckpoint(
            monitor="mlm_val_loss",
            dirpath=args.output_dir,
            filename=f"{args.model_name}" + "-{epoch:02d}-mlm_loss={mlm_val_loss:.2f}"
        )
    ], plugins=plugins)

    dm = PretrainingDataModule(args)

    trainer.fit(model, datamodule=dm)

if __name__ == "__main__":
    parent_parser = ArgumentParser(add_help=False)
    parent_parser = Trainer.add_argparse_args(parent_parser)
    parser = MyModel.add_model_specific_args(parent_parser)
    args = parse_with_config(parser)
    main(args)

Then I run this using python on the command line specifying --gpus 4 --accelerator ddp --precision 16.

Gateway2745 · 2021-08-26T04:39:37Z

This solved it for me.

Don't set CUDA_LAUNCH_BLOCKING=1
Use PyTorch nightly build. My PLT version=1.4.2.

tchaton · 2021-08-26T08:26:13Z

Dear @aleSuglia,

Any chance you can provide a fully reproducible script with imports, data ?

Best,
T.C

aleSuglia · 2021-08-26T08:31:44Z

@tchaton Sorry, unfortunately I cannot. I can definitely say that I'm using Huggingface Transformers as my main library. My datasets are implemented using classic Pytorch Dataset class. No IterableDataset involved.

aleSuglia · 2021-08-27T08:10:23Z

@tchaton @justusschock I was profiling my code, I noticed that the get_train_batch() is taking most of the time in my train loop. Therefore I thought that it might be this creating the issue. Is this reasonable? Could it be the case that the data loaders are too slow and somehow the GPU processes are timing out?

ghost · 2021-09-01T02:00:24Z

I am experienceing the same problem.

Except that it does not work in 'dp' setting as well as in 'ddp' setting'

I ran the cifar-10 example with multiple gpus

import os

import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
from pl_bolts.datamodules import CIFAR10DataModule
from pl_bolts.transforms.dataset_normalizations import cifar10_normalization
from pytorch_lightning import LightningModule, seed_everything, Trainer
from pytorch_lightning.callbacks import LearningRateMonitor
from pytorch_lightning.loggers import TensorBoardLogger
from torch.optim.lr_scheduler import OneCycleLR
from torch.optim.swa_utils import AveragedModel, update_bn
from torchmetrics.functional import accuracy

seed_everything(7)

PATH_DATASETS = os.environ.get('PATH_DATASETS', '.')
AVAIL_GPUS = torch.cuda.device_count()
BATCH_SIZE = 256 if AVAIL_GPUS else 64
NUM_WORKERS = int(os.cpu_count() / 2)

train_transforms = torchvision.transforms.Compose([
    torchvision.transforms.RandomCrop(32, padding=4),
    torchvision.transforms.RandomHorizontalFlip(),
    torchvision.transforms.ToTensor(),
    cifar10_normalization(),
])

test_transforms = torchvision.transforms.Compose([
    torchvision.transforms.ToTensor(),
    cifar10_normalization(),
])

cifar10_dm = CIFAR10DataModule(
    data_dir=PATH_DATASETS,
    batch_size=BATCH_SIZE,
    num_workers=NUM_WORKERS,
    train_transforms=train_transforms,
    test_transforms=test_transforms,
    val_transforms=test_transforms,
)

def create_model():
    model = torchvision.models.resnet18(pretrained=False, num_classes=10)
    model.conv1 = nn.Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
    model.maxpool = nn.Identity()
    return model

class LitResnet(LightningModule):

    def __init__(self, lr=0.05):
        super().__init__()

        self.save_hyperparameters()
        self.model = create_model()

    def forward(self, x):
        out = self.model(x)
        return F.log_softmax(out, dim=1)

    def training_step(self, batch, batch_idx):
        x, y = batch
        logits = self(x)
        loss = F.nll_loss(logits, y)
        self.log('train_loss', loss)
        return loss

    def evaluate(self, batch, stage=None):
        x, y = batch
        logits = self(x)
        loss = F.nll_loss(logits, y)
        preds = torch.argmax(logits, dim=1)
        acc = accuracy(preds, y)

        if stage:
            self.log(f'{stage}_loss', loss, prog_bar=True)
            self.log(f'{stage}_acc', acc, prog_bar=True)

    def validation_step(self, batch, batch_idx):
        self.evaluate(batch, 'val')

    def test_step(self, batch, batch_idx):
        self.evaluate(batch, 'test')

    def configure_optimizers(self):
        optimizer = torch.optim.SGD(
            self.parameters(),
            lr=self.hparams.lr,
            momentum=0.9,
            weight_decay=5e-4,
        )
        steps_per_epoch = 45000 // BATCH_SIZE
        scheduler_dict = {
            'scheduler': OneCycleLR(
                optimizer,
                0.1,
                epochs=self.trainer.max_epochs,
                steps_per_epoch=steps_per_epoch,
            ),
            'interval': 'step',
        }
        return {'optimizer': optimizer, 'lr_scheduler': scheduler_dict}

model = LitResnet(lr=0.05)
model.datamodule = cifar10_dm

trainer = Trainer(
    progress_bar_refresh_rate=10,
    max_epochs=30,
    gpus=AVAIL_GPUS,
   accelerator='dp',
    logger=TensorBoardLogger('lightning_logs/', name='resnet'),
    callbacks=[LearningRateMonitor(logging_interval='step')],
)

trainer.fit(model, cifar10_dm)
trainer.test(model, datamodule=cifar10_dm)

The output is hanged after working for just one step of training_step(one batch for each gpu).

Also, even if I press Ctrl+C multiple times, it does not halt. So I had to kill the process by looking up in htop.

tchaton · 2021-09-15T11:08:35Z

Hey @saitjinwon,

I have tried your script using PyTorch Lightning master on using both dp and ddp on 2 gpus and it seems to work fine.

Would you mind trying out master ?

pip install git+https://github.com/PyTorchLightning/pytorch-lightning.git

Best,
T.C

stale · 2021-10-15T17:35:04Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

FrancescoSaverioZuppichini · 2021-10-19T07:06:16Z

same here

kritiyer · 2021-10-22T14:29:33Z

I am using a SLURM cluster and am experiencing the same problem when I try to use 2 GPUs on the same node for trainer.fit(). I tested using the Boring model and a pytorch torchvision model wrapped in a Lightning module, and the process hangs here:

Multi-processing is handled by Slurm.
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2

SLURM flags:

#!/bin/bash
#SBATCH --job-name=lightning_multiGPU_boring
#SBATCH --cpus-per-task=8
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --mem=12GB
#SBATCH --gres=gpu:2

I am able to use DP without issues but not DDP or DDP2. I set NCCL_DEBUG=INFO in my slurm batch script, but I don't see any extra information. I saw in this issue that num_gpus*num_nodes in the trainer should be the same as --ntasks in SLURM. I have set my trainer as follows:

trainer = pl.Trainer(gpus=[0,1], num_nodes = 1, accelerator="ddp", max_epochs=args.epochs, callbacks=[early_stopping, checkpoint_callback], benchmark=True)

pytorch version: 1.9.0
PL version: 1.4.9
types of GPUs: RTX 2080ti, 8 GPUs per node

awaelchli · 2021-10-22T15:47:12Z

Is --cpus-per-task compatible? I'm not sure.
Could you print

print(trainer.world_size, trainer.local_rank, trainer.global_rank, trainer.node_rank)
print(trainer.training_type_plugin)

it should print

2, 0, 0, 0
2, 1, 1, 0
DDPPlugin
DDPPlugin

I'm surprised you don't get anything with NCCL_DEBUG=INFO did you set it by export?

kritiyer · 2021-10-22T16:02:16Z

I was able to run on a single GPU with these flags, maybe --cpus-per-task is too high for more than one GPU. I will reduce it and try again with the print statements.

I did set NCCL_DEBUG=INFO using export, I'm not sure why nothing from NCCL_DEBUG was in the logs.

kritiyer · 2021-10-22T20:24:37Z

Decreasing --cpus-per-task did the trick, I can also see the NCCL_DEBUG information in the logs now.

The print statements came out as:

2 0 0 0
2 1 1 0
<pytorch_lightning.plugins.training_type.ddp.DDPPlugin object at 0x2aedab681730>

and the training was able to start. Thank you @awaelchli!

bolandih · 2021-11-16T04:14:59Z

I have the same issue. I filed the issue here #10471

carmocca · 2022-04-06T03:57:06Z

Closing. If you are a future reader and none of the existing discussions helped you, please open a new issue with details and reproduction for your hang.

ghost · 2022-10-28T18:36:51Z

Hi, I had this problem to while running PL on my university's SLURM cluster. I was trying to use DDP on 4 GPUs.

Following the tutorial, I set tasks per node to 4 (number of GPUs) and it hung on Initializing DDP.

I solved this by setting tasks per node to 8.

Worked for me.
Thanks for sharing

jshilong · 2023-08-12T23:50:02Z

Hi guys!
I've encountered a similar issue recently. My program hangs at 'initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2' when running a repository requiring an older PyTorch. I'm testing it with both the 4090 and A100 GPUs, and due to changes in the computing architecture, these two cards now have minimum CUDAToolkit version requirements. However, the older PyTorch installation in the repository doesn't provide any relevant error messages, such as

'NVIDIA GeForce RTX 4090 with CUDA capability sm_89 is not compatible with the current PyTorch installation. The current PyTorch installation supports CUDA capabilities sm_37 sm_50 sm_60 sm_61 sm_70 sm_75 compute_37.' in the newer versions of PyTorch.

so you just need to update to a newer version of PyTorch with a compatible CUDAToolkit version. Hopefully, this can help some people out.

ImNotPrepared · 2023-08-21T16:17:42Z

Hi guys, I ran into the same problem, I am using slurm from university. just change --ntasks-per-node=1 solves my problem

ahmadikalkhorani · 2024-01-08T01:35:04Z

using srun before python solved my problem:

srun python train.py ...

JosephGatto added bug Something isn't working help wanted Open to be worked on labels Nov 10, 2020

edenlightning added the distributed Generic distributed-related topic label Nov 11, 2020

edenlightning added the waiting on author Waiting on user action, correction, or update label Nov 11, 2020

edenlightning added priority: 1 Medium priority task and removed waiting on author Waiting on user action, correction, or update labels Nov 17, 2020

stale bot added the won't fix This will not be worked on label Oct 15, 2021

stale bot removed the won't fix This will not be worked on label Oct 19, 2021

This was referenced Oct 24, 2021

RFC: Validate Trainer settings against cluster environment settings #10107

Closed

Validate environment variables in SLURMEnvironment and warn user about incompatible settings #10150

Closed

hkchengrex mentioned this issue Nov 16, 2021

The code get stuck hkchengrex/STCN#73

Closed

bolandih mentioned this issue Nov 16, 2021

DDP is not working with Pytorch Lightning #10471

Closed

carmocca closed this as completed Apr 6, 2022

EmreOzkose mentioned this issue Jun 3, 2022

DPP is slow speechbrain/speechbrain#1426

Closed

michal2409 mentioned this issue Aug 14, 2022

[PyTorch/Segmentation/nnUNet] If multiple GPUs requested code will not run NVIDIA/DeepLearningExamples#1189

Open

anonymous-engineering mentioned this issue Jul 7, 2023

Code stuck on "initalizing ddp" when using more than one gpu on neuralforecast AutoTFT, AutoNHITs Nixtla/neuralforecast#686

Open

hxuaj mentioned this issue Mar 22, 2024

model training got stuck when running the official tutorial example Nixtla/neuralforecast#937

Open

Code stuck on "initalizing ddp" when using more than one gpu #4612

Code stuck on "initalizing ddp" when using more than one gpu #4612

Comments

JosephGatto commented Nov 10, 2020

🐛 Bug

github-actions bot commented Nov 10, 2020

edenlightning commented Nov 11, 2020

JosephGatto commented Nov 11, 2020

JosephGatto commented Nov 11, 2020 • edited

JosephGatto commented Nov 16, 2020

edenlightning commented Nov 17, 2020

justusschock commented Nov 18, 2020

JosephGatto commented Nov 18, 2020

SohamTamba commented Nov 19, 2020

justusschock commented Nov 19, 2020

JosephGatto commented Nov 19, 2020

justusschock commented Nov 19, 2020

JosephGatto commented Nov 19, 2020

justusschock commented Nov 23, 2020

JosephGatto commented Nov 23, 2020

yikuanli commented Dec 1, 2020

Stellakats commented Dec 1, 2020

alexionby commented Dec 6, 2020

JosephGatto commented Dec 6, 2020

alexionby commented Dec 6, 2020

JosephGatto commented Dec 9, 2020

justusschock commented Dec 10, 2020

MarsSu0618 commented Dec 24, 2020

shreyaskamathkm commented Jan 4, 2021

aleSuglia commented Aug 25, 2021 • edited

Gateway2745 commented Aug 26, 2021 • edited

tchaton commented Aug 26, 2021

aleSuglia commented Aug 26, 2021 • edited

aleSuglia commented Aug 27, 2021

ghost commented Sep 1, 2021 • edited by ghost

tchaton commented Sep 15, 2021

stale bot commented Oct 15, 2021

FrancescoSaverioZuppichini commented Oct 19, 2021

kritiyer commented Oct 22, 2021 • edited

awaelchli commented Oct 22, 2021 • edited

kritiyer commented Oct 22, 2021

kritiyer commented Oct 22, 2021

bolandih commented Nov 16, 2021 • edited

carmocca commented Apr 6, 2022

ghost commented Oct 28, 2022

jshilong commented Aug 12, 2023

ImNotPrepared commented Aug 21, 2023

ahmadikalkhorani commented Jan 8, 2024

JosephGatto commented Nov 11, 2020 •

edited

aleSuglia commented Aug 25, 2021 •

edited

Gateway2745 commented Aug 26, 2021 •

edited

aleSuglia commented Aug 26, 2021 •

edited

ghost commented Sep 1, 2021 •

edited by ghost

kritiyer commented Oct 22, 2021 •

edited

awaelchli commented Oct 22, 2021 •

edited

bolandih commented Nov 16, 2021 •

edited