Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code stuck on "initalizing ddp" when using more than one gpu #4612

Closed
JosephGatto opened this issue Nov 10, 2020 · 81 comments
Closed

Code stuck on "initalizing ddp" when using more than one gpu #4612

JosephGatto opened this issue Nov 10, 2020 · 81 comments
Labels
bug Something isn't working distributed Generic distributed-related topic help wanted Open to be worked on priority: 1 Medium priority task

Comments

@JosephGatto
Copy link

🐛 Bug

I am trying to run a pytorch lightning model on a 4-GPU node. In my trainer, if I specify

pl.Trainer(gpus=[0])

It runs fine. However, once I add another GPU

pl.Trainer(gpus=[0,1,2,3])

I get this output:

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/4
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/4
initializing ddp: GLOBAL_RANK: 2, MEMBER: 3/4
initializing ddp: GLOBAL_RANK: 3, MEMBER: 4/4

And the model just hangs there forever. I have tried this with only 2 GPUs and get the same behavior.

Any idea why this may happen? I have tried with both ddp and ddp_spawn.

  • PyTorch Version-- tried both 1.4 and 1.7
  • OS-- Linux
  • Installed with pip
  • Python version: 3.8.5
  • CUDA/cuDNN version: 10.1
  • GPU models and configuration: NVIDIA K80s
@JosephGatto JosephGatto added bug Something isn't working help wanted Open to be worked on labels Nov 10, 2020
@github-actions
Copy link
Contributor

Hi! thanks for your contribution!, great first issue!

@edenlightning edenlightning added the distributed Generic distributed-related topic label Nov 11, 2020
@edenlightning
Copy link
Contributor

Hey! Can you try to reproduce using our simple boring model?

Just to verify the bug isn't something in your model.

@edenlightning edenlightning added the waiting on author Waiting on user action, correction, or update label Nov 11, 2020
@JosephGatto
Copy link
Author

Thanks for the response @edenlightning. This morning (about 12 hrs after my last attempt), I ran the simple boring model and it ran. I also added gpu=[0] and it worked. When I added gpu=[0,1] it worked the first time I ran it. I successfully ctrl+c'd out of it, then tried to run it again and it hangs forever. nvidia-smi shows no processes running and I can continually run the single gpu approach after this. However, anytime I do gpu > 1 from now on it does not run.

Thus, it seems like something bad happens when I ctrl+c a ddp process that blocks it from happening again. Any idea what that might be?

@JosephGatto
Copy link
Author

JosephGatto commented Nov 11, 2020

Update: 1) Have had my system admin restart this GPU node and it didn't run again. Not sure how it got through that one time. 2) None of the accelerators ['ddp', 'ddp_spawn', 'dp'] run for me when using gpu=[0,1] on boring model.

Just curious, could this be a related issue? pytorch/pytorch#1637 (comment) I am working on a 4-K80 node.

@JosephGatto
Copy link
Author

@edenlightning any ideas about what the problem might be? ddp still isnt working on the boring model and ddp_spawn gives the pickle error on the boring model ... i am very stuck.

@edenlightning
Copy link
Contributor

@justusschock mind taking a look?

@edenlightning edenlightning added priority: 1 Medium priority task and removed waiting on author Waiting on user action, correction, or update labels Nov 17, 2020
@justusschock
Copy link
Member

Hi @JosephGatto I am sorry that I cannot reproduce this, since I don't have these kind of GPUs. But I can try to guide you during troubleshooting.

Yes pytorch/pytorch#1637 (comment) seems to be related. Have you tried the steps mentioned there to track down the problem?

@JosephGatto
Copy link
Author

Hi @justusschock thanks for offering your help! Sadly, this did not work. Any other ideas?

@SohamTamba
Copy link

Hi,
I had this problem to while running PL on my university's SLURM cluster. I was trying to use DDP on 4 GPUs.

Following the tutorial, I set tasks per node to 4 (number of GPUs) and it hung on Initializing DDP.

I solved this by setting tasks per node to 8.

@justusschock
Copy link
Member

@JosephGatto When you ctrl+c the dip process, do they still appear in Nvidia-smi as zombie processes?

@JosephGatto
Copy link
Author

@justusschock Correct. I can't even ctrl+c I have to usually ctrl+z and then manually kill the process when I use ddp.

@justusschock
Copy link
Member

What happens I you try to ctrl+c?

@JosephGatto
Copy link
Author

@justusschock nothing, it just stays frozen. I am forced to ctrl+z.

@justusschock
Copy link
Member

Even ctrl+c multiple times does not work?

@JosephGatto
Copy link
Author

@justusschock correct

@yikuanli
Copy link

yikuanli commented Dec 1, 2020

I have similar problem, and I m using the university's cluster, the exact same problem, hope someone can help out.

@Stellakats
Copy link

I have the exact same problem. Any idea of what could fix this?

@alexionby
Copy link

I faced the same problem. Looks like it somehow connected to zeromq (?)

Traceback (most recent call last):
  File "/home/alex/anaconda3/envs/pt/lib/python3.8/site-packages/ipykernel_launcher.py", line 16, in <module>
    app.launch_new_instance()
  File "/home/alex/anaconda3/envs/pt/lib/python3.8/site-packages/traitlets/config/application.py", line 663, in launch_instance
    app.initialize(argv)
  File "<decorator-gen-124>", line 2, in initialize
  File "/home/alex/anaconda3/envs/pt/lib/python3.8/site-packages/traitlets/config/application.py", line 87, in catch_config_error
    return method(app, *args, **kwargs)
  File "/home/alex/anaconda3/envs/pt/lib/python3.8/site-packages/ipykernel/kernelapp.py", line 567, in initialize
    self.init_sockets()
  File "/home/alex/anaconda3/envs/pt/lib/python3.8/site-packages/ipykernel/kernelapp.py", line 271, in init_sockets
    self.shell_port = self._bind_socket(self.shell_socket, self.shell_port)
  File "/home/alex/anaconda3/envs/pt/lib/python3.8/site-packages/ipykernel/kernelapp.py", line 218, in _bind_socket
    return self._try_bind_socket(s, port)
  File "/home/alex/anaconda3/envs/pt/lib/python3.8/site-packages/ipykernel/kernelapp.py", line 194, in _try_bind_socket
    s.bind("tcp://%s:%i" % (self.ip, port))
  File "zmq/backend/cython/socket.pyx", line 550, in zmq.backend.cython.socket.Socket.bind
  File "zmq/backend/cython/checkrc.pxd", line 26, in zmq.backend.cython.checkrc._check_rc
zmq.error.ZMQError: Address already in use

@JosephGatto
Copy link
Author

@alexionby I am no expert but an 'Address already in use' error in my pytorch-lightning experience has been related to an occupied port or ip address.

@alexionby
Copy link

@JosephGatto, I figured out that in my case it was connected to jupyter-notebook and ZMQ. As python script it works nice. My case closed.

@JosephGatto
Copy link
Author

Hi @edenlightning @justusschock,

Does PyTorch-lightning support compute capability 3.7? One of the HPC specialists who manage my compute cluster tried debugging this today and said the issue was isolated to the K80 nodes and that he got it to work on other nodes that used compute capability 7.0.

Note: The K80s failed even after a driver update and he said that all GPUs passed to PyTorch-lightning were working very hard, but whatever process pushes the workload to the GPU is not returning to the host.

Thanks again.

@justusschock
Copy link
Member

@JosephGatto We support whatever pytorch supports. we don't do anything related to specific cuda compute capabilities.
But even if the prebuilt binaries of pytorch don't support your GPU, you should be able to compile it yourself

@MarsSu0618
Copy link

@JosephGatto
Did you solve it? Because i have same issues. It stucks when i set ddp ...

@shreyaskamathkm
Copy link

I have the same problem. Weirdly, it works well with V100 and P100 GPUs. But when I try using Tesla T4 GPUs, the code hangs.

@aleSuglia
Copy link

aleSuglia commented Aug 25, 2021

@justusschock I have a main program that looks like this:

def main(args):
    config = AutoConfig.from_pretrained(
        args.model_name
    )

    model = MyModel.from_pretrained(args.model_name, config=config, args=args)

    if args.accelerator == "ddp":
        plugins = DDPPlugin(find_unused_parameters=True)
    else:
        plugins = None

    trainer = Trainer.from_argparse_args(args, callbacks=[
        ModelCheckpoint(
            monitor="mlm_val_loss",
            dirpath=args.output_dir,
            filename=f"{args.model_name}" + "-{epoch:02d}-mlm_loss={mlm_val_loss:.2f}"
        )
    ], plugins=plugins)

    dm = PretrainingDataModule(args)

    trainer.fit(model, datamodule=dm)

if __name__ == "__main__":
    parent_parser = ArgumentParser(add_help=False)
    parent_parser = Trainer.add_argparse_args(parent_parser)
    parser = MyModel.add_model_specific_args(parent_parser)
    args = parse_with_config(parser)
    main(args)

Then I run this using python on the command line specifying --gpus 4 --accelerator ddp --precision 16.

@Gateway2745
Copy link

Gateway2745 commented Aug 26, 2021

This solved it for me.

  1. Don't set CUDA_LAUNCH_BLOCKING=1
  2. Use PyTorch nightly build. My PLT version=1.4.2.

@tchaton
Copy link
Contributor

tchaton commented Aug 26, 2021

Dear @aleSuglia,

Any chance you can provide a fully reproducible script with imports, data ?

Best,
T.C

@aleSuglia
Copy link

aleSuglia commented Aug 26, 2021

@tchaton Sorry, unfortunately I cannot. I can definitely say that I'm using Huggingface Transformers as my main library. My datasets are implemented using classic Pytorch Dataset class. No IterableDataset involved.

@aleSuglia
Copy link

@tchaton @justusschock I was profiling my code, I noticed that the get_train_batch() is taking most of the time in my train loop. Therefore I thought that it might be this creating the issue. Is this reasonable? Could it be the case that the data loaders are too slow and somehow the GPU processes are timing out?

@ghost
Copy link

ghost commented Sep 1, 2021

I am experienceing the same problem.

Except that it does not work in 'dp' setting as well as in 'ddp' setting'

I ran the cifar-10 example with multiple gpus

import os

import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
from pl_bolts.datamodules import CIFAR10DataModule
from pl_bolts.transforms.dataset_normalizations import cifar10_normalization
from pytorch_lightning import LightningModule, seed_everything, Trainer
from pytorch_lightning.callbacks import LearningRateMonitor
from pytorch_lightning.loggers import TensorBoardLogger
from torch.optim.lr_scheduler import OneCycleLR
from torch.optim.swa_utils import AveragedModel, update_bn
from torchmetrics.functional import accuracy

seed_everything(7)

PATH_DATASETS = os.environ.get('PATH_DATASETS', '.')
AVAIL_GPUS = torch.cuda.device_count()
BATCH_SIZE = 256 if AVAIL_GPUS else 64
NUM_WORKERS = int(os.cpu_count() / 2)

train_transforms = torchvision.transforms.Compose([
    torchvision.transforms.RandomCrop(32, padding=4),
    torchvision.transforms.RandomHorizontalFlip(),
    torchvision.transforms.ToTensor(),
    cifar10_normalization(),
])

test_transforms = torchvision.transforms.Compose([
    torchvision.transforms.ToTensor(),
    cifar10_normalization(),
])

cifar10_dm = CIFAR10DataModule(
    data_dir=PATH_DATASETS,
    batch_size=BATCH_SIZE,
    num_workers=NUM_WORKERS,
    train_transforms=train_transforms,
    test_transforms=test_transforms,
    val_transforms=test_transforms,
)

def create_model():
    model = torchvision.models.resnet18(pretrained=False, num_classes=10)
    model.conv1 = nn.Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
    model.maxpool = nn.Identity()
    return model

class LitResnet(LightningModule):

    def __init__(self, lr=0.05):
        super().__init__()

        self.save_hyperparameters()
        self.model = create_model()

    def forward(self, x):
        out = self.model(x)
        return F.log_softmax(out, dim=1)

    def training_step(self, batch, batch_idx):
        x, y = batch
        logits = self(x)
        loss = F.nll_loss(logits, y)
        self.log('train_loss', loss)
        return loss

    def evaluate(self, batch, stage=None):
        x, y = batch
        logits = self(x)
        loss = F.nll_loss(logits, y)
        preds = torch.argmax(logits, dim=1)
        acc = accuracy(preds, y)

        if stage:
            self.log(f'{stage}_loss', loss, prog_bar=True)
            self.log(f'{stage}_acc', acc, prog_bar=True)

    def validation_step(self, batch, batch_idx):
        self.evaluate(batch, 'val')

    def test_step(self, batch, batch_idx):
        self.evaluate(batch, 'test')

    def configure_optimizers(self):
        optimizer = torch.optim.SGD(
            self.parameters(),
            lr=self.hparams.lr,
            momentum=0.9,
            weight_decay=5e-4,
        )
        steps_per_epoch = 45000 // BATCH_SIZE
        scheduler_dict = {
            'scheduler': OneCycleLR(
                optimizer,
                0.1,
                epochs=self.trainer.max_epochs,
                steps_per_epoch=steps_per_epoch,
            ),
            'interval': 'step',
        }
        return {'optimizer': optimizer, 'lr_scheduler': scheduler_dict}

model = LitResnet(lr=0.05)
model.datamodule = cifar10_dm

trainer = Trainer(
    progress_bar_refresh_rate=10,
    max_epochs=30,
    gpus=AVAIL_GPUS,
   accelerator='dp',
    logger=TensorBoardLogger('lightning_logs/', name='resnet'),
    callbacks=[LearningRateMonitor(logging_interval='step')],
)

trainer.fit(model, cifar10_dm)
trainer.test(model, datamodule=cifar10_dm)

The output is hanged after working for just one step of training_step(one batch for each gpu).

Also, even if I press Ctrl+C multiple times, it does not halt. So I had to kill the process by looking up in htop.

@tchaton
Copy link
Contributor

tchaton commented Sep 15, 2021

Hey @saitjinwon,

I have tried your script using PyTorch Lightning master on using both dp and ddp on 2 gpus and it seems to work fine.

Would you mind trying out master ?

pip install git+https://github.com/PyTorchLightning/pytorch-lightning.git

Best,
T.C

@stale
Copy link

stale bot commented Oct 15, 2021

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Oct 15, 2021
@FrancescoSaverioZuppichini

same here

@stale stale bot removed the won't fix This will not be worked on label Oct 19, 2021
@kritiyer
Copy link

kritiyer commented Oct 22, 2021

I am using a SLURM cluster and am experiencing the same problem when I try to use 2 GPUs on the same node for trainer.fit(). I tested using the Boring model and a pytorch torchvision model wrapped in a Lightning module, and the process hangs here:

Multi-processing is handled by Slurm.
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2

SLURM flags:

#!/bin/bash
#SBATCH --job-name=lightning_multiGPU_boring
#SBATCH --cpus-per-task=8
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --mem=12GB
#SBATCH --gres=gpu:2

I am able to use DP without issues but not DDP or DDP2. I set NCCL_DEBUG=INFO in my slurm batch script, but I don't see any extra information. I saw in this issue that num_gpus*num_nodes in the trainer should be the same as --ntasks in SLURM. I have set my trainer as follows:

trainer = pl.Trainer(gpus=[0,1], num_nodes = 1, accelerator="ddp", max_epochs=args.epochs, callbacks=[early_stopping, checkpoint_callback], benchmark=True)

pytorch version: 1.9.0
PL version: 1.4.9
types of GPUs: RTX 2080ti, 8 GPUs per node

@awaelchli
Copy link
Member

awaelchli commented Oct 22, 2021

Is --cpus-per-task compatible? I'm not sure.
Could you print

print(trainer.world_size, trainer.local_rank, trainer.global_rank, trainer.node_rank)
print(trainer.training_type_plugin)

it should print

2, 0, 0, 0
2, 1, 1, 0
DDPPlugin
DDPPlugin

I'm surprised you don't get anything with NCCL_DEBUG=INFO did you set it by export?

@kritiyer
Copy link

I was able to run on a single GPU with these flags, maybe --cpus-per-task is too high for more than one GPU. I will reduce it and try again with the print statements.

I did set NCCL_DEBUG=INFO using export, I'm not sure why nothing from NCCL_DEBUG was in the logs.

@kritiyer
Copy link

Decreasing --cpus-per-task did the trick, I can also see the NCCL_DEBUG information in the logs now.

The print statements came out as:

2 0 0 0
2 1 1 0
<pytorch_lightning.plugins.training_type.ddp.DDPPlugin object at 0x2aedab681730>

and the training was able to start. Thank you @awaelchli!

@bolandih
Copy link

bolandih commented Nov 16, 2021

I have the same issue. I filed the issue here #10471

@carmocca
Copy link
Member

carmocca commented Apr 6, 2022

Closing. If you are a future reader and none of the existing discussions helped you, please open a new issue with details and reproduction for your hang.

@ghost
Copy link

ghost commented Oct 28, 2022

Hi, I had this problem to while running PL on my university's SLURM cluster. I was trying to use DDP on 4 GPUs.

Following the tutorial, I set tasks per node to 4 (number of GPUs) and it hung on Initializing DDP.

I solved this by setting tasks per node to 8.

Worked for me.
Thanks for sharing

@jshilong
Copy link

Hi guys!
I've encountered a similar issue recently. My program hangs at 'initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2' when running a repository requiring an older PyTorch. I'm testing it with both the 4090 and A100 GPUs, and due to changes in the computing architecture, these two cards now have minimum CUDAToolkit version requirements. However, the older PyTorch installation in the repository doesn't provide any relevant error messages, such as

'NVIDIA GeForce RTX 4090 with CUDA capability sm_89 is not compatible with the current PyTorch installation. The current PyTorch installation supports CUDA capabilities sm_37 sm_50 sm_60 sm_61 sm_70 sm_75 compute_37.' in the newer versions of PyTorch.

so you just need to update to a newer version of PyTorch with a compatible CUDAToolkit version. Hopefully, this can help some people out.

@ImNotPrepared
Copy link

Hi guys, I ran into the same problem, I am using slurm from university. just change --ntasks-per-node=1 solves my problem

@ahmadikalkhorani
Copy link

using srun before python solved my problem:

srun python train.py ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working distributed Generic distributed-related topic help wanted Open to be worked on priority: 1 Medium priority task
Projects
None yet
Development

No branches or pull requests