DistributedDataParallel: GRU module gets additional processes on GPU 0 (1st GPU) and takes more memory #70404

chgwan · 2021-12-25T19:07:57Z

🐛 Describe the bug

Hi, Thank you for your concern. When I am testing a simple example with DistributedDataParallel, using a single node with 4 gpus, I found that when I used the GRU or LSTM module was taking additional processes and more memory on GPU 0. while using the Linear module was not gotten these problems. The test code snippets are as follows:

def run_demo(demo_fn, world_size):
    mp.spawn(demo_fn,
             args=(world_size,),
             nprocs=world_size,
             join=True)
def demo_basic(rank, world_size):
    print(f"Running basic DDP example on rank {rank}.")
    dist.init_process_group("nccl", rank=rank, world_size=world_size)
    # The GRU or LSTM gets additional processes on GPU 0.
    ToyModel = nn.GRU(10, 10, 1)
    # The Linear does not get these problems.
    # ToyModel = nn.Linear(10,1)
    model = ToyModel.to(rank)
    ddp_model = DDP(model, device_ids=[rank])
    pbar_len = int(1e10 / 2)
    for _ in range(pbar_len):
        input_seq = torch.randn(4, 20,10)
        input_seq = input_seq.float().to(rank)
        ddp_model(input_seq)
    dist.destroy_process_group()
if __name__ == "__main__":
    world_size = torch.cuda.device_count()
    os.environ['MASTER_ADDR'] = 'localhost'
    os.environ['MASTER_PORT'] = '12355'
    run_demo(demo_basic, world_size)

I called the script like python XX.py. And I got GRU and Linear modules results are as follows:

GRU module result

Linear module result

Versions

Collecting environment information...
PyTorch version: 1.10.0
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: CentOS Linux release 7.7.1908 (Core) (x86_64)
GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
Clang version: 3.4.2 (tags/RELEASE_34/dot2-final)
CMake version: Could not collect
Libc version: glibc-2.17

Python version: 3.9.7 (default, Sep 16 2021, 13:09:58) [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-3.10.0-514.el7.x86_64-x86_64-with-glibc2.17
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: Tesla P100-PCIE-16GB
GPU 1: Tesla P100-PCIE-16GB
GPU 2: Tesla P100-PCIE-16GB
GPU 3: Tesla P100-PCIE-16GB

Nvidia driver version: 460.27.04
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.21.2
[pip3] torch==1.10.0
[pip3] torch-tb-profiler==0.1.0
[pip3] torchaudio==0.10.0
[pip3] torchinfo==1.5.4
[pip3] torchvision==0.11.1
[conda] blas 1.0 mkl defaults
[conda] cudatoolkit 10.2.89 hfd86e86_1 defaults
[conda] ffmpeg 4.3 hf484d3e_0 pytorch
[conda] mkl 2021.4.0 h06a4308_640 defaults
[conda] mkl-service 2.4.0 py39h7f8727e_0 defaults
[conda] mkl_fft 1.3.1 py39hd3c417c_0 defaults
[conda] mkl_random 1.2.2 py39h51133e4_0 defaults
[conda] mypy_extensions 0.4.3 py39h06a4308_0 defaults
[conda] numpy 1.21.2 py39h20f2e39_0 defaults
[conda] numpy-base 1.21.2 py39h79a1101_0 defaults
[conda] pytorch 1.10.0 py3.9_cuda10.2_cudnn7.6.5_0 pytorch
[conda] pytorch-mutex 1.0 cuda pytorch
[conda] torch-tb-profiler 0.1.0 pypi_0 pypi
[conda] torchaudio 0.10.0 py39_cu102 pytorch
[conda] torchinfo 1.5.4 pyhd8ed1ab_0 conda-forge
[conda] torchvision 0.11.1 py39_cu102 pytorch

cc @zou3519 @ngimel @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang

Hannibal046 · 2021-12-26T05:51:47Z

Hi, I am also experiencing similar problem when using DDP. I find that large layer would significantly slow down the DDP training. And I can not figure out why.
https://discuss.pytorch.org/t/super-weird-bug-more-gpu-lower-speed/140213

ngimel · 2021-12-26T06:31:20Z

duplicate of #66203
While the behavior you are seeing is a bug, with DDP you should set device to the correct one, instead of doing .to(rank) everywhere (e.g. add torch.cuda.set_device(rank) to the beginning of demo_basic function), that will both work around the bug and give you slightly better performance.

Hannibal046 · 2021-12-26T07:09:51Z

Ok, got it . I think my problem is due to hardware. Because my computer case is too small to hold 2 RTX3090, so I use extension line, and this causes much IO block. Thanks so much for giving kind advice.

ngimel · 2021-12-26T07:12:34Z

@Hannibal046 sorry, I was replying to @chgwan. Your problem doesn't seem to be related to this issue, and the workaround also doesn't apply.

Hannibal046 · 2021-12-26T07:35:49Z

It is ok. I also learn an elegant and efficient way to put the model and data in the right place.

chgwan · 2021-12-26T19:14:15Z

@ngimel Thank you very much for your elegant solution. It works well.
There is a simple example for others' reference.

def demo_basic(rank, world_size):
    print(f"Running basic DDP example on rank {rank}.")
    dist.init_process_group("nccl", rank=rank, world_size=world_size)
    torch.cuda.set_device(rank)
    ToyModel = nn.GRU(10, 10, 1)
    model = ToyModel.cuda()
    ddp_model = DDP(model, device_ids=[rank])
    pbar_len = int(1e10 / 2)
    for _ in range(pbar_len):
        input_seq = torch.randn(4, 20,10)
        input_seq = input_seq.float().cuda()
        ddp_model(input_seq)
    dist.destroy_process_group()

Summary: Fixes #70404 Pull Request resolved: #70406 Reviewed By: mruberry Differential Revision: D33407972 Pulled By: ngimel fbshipit-source-id: 6bf97602ea13f8eaaff95d9f412a2eeaa0e6ba10

facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Dec 25, 2021

ngimel mentioned this issue Dec 26, 2021

Sets device guard in _cudnn_impl functions #70406

Closed

ngimel added module: cuda Related to torch.cuda, and CUDA support in general module: rnn Issues related to RNN support (LSTM, GRU, etc) and removed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Dec 26, 2021

chgwan closed this as completed Dec 26, 2021

ngimel mentioned this issue Jan 18, 2022

rnn module uses cuda:0 even it moved into cuda:1 by to('cuda:1') #71400

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DistributedDataParallel: GRU module gets additional processes on GPU 0 (1st GPU) and takes more memory #70404

DistributedDataParallel: GRU module gets additional processes on GPU 0 (1st GPU) and takes more memory #70404

chgwan commented Dec 25, 2021 •

edited by pytorch-probot bot

Hannibal046 commented Dec 26, 2021 •

edited

ngimel commented Dec 26, 2021 •

edited

Hannibal046 commented Dec 26, 2021

ngimel commented Dec 26, 2021

Hannibal046 commented Dec 26, 2021

chgwan commented Dec 26, 2021 •

edited

DistributedDataParallel: GRU module gets additional processes on GPU 0 (1st GPU) and takes more memory #70404

DistributedDataParallel: GRU module gets additional processes on GPU 0 (1st GPU) and takes more memory #70404

Comments

chgwan commented Dec 25, 2021 • edited by pytorch-probot bot

🐛 Describe the bug

Versions

Hannibal046 commented Dec 26, 2021 • edited

ngimel commented Dec 26, 2021 • edited

Hannibal046 commented Dec 26, 2021

ngimel commented Dec 26, 2021

Hannibal046 commented Dec 26, 2021

chgwan commented Dec 26, 2021 • edited

chgwan commented Dec 25, 2021 •

edited by pytorch-probot bot

Hannibal046 commented Dec 26, 2021 •

edited

ngimel commented Dec 26, 2021 •

edited

chgwan commented Dec 26, 2021 •

edited