Multi GPU RuntimeError: Model replicas must have an equal number of parameters. #11

lhwcv · 2020-06-03T06:18:58Z

🐛 Bug

when using 4* 2080ti for training:
"RuntimeError: Model replicas must have an equal number of parameters."
(1 gpu is OK)

To Reproduce

REQUIRED: Code to reproduce your issue below

CUDA_VISIBLE_DEVICES=0,1,2,3 python  train.py --device 0,1,2,3  --data coco.yaml --cfg yolov3-spp.yaml  --weights '' --batch-size 64



## Expected behavior
It should be OK

## Environment
 - OS: [Ubuntu 18.04]
 - GPU [4* 2080 Ti]
 - packages:  match  requriments.txt

The text was updated successfully, but these errors were encountered:

github-actions · 2020-06-03T06:19:34Z

Hello @lhwcv, thank you for your interest in our work! Please visit our Custom Training Tutorial to get started, and see our Google Colab Notebook, Docker Image, and GCP Quickstart Guide for example environments.

If this is a bug report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom model or data training question, please note that Ultralytics does not provide free personal support. As a leader in vision ML and AI, we do offer professional consulting, from simple expert advice up to delivery of fully customized, end-to-end production solutions for our clients, such as:

Cloud-based AI surveillance systems operating on hundreds of HD video streams in realtime.
Edge AI integrated into custom iOS and Android apps for realtime 30 FPS video inference.
Custom data training, hyperparameter evolution, and model exportation to any destination.

For more information please visit https://www.ultralytics.com.

lhwcv · 2020-06-03T06:35:48Z

It maybe pytorch==1.5 version problem, 1.4 ok. Closed!

glenn-jocher · 2020-06-03T07:24:44Z

@lhwcv I'm not able to reproduce your issue. I tried with our docker container (with pytorch 1.5), and training operates correctly with your command with 4 GPUs:

glenn-jocher · 2020-06-05T20:32:28Z

Note: this may have been fixed by the fix applied for #15.

glenn-jocher · 2020-06-11T04:27:07Z

It maybe pytorch==1.5 version problem, 1.4 ok. Closed!

Closing as the original issue seems to be resolved.

lucasjinreal · 2020-06-11T07:15:42Z

Not yet, official pytorch 1.5 still got this issue:

/usr/local/lib/python3.6/dist-packages/torch/serialization.py:657: SourceChangeWarning: source code of class 'models.yolo.Model' has changed. you can retrieve the original source code by accessing the object's source attribute or set `torch.nn.Module.dump_patches = True` and use the patch tool to revert the changes.
  warnings.warn(msg, SourceChangeWarning)
/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/distributed.py:303: UserWarning: Single-Process Multi-GPU is not the recommended mode for DDP. In this mode, each DDP instance operates on multiple devices and creates multiple module replicas within one process. The overhead of scatter/gather and GIL contention in every forward pass can slow down training. Please consider using one DDP instance per device or per module replica by explicitly setting device_ids or CUDA_VISIBLE_DEVICES. NB: There is a known issue in nn.parallel.replicate that prevents a single DDP instance to operate on multiple model replicas.
  "Single-Process Multi-GPU is not the recommended mode for "
Traceback (most recent call last):
  File "train.py", line 399, in <module>
    train(hyp)
  File "train.py", line 155, in train
    model = torch.nn.parallel.DistributedDataParallel(model)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/distributed.py", line 287, in __init__
    self._ddp_init_helper()
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/distributed.py", line 380, in _ddp_init_helper
    expect_sparse_gradient)
RuntimeError: Model replicas must have an equal number of parameters.

mingmmq · 2020-06-15T06:08:04Z

the same issue with custom dataset and using the pre-trained yolov5x.pt file

RuntimeError: Model replicas must have an equal number of parameters.

glenn-jocher · 2020-06-15T07:53:27Z

I've reopened as issue appears to still be present.

@mingmmq could you supply code to reproduce your issue? Is it reproducible on coco128.yaml dataset?

intgogo · 2020-06-15T11:06:20Z

I have the same problem in my custom dataset(24 classes).

tomjerrygithub · 2020-06-15T11:07:44Z

I have the same problem in my custom dataset(11 classes).

JierunChen · 2020-06-15T14:23:13Z

Try to downgrade the PyTorch from1.5 to 1.4. It works for me

Lornatang · 2020-06-18T03:00:43Z

run

pip install torch==1.4.0+cu100 torchvision==0.5.0+cu100 -f https://download.pytorch.org/whl/torch_stable.html

to fix Model replicas must have an equal number of parameters.

Or you see https://github.com/pytorch/pytorch/pull/36503. This bug was fixed in this issue, but you must manually build PyTorch==1.5+cu102

panchengl · 2020-06-24T07:30:45Z

torch1.5->1.4 is ok

glenn-jocher · 2020-06-24T19:06:26Z

@panchengl does the recently released 1.5.1 fix this?

github-actions · 2020-08-01T05:27:38Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

update bias init&&update obj loss

lhwcv added the bug Something isn't working label Jun 3, 2020

lhwcv changed the title ~~Muti GPU training error~~ Multi GPU training error Jun 3, 2020

glenn-jocher changed the title ~~Multi GPU training error~~ Multi GPU RuntimeError: Model replicas must have an equal number of parameters. Jun 4, 2020

glenn-jocher closed this as completed Jun 11, 2020

glenn-jocher reopened this Jun 15, 2020

matinhosseiny mentioned this issue Jun 23, 2020

RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM #185

Closed

DLLXW mentioned this issue Jul 3, 2020

RuntimeError: CUDA error: no kernel image is available for execution on the device (nms_cuda at /tmp/pip-req-build-9d9zypi6/torchvision/csrc/cuda/nms_cuda.cu:127) #281

Closed

github-actions bot added the Stale label Aug 1, 2020

github-actions bot closed this as completed Aug 12, 2020

wuzuiyuzui mentioned this issue Nov 27, 2020

RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR #1546

Closed

jerryWTMH mentioned this issue Dec 22, 2020

RuntimeError: CUDA error: unspecified launch failure #1752

Closed

alicera mentioned this issue Jun 23, 2021

different gpus to train #3736

Closed

coallar mentioned this issue Sep 18, 2021

CUDA error: the launch timed out and was terminated #4851

Closed

zldrobit pushed a commit to zldrobit/yolov5 that referenced this issue Sep 3, 2022

Merge pull request ultralytics#11 from Laughing-q/instance_seg

483d13e

update bias init&&update obj loss

manole-alexandru added a commit to manole-alexandru/yolov5-uolo that referenced this issue Mar 30, 2023

ultralytics#11 - A new hope

b6b5008

cool112624 mentioned this issue May 16, 2023

DDP training with multiple gpu using wsl #11519

Closed

1 task

thiendt2k1 mentioned this issue Jul 7, 2023

(Possibly) Cuda version conflict when training classifcation on custom dataset on Colab #11833

Closed

1 task

jcluo1994 mentioned this issue Oct 10, 2023

Using multi-GPU training reports errors #12213

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi GPU RuntimeError: Model replicas must have an equal number of parameters. #11

Multi GPU RuntimeError: Model replicas must have an equal number of parameters. #11

lhwcv commented Jun 3, 2020

github-actions bot commented Jun 3, 2020 •

edited by glenn-jocher

lhwcv commented Jun 3, 2020

glenn-jocher commented Jun 3, 2020

glenn-jocher commented Jun 5, 2020

glenn-jocher commented Jun 11, 2020

lucasjinreal commented Jun 11, 2020

mingmmq commented Jun 15, 2020

glenn-jocher commented Jun 15, 2020

intgogo commented Jun 15, 2020

tomjerrygithub commented Jun 15, 2020

JierunChen commented Jun 15, 2020

Lornatang commented Jun 18, 2020 •

edited

panchengl commented Jun 24, 2020

glenn-jocher commented Jun 24, 2020

github-actions bot commented Aug 1, 2020

Multi GPU RuntimeError: Model replicas must have an equal number of parameters. #11

Multi GPU RuntimeError: Model replicas must have an equal number of parameters. #11

Comments

lhwcv commented Jun 3, 2020

🐛 Bug

To Reproduce

github-actions bot commented Jun 3, 2020 • edited by glenn-jocher

lhwcv commented Jun 3, 2020

glenn-jocher commented Jun 3, 2020

glenn-jocher commented Jun 5, 2020

glenn-jocher commented Jun 11, 2020

lucasjinreal commented Jun 11, 2020

mingmmq commented Jun 15, 2020

glenn-jocher commented Jun 15, 2020

intgogo commented Jun 15, 2020

tomjerrygithub commented Jun 15, 2020

JierunChen commented Jun 15, 2020

Lornatang commented Jun 18, 2020 • edited

panchengl commented Jun 24, 2020

glenn-jocher commented Jun 24, 2020

github-actions bot commented Aug 1, 2020

github-actions bot commented Jun 3, 2020 •

edited by glenn-jocher

Lornatang commented Jun 18, 2020 •

edited