Fail to train with multiple of GPU in DP mode #35

Alec-Lin · 2020-11-22T13:54:45Z

Here is the wrong detail.
Traceback (most recent call last):
File "/home/xxx/hard_disk/xxx/ScaledYOLOv4/train.py", line 438, in
train(hyp, opt, device, tb_writer)
File "/home/xxx/hard_disk/xxx/ScaledYOLOv4/train.py", line 255, in train
loss, loss_items = compute_loss(pred, targets.to(device), model) # scaled by batch_size
File "/home/xxx/hard_disk/xxx/ScaledYOLOv4/utils/general.py", line 446, in compute_loss
tcls, tbox, indices, anchors = build_targets(p, targets, model) # targets
File "/home/xxx/hard_disk/xxx/ScaledYOLOv4/utils/general.py", line 526, in build_targets
r = t[None, :, 4:6] / anchors[:, None] # wh ratio
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

Process finished with exit code 1

yuyijie1995 · 2020-11-26T12:02:49Z

Have you solved it?

Alec-Lin · 2020-11-26T12:06:01Z

Have you solved it?

Sorry I don't. So I run it in DDP mode[Laugh and cry]. It runs well.

JoonHoonKim · 2021-04-01T09:01:05Z

Have you solved it?

This is a problem that arises because the elements being computed go into the cpu and gpu respectively.
At line 531 of the 'general.py' file, t(target) goes into gpu and anchor goes into cpu, so when you divide anchor by t, an error occurs.
This is solved by sending the anchor to the gpu before the calculation takes place.
The code is anchor = anchor.to(device='cuda').
Please understand that I am unfamiliar with using github.

ShADAMoV · 2021-09-06T13:01:43Z

Have you solved it?

This is a problem that arises because the elements being computed go into the cpu and gpu respectively.
At line 531 of the 'general.py' file, t(target) goes into gpu and anchor goes into cpu, so when you divide anchor by t, an error occurs.
This is solved by sending the anchor to the gpu before the calculation takes place.
The code is anchor = anchor.to(device='cuda').
Please understand that I am unfamiliar with using github.

I added "anchors = anchors.to(device='cuda')" in 141 line in loss.py file and that been work! (06.09.2021)
Now, my code in loss.py (135-149 line) look like
for i, jj in enumerate(model.module.yolo_layers if multi_gpu else model.yolo_layers):
# get number of grid points and anchor vec for this yolo layer
anchors = model.module.module_list[jj].anchor_vec if multi_gpu else model.module_list[jj].anchor_vec
gain[2:] = torch.tensor(p[i].shape)[[3, 2, 3, 2]] # xyxy gain

    # Match targets to anchors
    anchors = anchors.to(device='cuda')
    a, t, offsets = [], targets * gain, 0
    if nt:
        na = anchors.shape[0]  # number of anchors
        at = torch.arange(na).view(na, 1).repeat(1, nt)  # anchor tensor, same as .repeat_interleave(nt)
        r = t[None, :, 4:6] / anchors[:, None]  # wh ratio
        j = torch.max(r, 1. / r).max(2)[0] < model.hyp['anchor_t']  # compare
        # j = wh_iou(anchors, t[:, 4:6]) > model.hyp['iou_t']  # iou(3,n) = wh_iou(anchors(3,2), gwh(n,2))
        a, t = at[j], t.repeat(na, 1, 1)[j]  # filter

khg2478 mentioned this issue Jul 27, 2021

Regarding model configuration #305

Closed

Alec-Lin closed this as completed Sep 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fail to train with multiple of GPU in DP mode #35

Fail to train with multiple of GPU in DP mode #35

Alec-Lin commented Nov 22, 2020

yuyijie1995 commented Nov 26, 2020

Alec-Lin commented Nov 26, 2020

JoonHoonKim commented Apr 1, 2021

ShADAMoV commented Sep 6, 2021 •

edited

Fail to train with multiple of GPU in DP mode #35

Fail to train with multiple of GPU in DP mode #35

Comments

Alec-Lin commented Nov 22, 2020

yuyijie1995 commented Nov 26, 2020

Alec-Lin commented Nov 26, 2020

JoonHoonKim commented Apr 1, 2021

ShADAMoV commented Sep 6, 2021 • edited

ShADAMoV commented Sep 6, 2021 •

edited