Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail to train with multiple of GPU in DP mode #35

Closed
Alec-Lin opened this issue Nov 22, 2020 · 4 comments
Closed

Fail to train with multiple of GPU in DP mode #35

Alec-Lin opened this issue Nov 22, 2020 · 4 comments

Comments

@Alec-Lin
Copy link

Here is the wrong detail.
Traceback (most recent call last):
File "/home/xxx/hard_disk/xxx/ScaledYOLOv4/train.py", line 438, in
train(hyp, opt, device, tb_writer)
File "/home/xxx/hard_disk/xxx/ScaledYOLOv4/train.py", line 255, in train
loss, loss_items = compute_loss(pred, targets.to(device), model) # scaled by batch_size
File "/home/xxx/hard_disk/xxx/ScaledYOLOv4/utils/general.py", line 446, in compute_loss
tcls, tbox, indices, anchors = build_targets(p, targets, model) # targets
File "/home/xxx/hard_disk/xxx/ScaledYOLOv4/utils/general.py", line 526, in build_targets
r = t[None, :, 4:6] / anchors[:, None] # wh ratio
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

Process finished with exit code 1

@yuyijie1995
Copy link

Have you solved it?

@Alec-Lin
Copy link
Author

Have you solved it?

Sorry I don't. So I run it in DDP mode[Laugh and cry]. It runs well.

@JoonHoonKim
Copy link

Have you solved it?

This is a problem that arises because the elements being computed go into the cpu and gpu respectively.
At line 531 of the 'general.py' file, t(target) goes into gpu and anchor goes into cpu, so when you divide anchor by t, an error occurs.
This is solved by sending the anchor to the gpu before the calculation takes place.
The code is anchor = anchor.to(device='cuda').
Please understand that I am unfamiliar with using github.

@ShADAMoV
Copy link

ShADAMoV commented Sep 6, 2021

Have you solved it?

This is a problem that arises because the elements being computed go into the cpu and gpu respectively.
At line 531 of the 'general.py' file, t(target) goes into gpu and anchor goes into cpu, so when you divide anchor by t, an error occurs.
This is solved by sending the anchor to the gpu before the calculation takes place.
The code is anchor = anchor.to(device='cuda').
Please understand that I am unfamiliar with using github.

I added "anchors = anchors.to(device='cuda')" in 141 line in loss.py file and that been work! (06.09.2021)
Now, my code in loss.py (135-149 line) look like
for i, jj in enumerate(model.module.yolo_layers if multi_gpu else model.yolo_layers):
# get number of grid points and anchor vec for this yolo layer
anchors = model.module.module_list[jj].anchor_vec if multi_gpu else model.module_list[jj].anchor_vec
gain[2:] = torch.tensor(p[i].shape)[[3, 2, 3, 2]] # xyxy gain

    # Match targets to anchors
    anchors = anchors.to(device='cuda')
    a, t, offsets = [], targets * gain, 0
    if nt:
        na = anchors.shape[0]  # number of anchors
        at = torch.arange(na).view(na, 1).repeat(1, nt)  # anchor tensor, same as .repeat_interleave(nt)
        r = t[None, :, 4:6] / anchors[:, None]  # wh ratio
        j = torch.max(r, 1. / r).max(2)[0] < model.hyp['anchor_t']  # compare
        # j = wh_iou(anchors, t[:, 4:6]) > model.hyp['iou_t']  # iou(3,n) = wh_iou(anchors(3,2), gwh(n,2))
        a, t = at[j], t.repeat(na, 1, 1)[j]  # filter

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants