The multi-fc losses calculating in DistributedDataParallel. #56772

Bonsen · 2021-04-23T09:10:19Z

The labels of classes are in 0~9. In a batch, some labels are 255, so we need to ignore the loss of these labels.

I calculate the loss in two ways:

Way1:

  y_pred = y_pred[y_true!=ignore_index].view(-1, si)
  y_true = y_true[y_true!=ignore_index].view(-1, si)
  loss_sum = ........
  if loss_sum.size(0) == 0:
      loss_sum = loss_sum + 0
      bs = torch.Tensor([0])
  else:
      bs = torch.Tensor([y_true.size(0)])
      loss_sum = torch.**sum**(loss_sum)
  return loss_sum, bs

  loss = None
  ....
  In each fc:
      bs = **reduce_tensor_sum**(bs) #sum all gpus
      loss_sum = **reduce_tensor_sum**(loss_sum) #sum all gpus
      if len(loss_sum.shape) != 0: loss_sum = torch.Tensor([0]).cuda(gpu, non_blocking=True)
      if bs != 0:
          loss_sum = loss_sum / bs
          loss = loss + loss_sum
   return loss

Way2:

    y_pred = y_pred[y_true!=ignore_index].view(-1, si)
    y_true = y_true[y_true!=ignore_index].view(-1, si)
    loss_sum = ........
    if loss_sum.size(0) == 0:
        loss_avg = loss_sum + 0
    else:
        loss_avg = torch.**mean**(loss_sum)
    return loss_avg

    loss = None
    ....
    In each fc:
        ............
        if loss is None:
            loss = loss_m
        else:
            loss = loss + loss_m
    
    loss = reduce_tensor_mean(loss)

The way1 should be right, but the acc of way1 is always lower than way2 in four experiments.

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @agolynski @SciPioneer @H-Huang @mrzzd @cbalioglu @gcramer23

The text was updated successfully, but these errors were encountered:

pritamdamania87 · 2021-04-29T01:52:05Z

@Bonsen Can you provide complete self contained repros that we can run on our end?

facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Apr 23, 2021

rohan-varma added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The multi-fc losses calculating in DistributedDataParallel. #56772

The multi-fc losses calculating in DistributedDataParallel. #56772

Bonsen commented Apr 23, 2021 •

edited by pytorch-probot bot

pritamdamania87 commented Apr 29, 2021

The multi-fc losses calculating in DistributedDataParallel. #56772

The multi-fc losses calculating in DistributedDataParallel. #56772

Comments

Bonsen commented Apr 23, 2021 • edited by pytorch-probot bot

pritamdamania87 commented Apr 29, 2021

Bonsen commented Apr 23, 2021 •

edited by pytorch-probot bot