Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The multi-fc losses calculating in DistributedDataParallel. #56772

Open
Bonsen opened this issue Apr 23, 2021 · 1 comment
Open

The multi-fc losses calculating in DistributedDataParallel. #56772

Bonsen opened this issue Apr 23, 2021 · 1 comment
Labels
oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@Bonsen
Copy link

Bonsen commented Apr 23, 2021

The labels of classes are in 0~9. In a batch, some labels are 255, so we need to ignore the loss of these labels.

I calculate the loss in two ways:

Way1:

  y_pred = y_pred[y_true!=ignore_index].view(-1, si)
  y_true = y_true[y_true!=ignore_index].view(-1, si)
  loss_sum = ........
  if loss_sum.size(0) == 0:
      loss_sum = loss_sum + 0
      bs = torch.Tensor([0])
  else:
      bs = torch.Tensor([y_true.size(0)])
      loss_sum = torch.**sum**(loss_sum)
  return loss_sum, bs

  loss = None
  ....
  In each fc:
      bs = **reduce_tensor_sum**(bs) #sum all gpus
      loss_sum = **reduce_tensor_sum**(loss_sum) #sum all gpus
      if len(loss_sum.shape) != 0: loss_sum = torch.Tensor([0]).cuda(gpu, non_blocking=True)
      if bs != 0:
          loss_sum = loss_sum / bs
          loss = loss + loss_sum
   return loss

Way2:

    y_pred = y_pred[y_true!=ignore_index].view(-1, si)
    y_true = y_true[y_true!=ignore_index].view(-1, si)
    loss_sum = ........
    if loss_sum.size(0) == 0:
        loss_avg = loss_sum + 0
    else:
        loss_avg = torch.**mean**(loss_sum)
    return loss_avg

    loss = None
    ....
    In each fc:
        ............
        if loss is None:
            loss = loss_m
        else:
            loss = loss + loss_m
    
    loss = reduce_tensor_mean(loss)

The way1 should be right, but the acc of way1 is always lower than way2 in four experiments.

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @agolynski @SciPioneer @H-Huang @mrzzd @cbalioglu @gcramer23

@facebook-github-bot facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Apr 23, 2021
@pritamdamania87
Copy link
Contributor

@Bonsen Can you provide complete self contained repros that we can run on our end?

@rohan-varma rohan-varma added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

4 participants