-
Notifications
You must be signed in to change notification settings - Fork 121
Open
Description
Hi devs, first, thanks for this clean & useful code base!
I'm not following this line
| loss = loss * self.world_size # counter average weight reduction |
Since in an earlier line
| self.cross_entropy = nn.CrossEntropyLoss(reduction='mean') |
reduction is "mean", I believe it is correct to let DDP average the gradient across multiple GPUs, and there's no need to counteract that? Only when DDP does the average, would it be equivalent between 2 GPUs with 8 examples each and 1 GPU with 16 examples, both reduced by "mean".
Thanks in advance.
Metadata
Metadata
Assignees
Labels
No labels