Skip to content

Correctness of "counter average weight reduction" #143

@youyuandx

Description

@youyuandx

Hi devs, first, thanks for this clean & useful code base!

I'm not following this line

loss = loss * self.world_size # counter average weight reduction

Since in an earlier line

self.cross_entropy = nn.CrossEntropyLoss(reduction='mean')

reduction is "mean", I believe it is correct to let DDP average the gradient across multiple GPUs, and there's no need to counteract that? Only when DDP does the average, would it be equivalent between 2 GPUs with 8 examples each and 1 GPU with 16 examples, both reduced by "mean".

Thanks in advance.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions