fix: Batchnorm does not work properly when training on multiple devices #1167

wcshds · 2024-01-23T14:40:28Z

It seems that running_var and runninng_mean are shared across devices when training on multiple devices. Therefore, ensuring that running_var/var and running_mean/mean are on the same device will allow Batchnorm to work properly during multiple devices training.

nathanielsimard · 2024-01-23T15:03:54Z

burn-core/src/nn/norm/batch.rs

+        let running_mean = running_mean.clone().mul_scalar(1.0 - self.momentum).add(
            mean.clone()
+                .to_device(&running_mean.device())
                .detach()
                .mul_scalar(self.momentum)
                .reshape([channels]),
        );


I think we should do the opposite: change the device of the running_mean after detaching it so that the gradients are calculated on each device. When we store the newly updated running_mean, we should also check the devices. Do you think it would work?

nathanielsimard

LGTM

make sure running_mean/mean;running_var/var on the same device

cb9545f

nathanielsimard reviewed Jan 23, 2024

View reviewed changes

wcshds and others added 2 commits January 26, 2024 04:12

Merge branch 'tracel-ai:main' into batchnorm-multi-device

08eaf8a

make sure tensors are on the same device when updating values

df71fb4

wcshds requested a review from nathanielsimard January 25, 2024 20:54

nathanielsimard approved these changes Jan 25, 2024

View reviewed changes

nathanielsimard merged commit 8686082 into tracel-ai:main Jan 27, 2024
12 checks passed

wcshds deleted the batchnorm-multi-device branch January 27, 2024 16:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Batchnorm does not work properly when training on multiple devices #1167

fix: Batchnorm does not work properly when training on multiple devices #1167

wcshds commented Jan 23, 2024

nathanielsimard Jan 23, 2024

nathanielsimard left a comment

fix: Batchnorm does not work properly when training on multiple devices #1167

fix: Batchnorm does not work properly when training on multiple devices #1167

Conversation

wcshds commented Jan 23, 2024

nathanielsimard Jan 23, 2024

Choose a reason for hiding this comment

nathanielsimard left a comment

Choose a reason for hiding this comment