Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grad strides do not match bucket view strides. #47163

Open
xingxinggui opened this issue Nov 1, 2020 · 17 comments
Open

Grad strides do not match bucket view strides. #47163

xingxinggui opened this issue Nov 1, 2020 · 17 comments
Labels
module: ddp Issues/PRs related distributed data parallel training module: memory format Memory format/layout related issues/changes (channels_last, nhwc) oncall: distributed Add this issue/PR to distributed oncall triage queue

Comments

@xingxinggui
Copy link

xingxinggui commented Nov 1, 2020

[W reducer.cpp:313] Warning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance.
grad.sizes() = [1024, 1024, 1, 1], strides() = [1024, 1, 1024, 1024]
bucket_view.sizes() = [1024, 1024, 1, 1], strides() = [1024, 1, 1, 1] (function operator())

this problem impair performance .
what can i do?

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang @VitalyFedyunin @jamesr66a @ppwwyyxx

@xingxinggui
Copy link
Author

[W reducer.cpp:313] Warning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance.
grad.sizes() = [1024, 1024, 1, 1], strides() = [1024, 1, 1024, 1024]
bucket_view.sizes() = [1024, 1024, 1, 1], strides() = [1024, 1, 1, 1] (function operator())

this problem impair performance .
what can i do?

my code is cnn
nn.Conv2d(1024, 1024, 1)

@xingxinggui
Copy link
Author

[W reducer.cpp:313] Warning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance.
grad.sizes() = [1024, 1024, 1, 1], strides() = [1024, 1, 1024, 1024]
bucket_view.sizes() = [1024, 1024, 1, 1], strides() = [1024, 1, 1, 1] (function operator())
this problem impair performance .
what can i do?

my code is cnn
nn.Conv2d(1024, 1024, 1)

I use DDP

@ngimel
Copy link
Collaborator

ngimel commented Nov 1, 2020

cc @mcarilli. In this case it seems the warning is spurious, strides are nominally different but physical layout is the same.

@ngimel ngimel added module: ddp Issues/PRs related distributed data parallel training module: memory format Memory format/layout related issues/changes (channels_last, nhwc) triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Nov 1, 2020
@starhiking
Copy link

I met the problem in distributed training with bacthsize > 1.
In batchsize = 1 or single gpu, it won't occur.
I thought the problem caused by transpose or permute.
When I delete the transpose or permute, or add .contiguous() after above function, it goes work.
Thus I suspect transpose caused the gradients stride wrong. Need to be contigous.

Before:

 vit_ll = eval('self.vit'+str(i)).forward_wo_cls(x).transpose(1,2).view([x.shape[0],-1,*x.shape[2:]])

after: (fixed)

vit_ll = eval('self.vit'+str(i)).forward_wo_cls(x).transpose(1,2).contiguous().view([x.shape[0],-1,*x.shape[2:]])

@tojimahammatov
Copy link

Agree with @starhiking , tensors should be contiguous once their views have been changed. I also solved my problem in similar way. Looking through this https://pytorch.org/docs/stable/tensor_view.html doc might be very helpful.

@MRI000000
Copy link

@starhiking, When I use 1*1 convolution kernel, it also happened. But why?

@lhyciomp
Copy link

@MRI000000 , I also meet the issue, do you have resolved it?

@KMY-SEU
Copy link

KMY-SEU commented Jan 5, 2022

It may made by distributed training.

@rohan-varma rohan-varma added oncall: distributed Add this issue/PR to distributed oncall triage queue and removed triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Feb 4, 2022
@jxtps
Copy link

jxtps commented Feb 10, 2023

I get this issue when using channels_last training, and the optimizer was defined before I switched the model over to channels_last.

@NayeeC
Copy link

NayeeC commented Mar 14, 2023

I get this issue when using U-Net and trying to set BatchNorm2d in TransposeConv

@nicoloesch
Copy link

Is there an update on it @rohan-varma (tagged you since you removed the triaged tage)? I am facing the same issue but only with DDP - otherwise the code runs through without any issues.

@jbmaxwell
Copy link

I'm seeing this warning too, though the model seems to be running/converging okay.

@dongdongtong
Copy link

I met the problem in distributed training with bacthsize > 1. In batchsize = 1 or single gpu, it won't occur. I thought the problem caused by transpose or permute. When I delete the transpose or permute, or add .contiguous() after above function, it goes work. Thus I suspect transpose caused the gradients stride wrong. Need to be contigous.

Before:

 vit_ll = eval('self.vit'+str(i)).forward_wo_cls(x).transpose(1,2).view([x.shape[0],-1,*x.shape[2:]])

after: (fixed)

vit_ll = eval('self.vit'+str(i)).forward_wo_cls(x).transpose(1,2).contiguous().view([x.shape[0],-1,*x.shape[2:]])

Actually it does work for me, but what causes this?

@ShengYun-Peng
Copy link

I met the problem in distributed training with bacthsize > 1. In batchsize = 1 or single gpu, it won't occur. I thought the problem caused by transpose or permute. When I delete the transpose or permute, or add .contiguous() after above function, it goes work. Thus I suspect transpose caused the gradients stride wrong. Need to be contigous.

Before:

 vit_ll = eval('self.vit'+str(i)).forward_wo_cls(x).transpose(1,2).view([x.shape[0],-1,*x.shape[2:]])

after: (fixed)

vit_ll = eval('self.vit'+str(i)).forward_wo_cls(x).transpose(1,2).contiguous().view([x.shape[0],-1,*x.shape[2:]])

If you are using einsum or einops.rearrange, this is the same warning. Append ".contiguous()" after these ops.

@bhack
Copy link
Contributor

bhack commented Feb 6, 2024

Any news on this? From the error it is hard to locate the source point on complex networks.

@OvO1111
Copy link

OvO1111 commented Apr 25, 2024

I met the problem in distributed training with bacthsize > 1. In batchsize = 1 or single gpu, it won't occur. I thought the problem caused by transpose or permute. When I delete the transpose or permute, or add .contiguous() after above function, it goes work. Thus I suspect transpose caused the gradients stride wrong. Need to be contigous.
Before:

 vit_ll = eval('self.vit'+str(i)).forward_wo_cls(x).transpose(1,2).view([x.shape[0],-1,*x.shape[2:]])

after: (fixed)

vit_ll = eval('self.vit'+str(i)).forward_wo_cls(x).transpose(1,2).contiguous().view([x.shape[0],-1,*x.shape[2:]])

If you are using einsum or einops.rearrange, this is the same warning. Append ".contiguous()" after these ops.

Thank you, this is helpful

@bhack
Copy link
Contributor

bhack commented Apr 25, 2024

@albanD We have this sparse on different tickets. Can we re-triage/unify this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: ddp Issues/PRs related distributed data parallel training module: memory format Memory format/layout related issues/changes (channels_last, nhwc) oncall: distributed Add this issue/PR to distributed oncall triage queue
Projects
None yet
Development

No branches or pull requests