Grad strides do not match bucket view strides. #47163

xingxinggui · 2020-11-01T09:48:49Z

[W reducer.cpp:313] Warning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance.
grad.sizes() = [1024, 1024, 1, 1], strides() = [1024, 1, 1024, 1024]
bucket_view.sizes() = [1024, 1024, 1, 1], strides() = [1024, 1, 1, 1] (function operator())

this problem impair performance .
what can i do?

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang @VitalyFedyunin @jamesr66a @ppwwyyxx

xingxinggui · 2020-11-01T10:15:03Z

[W reducer.cpp:313] Warning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance.
grad.sizes() = [1024, 1024, 1, 1], strides() = [1024, 1, 1024, 1024]
bucket_view.sizes() = [1024, 1024, 1, 1], strides() = [1024, 1, 1, 1] (function operator())

this problem impair performance .
what can i do?

my code is cnn
nn.Conv2d(1024, 1024, 1)

xingxinggui · 2020-11-01T10:28:29Z

[W reducer.cpp:313] Warning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance.
grad.sizes() = [1024, 1024, 1, 1], strides() = [1024, 1, 1024, 1024]
bucket_view.sizes() = [1024, 1024, 1, 1], strides() = [1024, 1, 1, 1] (function operator())
this problem impair performance .
what can i do?

my code is cnn
nn.Conv2d(1024, 1024, 1)

I use DDP

ngimel · 2020-11-01T21:14:31Z

cc @mcarilli. In this case it seems the warning is spurious, strides are nominally different but physical layout is the same.

starhiking · 2021-01-09T09:24:59Z

I met the problem in distributed training with bacthsize > 1.
In batchsize = 1 or single gpu, it won't occur.
I thought the problem caused by transpose or permute.
When I delete the transpose or permute, or add .contiguous() after above function, it goes work.
Thus I suspect transpose caused the gradients stride wrong. Need to be contigous.

Before:

 vit_ll = eval('self.vit'+str(i)).forward_wo_cls(x).transpose(1,2).view([x.shape[0],-1,*x.shape[2:]])

after: (fixed)

vit_ll = eval('self.vit'+str(i)).forward_wo_cls(x).transpose(1,2).contiguous().view([x.shape[0],-1,*x.shape[2:]])

tojimahammatov · 2021-01-13T03:31:22Z

Agree with @starhiking , tensors should be contiguous once their views have been changed. I also solved my problem in similar way. Looking through this https://pytorch.org/docs/stable/tensor_view.html doc might be very helpful.

MRI000000 · 2021-09-03T23:12:28Z

@starhiking, When I use 1*1 convolution kernel, it also happened. But why?

lhyciomp · 2021-09-14T03:13:05Z

@MRI000000 , I also meet the issue, do you have resolved it?

KMY-SEU · 2022-01-05T12:58:56Z

It may made by distributed training.

jxtps · 2023-02-10T19:00:14Z

I get this issue when using channels_last training, and the optimizer was defined before I switched the model over to channels_last.

NayeeC · 2023-03-14T16:11:58Z

I get this issue when using U-Net and trying to set BatchNorm2d in TransposeConv

nicoloesch · 2023-06-05T05:11:44Z

Is there an update on it @rohan-varma (tagged you since you removed the triaged tage)? I am facing the same issue but only with DDP - otherwise the code runs through without any issues.

jbmaxwell · 2023-06-18T19:32:20Z

I'm seeing this warning too, though the model seems to be running/converging okay.

dongdongtong · 2023-07-29T04:29:19Z

I met the problem in distributed training with bacthsize > 1. In batchsize = 1 or single gpu, it won't occur. I thought the problem caused by transpose or permute. When I delete the transpose or permute, or add .contiguous() after above function, it goes work. Thus I suspect transpose caused the gradients stride wrong. Need to be contigous.

Before:
 vit_ll = eval('self.vit'+str(i)).forward_wo_cls(x).transpose(1,2).view([x.shape[0],-1,*x.shape[2:]])
after: (fixed)
vit_ll = eval('self.vit'+str(i)).forward_wo_cls(x).transpose(1,2).contiguous().view([x.shape[0],-1,*x.shape[2:]])

Actually it does work for me, but what causes this?

ShengYun-Peng · 2023-10-17T13:55:41Z

I met the problem in distributed training with bacthsize > 1. In batchsize = 1 or single gpu, it won't occur. I thought the problem caused by transpose or permute. When I delete the transpose or permute, or add .contiguous() after above function, it goes work. Thus I suspect transpose caused the gradients stride wrong. Need to be contigous.

Before:
 vit_ll = eval('self.vit'+str(i)).forward_wo_cls(x).transpose(1,2).view([x.shape[0],-1,*x.shape[2:]])
after: (fixed)
vit_ll = eval('self.vit'+str(i)).forward_wo_cls(x).transpose(1,2).contiguous().view([x.shape[0],-1,*x.shape[2:]])

If you are using einsum or einops.rearrange, this is the same warning. Append ".contiguous()" after these ops.

bhack · 2024-02-06T10:42:41Z

Any news on this? From the error it is hard to locate the source point on complex networks.

OvO1111 · 2024-04-25T08:28:11Z

I met the problem in distributed training with bacthsize > 1. In batchsize = 1 or single gpu, it won't occur. I thought the problem caused by transpose or permute. When I delete the transpose or permute, or add .contiguous() after above function, it goes work. Thus I suspect transpose caused the gradients stride wrong. Need to be contigous.
Before:
 vit_ll = eval('self.vit'+str(i)).forward_wo_cls(x).transpose(1,2).view([x.shape[0],-1,*x.shape[2:]])
after: (fixed)
vit_ll = eval('self.vit'+str(i)).forward_wo_cls(x).transpose(1,2).contiguous().view([x.shape[0],-1,*x.shape[2:]])
If you are using einsum or einops.rearrange, this is the same warning. Append ".contiguous()" after these ops.

Thank you, this is helpful

bhack · 2024-04-25T18:22:10Z

@albanD We have this sparse on different tickets. Can we re-triage/unify this issue?

ngimel added module: ddp Issues/PRs related distributed data parallel training module: memory format Memory format/layout related issues/changes (channels_last, nhwc) triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Nov 1, 2020

albanD mentioned this issue Dec 29, 2021

[W accumulate_grad.h:184] #70389

Closed

rohan-varma added oncall: distributed Add this issue/PR to distributed oncall triage queue and removed triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Feb 4, 2022

JisongXie mentioned this issue Apr 7, 2022

fix(convnext): stride warning huggingface/pytorch-image-models#1212

Closed

mrsalehi mentioned this issue Sep 17, 2022

DDP Warning: Grad strides do not match bucket view strides warning huggingface/diffusers#543

Closed

chevalierNoir mentioned this issue Dec 4, 2022

issue during 1st iteration of pretraining facebookresearch/av_hubert#75

Open

vavanade mentioned this issue Jun 1, 2023

[Bug] Grad strides do not match bucket view strides when dist_training rotated RTMDet open-mmlab/mmrotate#863

Open

3 tasks

glenn-jocher mentioned this issue Jun 18, 2023

UserWarning: Grad strides do not match bucket view strides. ultralytics/ultralytics#3254

Closed

2 tasks

noahzn mentioned this issue Oct 18, 2023

about distriuted training noahzn/Lite-Mono#43

Closed

great-energizer mentioned this issue Nov 9, 2023

Stuck at validation step during 2x_HAT finetuning. XPixelGroup/HAT#75

Open

blaise-tk mentioned this issue Jan 3, 2024

Warning: Grad strides do not match bucket view strides DDP. (Can impact performance) RVC-Project/Retrieval-based-Voice-Conversion-WebUI#1678

Open

bhack mentioned this issue Mar 3, 2024

Grad strides do not match bucket view strides #83909

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Grad strides do not match bucket view strides. #47163

Grad strides do not match bucket view strides. #47163

xingxinggui commented Nov 1, 2020 •

edited by pytorch-bot bot

xingxinggui commented Nov 1, 2020

xingxinggui commented Nov 1, 2020

ngimel commented Nov 1, 2020

starhiking commented Jan 9, 2021

tojimahammatov commented Jan 13, 2021

MRI000000 commented Sep 3, 2021

lhyciomp commented Sep 14, 2021

KMY-SEU commented Jan 5, 2022

jxtps commented Feb 10, 2023

NayeeC commented Mar 14, 2023

nicoloesch commented Jun 5, 2023

jbmaxwell commented Jun 18, 2023

dongdongtong commented Jul 29, 2023

ShengYun-Peng commented Oct 17, 2023

bhack commented Feb 6, 2024

OvO1111 commented Apr 25, 2024

bhack commented Apr 25, 2024

Grad strides do not match bucket view strides. #47163

Grad strides do not match bucket view strides. #47163

Comments

xingxinggui commented Nov 1, 2020 • edited by pytorch-bot bot

xingxinggui commented Nov 1, 2020

xingxinggui commented Nov 1, 2020

ngimel commented Nov 1, 2020

starhiking commented Jan 9, 2021

tojimahammatov commented Jan 13, 2021

MRI000000 commented Sep 3, 2021

lhyciomp commented Sep 14, 2021

KMY-SEU commented Jan 5, 2022

jxtps commented Feb 10, 2023

NayeeC commented Mar 14, 2023

nicoloesch commented Jun 5, 2023

jbmaxwell commented Jun 18, 2023

dongdongtong commented Jul 29, 2023

ShengYun-Peng commented Oct 17, 2023

bhack commented Feb 6, 2024

OvO1111 commented Apr 25, 2024

bhack commented Apr 25, 2024

xingxinggui commented Nov 1, 2020 •

edited by pytorch-bot bot