New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Grad strides do not match bucket view strides. #47163
Comments
my code is cnn |
I use DDP |
cc @mcarilli. In this case it seems the warning is spurious, strides are nominally different but physical layout is the same. |
I met the problem in distributed training with bacthsize > 1. Before:
after: (fixed)
|
Agree with @starhiking , tensors should be contiguous once their views have been changed. I also solved my problem in similar way. Looking through this https://pytorch.org/docs/stable/tensor_view.html doc might be very helpful. |
@starhiking, When I use 1*1 convolution kernel, it also happened. But why? |
@MRI000000 , I also meet the issue, do you have resolved it? |
It may made by distributed training. |
I get this issue when using channels_last training, and the optimizer was defined before I switched the model over to channels_last. |
I get this issue when using U-Net and trying to set BatchNorm2d in TransposeConv |
Is there an update on it @rohan-varma (tagged you since you removed the triaged tage)? I am facing the same issue but only with DDP - otherwise the code runs through without any issues. |
I'm seeing this warning too, though the model seems to be running/converging okay. |
Actually it does work for me, but what causes this? |
If you are using einsum or einops.rearrange, this is the same warning. Append ".contiguous()" after these ops. |
Any news on this? From the error it is hard to locate the source point on complex networks. |
Thank you, this is helpful |
@albanD We have this sparse on different tickets. Can we re-triage/unify this issue? |
[W reducer.cpp:313] Warning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance.
grad.sizes() = [1024, 1024, 1, 1], strides() = [1024, 1, 1024, 1024]
bucket_view.sizes() = [1024, 1024, 1, 1], strides() = [1024, 1, 1, 1] (function operator())
this problem impair performance .
what can i do?
cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang @VitalyFedyunin @jamesr66a @ppwwyyxx
The text was updated successfully, but these errors were encountered: