Runtime error using nearest neighbour upsampling on tensor with channels-last memory layout #81665

pcuenca · 2022-07-18T22:14:53Z

🐛 Describe the bug

torch.nn.functional.interpolate fails with a RuntimeError when the following conditions are met:

The input tensor uses the channels_last memory format.
The input shape is larger than a certain threshold.

The following code works fine, producing a tensor with the expected shape [31, 64, 1024, 1024]:

x = torch.rand((31, 64, 512, 512)).cuda().to(memory_format=torch.channels_last)
torch.nn.functional.interpolate(x, scale_factor=2, mode='nearest').shape

torch.Size([31, 64, 1024, 1024])

However, when the input batch dimension is 32 or larger, it fails:

x = torch.rand((32, 64, 512, 512)).cuda().to(memory_format=torch.channels_last)
torch.nn.functional.interpolate(x, scale_factor=2, mode='nearest').shape

RuntimeError: upsample_nearest_nhwc only supports output tensors with less than INT_MAX elements

If the memory layout is contiguous rather than channels last, it works fine too:

x = torch.rand((32, 64, 512, 512)).cuda()
torch.nn.functional.interpolate(x, scale_factor=2, mode='nearest').shape

torch.Size([32, 64, 1024, 1024])

The error is raised here. I'm not sure about the details, but I think a potential workaround could be to automatically revert to contiguous format, rather than failing.

Versions

Collecting environment information...
PyTorch version: 1.10.1+cu113
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.3 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: 10.0.0-4ubuntu1
CMake version: version 3.16.3
Libc version: glibc-2.31

Python version: 3.8.12 (default, Oct 12 2021, 13:49:34) [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-5.4.0-91-generic-x86_64-with-glibc2.17
Is CUDA available: True
CUDA runtime version: 11.4.48
GPU models and configuration:
GPU 0: NVIDIA GeForce RTX 2080 Ti
GPU 1: NVIDIA GeForce RTX 3090

Nvidia driver version: 470.42.01
cuDNN version: Probably one of the following:
/usr/local/cuda-11.4/targets/x86_64-linux/lib/libcudnn.so.8.2.2
/usr/local/cuda-11.4/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.2.2
/usr/local/cuda-11.4/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.2.2
/usr/local/cuda-11.4/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.2.2
/usr/local/cuda-11.4/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.2.2
/usr/local/cuda-11.4/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.2.2
/usr/local/cuda-11.4/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.2.2
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.20.3
[pip3] torch==1.10.1+cu113
[pip3] torchaudio==0.10.1+cu113
[pip3] torchvision==0.11.2+cu113
[conda] numpy 1.20.3 pypi_0 pypi
[conda] torch 1.10.1+cu113 pypi_0 pypi
[conda] torchaudio 0.10.1+cu113 pypi_0 pypi
[conda] torchvision 0.11.2+cu113 pypi_0 pypi

I was using the RTX 3090 in this test. I have observed the same behaviour using other cards (RTX A6000, for instance) in other systems running different versions of Python, PyTorch and OS.

cc @ezyang @gchanan @zou3519 @ngimel @VitalyFedyunin @jamesr66a @csarofeen @ptrblck @xwang233

The text was updated successfully, but these errors were encountered:

Fallback to contiguous memory layout before upscale, as a workaround for pytorch/pytorch#81665. The condition (batch dimension >= 32) works for my tests, but it might not be general enough. Further analysis required.

xwang233 · 2022-07-20T16:50:31Z

Thank you for raising this issue. We can probably add a index_t template function with int64_t indexing here with large tensors.

pytorch/aten/src/ATen/native/cuda/UpSampleNearest2d.cu

Lines 83 to 95 in 35d4a80

    
           template <typename scalar_t, nn_compute_source_index_fn_t nn_compute_source_index_fn> 
        
           C10_LAUNCH_BOUNDS_1(1024) 
        
           __global__ void upsample_nearest2d_nhwc_out_frame( 
        
               const scalar_t* idata, 
        
               scalar_t* odata, 
        
               const size_t channels, 
        
               const size_t height1, 
        
               const size_t width1, 
        
               const size_t height2, 
        
               const size_t width2, 
        
               float height_scale, 
        
               float width_scale, 
        
               const size_t out_numel) {

pcuenca · 2022-07-24T11:25:38Z

Sounds great! Is there anything I can do to help?

NouamaneTazi · 2022-10-26T22:51:42Z

Any updates on this? 👀

@ngimel

#81665 CC @ngimel @ptrblck Pull Request resolved: #87901 Approved by: https://github.com/ngimel

rtaori · 2022-10-29T08:51:03Z

+1 have run into this issue as well!

hadaev8 · 2022-10-31T01:39:11Z

Same here

malfet · 2022-10-31T17:33:04Z

Fixed by #87901
Not a regression, if we are doing 1.13.1 tentatively we should pick this one fix into the branch

@ngimel

) pytorch#81665 CC @ngimel @ptrblck Pull Request resolved: pytorch#87901 Approved by: https://github.com/ngimel

@ngimel

) pytorch#81665 CC @ngimel @ptrblck Pull Request resolved: pytorch#87901 Approved by: https://github.com/ngimel

yyt-2378 · 2023-03-31T06:56:15Z

same here

gchhablani · 2023-07-20T15:53:49Z

@malfet I am running into the same issue with PyTorch 1.12 but with Bilinear upsampling.

ptrblck · 2023-07-21T07:55:58Z

@gchhablani Update to the latest stable or nightly release as the fix seems to be in 1.13.1+.

Parskatt · 2023-08-12T09:54:14Z

I seem to still have this issue on pytorch 2.0.1, not sure how :D
Also, seem unable to reproduce it on other machines so I just added a .contiguous() in the interpolate as a workaround.

SMSD75 · 2023-08-14T12:52:41Z

I seem to still have this issue on pytorch 2.0.1, not sure how :D Also, seem unable to reproduce it on other machines so I just added a .contiguous() in the interpolate as a workaround.

Same here, but how does .contiguous() solve the issue?

Parskatt · 2023-08-14T13:16:23Z

Probably because it converts the format to bchw internally. Also the related pull request only fixed nearest neighbour, but doesnt fix bilinear etc. Should be the same for them though.

cjissmart · 2024-06-24T06:14:54Z

Probably because it converts the format to bchw internally. Also the related pull request only fixed nearest neighbour, but doesnt fix bilinear etc. Should be the same for them though.

Still doesnt fix bilinear.

bdhirsh added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module module: memory format Memory format/layout related issues/changes (channels_last, nhwc) labels Jul 20, 2022

VitalyFedyunin added module: cudnn Related to torch.backends.cudnn, and CuDNN support module: cuda Related to torch.cuda, and CUDA support in general labels Jul 20, 2022

xwang233 added module: interpolation and removed module: cudnn Related to torch.backends.cudnn, and CuDNN support labels Jul 20, 2022

NouamaneTazi mentioned this issue Oct 26, 2022

F.interpolate(hidden_states, scale_factor=2.0, mode="nearest") breaks for large bsz huggingface/diffusers#984

Closed

ngimel added the high priority label Oct 26, 2022

pytorch-bot bot added the triage review label Oct 26, 2022

NouamaneTazi mentioned this issue Oct 26, 2022

fix F.interpolate() for large batch sizes huggingface/diffusers#1006

Merged

eqy mentioned this issue Oct 27, 2022

Allow 64bit indexing for channels-last upsample2d on CUDA #87901

Closed

pytorchmergebot pushed a commit that referenced this issue Oct 28, 2022

Allow 64bit indexing for channels-last upsample2d on CUDA (#87901)

c5cb6ec

#81665 CC @ngimel @ptrblck Pull Request resolved: #87901 Approved by: https://github.com/ngimel

malfet removed the triage review label Oct 31, 2022

malfet closed this as completed Oct 31, 2022

kulinseth pushed a commit to kulinseth/pytorch that referenced this issue Nov 5, 2022

Allow 64bit indexing for channels-last upsample2d on CUDA (pytorch#87901

e111b7c

) pytorch#81665 CC @ngimel @ptrblck Pull Request resolved: pytorch#87901 Approved by: https://github.com/ngimel

kulinseth pushed a commit to kulinseth/pytorch that referenced this issue Dec 10, 2022

Allow 64bit indexing for channels-last upsample2d on CUDA (pytorch#87901

84ddc58

) pytorch#81665 CC @ngimel @ptrblck Pull Request resolved: pytorch#87901 Approved by: https://github.com/ngimel

pcuenca mentioned this issue Feb 9, 2023

INT_MAX error huggingface/diffusers#2306

Closed

patil-suraj mentioned this issue Feb 10, 2023

Torch2.0 scaled_dot_product_attention processor huggingface/diffusers#2303

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runtime error using nearest neighbour upsampling on tensor with channels-last memory layout #81665

Runtime error using nearest neighbour upsampling on tensor with channels-last memory layout #81665

pcuenca commented Jul 18, 2022 •

edited by pytorch-bot bot

Loading

xwang233 commented Jul 20, 2022

pcuenca commented Jul 24, 2022

NouamaneTazi commented Oct 26, 2022

rtaori commented Oct 29, 2022

hadaev8 commented Oct 31, 2022

malfet commented Oct 31, 2022

yyt-2378 commented Mar 31, 2023

gchhablani commented Jul 20, 2023

ptrblck commented Jul 21, 2023

Parskatt commented Aug 12, 2023

SMSD75 commented Aug 14, 2023

Parskatt commented Aug 14, 2023

cjissmart commented Jun 24, 2024

Runtime error using nearest neighbour upsampling on tensor with channels-last memory layout #81665

Runtime error using nearest neighbour upsampling on tensor with channels-last memory layout #81665

Comments

pcuenca commented Jul 18, 2022 • edited by pytorch-bot bot Loading

🐛 Describe the bug

Versions

xwang233 commented Jul 20, 2022

pcuenca commented Jul 24, 2022

NouamaneTazi commented Oct 26, 2022

rtaori commented Oct 29, 2022

hadaev8 commented Oct 31, 2022

malfet commented Oct 31, 2022

yyt-2378 commented Mar 31, 2023

gchhablani commented Jul 20, 2023

ptrblck commented Jul 21, 2023

Parskatt commented Aug 12, 2023

SMSD75 commented Aug 14, 2023

Parskatt commented Aug 14, 2023

cjissmart commented Jun 24, 2024

pcuenca commented Jul 18, 2022 •

edited by pytorch-bot bot

Loading