UpSample GPU Porting #19630

xmnlab · 2019-04-23T20:22:22Z

resolves #16158

rgommers · 2019-05-03T14:55:40Z

Fewer errors than before, 4 instead of 8 with the same message:

RuntimeError: CUDA error: too many resources requested for launch

This seems to be an existing problem: gh-8103.

@skrah could you have a look at this and tell us what you think?

skrah · 2019-05-03T17:23:19Z

Looking at it very briefly, the Jetson TX referenced in the issue only has 256 cores.

So while the issue looks the same, I'd probably not expect it on the CI platforms. FWIW, some recent CI tests in other issues are green.

skrah · 2019-05-03T17:24:05Z

@pytorchbot retest this please.

rgommers · 2019-05-03T17:56:29Z

same failures.

skrah · 2019-05-03T17:58:09Z

Is this rebased on the latest master (you can also ask the bot to rebase)?

xmnlab · 2019-05-03T17:59:40Z

I rebased that yesterday ... also needed to resolve conflicts because 7 days ago upsample files were changed.

xmnlab · 2019-05-03T18:22:47Z

@skrah do you have any idea about what could be this problem? or a way to debug that?
also it seems some jobs is taking a lot of time to build, for example, for some job the estimate time is 19h .. not sure if it the regular estimate ..

skrah · 2019-05-03T18:57:24Z

@xmnlab If you can't reproduce it at home it seems hard to debug other than reading the diffs again. I can take a look on Monday, it's getting a bit late here.

skrah · 2019-05-03T18:58:19Z

Also, has anyone found a way to show the actual hardware used on the CI in detail?

xmnlab · 2019-05-03T19:11:40Z

@skrah I am working in parallel on a paperspace environment .. it tooks a lot of time it is running with 8 cpu cores.

skrah · 2019-05-03T19:16:06Z

On Fri, May 03, 2019 at 07:12:35PM +0000, Ivan Ogasawara wrote: @skrah I am working in parallel on a paperspace environment .. it tooks a lot of time it is running with 8 cpu cores.

But have you ever been able to reproduce this issue on paperspace?

rgommers · 2019-05-03T19:18:45Z

But have you ever been able to reproduce this issue on paperspace?

I can reproduce it locally: Arch Linux, CUDA 10.0, RTX2070 GPU. Given the CI failures, I think it should be reproducible for multiple CUDA and GPU versions. I just haven't worked on any CUDA code before, and am short on time, so I'd rather not dig too deep.

xmnlab · 2019-05-03T19:19:50Z

not yet .. my last building was with a my previous commit ... I will work again in this task in some minutes :)

rgommers · 2019-05-03T19:21:56Z

@xmnlab your previous commit had the same failure though (except for the less clear exception message), and it seems quite reproducible. So I think you'll see it now.

skrah · 2019-05-03T19:26:12Z

On Fri, May 03, 2019 at 12:19:42PM -0700, Ralf Gommers wrote: > But have you ever been able to reproduce this issue on paperspace? I can reproduce it locally: Arch Linux, CUDA 10.0, RTX2070 GPU. Given the CI failures, I think it should be reproducible for multiple CUDA and GPU versions. I just haven't worked on any CUDA code before, and am short on time, so I'd rather not dig too deep.

Ah, thanks. @xmnlab, then I guess just compiling with "DEBUG=1" and stepping through the offending code with gdb may be the fastest way.

xmnlab · 2019-05-03T20:05:47Z

@skrah thanks! I will try that!

xmnlab · 2019-05-04T17:25:38Z

it seems just UpSampleBicubic2d is using upsample_get_value_bounded (https://github.com/pytorch/pytorch/pull/19630/files#diff-5092da792c30694ee4adf0d0ae2a37c6R171) and upsample_increment_value_bounded (https://github.com/pytorch/pytorch/pull/19630/files#diff-5092da792c30694ee4adf0d0ae2a37c6R191)

maybe the problem could be inside one of these functions ... maybe related to the order of indexes (x, y) ...

but the problem seems to be related to cuda block/threads ... so not sure if it is really related to these functions.

skrah · 2019-05-04T19:52:42Z

The new code uses far more registers than the existing one. I verified that the both versions actually call the offending test case with 1024 in blockDim. So it's very likely a register issue.

Existing uses 64 registers, which seems to be optimal for my card:

ptxas info    : 77696 bytes gmem, 72 bytes cmem[3]
ptxas info    : Compiling entry function '_Z23bicubic_interp2d_kerneliddb15THCDeviceTensorIdLi4Ei16DefaultPtrTraitsES1_' for 'sm_61'
ptxas info    : Function properties for _Z23bicubic_interp2d_kerneliddb15THCDeviceTensorIdLi4Ei16DefaultPtrTraitsES1_
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 64 registers, 432 bytes cmem[0], 24 bytes cmem[2]

New uses 124 registers, which is too much for my card:

ptxas info    : 77696 bytes gmem, 72 bytes cmem[3]
ptxas info    : Compiling entry function '_ZN2at6native76_GLOBAL__N__52_tmpxft_0000626d_00000000_6_UpSampleBicubic2d_cpp1_ii_b4c1e1f328upsample_bicubic2d_out_frameElddbNS_20PackedTensorAccessorIdLm4ENS_16DefaultPtrTraitsElEES4_' for 'sm_61'
ptxas info    : Function properties for _ZN2at6native76_GLOBAL__N__52_tmpxft_0000626d_00000000_6_UpSampleBicubic2d_cpp1_ii_b4c1e1f328upsample_bicubic2d_out_frameElddbNS_20PackedTensorAccessorIdLm4ENS_16DefaultPtrTraitsElEES4_
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 124 registers, 496 bytes cmem[0], 24 bytes cmem[2]

With __launch_bounds__(1024), the code again uses 64 registers.

If you use C10_LAUNCH_BOUNDS_1(1024) for both kernels, the tests pass here.

Now why is the regcount higher in the new code? It could be PackedTensorAccessor, it could be the fact that many instances of int have been changed to int64_t. :)

You could experiment or just use the launch bounds. Other code in native/cuda seems to use lower bounds for blockDim, too. 1024 seems to be an outlier.

xmnlab · 2019-05-04T20:18:22Z

@skrah

it seems it worked locally! thank you so much! I really appreciate that!

xmnlab · 2019-05-04T23:08:04Z

thanks @rgommers and @skrah for all the help!

@ezyang it is done for review!

xmnlab · 2019-05-07T14:07:17Z

thanks so much @ezyang!
I will let you know when it is ready again for a new review! thanks!

This will now give more informative errors: RuntimeError: CUDA error: too many resources requested for launch instead of RuntimeError: Failed with error code 0

See pytorchgh-8103 for other reports of this issue.

Fixing launch bounds Move back from int64_t to int Changed at::zero to at::empty_like Use cuda::ATenCeilDiv, removed unncessary += op Decreasing max threads per block Removing declaration on THNN

xmnlab · 2019-05-14T03:22:17Z

@skrah @ezyang
not sure but the errors on CI seems to be related to jenkins ... it seems all these jobs that failed (for building) ran by 01h 01min ... not sure if it is a coincidence ...

rgommers · 2019-05-14T09:15:06Z

@xmnlab indeed it looks like all jobs were aborted at the same time. no obvious issues in the build log related to your code. Comparing e.g. the cuda9-cudnn7 build with a successful one from another PR, it takes 1hr 14min there and your build is about at the place where the other one is after an hour.

I suggest to just push a new commit to rebuild. Probably a temporary CI hiccup.

ezyang · 2019-05-14T11:06:08Z

I accidentally rebooted Jenkins yesterday which is the likely cause, my apologies.

@pytorchbot retest this please

xmnlab · 2019-05-14T14:48:22Z

@ezyang @skrah @rgommers

all tests passed except pr/caffe2-py2-cuda9.0-cudnn7-windows-build:

14:43:49 Build timed out (after 180 minutes). Marking the build as failed.
14:43:49 Build was aborted
14:43:49 [BFA] Scanning build for known causes...
14:43:49 [BFA] No failure causes found
14:43:49 [BFA] Done. 0s
14:43:49 Finished: FAILURE

not sure if this timeout means that the code now is slower.

what do you think? do you have any suggestion?

ezyang · 2019-05-14T15:21:48Z

Sometimes the Windows build flakes out like that. It didn't timeout while running a relevant test, so I judge it to be not your problem.

ezyang · 2019-05-14T15:31:02Z

The launch bounds logic is wrong, but I acknowledge that this is a big patch already; just fix it in a follow up. I am going to go ahead and land this.

facebook-github-bot

@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

xmnlab · 2019-05-14T15:55:57Z

sounds good @ezyang thanks!

Summary: resolves #16158 Pull Request resolved: pytorch/pytorch#19630 Differential Revision: D15335765 Pulled By: ezyang fbshipit-source-id: 03dd590c715a65c20ac99674a5d77179cd4a50fc

facebook-github-bot · 2019-05-14T23:37:50Z

@ezyang merged this pull request in 3479777.

Summary: this is a follow up for #19630 Pull Request resolved: #20505 Differential Revision: D15392706 Pulled By: ezyang fbshipit-source-id: 5a8a7aacdbcf740508baf2b6e0c081c4e5a0390f

Summary: this is a follow up for pytorch/pytorch#19630 Pull Request resolved: pytorch/pytorch#20505 Differential Revision: D15392706 Pulled By: ezyang fbshipit-source-id: 5a8a7aacdbcf740508baf2b6e0c081c4e5a0390f

pytorchbot added module: build Build system issues module: cpu CPU specific problem (e.g., perf, algorithm) module: cuda Related to torch.cuda, and CUDA support in general module: internals Related to internal abstractions in c10 and ATen module: operators labels Apr 23, 2019

xmnlab force-pushed the issue16158-upsample-gpu-porting branch 2 times, most recently from facd876 to f850c96 Compare May 3, 2019 13:39

xmnlab changed the title ~~[WIP] UpSample GPU Porting~~ UpSample GPU Porting May 4, 2019

xmnlab marked this pull request as ready for review May 4, 2019 23:07

xmnlab requested a review from ezyang May 6, 2019 13:44

ezyang mentioned this pull request May 7, 2019

32-bit PackedTensorAccessor #19268

Closed

xmnlab and others added 4 commits May 13, 2019 16:22

Porting in progress

b0d8bbc

Fix CUDA error checks in UpSampleBicubic2d.cu

ec6f6ff

This will now give more informative errors: RuntimeError: CUDA error: too many resources requested for launch instead of RuntimeError: Failed with error code 0

Set launch bounds to 1024 for upsamplebicibic2d

304926e

See pytorchgh-8103 for other reports of this issue.

Fixed register issue, recommendation from @skrah

0b3ad95

Fixing launch bounds Move back from int64_t to int Changed at::zero to at::empty_like Use cuda::ATenCeilDiv, removed unncessary += op Decreasing max threads per block Removing declaration on THNN

xmnlab force-pushed the issue16158-upsample-gpu-porting branch from 980a254 to 0b3ad95 Compare May 13, 2019 20:24

facebook-github-bot reviewed May 14, 2019

View reviewed changes

facebook-github-bot closed this in 3479777 May 14, 2019

xmnlab added a commit to Quansight/pytorch that referenced this pull request May 14, 2019

Follow up for pytorch#19630

c0bdc7d

xmnlab mentioned this pull request May 14, 2019

Fix upsample kernel launch / reorder arguments #20505

Closed

facebook-github-bot added the merged label May 14, 2019

jjsjann123 mentioned this pull request May 15, 2019

Port UpsamplingNearest to ATen #16158

Closed

xmnlab deleted the issue16158-upsample-gpu-porting branch May 16, 2019 14:06

xmnlab mentioned this pull request May 16, 2019

Change function that truncate to scalar_t on each atomicAdd #20590

Closed

ezyang added the open source label Jun 24, 2019

t-vi mentioned this pull request Apr 14, 2021

Jetson: cuda runtime error (7) : too many resources requested for launch #24953

Closed

vfdev-5 mentioned this pull request Jun 29, 2023

interpolate bicubic mode returns unexpected value on CUDA with float16 #104157

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UpSample GPU Porting #19630

UpSample GPU Porting #19630

xmnlab commented Apr 23, 2019

rgommers commented May 3, 2019

skrah commented May 3, 2019

skrah commented May 3, 2019

rgommers commented May 3, 2019

skrah commented May 3, 2019

xmnlab commented May 3, 2019

xmnlab commented May 3, 2019

skrah commented May 3, 2019 •

edited

Loading

skrah commented May 3, 2019

xmnlab commented May 3, 2019

skrah commented May 3, 2019 via email

rgommers commented May 3, 2019

xmnlab commented May 3, 2019

rgommers commented May 3, 2019

skrah commented May 3, 2019 via email

xmnlab commented May 3, 2019

xmnlab commented May 4, 2019

skrah commented May 4, 2019 •

edited

Loading

xmnlab commented May 4, 2019

xmnlab commented May 4, 2019

xmnlab commented May 7, 2019

xmnlab commented May 14, 2019

rgommers commented May 14, 2019

ezyang commented May 14, 2019

xmnlab commented May 14, 2019

ezyang commented May 14, 2019

ezyang commented May 14, 2019

facebook-github-bot left a comment

xmnlab commented May 14, 2019

facebook-github-bot commented May 14, 2019

UpSample GPU Porting #19630

UpSample GPU Porting #19630

Conversation

xmnlab commented Apr 23, 2019

rgommers commented May 3, 2019

skrah commented May 3, 2019

skrah commented May 3, 2019

rgommers commented May 3, 2019

skrah commented May 3, 2019

xmnlab commented May 3, 2019

xmnlab commented May 3, 2019

skrah commented May 3, 2019 • edited Loading

skrah commented May 3, 2019

xmnlab commented May 3, 2019

skrah commented May 3, 2019 via email

rgommers commented May 3, 2019

xmnlab commented May 3, 2019

rgommers commented May 3, 2019

skrah commented May 3, 2019 via email

xmnlab commented May 3, 2019

xmnlab commented May 4, 2019

skrah commented May 4, 2019 • edited Loading

xmnlab commented May 4, 2019

xmnlab commented May 4, 2019

xmnlab commented May 7, 2019

xmnlab commented May 14, 2019

rgommers commented May 14, 2019

ezyang commented May 14, 2019

xmnlab commented May 14, 2019

ezyang commented May 14, 2019

ezyang commented May 14, 2019

facebook-github-bot left a comment

Choose a reason for hiding this comment

xmnlab commented May 14, 2019

facebook-github-bot commented May 14, 2019

skrah commented May 3, 2019 •

edited

Loading

skrah commented May 4, 2019 •

edited

Loading