Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UpSample GPU Porting #19630

Closed

Conversation

xmnlab
Copy link
Contributor

@xmnlab xmnlab commented Apr 23, 2019

resolves #16158

@pytorchbot pytorchbot added module: build Build system issues module: cpu CPU specific problem (e.g., perf, algorithm) module: cuda Related to torch.cuda, and CUDA support in general module: internals Related to internal abstractions in c10 and ATen module: operators labels Apr 23, 2019
@xmnlab xmnlab force-pushed the issue16158-upsample-gpu-porting branch 2 times, most recently from facd876 to f850c96 Compare May 3, 2019 13:39
@rgommers
Copy link
Collaborator

rgommers commented May 3, 2019

Fewer errors than before, 4 instead of 8 with the same message:

RuntimeError: CUDA error: too many resources requested for launch

This seems to be an existing problem: gh-8103.

@skrah could you have a look at this and tell us what you think?

@skrah
Copy link
Contributor

skrah commented May 3, 2019

Looking at it very briefly, the Jetson TX referenced in the issue only has 256 cores.

So while the issue looks the same, I'd probably not expect it on the CI platforms. FWIW, some recent CI tests in other issues are green.

@skrah
Copy link
Contributor

skrah commented May 3, 2019

@pytorchbot retest this please.

@rgommers
Copy link
Collaborator

rgommers commented May 3, 2019

same failures.

@skrah
Copy link
Contributor

skrah commented May 3, 2019

Is this rebased on the latest master (you can also ask the bot to rebase)?

@xmnlab
Copy link
Contributor Author

xmnlab commented May 3, 2019

I rebased that yesterday ... also needed to resolve conflicts because 7 days ago upsample files were changed.

@xmnlab
Copy link
Contributor Author

xmnlab commented May 3, 2019

@skrah do you have any idea about what could be this problem? or a way to debug that?
also it seems some jobs is taking a lot of time to build, for example, for some job the estimate time is 19h .. not sure if it the regular estimate ..

@skrah
Copy link
Contributor

skrah commented May 3, 2019

@xmnlab If you can't reproduce it at home it seems hard to debug other than reading the diffs again. I can take a look on Monday, it's getting a bit late here.

@skrah
Copy link
Contributor

skrah commented May 3, 2019

Also, has anyone found a way to show the actual hardware used on the CI in detail?

@xmnlab
Copy link
Contributor Author

xmnlab commented May 3, 2019

@skrah I am working in parallel on a paperspace environment .. it tooks a lot of time it is running with 8 cpu cores.

@skrah
Copy link
Contributor

skrah commented May 3, 2019 via email

@rgommers
Copy link
Collaborator

rgommers commented May 3, 2019

But have you ever been able to reproduce this issue on paperspace?

I can reproduce it locally: Arch Linux, CUDA 10.0, RTX2070 GPU. Given the CI failures, I think it should be reproducible for multiple CUDA and GPU versions. I just haven't worked on any CUDA code before, and am short on time, so I'd rather not dig too deep.

@xmnlab
Copy link
Contributor Author

xmnlab commented May 3, 2019

not yet .. my last building was with a my previous commit ... I will work again in this task in some minutes :)

@rgommers
Copy link
Collaborator

rgommers commented May 3, 2019

@xmnlab your previous commit had the same failure though (except for the less clear exception message), and it seems quite reproducible. So I think you'll see it now.

@skrah
Copy link
Contributor

skrah commented May 3, 2019 via email

@xmnlab
Copy link
Contributor Author

xmnlab commented May 3, 2019

@skrah thanks! I will try that!

@xmnlab
Copy link
Contributor Author

xmnlab commented May 4, 2019

it seems just UpSampleBicubic2d is using upsample_get_value_bounded (https://github.com/pytorch/pytorch/pull/19630/files#diff-5092da792c30694ee4adf0d0ae2a37c6R171) and upsample_increment_value_bounded (https://github.com/pytorch/pytorch/pull/19630/files#diff-5092da792c30694ee4adf0d0ae2a37c6R191)

maybe the problem could be inside one of these functions ... maybe related to the order of indexes (x, y) ...

but the problem seems to be related to cuda block/threads ... so not sure if it is really related to these functions.

@skrah
Copy link
Contributor

skrah commented May 4, 2019

The new code uses far more registers than the existing one. I verified that the both versions actually call the offending test case with 1024 in blockDim. So it's very likely a register issue.

Existing uses 64 registers, which seems to be optimal for my card:

ptxas info    : 77696 bytes gmem, 72 bytes cmem[3]
ptxas info    : Compiling entry function '_Z23bicubic_interp2d_kerneliddb15THCDeviceTensorIdLi4Ei16DefaultPtrTraitsES1_' for 'sm_61'
ptxas info    : Function properties for _Z23bicubic_interp2d_kerneliddb15THCDeviceTensorIdLi4Ei16DefaultPtrTraitsES1_
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 64 registers, 432 bytes cmem[0], 24 bytes cmem[2]

New uses 124 registers, which is too much for my card:

ptxas info    : 77696 bytes gmem, 72 bytes cmem[3]
ptxas info    : Compiling entry function '_ZN2at6native76_GLOBAL__N__52_tmpxft_0000626d_00000000_6_UpSampleBicubic2d_cpp1_ii_b4c1e1f328upsample_bicubic2d_out_frameElddbNS_20PackedTensorAccessorIdLm4ENS_16DefaultPtrTraitsElEES4_' for 'sm_61'
ptxas info    : Function properties for _ZN2at6native76_GLOBAL__N__52_tmpxft_0000626d_00000000_6_UpSampleBicubic2d_cpp1_ii_b4c1e1f328upsample_bicubic2d_out_frameElddbNS_20PackedTensorAccessorIdLm4ENS_16DefaultPtrTraitsElEES4_
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 124 registers, 496 bytes cmem[0], 24 bytes cmem[2]

With __launch_bounds__(1024), the code again uses 64 registers.

If you use C10_LAUNCH_BOUNDS_1(1024) for both kernels, the tests pass here.

Now why is the regcount higher in the new code? It could be PackedTensorAccessor, it could be the fact that many instances of int have been changed to int64_t. :)

You could experiment or just use the launch bounds. Other code in native/cuda seems to use lower bounds for blockDim, too. 1024 seems to be an outlier.

@xmnlab
Copy link
Contributor Author

xmnlab commented May 4, 2019

@skrah

it seems it worked locally! thank you so much! I really appreciate that!

@xmnlab xmnlab changed the title [WIP] UpSample GPU Porting UpSample GPU Porting May 4, 2019
@xmnlab xmnlab marked this pull request as ready for review May 4, 2019 23:07
@xmnlab
Copy link
Contributor Author

xmnlab commented May 4, 2019

thanks @rgommers and @skrah for all the help!

@ezyang it is done for review!

@xmnlab xmnlab requested a review from ezyang May 6, 2019 13:44
@xmnlab
Copy link
Contributor Author

xmnlab commented May 7, 2019

thanks so much @ezyang!
I will let you know when it is ready again for a new review! thanks!

xmnlab and others added 4 commits May 13, 2019 16:22
This will now give more informative errors:

  RuntimeError: CUDA error: too many resources requested for launch

instead of

  RuntimeError: Failed with error code 0
Fixing launch bounds
Move back from int64_t to int
Changed at::zero to at::empty_like
Use cuda::ATenCeilDiv, removed unncessary += op
Decreasing max threads per block
Removing declaration on THNN
@xmnlab xmnlab force-pushed the issue16158-upsample-gpu-porting branch from 980a254 to 0b3ad95 Compare May 13, 2019 20:24
@xmnlab
Copy link
Contributor Author

xmnlab commented May 14, 2019

@skrah @ezyang
not sure but the errors on CI seems to be related to jenkins ... it seems all these jobs that failed (for building) ran by 01h 01min ... not sure if it is a coincidence ...

@rgommers
Copy link
Collaborator

@xmnlab indeed it looks like all jobs were aborted at the same time. no obvious issues in the build log related to your code. Comparing e.g. the cuda9-cudnn7 build with a successful one from another PR, it takes 1hr 14min there and your build is about at the place where the other one is after an hour.

I suggest to just push a new commit to rebuild. Probably a temporary CI hiccup.

@ezyang
Copy link
Contributor

ezyang commented May 14, 2019

I accidentally rebooted Jenkins yesterday which is the likely cause, my apologies.

@pytorchbot retest this please

@xmnlab
Copy link
Contributor Author

xmnlab commented May 14, 2019

@ezyang @skrah @rgommers

all tests passed except pr/caffe2-py2-cuda9.0-cudnn7-windows-build:

14:43:49 Build timed out (after 180 minutes). Marking the build as failed.
14:43:49 Build was aborted
14:43:49 [BFA] Scanning build for known causes...
14:43:49 [BFA] No failure causes found
14:43:49 [BFA] Done. 0s
14:43:49 Finished: FAILURE

not sure if this timeout means that the code now is slower.

what do you think? do you have any suggestion?

@ezyang
Copy link
Contributor

ezyang commented May 14, 2019

Sometimes the Windows build flakes out like that. It didn't timeout while running a relevant test, so I judge it to be not your problem.

@ezyang
Copy link
Contributor

ezyang commented May 14, 2019

The launch bounds logic is wrong, but I acknowledge that this is a big patch already; just fix it in a follow up. I am going to go ahead and land this.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@xmnlab
Copy link
Contributor Author

xmnlab commented May 14, 2019

sounds good @ezyang thanks!

xmnlab added a commit to Quansight/pytorch that referenced this pull request May 14, 2019
zdevito pushed a commit to zdevito/ATen that referenced this pull request May 14, 2019
Summary:
resolves #16158
Pull Request resolved: pytorch/pytorch#19630

Differential Revision: D15335765

Pulled By: ezyang

fbshipit-source-id: 03dd590c715a65c20ac99674a5d77179cd4a50fc
@facebook-github-bot
Copy link
Contributor

@ezyang merged this pull request in 3479777.

@xmnlab xmnlab deleted the issue16158-upsample-gpu-porting branch May 16, 2019 14:06
facebook-github-bot pushed a commit that referenced this pull request May 17, 2019
Summary:
this is a follow up for #19630
Pull Request resolved: #20505

Differential Revision: D15392706

Pulled By: ezyang

fbshipit-source-id: 5a8a7aacdbcf740508baf2b6e0c081c4e5a0390f
zdevito pushed a commit to zdevito/ATen that referenced this pull request May 17, 2019
Summary:
this is a follow up for pytorch/pytorch#19630
Pull Request resolved: pytorch/pytorch#20505

Differential Revision: D15392706

Pulled By: ezyang

fbshipit-source-id: 5a8a7aacdbcf740508baf2b6e0c081c4e5a0390f
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: build Build system issues module: cpu CPU specific problem (e.g., perf, algorithm) module: cuda Related to torch.cuda, and CUDA support in general module: internals Related to internal abstractions in c10 and ATen open source
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Port UpsamplingNearest to ATen
8 participants