-
Notifications
You must be signed in to change notification settings - Fork 21.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UpSample GPU Porting #19630
UpSample GPU Porting #19630
Conversation
facd876
to
f850c96
Compare
Looking at it very briefly, the So while the issue looks the same, I'd probably not expect it on the CI platforms. FWIW, some recent CI tests in other issues are green. |
@pytorchbot retest this please. |
same failures. |
Is this rebased on the latest master (you can also ask the bot to rebase)? |
I rebased that yesterday ... also needed to resolve conflicts because 7 days ago upsample files were changed. |
@skrah do you have any idea about what could be this problem? or a way to debug that? |
@xmnlab If you can't reproduce it at home it seems hard to debug other than reading the diffs again. I can take a look on Monday, it's getting a bit late here. |
Also, has anyone found a way to show the actual hardware used on the CI in detail? |
@skrah I am working in parallel on a paperspace environment .. it tooks a lot of time it is running with 8 cpu cores. |
On Fri, May 03, 2019 at 07:12:35PM +0000, Ivan Ogasawara wrote:
@skrah I am working in parallel on a paperspace environment .. it tooks a lot of time it is running with 8 cpu cores.
But have you ever been able to reproduce this issue on paperspace?
|
I can reproduce it locally: Arch Linux, CUDA 10.0, RTX2070 GPU. Given the CI failures, I think it should be reproducible for multiple CUDA and GPU versions. I just haven't worked on any CUDA code before, and am short on time, so I'd rather not dig too deep. |
not yet .. my last building was with a my previous commit ... I will work again in this task in some minutes :) |
@xmnlab your previous commit had the same failure though (except for the less clear exception message), and it seems quite reproducible. So I think you'll see it now. |
On Fri, May 03, 2019 at 12:19:42PM -0700, Ralf Gommers wrote:
> But have you ever been able to reproduce this issue on paperspace?
I can reproduce it locally: Arch Linux, CUDA 10.0, RTX2070 GPU. Given the CI failures, I think it should be reproducible for multiple CUDA and GPU versions. I just haven't worked on any CUDA code before, and am short on time, so I'd rather not dig too deep.
Ah, thanks. @xmnlab, then I guess just compiling with "DEBUG=1" and stepping through the offending code with gdb may be the fastest way.
|
@skrah thanks! I will try that! |
it seems just UpSampleBicubic2d is using upsample_get_value_bounded (https://github.com/pytorch/pytorch/pull/19630/files#diff-5092da792c30694ee4adf0d0ae2a37c6R171) and upsample_increment_value_bounded (https://github.com/pytorch/pytorch/pull/19630/files#diff-5092da792c30694ee4adf0d0ae2a37c6R191) maybe the problem could be inside one of these functions ... maybe related to the order of indexes (x, y) ... but the problem seems to be related to cuda block/threads ... so not sure if it is really related to these functions. |
The new code uses far more registers than the existing one. I verified that the both versions actually call the offending test case with
With If you use Now why is the regcount higher in the new code? It could be You could experiment or just use the launch bounds. Other code in |
it seems it worked locally! thank you so much! I really appreciate that! |
thanks so much @ezyang! |
This will now give more informative errors: RuntimeError: CUDA error: too many resources requested for launch instead of RuntimeError: Failed with error code 0
See pytorchgh-8103 for other reports of this issue.
Fixing launch bounds Move back from int64_t to int Changed at::zero to at::empty_like Use cuda::ATenCeilDiv, removed unncessary += op Decreasing max threads per block Removing declaration on THNN
980a254
to
0b3ad95
Compare
@xmnlab indeed it looks like all jobs were aborted at the same time. no obvious issues in the build log related to your code. Comparing e.g. the I suggest to just push a new commit to rebuild. Probably a temporary CI hiccup. |
I accidentally rebooted Jenkins yesterday which is the likely cause, my apologies. @pytorchbot retest this please |
all tests passed except
not sure if this timeout means that the code now is slower. what do you think? do you have any suggestion? |
Sometimes the Windows build flakes out like that. It didn't timeout while running a relevant test, so I judge it to be not your problem. |
The launch bounds logic is wrong, but I acknowledge that this is a big patch already; just fix it in a follow up. I am going to go ahead and land this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
sounds good @ezyang thanks! |
Summary: resolves #16158 Pull Request resolved: pytorch/pytorch#19630 Differential Revision: D15335765 Pulled By: ezyang fbshipit-source-id: 03dd590c715a65c20ac99674a5d77179cd4a50fc
Summary: this is a follow up for pytorch/pytorch#19630 Pull Request resolved: pytorch/pytorch#20505 Differential Revision: D15392706 Pulled By: ezyang fbshipit-source-id: 5a8a7aacdbcf740508baf2b6e0c081c4e5a0390f
resolves #16158