fix asserts in cuda code #39047

ngimel · 2020-05-27T03:32:27Z

Gets rid of some in-kernel asserts where they can be replaced with static_asserts
Replaces bare in-kernel assert in one case with CUDA_KERNEL_ASSERT where necessary
replaces host code asserts with TORCH_INTERNAL_ASSERT
Another group of asserts is in fractional max pooling kernels which should be fixed regardless #39044, the problems there are not just asserts.
I've audited remaining cases of in-kernel asserts, and they are more like TORCH_INTERNAL_ASSERT, so they should not happen with invalid user data. I think it's ok to leave them as is.

dr-ci · 2020-05-27T03:39:48Z

💊 CI failures summary and remediations

As of commit 2692d8f (more details on the Dr. CI page):

1/1 failures possibly* introduced in this PR
- 1/1 non-CircleCI failure(s)

ci.pytorch.org: 1 failed

Failed: pr/py3.6-clang7-rocmdeb-ubuntu16.04

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 12 times.

ezyang · 2020-05-27T21:01:41Z

aten/src/THC/THCTensorInfo.cuh

@@ -73,15 +73,15 @@ TensorInfo<T, IndexType>::TensorInfo(T* p,
 template <typename T, typename IndexType>
 void
 TensorInfo<T, IndexType>::reduceDim(int dim) {
-  assert(dim < dims && dim >= 0);
+  TORCH_INTERNAL_ASSERT(dim < dims && dim >= 0);


oh, these aren't run in CUDA?! Intruiging.

ezyang

I make no claims about the completeness of this PR, but this is certainly an improvement.

facebook-github-bot

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

ngimel · 2020-05-28T03:50:50Z

static_assert(sizeof(long)==8) fails on windows, so I turned it back into CUDA_KERNEL_ASSERT, may it never be triggered (it's in caffe2 code).

facebook-github-bot

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

gchanan · 2020-05-28T16:53:36Z

caffe2/operators/top_k_radix_selection.cuh

@@ -104,6 +104,7 @@ struct TopKTypeConfig<long> {
  typedef unsigned long long int RadixType;

  static inline __device__ RadixType convert(long v) {
+    //static_assert fails on windows, so leave it as CUDA_KERNEL_ASSERT


...how does the CUDA_KERNEL_ASSERT not fail on windows?

It will fail, if someone tries to run caffe2 radix sort for long inputs on windows pytorch build. Hopefully, no one will do it, but I can't make it static_assert, because with static_assert windows build itself fails.

Should we just make this into int64_t?

I'm not familiar with caffe2 code and don't know if it's possible. It is also probably untested?

I don't think it will break anything because of the assert sizeof(long) == 8 here.

gchanan · 2020-05-28T16:54:14Z

adding milestone 1.5.1 since this seems worth getting into the release, because some of these are regressions.

facebook-github-bot · 2020-05-29T00:16:28Z

@ngimel merged this pull request in 9c19a12.

Summary: Gets rid of some in-kernel asserts where they can be replaced with static_asserts Replaces bare in-kernel `assert` in one case with `CUDA_KERNEL_ASSERT` where necessary replaces host code `assert`s with `TORCH_INTERNAL_ASSERT` Another group of asserts is in fractional max pooling kernels which should be fixed regardless pytorch#39044, the problems there are not just asserts. I've audited remaining cases of in-kernel asserts, and they are more like `TORCH_INTERNAL_ASSERT`, so they should not happen with invalid user data. I think it's ok to leave them as is. Pull Request resolved: pytorch#39047 Differential Revision: D21750392 Pulled By: ngimel fbshipit-source-id: e9417523a2c672284de3515933cb7ed166e56719

Summary: Gets rid of some in-kernel asserts where they can be replaced with static_asserts Replaces bare in-kernel `assert` in one case with `CUDA_KERNEL_ASSERT` where necessary replaces host code `assert`s with `TORCH_INTERNAL_ASSERT` Another group of asserts is in fractional max pooling kernels which should be fixed regardless #39044, the problems there are not just asserts. I've audited remaining cases of in-kernel asserts, and they are more like `TORCH_INTERNAL_ASSERT`, so they should not happen with invalid user data. I think it's ok to leave them as is. Pull Request resolved: #39047 Differential Revision: D21750392 Pulled By: ngimel fbshipit-source-id: e9417523a2c672284de3515933cb7ed166e56719

ngimel requested a review from ezyang May 27, 2020 03:32

ezyang reviewed May 27, 2020

View reviewed changes

ezyang approved these changes May 27, 2020

View reviewed changes

facebook-github-bot reviewed May 27, 2020

View reviewed changes

fix asserts in cuda code

79ee1f5

ngimel force-pushed the assert_cuda branch from cad346d to 79ee1f5 Compare May 27, 2020 21:49

ngimel added 2 commits May 27, 2020 18:20

fix 2-arg assert

d1887cf

work around static_assert failure on windows

2692d8f

facebook-github-bot reviewed May 28, 2020

View reviewed changes

gchanan reviewed May 28, 2020

View reviewed changes

gchanan added this to the 1.5.1 milestone May 28, 2020

facebook-github-bot closed this in 9c19a12 May 28, 2020

facebook-github-bot added the merged label May 29, 2020

gchanan mentioned this pull request Jun 2, 2020

[v1.5.1] Release Tracker #39104

Closed

gchanan mentioned this pull request Jun 2, 2020

[v1.5.1] fix asserts in cuda code (#39047) #39418

Merged

mruberry added the Merged label Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix asserts in cuda code #39047

fix asserts in cuda code #39047

ngimel commented May 27, 2020

dr-ci bot commented May 27, 2020 •

edited

ezyang May 27, 2020

ezyang left a comment

facebook-github-bot left a comment

ngimel commented May 28, 2020

facebook-github-bot left a comment

gchanan May 28, 2020

ngimel May 28, 2020

peterjc123 May 29, 2020

ngimel May 29, 2020

peterjc123 May 29, 2020 •

edited

gchanan commented May 28, 2020

facebook-github-bot commented May 29, 2020

fix asserts in cuda code #39047

fix asserts in cuda code #39047

Conversation

ngimel commented May 27, 2020

dr-ci bot commented May 27, 2020 • edited

💊 CI failures summary and remediations

ci.pytorch.org: 1 failed

ezyang May 27, 2020

Choose a reason for hiding this comment

ezyang left a comment

Choose a reason for hiding this comment

facebook-github-bot left a comment

Choose a reason for hiding this comment

ngimel commented May 28, 2020

facebook-github-bot left a comment

Choose a reason for hiding this comment

gchanan May 28, 2020

Choose a reason for hiding this comment

ngimel May 28, 2020

Choose a reason for hiding this comment

peterjc123 May 29, 2020

Choose a reason for hiding this comment

ngimel May 29, 2020

Choose a reason for hiding this comment

peterjc123 May 29, 2020 • edited

Choose a reason for hiding this comment

gchanan commented May 28, 2020

facebook-github-bot commented May 29, 2020

dr-ci bot commented May 27, 2020 •

edited

peterjc123 May 29, 2020 •

edited