Add back "PR #49173: [Crash fix] Fix cudaMallocAsync crashes." #50961

nouiz · 2021-07-26T20:41:08Z

This reverts commit a4553f8 that was reverting #49173.

When I run it now, //tensorflow/core/common_runtime/gpu:gpu_device_test in asan give me the same output as before. It was already crashing before this PR with my command line:

bazel test --distinct_host_configuration=false --javabase=@bazel_tools//tools/jdk:remote_jdk11 --config=asan -c opt --config=cuda //tensorflow/core/common_runtime/gpu:gpu_device_test

@penpornk

fixes #50669

nouiz · 2021-07-27T14:58:34Z

I just pushed a fix that fix a test error in NDEBUG mode.

penpornk · 2021-07-28T22:31:13Z

@nouiz Thank you so much for taking care of this. And sorry again about the revert. :(
The PR looks good to me. I applied your additional changes and it seems the cuda_asan test is still failing internally and might need further fixing, so I'm reassigning this PR to @cheshire. (I don't usually work on GPU PRs unless it's related to sparse kernels.)

nouiz · 2021-08-16T16:49:11Z

Any update on this? It is 3 mount since I started to get this fix in the first time.

nouiz · 2021-08-16T19:16:29Z

@cheshire Note, the problem is the internal cuda_asan test fail. I do not have any information about it. So I can't fix it. I'm also not able to reproduce it.

akuegel · 2021-08-18T07:15:17Z

@cheshire Note, the problem is the internal cuda_asan test fail. I do not have any information about it. So I can't fix it. I'm also not able to reproduce it.

The error is actually another one:
F0817 14:52:42.569874 7691 gpu_cudamallocasync_allocator.cc:133] Failed to get device attribute: CUDA error: invalid argument (CUDA_ERROR_INVALID_VALUE)

This happens when running the gpu_device_test without cuda_asan. I guess that in the meantime, the surrounding code has changed, and you will have to adapt your change to that.
I couldn't reproduce the cuda_asan error, so I guess that is fixed now.

This reverts commit a4553f8.

nouiz · 2021-08-18T20:50:27Z

I tried the commit in this PR and the gpu_device_test passed. I rebased it and pushed the rebased version and it still pass. The Github Linux CI passed too. So I'm not able to reproduce any error here.

What idea what would be different in your system that would make it fail?

Artem-B · 2021-08-25T22:11:39Z

tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc

@@ -122,13 +125,31 @@ GpuCudaMallocAsyncAllocator::GpuCudaMallocAsyncAllocator(
  }

  se::cuda::ScopedActivateExecutorContext scoped_activation{stream_exec_};
+
+  // Check the the CUDA runtime is recent enough.
+  if (auto status2 = cuDriverGetVersion(&driverVersion)) {


I think something like this should be done on runtime startup.

We've seen quite a few issues when folks tried to run TF built with CUDA-11.3 on the drivers 460. It does not fail right away, but tends to cause weird issues later one. Some apps work OK, others crash or fail with odd CUDA errors. No idea what exactly triggers the failures.

Checking if the driver is recent enough for the CUDA version we build with, and issuing a warning if it's not, would be very useful. CUDA-11 was supposed to be driver-agnostic, but while it removed the strict checks for the driver version, it did not quite remove the dependency on recent enough driver versions.

Do you have example of problems that this cause?
If a new 11.X version introduce new features, then some of those features need new feature inside the driver. Like cudaMallocAsync. In that case, I think it is impossible to backport this to older 11.X.
To my knowledge, the compatibility between 11.X driver is only if you limit yourself to features in 11.0. If you use the new feature, then you are bumping the minimum driver requirement.

So you only need to take care about new features that need new drivers.

Personally, I think it is useful for TF user that those new features are enabled only when a recent enough driver is installed. So those feature should detect the version and be enabled only when they are available.

I think that crashing as you suggest is too strong.

akuegel · 2021-08-26T06:41:06Z

@akuegel Did the internal test fail? If so, what is the new error message?

During my work hours yesterday, the change hadn't been imported yet, I have read there were issues with the tool which does the import/export, possibly caused by Github problems. Today I see that the change was imported, and the test failed again, this time with an insightful error message:

gpu_cudamallocasync_allocator.cc:136] Disable cuda_malloc_async or update your CUDA driver to a version compitible with CUDA 11.2 or higher. We detected a version compatible with: 11000

So it is indeed the problem that we don't have a recent enough driver. So either the cuda_compat package isn't being used, or doesn't help here.
@Artem-B is there a way to check whether we indeed are running our tests with cuda_compat?

Artem-B · 2021-08-26T08:16:24Z

@Artem-B is there a way to check whether we indeed are running our tests with cuda_compat?

My answer above applies only to the build/test of the internal version of TF, not to the OSS one.
I'm a bit out of date on the state of OSS test infrastructure and don't know what we use there ATM.

In general, OSS builds would need to have cuda_compat installed within the container they are running. My guess is that it's not. I can check tomorrow.

akuegel · 2021-08-26T09:03:15Z

@Artem-B is there a way to check whether we indeed are running our tests with cuda_compat?

My answer above applies only to the build/test of the internal version of TF, not to the OSS one.
I'm a bit out of date on the state of OSS test infrastructure and don't know what we use there ATM.

In general, OSS builds would need to have cuda_compat installed within the container they are running. My guess is that it's not. I can check tomorrow.

Sorry, I should have clarified, this test is failing internally with the output I posted above (seemingly indicating that the driver is not new enough). In OSS, the test runs successfully.

nouiz · 2021-08-26T13:46:31Z

@akuegel I updated the test to silently pass on driver or toolkit is under 11.2.
This is an experimental features and so we shouldn't bump the min driver/toolkit for this test.

akuegel

Thanks, I think that should unblock the change :)

akuegel · 2021-08-27T08:15:33Z

So @Artem-B has figured out why we had the older driver version when running the test. A tensorrt target was loading libcuda.so before we could load the cuda compat package. This is fixed now, so the test should actually also be running fine internally :)

nouiz · 2021-08-27T13:27:01Z

Great. I still see one of the Github Check to be:
@copybara-service feedback/copybara — Google internal checks FAILED for runs with create time 2021-08-26T16:28:03.4

Should this be approved again to trigger a new run with that fix?

penpornk · 2021-08-27T14:37:46Z

(Checking back on this PR out of curiosity. Congrats on resolving the driver version issue!)

Great. I still see one of the Github Check to be:
@copybara-service feedback/copybara — Google internal checks FAILED for runs with create time 2021-08-26T16:28:03.4

Should this be approved again to trigger a new run with that fix?

This is because the ROCm build failed. Maybe add an #ifdef guard?

tensorflow/core/common_runtime/gpu/gpu_device_test.cc: In member function 'virtual void tensorflow::GPUDeviceTest_CudaMallocAsync_Test::TestBody()':
tensorflow/core/common_runtime/gpu/gpu_device_test.cc:120:64: error: 'CUDA_VERSION' was not declared in this scope
   LOG(INFO) << "CUDA toolkit too old, skipping this test: " << CUDA_VERSION;
                                                                ^~~~~~~~~~~~
tensorflow/core/common_runtime/gpu/gpu_device_test.cc:120:64: note: suggested alternative: '_SC_VERSION'
   LOG(INFO) << "CUDA toolkit too old, skipping this test: " << CUDA_VERSION;
                                                                ^~~~~~~~~~~~
                                                                _SC_VERSION
tensorflow/core/common_runtime/gpu/gpu_device_test.cc:124:3: error: 'cuDriverGetVersion' was not declared in this scope
   cuDriverGetVersion(&driverVersion);
   ^~~~~~~~~~~~~~~~~~
tensorflow/core/common_runtime/gpu/gpu_device_test.cc:124:3: note: suggested alternative: 'driverVersion'
   cuDriverGetVersion(&driverVersion);
   ^~~~~~~~~~~~~~~~~~
   driverVersion

nouiz · 2021-08-27T15:37:25Z

I updated the test again.

tensorflow-bot bot added the size:M CL Change Size: Medium label Jul 26, 2021

google-cla bot added the cla: yes label Jul 26, 2021

nouiz mentioned this pull request Jul 26, 2021

Crash fix cudaMallocAsync usage of TF_CUDA_MALLOC_ASYNC_SUPPORTED_PREALLOC. #50962

Merged

gbaned self-assigned this Jul 27, 2021

gbaned added this to Assigned Reviewer in PR Queue via automation Jul 27, 2021

gbaned added the comp:core issues related to core part of tensorflow label Jul 27, 2021

gbaned requested a review from penpornk July 27, 2021 04:15

penpornk requested review from cheshire and removed request for penpornk July 28, 2021 22:26

nouiz mentioned this pull request Jul 29, 2021

Introduce ability to clear GPU memory in Tensorflow 2 #48545

Open

gbaned added the awaiting review Pull request awaiting review label Aug 10, 2021

cheshire approved these changes Aug 16, 2021

View reviewed changes

google-ml-butler bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Aug 16, 2021

PR Queue automation moved this from Assigned Reviewer to Approved by Reviewer Aug 16, 2021

kokoro-team removed the kokoro:force-run Tests on submitted change label Aug 16, 2021

gbaned removed the awaiting review Pull request awaiting review label Aug 17, 2021

nouiz added 2 commits August 18, 2021 12:14

Revert "PR tensorflow#49173: [Crash fix] Fix cudaMallocAsync crashes."

d90dc7d

This reverts commit a4553f8.

Do not free nullptr and if a nullptr is freed, handle it correctly.

ed6d149

nouiz force-pushed the upstream-cudaMallocAsync-no_stats_error branch from c7a3b3a to ed6d149 Compare August 18, 2021 19:45

google-ml-butler bot removed the ready to pull PR ready for merge process label Aug 18, 2021

akuegel approved these changes Aug 19, 2021

View reviewed changes

google-ml-butler bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Aug 19, 2021

Artem-B reviewed Aug 25, 2021

View reviewed changes

nouiz added 2 commits August 26, 2021 06:13

More earlier a check

95f0b0f

Skip a test when the driver is too old.

0adbc1f

google-ml-butler bot removed the ready to pull PR ready for merge process label Aug 26, 2021

Skip the test if the cuda toolkit is also too old.

8554571

akuegel approved these changes Aug 26, 2021

View reviewed changes

google-ml-butler bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Aug 26, 2021

kokoro-team removed the kokoro:force-run Tests on submitted change label Aug 26, 2021

Try to make the test pass with ROCM.

69255bb

google-ml-butler bot removed the ready to pull PR ready for merge process label Aug 27, 2021

nouiz force-pushed the upstream-cudaMallocAsync-no_stats_error branch from 7f4db4d to 69255bb Compare August 27, 2021 15:37

mihaimaruseac approved these changes Aug 28, 2021

View reviewed changes

google-ml-butler bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Aug 28, 2021

kokoro-team removed the kokoro:force-run Tests on submitted change label Aug 28, 2021

gbaned added ready to pull PR ready for merge process and removed ready to pull PR ready for merge process labels Aug 30, 2021

copybara-service bot merged commit 4e8bba0 into tensorflow:master Aug 30, 2021

PR Queue automation moved this from Approved by Reviewer to Merged Aug 30, 2021

google-ml-butler bot removed the ready to pull PR ready for merge process label Aug 30, 2021

nouiz deleted the upstream-cudaMallocAsync-no_stats_error branch August 30, 2021 14:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add back "PR #49173: [Crash fix] Fix cudaMallocAsync crashes." #50961

Add back "PR #49173: [Crash fix] Fix cudaMallocAsync crashes." #50961

nouiz commented Jul 26, 2021 •

edited

nouiz commented Jul 27, 2021

penpornk commented Jul 28, 2021

nouiz commented Aug 16, 2021

nouiz commented Aug 16, 2021

akuegel commented Aug 18, 2021 •

edited

nouiz commented Aug 18, 2021

Artem-B Aug 25, 2021

nouiz Aug 26, 2021

akuegel commented Aug 26, 2021

Artem-B commented Aug 26, 2021

akuegel commented Aug 26, 2021

nouiz commented Aug 26, 2021

akuegel left a comment

akuegel commented Aug 27, 2021

nouiz commented Aug 27, 2021

penpornk commented Aug 27, 2021

nouiz commented Aug 27, 2021

Add back "PR #49173: [Crash fix] Fix cudaMallocAsync crashes." #50961

Add back "PR #49173: [Crash fix] Fix cudaMallocAsync crashes." #50961

Conversation

nouiz commented Jul 26, 2021 • edited

nouiz commented Jul 27, 2021

penpornk commented Jul 28, 2021

nouiz commented Aug 16, 2021

nouiz commented Aug 16, 2021

akuegel commented Aug 18, 2021 • edited

nouiz commented Aug 18, 2021

Artem-B Aug 25, 2021

Choose a reason for hiding this comment

nouiz Aug 26, 2021

Choose a reason for hiding this comment

akuegel commented Aug 26, 2021

Artem-B commented Aug 26, 2021

akuegel commented Aug 26, 2021

nouiz commented Aug 26, 2021

akuegel left a comment

Choose a reason for hiding this comment

akuegel commented Aug 27, 2021

nouiz commented Aug 27, 2021

penpornk commented Aug 27, 2021

nouiz commented Aug 27, 2021

nouiz commented Jul 26, 2021 •

edited

akuegel commented Aug 18, 2021 •

edited