Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add back "PR #49173: [Crash fix] Fix cudaMallocAsync crashes." #50961

Conversation

nouiz
Copy link
Contributor

@nouiz nouiz commented Jul 26, 2021

This reverts commit a4553f8 that was reverting #49173.

When I run it now, //tensorflow/core/common_runtime/gpu:gpu_device_test in asan give me the same output as before. It was already crashing before this PR with my command line:

bazel test --distinct_host_configuration=false --javabase=@bazel_tools//tools/jdk:remote_jdk11 --config=asan -c opt --config=cuda //tensorflow/core/common_runtime/gpu:gpu_device_test

@penpornk

fixes #50669

@tensorflow-bot tensorflow-bot bot added the size:M CL Change Size: Medium label Jul 26, 2021
@google-cla google-cla bot added the cla: yes label Jul 26, 2021
@gbaned gbaned self-assigned this Jul 27, 2021
@gbaned gbaned added this to Assigned Reviewer in PR Queue via automation Jul 27, 2021
@gbaned gbaned added the comp:core issues related to core part of tensorflow label Jul 27, 2021
@gbaned gbaned requested a review from penpornk July 27, 2021 04:15
@nouiz
Copy link
Contributor Author

nouiz commented Jul 27, 2021

I just pushed a fix that fix a test error in NDEBUG mode.

@penpornk penpornk requested review from cheshire and removed request for penpornk July 28, 2021 22:26
@penpornk
Copy link
Member

@nouiz Thank you so much for taking care of this. And sorry again about the revert. :(
The PR looks good to me. I applied your additional changes and it seems the cuda_asan test is still failing internally and might need further fixing, so I'm reassigning this PR to @cheshire. (I don't usually work on GPU PRs unless it's related to sparse kernels.)

@gbaned gbaned added the awaiting review Pull request awaiting review label Aug 10, 2021
@nouiz
Copy link
Contributor Author

nouiz commented Aug 16, 2021

Any update on this? It is 3 mount since I started to get this fix in the first time.

@google-ml-butler google-ml-butler bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Aug 16, 2021
PR Queue automation moved this from Assigned Reviewer to Approved by Reviewer Aug 16, 2021
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Aug 16, 2021
@nouiz
Copy link
Contributor Author

nouiz commented Aug 16, 2021

@cheshire Note, the problem is the internal cuda_asan test fail. I do not have any information about it. So I can't fix it. I'm also not able to reproduce it.

@gbaned gbaned removed the awaiting review Pull request awaiting review label Aug 17, 2021
@akuegel
Copy link
Member

akuegel commented Aug 18, 2021

@cheshire Note, the problem is the internal cuda_asan test fail. I do not have any information about it. So I can't fix it. I'm also not able to reproduce it.

The error is actually another one:
F0817 14:52:42.569874 7691 gpu_cudamallocasync_allocator.cc:133] Failed to get device attribute: CUDA error: invalid argument (CUDA_ERROR_INVALID_VALUE)

This happens when running the gpu_device_test without cuda_asan. I guess that in the meantime, the surrounding code has changed, and you will have to adapt your change to that.
I couldn't reproduce the cuda_asan error, so I guess that is fixed now.

@nouiz nouiz force-pushed the upstream-cudaMallocAsync-no_stats_error branch from c7a3b3a to ed6d149 Compare August 18, 2021 19:45
@google-ml-butler google-ml-butler bot removed the ready to pull PR ready for merge process label Aug 18, 2021
@nouiz
Copy link
Contributor Author

nouiz commented Aug 18, 2021

I tried the commit in this PR and the gpu_device_test passed. I rebased it and pushed the rebased version and it still pass. The Github Linux CI passed too. So I'm not able to reproduce any error here.

What idea what would be different in your system that would make it fail?

@google-ml-butler google-ml-butler bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Aug 19, 2021
@@ -122,13 +125,31 @@ GpuCudaMallocAsyncAllocator::GpuCudaMallocAsyncAllocator(
}

se::cuda::ScopedActivateExecutorContext scoped_activation{stream_exec_};

// Check the the CUDA runtime is recent enough.
if (auto status2 = cuDriverGetVersion(&driverVersion)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think something like this should be done on runtime startup.

We've seen quite a few issues when folks tried to run TF built with CUDA-11.3 on the drivers 460. It does not fail right away, but tends to cause weird issues later one. Some apps work OK, others crash or fail with odd CUDA errors. No idea what exactly triggers the failures.

Checking if the driver is recent enough for the CUDA version we build with, and issuing a warning if it's not, would be very useful. CUDA-11 was supposed to be driver-agnostic, but while it removed the strict checks for the driver version, it did not quite remove the dependency on recent enough driver versions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have example of problems that this cause?
If a new 11.X version introduce new features, then some of those features need new feature inside the driver. Like cudaMallocAsync. In that case, I think it is impossible to backport this to older 11.X.
To my knowledge, the compatibility between 11.X driver is only if you limit yourself to features in 11.0. If you use the new feature, then you are bumping the minimum driver requirement.

So you only need to take care about new features that need new drivers.

Personally, I think it is useful for TF user that those new features are enabled only when a recent enough driver is installed. So those feature should detect the version and be enabled only when they are available.

I think that crashing as you suggest is too strong.

@akuegel
Copy link
Member

akuegel commented Aug 26, 2021

@akuegel Did the internal test fail? If so, what is the new error message?

During my work hours yesterday, the change hadn't been imported yet, I have read there were issues with the tool which does the import/export, possibly caused by Github problems. Today I see that the change was imported, and the test failed again, this time with an insightful error message:

gpu_cudamallocasync_allocator.cc:136] Disable cuda_malloc_async or update your CUDA driver to a version compitible with CUDA 11.2 or higher. We detected a version compatible with: 11000

So it is indeed the problem that we don't have a recent enough driver. So either the cuda_compat package isn't being used, or doesn't help here.
@Artem-B is there a way to check whether we indeed are running our tests with cuda_compat?

@Artem-B
Copy link
Member

Artem-B commented Aug 26, 2021

@Artem-B is there a way to check whether we indeed are running our tests with cuda_compat?

My answer above applies only to the build/test of the internal version of TF, not to the OSS one.
I'm a bit out of date on the state of OSS test infrastructure and don't know what we use there ATM.

In general, OSS builds would need to have cuda_compat installed within the container they are running. My guess is that it's not. I can check tomorrow.

@akuegel
Copy link
Member

akuegel commented Aug 26, 2021

@Artem-B is there a way to check whether we indeed are running our tests with cuda_compat?

My answer above applies only to the build/test of the internal version of TF, not to the OSS one.
I'm a bit out of date on the state of OSS test infrastructure and don't know what we use there ATM.

In general, OSS builds would need to have cuda_compat installed within the container they are running. My guess is that it's not. I can check tomorrow.

Sorry, I should have clarified, this test is failing internally with the output I posted above (seemingly indicating that the driver is not new enough). In OSS, the test runs successfully.

@google-ml-butler google-ml-butler bot removed the ready to pull PR ready for merge process label Aug 26, 2021
@nouiz
Copy link
Contributor Author

nouiz commented Aug 26, 2021

@akuegel I updated the test to silently pass on driver or toolkit is under 11.2.
This is an experimental features and so we shouldn't bump the min driver/toolkit for this test.

Copy link
Member

@akuegel akuegel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I think that should unblock the change :)

@google-ml-butler google-ml-butler bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Aug 26, 2021
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Aug 26, 2021
@akuegel
Copy link
Member

akuegel commented Aug 27, 2021

So @Artem-B has figured out why we had the older driver version when running the test. A tensorrt target was loading libcuda.so before we could load the cuda compat package. This is fixed now, so the test should actually also be running fine internally :)

@nouiz
Copy link
Contributor Author

nouiz commented Aug 27, 2021

Great. I still see one of the Github Check to be:
@copybara-service feedback/copybara — Google internal checks FAILED for runs with create time 2021-08-26T16:28:03.4

Should this be approved again to trigger a new run with that fix?

@penpornk
Copy link
Member

(Checking back on this PR out of curiosity. Congrats on resolving the driver version issue!)

Great. I still see one of the Github Check to be:
@copybara-service feedback/copybara — Google internal checks FAILED for runs with create time 2021-08-26T16:28:03.4

Should this be approved again to trigger a new run with that fix?

This is because the ROCm build failed. Maybe add an #ifdef guard?

tensorflow/core/common_runtime/gpu/gpu_device_test.cc: In member function 'virtual void tensorflow::GPUDeviceTest_CudaMallocAsync_Test::TestBody()':
tensorflow/core/common_runtime/gpu/gpu_device_test.cc:120:64: error: 'CUDA_VERSION' was not declared in this scope
   LOG(INFO) << "CUDA toolkit too old, skipping this test: " << CUDA_VERSION;
                                                                ^~~~~~~~~~~~
tensorflow/core/common_runtime/gpu/gpu_device_test.cc:120:64: note: suggested alternative: '_SC_VERSION'
   LOG(INFO) << "CUDA toolkit too old, skipping this test: " << CUDA_VERSION;
                                                                ^~~~~~~~~~~~
                                                                _SC_VERSION
tensorflow/core/common_runtime/gpu/gpu_device_test.cc:124:3: error: 'cuDriverGetVersion' was not declared in this scope
   cuDriverGetVersion(&driverVersion);
   ^~~~~~~~~~~~~~~~~~
tensorflow/core/common_runtime/gpu/gpu_device_test.cc:124:3: note: suggested alternative: 'driverVersion'
   cuDriverGetVersion(&driverVersion);
   ^~~~~~~~~~~~~~~~~~
   driverVersion

@google-ml-butler google-ml-butler bot removed the ready to pull PR ready for merge process label Aug 27, 2021
@nouiz
Copy link
Contributor Author

nouiz commented Aug 27, 2021

I updated the test again.

@nouiz nouiz force-pushed the upstream-cudaMallocAsync-no_stats_error branch from 7f4db4d to 69255bb Compare August 27, 2021 15:37
@google-ml-butler google-ml-butler bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Aug 28, 2021
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Aug 28, 2021
@gbaned gbaned added ready to pull PR ready for merge process and removed ready to pull PR ready for merge process labels Aug 30, 2021
@copybara-service copybara-service bot merged commit 4e8bba0 into tensorflow:master Aug 30, 2021
PR Queue automation moved this from Approved by Reviewer to Merged Aug 30, 2021
@google-ml-butler google-ml-butler bot removed the ready to pull PR ready for merge process label Aug 30, 2021
@nouiz nouiz deleted the upstream-cudaMallocAsync-no_stats_error branch August 30, 2021 14:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes comp:core issues related to core part of tensorflow size:M CL Change Size: Medium
Projects
PR Queue
  
Merged
Development

Successfully merging this pull request may close these issues.

Internal Error with TF_GPU_ALLOCATOR=cuda_malloc_async
8 participants