-
Notifications
You must be signed in to change notification settings - Fork 74k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add back "PR #49173: [Crash fix] Fix cudaMallocAsync crashes." #50961
Add back "PR #49173: [Crash fix] Fix cudaMallocAsync crashes." #50961
Conversation
I just pushed a fix that fix a test error in NDEBUG mode. |
@nouiz Thank you so much for taking care of this. And sorry again about the revert. :( |
Any update on this? It is 3 mount since I started to get this fix in the first time. |
@cheshire Note, the problem is the internal |
The error is actually another one: This happens when running the gpu_device_test without cuda_asan. I guess that in the meantime, the surrounding code has changed, and you will have to adapt your change to that. |
c7a3b3a
to
ed6d149
Compare
I tried the commit in this PR and the gpu_device_test passed. I rebased it and pushed the rebased version and it still pass. The Github Linux CI passed too. So I'm not able to reproduce any error here. What idea what would be different in your system that would make it fail? |
@@ -122,13 +125,31 @@ GpuCudaMallocAsyncAllocator::GpuCudaMallocAsyncAllocator( | |||
} | |||
|
|||
se::cuda::ScopedActivateExecutorContext scoped_activation{stream_exec_}; | |||
|
|||
// Check the the CUDA runtime is recent enough. | |||
if (auto status2 = cuDriverGetVersion(&driverVersion)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think something like this should be done on runtime startup.
We've seen quite a few issues when folks tried to run TF built with CUDA-11.3 on the drivers 460. It does not fail right away, but tends to cause weird issues later one. Some apps work OK, others crash or fail with odd CUDA errors. No idea what exactly triggers the failures.
Checking if the driver is recent enough for the CUDA version we build with, and issuing a warning if it's not, would be very useful. CUDA-11 was supposed to be driver-agnostic, but while it removed the strict checks for the driver version, it did not quite remove the dependency on recent enough driver versions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have example of problems that this cause?
If a new 11.X version introduce new features, then some of those features need new feature inside the driver. Like cudaMallocAsync. In that case, I think it is impossible to backport this to older 11.X.
To my knowledge, the compatibility between 11.X driver is only if you limit yourself to features in 11.0. If you use the new feature, then you are bumping the minimum driver requirement.
So you only need to take care about new features that need new drivers.
Personally, I think it is useful for TF user that those new features are enabled only when a recent enough driver is installed. So those feature should detect the version and be enabled only when they are available.
I think that crashing as you suggest is too strong.
During my work hours yesterday, the change hadn't been imported yet, I have read there were issues with the tool which does the import/export, possibly caused by Github problems. Today I see that the change was imported, and the test failed again, this time with an insightful error message: gpu_cudamallocasync_allocator.cc:136] Disable cuda_malloc_async or update your CUDA driver to a version compitible with CUDA 11.2 or higher. We detected a version compatible with: 11000 So it is indeed the problem that we don't have a recent enough driver. So either the cuda_compat package isn't being used, or doesn't help here. |
My answer above applies only to the build/test of the internal version of TF, not to the OSS one. In general, OSS builds would need to have cuda_compat installed within the container they are running. My guess is that it's not. I can check tomorrow. |
Sorry, I should have clarified, this test is failing internally with the output I posted above (seemingly indicating that the driver is not new enough). In OSS, the test runs successfully. |
@akuegel I updated the test to silently pass on driver or toolkit is under 11.2. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I think that should unblock the change :)
So @Artem-B has figured out why we had the older driver version when running the test. A tensorrt target was loading libcuda.so before we could load the cuda compat package. This is fixed now, so the test should actually also be running fine internally :) |
Great. I still see one of the Github Check to be: Should this be approved again to trigger a new run with that fix? |
(Checking back on this PR out of curiosity. Congrats on resolving the driver version issue!)
This is because the ROCm build failed. Maybe add an
|
I updated the test again. |
7f4db4d
to
69255bb
Compare
This reverts commit a4553f8 that was reverting #49173.
When I run it now, //tensorflow/core/common_runtime/gpu:gpu_device_test in asan give me the same output as before. It was already crashing before this PR with my command line:
bazel test --distinct_host_configuration=false --javabase=@bazel_tools//tools/jdk:remote_jdk11 --config=asan -c opt --config=cuda //tensorflow/core/common_runtime/gpu:gpu_device_test
@penpornk
fixes #50669