Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kernel dies #121

Closed
malj390 opened this issue Feb 26, 2021 · 26 comments
Closed

Kernel dies #121

malj390 opened this issue Feb 26, 2021 · 26 comments

Comments

@malj390
Copy link

malj390 commented Feb 26, 2021

Hello,

For some reason just running the notebook examples provided for 3D segmentation the kernel restart on the Training notebook for the cell 13th with

median_size = calculate_extents(Y, np.median)
fov = np.array(model._axes_tile_overlap('ZYX'))
print(f"median object size:      {median_size}")
print(f"network field of view :  {fov}")
if any(median_size > fov):
    print("WARNING: median object size larger than field of view of the neural network.")

I'm running it on Windows 10, python=3.8.5, tensorflow=2.4.1 and gputools=0.2.9

I have done all the routinary checkings to check the GPU working and being recognized by TensorFlow.

I also tried in Spyder

2021󈚦󈚻 13:24:21.260091: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2021󈚦󈚻 13:24:25.273342: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021󈚦󈚻 13:24:25.275951: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library nvcuda.dll
2021󈚦󈚻 13:24:26.630297: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce RTX 2060 computeCapability: 7.5
coreClock: 1.2GHz coreCount: 30 deviceMemorySize: 6.00GiB deviceMemoryBandwidth: 245.91GiB/s
2021󈚦󈚻 13:24:26.633000: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2021󈚦󈚻 13:24:26.662213: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll
2021󈚦󈚻 13:24:26.662795: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublasLt64_11.dll
2021󈚦󈚻 13:24:26.681321: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cufft64_10.dll
2021󈚦󈚻 13:24:26.686489: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library curand64_10.dll
2021󈚦󈚻 13:24:26.737777: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusolver64_10.dll
2021󈚦󈚻 13:24:26.755146: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusparse64_11.dll
2021󈚦󈚻 13:24:26.757006: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll
2021󈚦󈚻 13:24:26.757886: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2021󈚦󈚻 13:24:26.759337: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance‑critical operations: AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021󈚦󈚻 13:24:26.761659: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce RTX 2060 computeCapability: 7.5
coreClock: 1.2GHz coreCount: 30 deviceMemorySize: 6.00GiB deviceMemoryBandwidth: 245.91GiB/s
2021󈚦󈚻 13:24:26.762835: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2021󈚦󈚻 13:24:26.763438: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll
2021󈚦󈚻 13:24:26.764039: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublasLt64_11.dll
2021󈚦󈚻 13:24:26.764743: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cufft64_10.dll
2021󈚦󈚻 13:24:26.765399: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library curand64_10.dll
2021󈚦󈚻 13:24:26.766015: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusolver64_10.dll
2021󈚦󈚻 13:24:26.766659: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusparse64_11.dll
2021󈚦󈚻 13:24:26.767273: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll
2021󈚦󈚻 13:24:26.767920: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2021󈚦󈚻 13:24:27.730059: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2021󈚦󈚻 13:24:27.730706: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267] 0 
2021󈚦󈚻 13:24:27.731105: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0: N 
2021󈚦󈚻 13:24:27.731638: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4592 MB memory) ‑> physical GPU (device: 0, name: GeForce RTX 2060, pci bus id: 0000:01:00.0, compute capability: 7.5)
2021󈚦󈚻 13:24:27.733223: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021󈚦󈚻 13:24:28.282501: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2021󈚦󈚻 13:24:28.439355: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll

and adding these lines to the notebook at the beginning but still.

import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

Does someone know what could be the problem?

Thank you

@maweigert
Copy link
Member

Hi @malj390 ,

Hard to diagnose - does it happen always at cell 13? (which actually doesn't do anything on the GPU, so should not be a problem).

@malj390
Copy link
Author

malj390 commented Feb 26, 2021

Yes. I also try skipping that cell and go to the next one of Quick demo= True and there it also dies.

@uschmidt83
Copy link
Member

Actually, the first time model._axes_tile_overlap('ZYX') is called, it does use the neural net (and the GPU) to empirically determine its field of view.

If you add this at the (very) beginning of the notebook, it shouldn't use the GPU and the kernel hopefully doesn't die:

import os
os.environ["CUDA_VISIBLE_DEVICES"] = ""

Of course this isn't a solution, but your log messages don't help me to figure out where the problem is.
Can you call nvidia-smi on the command line before you run the notebook?

@maweigert
Copy link
Member

Indeed, as Uwe mentioned it might be that already model._axes_tile_overlap leads to an out-of-memory error (which could be due to your relatively small GPU RAM). You can check whether this is the case by running

model._compute_receptive_field()

directly after you created the model, and see whether this already makes the kernel die.

@malj390
Copy link
Author

malj390 commented Feb 26, 2021

nvidia-smi output:

Fri Feb 26 13:26:30 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 461.40       Driver Version: 461.40       CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 2060   WDDM  | 00000000:01:00.0 Off |                  N/A |
| N/A   51C    P8     2W /  N/A |    164MiB /  6144MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     11104    C+G   ...artMenuExperienceHost.exe    N/A      |
|    0   N/A  N/A     24220    C+G   Insufficient Permissions        N/A      |
+-----------------------------------------------------------------------------+

I tried both of your suggestions. Together and individually and still the kernel keeps dying.

@maweigert
Copy link
Member

Ok. Could you try whether model._compute_receptive_field(img_size=(32,32,32)) works?

@malj390
Copy link
Author

malj390 commented Feb 26, 2021

It doesn't work either.

@uschmidt83
Copy link
Member

2021󈚦󈚻 13:24:21.260091: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2021󈚦󈚻 13:24:25.273342: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021󈚦󈚻 13:24:25.275951: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library nvcuda.dll
2021󈚦󈚻 13:24:26.630297: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce RTX 2060 computeCapability: 7.5
coreClock: 1.2GHz coreCount: 30 deviceMemorySize: 6.00GiB deviceMemoryBandwidth: 245.91GiB/s
2021󈚦󈚻 13:24:26.633000: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2021󈚦󈚻 13:24:26.662213: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll
2021󈚦󈚻 13:24:26.662795: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublasLt64_11.dll
2021󈚦󈚻 13:24:26.681321: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cufft64_10.dll
2021󈚦󈚻 13:24:26.686489: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library curand64_10.dll
2021󈚦󈚻 13:24:26.737777: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusolver64_10.dll
2021󈚦󈚻 13:24:26.755146: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusparse64_11.dll
2021󈚦󈚻 13:24:26.757006: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll
2021󈚦󈚻 13:24:26.757886: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2021󈚦󈚻 13:24:26.759337: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance‑critical operations: AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021󈚦󈚻 13:24:26.761659: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce RTX 2060 computeCapability: 7.5
coreClock: 1.2GHz coreCount: 30 deviceMemorySize: 6.00GiB deviceMemoryBandwidth: 245.91GiB/s
2021󈚦󈚻 13:24:26.762835: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2021󈚦󈚻 13:24:26.763438: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll
2021󈚦󈚻 13:24:26.764039: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublasLt64_11.dll
2021󈚦󈚻 13:24:26.764743: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cufft64_10.dll
2021󈚦󈚻 13:24:26.765399: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library curand64_10.dll
2021󈚦󈚻 13:24:26.766015: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusolver64_10.dll
2021󈚦󈚻 13:24:26.766659: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusparse64_11.dll
2021󈚦󈚻 13:24:26.767273: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll
2021󈚦󈚻 13:24:26.767920: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2021󈚦󈚻 13:24:27.730059: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2021󈚦󈚻 13:24:27.730706: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267] 0 
2021󈚦󈚻 13:24:27.731105: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0: N 
2021󈚦󈚻 13:24:27.731638: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4592 MB memory) ‑> physical GPU (device: 0, name: GeForce RTX 2060, pci bus id: 0000:01:00.0, compute capability: 7.5)
2021󈚦󈚻 13:24:27.733223: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021󈚦󈚻 13:24:28.282501: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2021󈚦󈚻 13:24:28.439355: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll

Is this really the entire output on the command line where you started the Jupyter notebook server? I would assume to see some more relevant messages to get a clue where the problem is.

@kapoorlab
Copy link

Hello,

I do not know if this is related, but I have also encountered such issues on Mac. I try to install stardist, it first fails if I do not provide the proper CC, g++ flags. After I do, installation works fine but then when running prediction on 2D data the kernel dies. However if I run the same notebook on Colab where !pip install stardist appears in a cel before, everything works just fine. I could be h5py Mac related issue?

Am not sure if this helps but maybe the issue is not 3D related but Mac/OS/h5py related? Maybe if you try to run your code on Colab and if that works then that would confirm the issue is not really a stardist bug but something else?

@malj390
Copy link
Author

malj390 commented Mar 1, 2021

@uschmidt83 No, I can only get that output through Spyder. Actually, Jupyter Notebook doesn't prompt any error, the kernel just dies.

@malj390
Copy link
Author

malj390 commented Mar 1, 2021

@kapoorlab I though to do so but I didn't think it was really going to give any extra answer. The code I'm running is directly from examples with their data, so if something goes wrong I know that for sure is my installation or my environment. The weird part is that I run the proper tests at the beginning to check if the installation is properly done (Tensorflow, CUDA, the graphic card is recognized, etc) and it says it is ok. I will follow your suggestion and I will run it in Colab. However at some point I need to run a training in my own computer.

@uschmidt83 @maweigert @kapoorlab Thanks for the prompt reply

@malj390
Copy link
Author

malj390 commented Mar 1, 2021

So I can confirm the issue is with my environment as Colab works well. I'm not sure what it is and how to obtain more information to address and solve the problem. Do you have any extra advice @uschmidt83 to extract more relevant information?
I think next I can try to run another TensorFlow model and check if the problem persists.

@uschmidt83
Copy link
Member

You can save/export the Jupyter notebook as a Python script and run it from the command line. Maybe that generates more helpful error messages.

@maweigert
Copy link
Member

You can test whether basic inference works with the following

  1. update stardist pip install -U stardist (we released a new version yesterday)
  2. try the following (in a new notebook or in ipython)
from stardist.models import StarDist2D
from stardist.data import test_image_nuclei_2d
from csbdeep.utils import normalize 

model = StarDist2D.from_pretrained('2D_versatile_fluo')

img = test_image_nuclei_2d()
label, _ = model.predict_instances(normalize(img))

@kapoorlab
Copy link

Just a related question, does stardist not work well with certain combinations of tensorflow/keras versions? Because I also had such a problem on my Mac and I switched to Colab with tensorflow1.x and keras 2.2.5 and it was all fine.

@uschmidt83
Copy link
Member

does stardist not work well with certain combinations of tensorflow/keras versions?

Not to my knowledge.

I also had such a problem on my Mac and I switched to Colab with tensorflow1.x and keras 2.2.5 and it was all fine.

I never use tensorflow (csbdeep, stardist, etc.) in Python on my Mac, because it doesn't have GPU support anyway. Maybe @maweigert can comment on this.

@kapoorlab
Copy link

I did pip install -U stardist and for me the kernel does not die now in the same virtual environment that it died before. So it is good from the Mac world side. Some users may want to have it on their Macs as they may not have access to a GPU server, just good to have it installed on slow computers as an option.

@malj390
Copy link
Author

malj390 commented Mar 2, 2021

@uschmidt83

You can save/export the Jupyter notebook as a Python script and run it from the command line. Maybe that generates more helpful error messages.

Yes. Actually, the previous post where I mentioned this long output from below was thanks to the error raised by the notebook converted to .py and run in Spyder. I did not find a way to access the error from the Jupyter notebook.

2021󈚦󈚻 13:24:21.260091: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2021󈚦󈚻 13:24:25.273342: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021󈚦󈚻 13:24:25.275951: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library nvcuda.dll
2021󈚦󈚻 13:24:26.630297: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce RTX 2060 computeCapability: 7.5
...

@malj390
Copy link
Author

malj390 commented Mar 2, 2021

@maweigert It's still dying in jupyter notebook and in Spyder (giving the same error mentioned). I suspect is a problem with my system itself or the graphic card.

As I'm not really into complex models I did this easy one and just to try tensorflow. It runs properly without dying, so now I'm not sure where the problem is.

from tensorflow import keras
import numpy as np

y = lambda x: 2*x - 1

values1 = list(range(-1, 15))
values2 = [y(i) for i in values1]
display(values1[:10])
display(values2[:10])

values1 = [float(i) for i in values1]
values2 = [float(i) for i in values2]

model = keras.Sequential([keras.layers.Dense(units=1, input_shape=[1])])
model.compile(optimizer='sgd', loss='mean_squared_error')

xs = np.array(values1, dtype=float)
ys = np.array(values2, dtype=float)

model.fit(xs, ys, epochs=2000)

model.predict([600.0])

test = 2000
print("Real value: {} VS Prediction: {}".format(y(test), round(float(model.predict([test])))))

@malj390
Copy link
Author

malj390 commented Mar 2, 2021

I found the error. Sorry guys, I didn't check the Anaconda prompt from Notebook which actually shows the error.

It was this one:

Could not load library cudnn_ops_infer64_8.dll. Error code 126     
Please make sure cudnn_ops_infer64_8.dll is in your library path!

I did all the installation following this link (with updated compatible versions of CUDA, CudNN and CUDA Tool Kit )
https://towardsdatascience.com/installing-tensorflow-with-cuda-cudnn-and-gpu-support-on-windows-10-60693e46e781

so it seems in the tutorial that cudnn is not mentioned. So now, I copied the full folders from

C:\Users\User\Downloads\cudnn-10.1-windows10-x64-v8.0.5.39\cuda\bin
C:\Users\User\Downloads\cudnn-10.1-windows10-x64-v8.0.5.39\cuda\include
C:\Users\User\Downloads\cudnn-10.1-windows10-x64-v8.0.5.39\cuda\lib\x64

to

C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\bin
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\include
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\lin/x64

Now the error in Notebook is:

InternalError:  Blas SGEMM launch failed : m=262144, n=96, k=128
	 [[node model/dist/Conv3D (defined at C:\Users\User\anaconda3\envs\py38tensor\lib\site-packages\stardist\models\base.py:666) ]] [Op:__inference_predict_function_632]

Function call stack:
predict_function

and in the Anaconda prompt:

2021-03-02 21:19:43.415108: W tensorflow/stream_executor/stream.cc:1455] attempting to perform BLAS operation using StreamExecutor without BLAS support

Thanks

@uschmidt83
Copy link
Member

Ok, thanks for letting us know.

Btw, I got TensorFlow with GPU support working (also on Windows) by simply installing this conda environment. It will automatically install the necessary CUDA and cuDNN libraries. After installing this environment, activate it, and then install StarDist via pip.

@malj390
Copy link
Author

malj390 commented Mar 2, 2021

Thanks a lot! I did it but still I get this error.
Notebook:

UnknownError:  Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[node functional_1/conv3d/Conv3D (defined at C:\Users\User\anaconda3\envs\csbdeep\lib\site-packages\stardist\models\base.py:666) ]] [Op:__inference_predict_function_655]

Function call stack:
predict_function

Anaconda prompt:

2021-03-02 23:18:16.839925: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): GeForce RTX 2060, Compute Capability 7.5                                
2021-03-02 23:18:24.730525: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cudnn64_7.dll                                      
2021-03-02 23:18:25.761022: E tensorflow/stream_executor/cuda/cuda_dnn.cc:328] Could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED                                            
2021-03-02 23:18:25.767004: E tensorflow/stream_executor/cuda/cuda_dnn.cc:328] Could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED                                                                                           
2021-03-02 23:19:46.895681: E tensorflow/stream_executor/cuda/cuda_dnn.cc:328] Could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED                                            
2021-03-02 23:19:46.899993: E tensorflow/stream_executor/cuda/cuda_dnn.cc:328] Could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED

@uschmidt83
Copy link
Member

This looks like something else is occupying the GPU memory (another notebook?).

@malj390
Copy link
Author

malj390 commented Mar 3, 2021

No I just opened the one notebook.

This is the complete error in the case brings up more information.

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
~\anaconda3\envs\csbdeep\lib\site-packages\stardist\models\base.py in _axes_tile_overlap(self, query_axes)
    678         try:
--> 679             self._tile_overlap
    680         except AttributeError:

AttributeError: 'StarDist3D' object has no attribute '_tile_overlap'

During handling of the above exception, another exception occurred:

UnknownError                              Traceback (most recent call last)
<ipython-input-15-213f80ac94e0> in <module>
      1 median_size = calculate_extents(Y, np.median)
----> 2 fov = np.array(model._axes_tile_overlap('ZYX'))
      3 print(f"median object size:      {median_size}")
      4 print(f"network field of view :  {fov}")
      5 if any(median_size > fov):

~\anaconda3\envs\csbdeep\lib\site-packages\stardist\models\base.py in _axes_tile_overlap(self, query_axes)
    679             self._tile_overlap
    680         except AttributeError:
--> 681             self._tile_overlap = self._compute_receptive_field()
    682         overlap = dict(zip(
    683             self.config.axes.replace('C',''),

~\anaconda3\envs\csbdeep\lib\site-packages\stardist\models\base.py in _compute_receptive_field(self, img_size)
    664         z = np.zeros_like(x)
    665         x[(0,)+mid+(slice(None),)] = 1
--> 666         y  = self.keras_model.predict(x)[0][0,...,0]
    667         y0 = self.keras_model.predict(z)[0][0,...,0]
    668         grid = tuple((np.array(x.shape[1:-1])/np.array(y.shape)).astype(int))

~\anaconda3\envs\csbdeep\lib\site-packages\tensorflow\python\keras\engine\training.py in _method_wrapper(self, *args, **kwargs)
    128       raise ValueError('{} is not supported in multi-worker mode.'.format(
    129           method.__name__))
--> 130     return method(self, *args, **kwargs)
    131 
    132   return tf_decorator.make_decorator(

~\anaconda3\envs\csbdeep\lib\site-packages\tensorflow\python\keras\engine\training.py in predict(self, x, batch_size, verbose, steps, callbacks, max_queue_size, workers, use_multiprocessing)
   1597           for step in data_handler.steps():
   1598             callbacks.on_predict_batch_begin(step)
-> 1599             tmp_batch_outputs = predict_function(iterator)
   1600             if data_handler.should_sync:
   1601               context.async_wait()

~\anaconda3\envs\csbdeep\lib\site-packages\tensorflow\python\eager\def_function.py in __call__(self, *args, **kwds)
    778       else:
    779         compiler = "nonXla"
--> 780         result = self._call(*args, **kwds)
    781 
    782       new_tracing_count = self._get_tracing_count()

~\anaconda3\envs\csbdeep\lib\site-packages\tensorflow\python\eager\def_function.py in _call(self, *args, **kwds)
    844               *args, **kwds)
    845       # If we did not create any variables the trace we have is good enough.
--> 846       return self._concrete_stateful_fn._filtered_call(canon_args, canon_kwds)  # pylint: disable=protected-access
    847 
    848     def fn_with_cond(*inner_args, **inner_kwds):

~\anaconda3\envs\csbdeep\lib\site-packages\tensorflow\python\eager\function.py in _filtered_call(self, args, kwargs, cancellation_manager)
   1846                            resource_variable_ops.BaseResourceVariable))],
   1847         captured_inputs=self.captured_inputs,
-> 1848         cancellation_manager=cancellation_manager)
   1849 
   1850   def _call_flat(self, args, captured_inputs, cancellation_manager=None):

~\anaconda3\envs\csbdeep\lib\site-packages\tensorflow\python\eager\function.py in _call_flat(self, args, captured_inputs, cancellation_manager)
   1922       # No tape is watching; skip to running the function.
   1923       return self._build_call_outputs(self._inference_function.call(
-> 1924           ctx, args, cancellation_manager=cancellation_manager))
   1925     forward_backward = self._select_forward_and_backward_functions(
   1926         args,

~\anaconda3\envs\csbdeep\lib\site-packages\tensorflow\python\eager\function.py in call(self, ctx, args, cancellation_manager)
    548               inputs=args,
    549               attrs=attrs,
--> 550               ctx=ctx)
    551         else:
    552           outputs = execute.execute_with_cancellation(

~\anaconda3\envs\csbdeep\lib\site-packages\tensorflow\python\eager\execute.py in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
     58     ctx.ensure_initialized()
     59     tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
---> 60                                         inputs, attrs, num_outputs)
     61   except core._NotOkStatusException as e:
     62     if name is not None:

UnknownError:  Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[node functional_1/conv3d/Conv3D (defined at C:\Users\User\anaconda3\envs\csbdeep\lib\site-packages\stardist\models\base.py:666) ]] [Op:__inference_predict_function_655]

Function call stack:
predict_function

I don't know if the warning log message that is referring to be printed above is this one:

AttributeError: 'StarDist3D' object has no attribute '_tile_overlap'

@uschmidt83
Copy link
Member

I really don't know what the problem is. One last try, run this at the very beginning of the notebook:

from csbdeep.utils.tf import limit_gpu_memory
limit_gpu_memory(None, allow_growth=True)

@malj390
Copy link
Author

malj390 commented Mar 4, 2021

Yes it fixed! Thanks a lot

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants