-
Notifications
You must be signed in to change notification settings - Fork 74.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU sync failed #1450
Comments
It might be because GTX970 has some memory issues if you are allocating more than 3.5Gb (see http://wccftech.com/nvidia-geforce-gtx-970-memory-issue-fully-explained/), you can try to allocate less than 3.5gb memory and check if it corrects the issue:
|
Yikes. Good to know @aymericdamien, thanks! |
it' make no sense.....
it had the same error:
|
Are you building from source, or did you install the pip package? What's your environment? E.g., all the information the template had but you removed :). If you built from source, what command line did you use? |
I build from source, and I build a whl, install it by pip. and i have test ok on on GPU. but it's not ok work on my code.
my command are:
And i live in china mainland, so I change some code in the WORKSPACE file:
and i also change .gitmudles file:
my build command are:
|
@zffchen78: Can you take a look? Is there any relationship between this and #2471? |
@vrv: Reassigning to you per @zffchen78's request. |
Pretty sure this is going to be hard for us to debug without being able to reproduce this. I would suggest:
and then try again. |
Automatically closing because there was no response. Please reopen if it is still an issue. |
I am getting the same error when I create a simple custom operator that operates on a list of input tensors of type int32. My input tensor is 5 elements, so this is clearly not a memory limitation issue. Specifics: Build and run the attached source code: int32: /job:localhost/replica:0/task:0/gpu:0 Key info: Notes: |
Does it work if you define the input type as int64? |
I can't build a custom operator with type int64: karenbre@karenZ820:~/workspace/issue1450$ ./build.sh |
Note, it does work with int16. |
I just met the same problem whether I installed tensorflow from source or official binary(the installation procedure was of no problems). |
also saw this last night; after 2hrs running at 80% GPU util
TITAN X (Pascal)
|
I am getting something similar during back-propagation. Bottleneck generation works fine.
|
@matpalm: Has it happened consistently since? These kind of one off failures can happen if there's some GPU hardware issues. @yash0307 same question: does it happen immediately or only after a while? @kbrems can you include the int64 code? int64 should definitely compile, and I can't figure out the error from the compiler output alone. |
I've not seen it again & have been running similar jobs (i.e. in terms of GPU util & mem load) almost every night since |
Here is the example with int64. I just pulled the latest source from tensorflow master this morning and tried again and it still does not compile. |
Your in_types looks to be int16, not int64, not sure if this is the only problem though. Other than that, this does seem like something we do all the time in other kernels, so I'm not sure why it's not compiling. |
My search/replace failed to catch that. I changed the in_types to int64, but it still does not compile. |
Even though we typedef int64 to int64_t, I think you need to use int64, not int64_t. The following simpler code (which doesn't add one, but for illustration) compiled for me:
|
It seems that somewhere deep within Eigen, int64 is defined as a long long int, but on 64 bit ubuntu, int64_t is defined as a long int in stdint.h, so the 2 are not compatible. I can work around that in this simple example, but it means that all our custom cuda kernels would then have to depend on Eigen types instead of the standard types for linux. . On the plus side, my original issue with in32_t generating the GPU sync error and core dump seems to have gone away with release 0.11.0rc0 (built from latest source). Although I have also upgraded to CUDA 8.0 since the original problem, so perhaps that fixed something. |
I have a similar problem:
It occurs intermittently on training (usually after a few epochs) |
Same problem
It works perfectly on CPU (when cuda visible devices = -1). It is strange: when i use tf.add() it works on GPU, but tf.multiply(), tf.square() (not tested on another math functions) gives an error. CUDA and CuDNN 8, Win10, 1050Ti, tensorflow 1.4 pip install. |
I met same problem ,and I final solve it by decreasing the size of batch.It's so strange that I can run this program with bigger batch size before |
with tf.device("/cpu:0"): Q. What should be the learning_rate? |
I meet the same issue when I running codes with Keras with GPU. I solved it after release the memory. It is highly probably that you have no enough memory to be used. That is also why someone said to reduce the batch size will also work. Good luck. |
I had the same problem. My batch-size is 64, and I changed it for 32. It had run. |
hi!
what's wrong with this, how i can solve this.I'm using cuda7.5 cudnn7.0 and all they are ok running on CPU. but when run GPU ,it occur wrong.
And I can local the operation which can't run on GPU
when i remvoe "tf.device("/cpu:0"),it ocure the bug reported above
The text was updated successfully, but these errors were encountered: