-
Notifications
You must be signed in to change notification settings - Fork 74k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
XLA related ptxas version error when changing batch size #66716
Comments
Hi @andremfreitas ,
Thank you! |
CUDA 12.3 These versions should be compatible with tensorflow 2.16. |
Turns out it was an issue with a specific node of the cluster ... sorry about that |
Hey @andremfreitas, could you go a bit more into what you did to solve the issue? I'm having the exact same issue but not sure what the issue with the node in the cluster you're talking about it, or how to even begin solving an issue like this. Any help is appreciated! |
Hi, having the same issue here! I also do not understand what issue with the node you're talking about. Any indication on how you solved the issue would really be appreciated! |
@LeonardoPaccianiMori while this isn't really an answer, I just re-implemented the network in PyTorch and it works like a charm! |
@LeonardoPaccianiMori @trevorcarrell In my case, one of the nodes of the cluster had an older cuda version, which caused me to mislead the error by the change of the batch size and not the allocation of this specific node. After avoiding this node I no longer had the problem with any batch size. |
Issue type
Bug
Have you reproduced the bug with TensorFlow Nightly?
No
Source
source
TensorFlow version
2.16
Custom code
Yes
OS platform and distribution
Linux Ubuntu 22.04.3 LTS
Mobile device
No response
Python version
3.10
Bazel version
No response
GCC/compiler version
No response
CUDA/cuDNN version
No response
GPU model and memory
A100 40GB
Current behavior?
I have a custom training loop that calls functions that are jit compiled. I got this error message you see below when using a batch size of 512. However, if I change the batch size to 256 for example (or 128), I no longer get this error. This is very weird, because the error about the ptxas version (which from my understanding is related with the CUDA toolkit version) has nothing to do with the batch size. So, I think the batch size of 512 may be causing another error (possibly some memory issue?) and the wrong error is being thrown ... I am not sure, but let me know what you think.
Thanks and sorry for not being able to provide MWE.
Standalone code to reproduce the issue
Relevant log output
The text was updated successfully, but these errors were encountered: