XLA related ptxas version error when changing batch size #66716

andremfreitas · 2024-04-30T17:20:16Z

Issue type

Bug

Have you reproduced the bug with TensorFlow Nightly?

No

Source

source

TensorFlow version

2.16

Custom code

Yes

OS platform and distribution

Linux Ubuntu 22.04.3 LTS

Mobile device

No response

Python version

3.10

Bazel version

No response

GCC/compiler version

No response

CUDA/cuDNN version

No response

GPU model and memory

A100 40GB

Current behavior?

I have a custom training loop that calls functions that are jit compiled. I got this error message you see below when using a batch size of 512. However, if I change the batch size to 256 for example (or 128), I no longer get this error. This is very weird, because the error about the ptxas version (which from my understanding is related with the CUDA toolkit version) has nothing to do with the batch size. So, I think the batch size of 512 may be causing another error (possibly some memory issue?) and the wrong error is being thrown ... I am not sure, but let me know what you think.

Thanks and sorry for not being able to provide MWE.

Standalone code to reproduce the issue

Cannot build a MWE unfortunately.

Relevant log output

2024-04-30 16:37:32.740441: W tensorflow/core/framework/op_kernel.cc:1839] OP_REQUIRES failed at xla_ops.cc:580 : INTERNAL: XLA requires ptxas version 11.8 or higher
2024-04-30 16:37:32.740516: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: INTERNAL: XLA requires ptxas version 11.8 or higher
	 [[{{node PartitionedCall}}]]
Traceback (most recent call last):
  File "/home/ids/afreitas/april/cnn_test/train_traj.py", line 281, in <module>
    loss = training_loop(ic, gt, msteps_sched[j])    
  File "/home/ids/afreitas/my_tf/lib/python3.10/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/ids/afreitas/my_tf/lib/python3.10/site-packages/tensorflow/python/eager/execute.py", line 53, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InternalError: Graph execution error:

Detected at node PartitionedCall defined at (most recent call last):
  File "/home/ids/afreitas/april/cnn_test/train_traj.py", line 281, in <module>

  File "/home/ids/afreitas/april/cnn_test/train_traj.py", line 191, in training_loop

  File "/home/ids/afreitas/april/cnn_test/train_traj.py", line 192, in training_loop

XLA requires ptxas version 11.8 or higher
	 [[{{node PartitionedCall}}]] [Op:__inference_training_loop_26507]

Venkat6871 · 2024-05-02T09:06:57Z

Hi @andremfreitas ,

Ensure that you have CUDA Toolkit version 11.8 or higher installed on your system. You can download the latest version from the NVIDIA website and follow the installation instructions specific to your operating system and CUDA version.
Verify that your TensorFlow version is compatible with the CUDA Toolkit version you are using. Some TensorFlow versions may require specific CUDA Toolkit versions for full compatibility.
Here i am providing tensorflow documentation for your reference.

Thank you!

andremfreitas · 2024-05-02T09:35:02Z

CUDA 12.3
cuDNN 8.9

These versions should be compatible with tensorflow 2.16.

andremfreitas · 2024-05-03T11:26:26Z

Turns out it was an issue with a specific node of the cluster ... sorry about that

google-ml-butler · 2024-05-03T11:26:28Z

Are you satisfied with the resolution of your issue?
Yes
No

trevorcarrell · 2024-05-24T12:31:11Z

Hey @andremfreitas, could you go a bit more into what you did to solve the issue? I'm having the exact same issue but not sure what the issue with the node in the cluster you're talking about it, or how to even begin solving an issue like this.

Any help is appreciated!

LeonardoPaccianiMori · 2024-05-26T04:01:22Z

Hi, having the same issue here! I also do not understand what issue with the node you're talking about. Any indication on how you solved the issue would really be appreciated!

trevorcarrell · 2024-05-26T09:10:03Z

@LeonardoPaccianiMori while this isn't really an answer, I just re-implemented the network in PyTorch and it works like a charm!

andremfreitas · 2024-05-26T10:16:25Z

@LeonardoPaccianiMori @trevorcarrell In my case, one of the nodes of the cluster had an older cuda version, which caused me to mislead the error by the change of the batch size and not the allocation of this specific node.

After avoiding this node I no longer had the problem with any batch size.

google-ml-butler bot added the type:bug Bug label Apr 30, 2024

google-ml-butler bot assigned Venkat6871 Apr 30, 2024

Venkat6871 added the TF 2.16 label May 2, 2024

Venkat6871 added stat:awaiting response Status - Awaiting response from author comp:xla XLA labels May 2, 2024

google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label May 2, 2024

andremfreitas closed this as completed May 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XLA related ptxas version error when changing batch size #66716

XLA related ptxas version error when changing batch size #66716

andremfreitas commented Apr 30, 2024

Venkat6871 commented May 2, 2024

andremfreitas commented May 2, 2024

andremfreitas commented May 3, 2024

google-ml-butler bot commented May 3, 2024

trevorcarrell commented May 24, 2024

LeonardoPaccianiMori commented May 26, 2024

trevorcarrell commented May 26, 2024 •

edited

andremfreitas commented May 26, 2024

XLA related ptxas version error when changing batch size #66716

XLA related ptxas version error when changing batch size #66716

Comments

andremfreitas commented Apr 30, 2024

Issue type

Have you reproduced the bug with TensorFlow Nightly?

Source

TensorFlow version

Custom code

OS platform and distribution

Mobile device

Python version

Bazel version

GCC/compiler version

CUDA/cuDNN version

GPU model and memory

Current behavior?

Standalone code to reproduce the issue

Relevant log output

Venkat6871 commented May 2, 2024

andremfreitas commented May 2, 2024

andremfreitas commented May 3, 2024

google-ml-butler bot commented May 3, 2024

trevorcarrell commented May 24, 2024

LeonardoPaccianiMori commented May 26, 2024

trevorcarrell commented May 26, 2024 • edited

andremfreitas commented May 26, 2024

trevorcarrell commented May 26, 2024 •

edited