Tensorflow_xla model inference crash on Jetson AGX xavier #48104

lcx2017 · 2021-03-26T12:37:33Z

I met tensorflow_xla crash issue for model inference on Nvidia Jetson AGX Xavier aarch64 system.

System information

Have I written custom code: Yes, I have some CPU computing custom ops, they are in different places in the middle of the network
OS Platform and Distribution): Linux Ubuntu18.04
device: Nvidia Jetson AGX Xavier
TensorFlow installed from (source or binary): source
TensorFlow version (use command below): tensorflow-2.4.1
Python version: 3.6.8
Bazel version (if compiling from source): 3.1.0
GCC/Compiler version (if compiling from source): 7.5.0
CUDA/cuDNN version: cuda10.2, cudnn8.0
GPU model and memory: GPU model with cpu custom ops, device memory 32GB (including cpu mem and gpu mem)

Issues：
1, When do model inference with xla enable, this crash can be reproduced almost every time (xavier aarch64 system)
2, When I turn off some of custom ops (CPU compute op), crash can happen about 7 times after 10 runs (xavier aarch64 system)
3, When I run same code, same tensorflow-2.4.1 version on V100 GPU and x86 system, it can run successful without crash (x86 + v100 gpu system)

amahendrakar · 2021-03-26T16:00:36Z

@lcx2017,
In order to expedite the trouble-shooting process, could you please provide a minimal code snippet to reproduce the issue reported here and the dataset you are using. Thanks!

amahendrakar · 2021-03-26T16:03:51Z

Also, TensorFlow v2.4.1 is compatible with CUDA 11.0 and cuDNN 8.0. Please take a look at the tested build configurations for more information.

Version	Python version	Compiler	Build tools	cuDNN	CUDA
tensorflow-2.4.0	3.6-3.8	GCC 7.3.1	Bazel 3.1.0	8.0	11.0
tensorflow-2.3.0	3.5-3.8	GCC 7.3.1	Bazel 3.1.0	7.6	10.1
tensorflow-2.2.0	3.5-3.8	GCC 7.3.1	Bazel 2.0.0	7.6	10.1

Could you please update CUDA to v11.0 and check if you are still facing the same error. Thanks!

lcx2017 · 2021-03-28T03:49:40Z

Also, TensorFlow v2.4.1 is compatible with CUDA 11.0 and cuDNN 8.0. Please take a look at the tested build configurations for more information.

Version Python version Compiler Build tools cuDNN CUDA
tensorflow-2.4.0 3.6-3.8 GCC 7.3.1 Bazel 3.1.0 8.0 11.0
tensorflow-2.3.0 3.5-3.8 GCC 7.3.1 Bazel 3.1.0 7.6 10.1
tensorflow-2.2.0 3.5-3.8 GCC 7.3.1 Bazel 2.0.0 7.6 10.1
Could you please update CUDA to v11.0 and check if you are still facing the same error. Thanks!

For Jetson AGX Xavier platform, it only support Nvidia Jetpack SDK to update system AI tool chain, now I'm using Jetpack4.4, and Jetpack does not support CUDA 11.0 yet. https://developer.nvidia.com/jetpack-sdk-44-archive

lcx2017 · 2021-03-28T03:55:41Z

@lcx2017,
In order to expedite the trouble-shooting process, could you please provide a minimal code snippet to reproduce the issue reported here and the dataset you are using. Thanks!

Do you have Jetson AGX Xavier platform to reproduce this issue?

amahendrakar · 2021-03-30T15:42:06Z

@lcx2017,
Thank you for the update.

Currently, I do not have the NVIDIA Jetson AGX Xavier Developer Kit. But a minimal code snippet would help us debug the issue and determine the source of the error easily.

lcx2017 · 2021-03-31T08:31:07Z

@lcx2017,
Thank you for the update.

Currently, I do not have the NVIDIA Jetson AGX Xavier Developer Kit. But a minimal code snippet would help us debug the issue and determine the source of the error easily.

Thanks a lot!
On Jetpack 4.4 (cuda 10.2, cudnn 8.0, tensorrt 7.1.3.0, Tensorflow 2.4.1)
1, The code can not share because of Trade secret, sorry for that!
2, The custom op is common pixel level computing method, some if/else logic in it so we didn't code it by CUDA kernel.
3, After close xla compilation for all the custom ops, crash issue still exists, crashed 6 times after 20 runs.
4, After enable xla compilation for the whole network, crashed 20 times after 20 runs.
5, After disable xla for the whole network, prue tensorflow model can run successful without crash

have also tested Jetpack 4.3 on Jetson AGX Xavier, find there is no crash issue on Jetpack4.3 (Tensorflow 2.2)(https://developer.nvidia.com/jetpack-43-archive),
On Jetpack 4.3 (cuda 10.0, cudnn 7.6.3, Tensorrt 6.0.1.10, Tensorflow 2.2)
1, After enable xla for the whole network, tensorflow xla model can run successful without crash, the latency is good.

After compare the Jetpack4.4 with Tensorflow 2.4 and Jetpack4.3 with Tensorflow2.2, found new latency slow issue for op of tf.math.unsorted_segment_max:
Jetpack4.3 with tensorflow 2.2: latency of "tf.math.unsorted_segment_max" op on Jetson AGX Xavier: 2~3ms
Jetpack4.3 with tensorflow 2.4: latency of "tf.math.unsorted_segment_max" op on Jetson AGX Xavier: 120ms

lcx2017 added the type:bug Bug label Mar 26, 2021

google-ml-butler bot assigned amahendrakar Mar 26, 2021

amahendrakar added stat:awaiting response Status - Awaiting response from author TF 2.4 for issues related to TF 2.4 type:support Support issues and removed type:bug Bug labels Mar 26, 2021

tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Mar 30, 2021

amahendrakar added the stat:awaiting response Status - Awaiting response from author label Mar 30, 2021

tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Apr 2, 2021

amahendrakar added the comp:xla XLA label Apr 6, 2021

amahendrakar assigned ymodak and unassigned amahendrakar Apr 6, 2021

ymodak added type:bug Bug and removed type:support Support issues labels Apr 15, 2021

ymodak assigned r4nt and unassigned ymodak Apr 15, 2021

ymodak added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Apr 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tensorflow_xla model inference crash on Jetson AGX xavier #48104

Tensorflow_xla model inference crash on Jetson AGX xavier #48104

lcx2017 commented Mar 26, 2021

amahendrakar commented Mar 26, 2021

amahendrakar commented Mar 26, 2021

lcx2017 commented Mar 28, 2021

lcx2017 commented Mar 28, 2021

amahendrakar commented Mar 30, 2021

lcx2017 commented Mar 31, 2021

Tensorflow_xla model inference crash on Jetson AGX xavier #48104

Tensorflow_xla model inference crash on Jetson AGX xavier #48104

Comments

lcx2017 commented Mar 26, 2021

amahendrakar commented Mar 26, 2021

amahendrakar commented Mar 26, 2021

lcx2017 commented Mar 28, 2021

lcx2017 commented Mar 28, 2021

amahendrakar commented Mar 30, 2021

lcx2017 commented Mar 31, 2021