Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tensorflow_xla model inference crash on Jetson AGX xavier #48104

Open
lcx2017 opened this issue Mar 26, 2021 · 6 comments
Open

Tensorflow_xla model inference crash on Jetson AGX xavier #48104

lcx2017 opened this issue Mar 26, 2021 · 6 comments
Assignees
Labels
comp:xla XLA stat:awaiting tensorflower Status - Awaiting response from tensorflower TF 2.4 for issues related to TF 2.4 type:bug Bug

Comments

@lcx2017
Copy link

lcx2017 commented Mar 26, 2021

I met tensorflow_xla crash issue for model inference on Nvidia Jetson AGX Xavier aarch64 system.

System information

  • Have I written custom code: Yes, I have some CPU computing custom ops, they are in different places in the middle of the network
  • OS Platform and Distribution): Linux Ubuntu18.04
  • device: Nvidia Jetson AGX Xavier
  • TensorFlow installed from (source or binary): source
  • TensorFlow version (use command below): tensorflow-2.4.1
  • Python version: 3.6.8
  • Bazel version (if compiling from source): 3.1.0
  • GCC/Compiler version (if compiling from source): 7.5.0
  • CUDA/cuDNN version: cuda10.2, cudnn8.0
  • GPU model and memory: GPU model with cpu custom ops, device memory 32GB (including cpu mem and gpu mem)

crash.log
gdb.log

Issues:
1, When do model inference with xla enable, this crash can be reproduced almost every time (xavier aarch64 system)
2, When I turn off some of custom ops (CPU compute op), crash can happen about 7 times after 10 runs (xavier aarch64 system)
3, When I run same code, same tensorflow-2.4.1 version on V100 GPU and x86 system, it can run successful without crash (x86 + v100 gpu system)

@amahendrakar
Copy link
Contributor

@lcx2017,
In order to expedite the trouble-shooting process, could you please provide a minimal code snippet to reproduce the issue reported here and the dataset you are using. Thanks!

@amahendrakar
Copy link
Contributor

Also, TensorFlow v2.4.1 is compatible with CUDA 11.0 and cuDNN 8.0. Please take a look at the tested build configurations for more information.

Version Python version Compiler Build tools cuDNN CUDA
tensorflow-2.4.0 3.6-3.8 GCC 7.3.1 Bazel 3.1.0 8.0 11.0
tensorflow-2.3.0 3.5-3.8 GCC 7.3.1 Bazel 3.1.0 7.6 10.1
tensorflow-2.2.0 3.5-3.8 GCC 7.3.1 Bazel 2.0.0 7.6 10.1

Could you please update CUDA to v11.0 and check if you are still facing the same error. Thanks!

@amahendrakar amahendrakar added stat:awaiting response Status - Awaiting response from author TF 2.4 for issues related to TF 2.4 type:support Support issues and removed type:bug Bug labels Mar 26, 2021
@lcx2017
Copy link
Author

lcx2017 commented Mar 28, 2021

Also, TensorFlow v2.4.1 is compatible with CUDA 11.0 and cuDNN 8.0. Please take a look at the tested build configurations for more information.

Version Python version Compiler Build tools cuDNN CUDA
tensorflow-2.4.0 3.6-3.8 GCC 7.3.1 Bazel 3.1.0 8.0 11.0
tensorflow-2.3.0 3.5-3.8 GCC 7.3.1 Bazel 3.1.0 7.6 10.1
tensorflow-2.2.0 3.5-3.8 GCC 7.3.1 Bazel 2.0.0 7.6 10.1
Could you please update CUDA to v11.0 and check if you are still facing the same error. Thanks!

For Jetson AGX Xavier platform, it only support Nvidia Jetpack SDK to update system AI tool chain, now I'm using Jetpack4.4, and Jetpack does not support CUDA 11.0 yet. https://developer.nvidia.com/jetpack-sdk-44-archive

@lcx2017
Copy link
Author

lcx2017 commented Mar 28, 2021

@lcx2017,
In order to expedite the trouble-shooting process, could you please provide a minimal code snippet to reproduce the issue reported here and the dataset you are using. Thanks!

Do you have Jetson AGX Xavier platform to reproduce this issue?

@tensorflowbutler tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Mar 30, 2021
@amahendrakar
Copy link
Contributor

@lcx2017,
Thank you for the update.

Currently, I do not have the NVIDIA Jetson AGX Xavier Developer Kit. But a minimal code snippet would help us debug the issue and determine the source of the error easily.

@amahendrakar amahendrakar added the stat:awaiting response Status - Awaiting response from author label Mar 30, 2021
@lcx2017
Copy link
Author

lcx2017 commented Mar 31, 2021

@lcx2017,
Thank you for the update.

Currently, I do not have the NVIDIA Jetson AGX Xavier Developer Kit. But a minimal code snippet would help us debug the issue and determine the source of the error easily.

Thanks a lot!
On Jetpack 4.4 (cuda 10.2, cudnn 8.0, tensorrt 7.1.3.0, Tensorflow 2.4.1)
1, The code can not share because of Trade secret, sorry for that!
2, The custom op is common pixel level computing method, some if/else logic in it so we didn't code it by CUDA kernel.
3, After close xla compilation for all the custom ops, crash issue still exists, crashed 6 times after 20 runs.
4, After enable xla compilation for the whole network, crashed 20 times after 20 runs.
5, After disable xla for the whole network, prue tensorflow model can run successful without crash

have also tested Jetpack 4.3 on Jetson AGX Xavier, find there is no crash issue on Jetpack4.3 (Tensorflow 2.2)(https://developer.nvidia.com/jetpack-43-archive),
On Jetpack 4.3 (cuda 10.0, cudnn 7.6.3, Tensorrt 6.0.1.10, Tensorflow 2.2)
1, After enable xla for the whole network, tensorflow xla model can run successful without crash, the latency is good.

After compare the Jetpack4.4 with Tensorflow 2.4 and Jetpack4.3 with Tensorflow2.2, found new latency slow issue for op of tf.math.unsorted_segment_max:
Jetpack4.3 with tensorflow 2.2: latency of "tf.math.unsorted_segment_max" op on Jetson AGX Xavier: 2~3ms
Jetpack4.3 with tensorflow 2.4: latency of "tf.math.unsorted_segment_max" op on Jetson AGX Xavier: 120ms

image

@tensorflowbutler tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Apr 2, 2021
@amahendrakar amahendrakar added the comp:xla XLA label Apr 6, 2021
@amahendrakar amahendrakar assigned ymodak and unassigned amahendrakar Apr 6, 2021
@ymodak ymodak added type:bug Bug and removed type:support Support issues labels Apr 15, 2021
@ymodak ymodak assigned r4nt and unassigned ymodak Apr 15, 2021
@ymodak ymodak added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Apr 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:xla XLA stat:awaiting tensorflower Status - Awaiting response from tensorflower TF 2.4 for issues related to TF 2.4 type:bug Bug
Projects
None yet
Development

No branches or pull requests

5 participants