Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-GPU workstation crashes during tf.Session() #26653

Closed
sbrodehl opened this issue Mar 13, 2019 · 8 comments
Closed

Multi-GPU workstation crashes during tf.Session() #26653

sbrodehl opened this issue Mar 13, 2019 · 8 comments

Comments

@sbrodehl
Copy link
Contributor

sbrodehl commented Mar 13, 2019

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow):No
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04):Linux 5.0.0-arch1-1-ARCH #1 SMP PREEMPT Mon Mar 4 14:11:43 UTC 2019 x86_64 GNU/Linux
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:None
  • TensorFlow installed from (source or binary):community python-tensorflow-opt-cuda
  • TensorFlow version (use command below):1.13.1
  • Python version:3.7.2
  • Bazel version (if compiling from source):None
  • GCC/Compiler version (if compiling from source):None
  • CUDA/cuDNN version:V10.0.130 / 7.5.0
  • GPU model and memory: 2 x Geforce GTX 1080 Ti 11GB; Driver Version: 418.43

Describe the current behavior
The workstation completely crashes if a tf.Session() is created when multiple GPUs are present.

I will roll back the last driver updates and post any updates.
Not sure if this is an error tensorflow can fix, maybe it is just a faulty driver.

Describe the expected behavior
Workstation should not crash.

Code to reproduce the issue
Provide a reproducible test case that is the bare minimum necessary to generate the problem.

import tensorflow as tf
s = tf.Session()

or, in short
python -c "import tensorflow as tf; s = tf.Session()"
the following line crashes as well
CUDA_VISIBLE_DEVICES="0,1" python -c "import tensorflow as tf; s = tf.Session()"

Other info / logs
The Problem exists only if multiple GPUs are present, so the following code works as expected:

  • CUDA_VISIBLE_DEVICES="" python -c "import tensorflow as tf; s = tf.Session()"
2019-03-13 09:49:14.192325: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3193000000 Hz
2019-03-13 09:49:14.193496: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x55f504e31540 executing computations on platform Host. Devices:
2019-03-13 09:49:14.193508: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
2019-03-13 09:49:14.201742: E tensorflow/stream_executor/cuda/cuda_driver.cc:300] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2019-03-13 09:49:14.201756: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:161] retrieving CUDA diagnostic information for host: ***
2019-03-13 09:49:14.201773: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:168] hostname: ***
2019-03-13 09:49:14.201820: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:192] libcuda reported version is: 418.43.0
2019-03-13 09:49:14.201833: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:196] kernel reported version is: 418.43.0
2019-03-13 09:49:14.201837: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:303] kernel version seems to match DSO: 418.43.0
  • CUDA_VISIBLE_DEVICES="0" python -c "import tensorflow as tf; s = tf.Session()"
2019-03-13 09:50:13.918948: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3193000000 Hz
2019-03-13 09:50:13.919562: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x55be94d7cdb0 executing computations on platform Host. Devices:
2019-03-13 09:50:13.919598: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
2019-03-13 09:50:14.009982: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-03-13 09:50:14.010633: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x55be949e86f0 executing computations on platform CUDA. Devices:
2019-03-13 09:50:14.010646: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): GeForce GTX 1080 Ti, Compute Capability 6.1
2019-03-13 09:50:14.011034: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: 
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:01:00.0
totalMemory: 10.91GiB freeMemory: 10.32GiB
2019-03-13 09:50:14.011044: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-03-13 09:50:14.778440: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-13 09:50:14.778463: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
2019-03-13 09:50:14.778467: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
2019-03-13 09:50:14.778759: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 9970 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
  • CUDA_VISIBLE_DEVICES="1" python -c "import tensorflow as tf; s = tf.Session()"
2019-03-13 09:50:19.398946: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3193000000 Hz
2019-03-13 09:50:19.400142: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x55eb2ee074e0 executing computations on platform Host. Devices:
2019-03-13 09:50:19.400177: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
2019-03-13 09:50:19.480237: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-03-13 09:50:19.480807: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x55eb2e7cc660 executing computations on platform CUDA. Devices:
2019-03-13 09:50:19.480820: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): GeForce GTX 1080 Ti, Compute Capability 6.1
2019-03-13 09:50:19.481144: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: 
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:02:00.0
totalMemory: 10.92GiB freeMemory: 10.77GiB
2019-03-13 09:50:19.481169: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-03-13 09:50:19.790419: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-13 09:50:19.790445: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
2019-03-13 09:50:19.790464: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
2019-03-13 09:50:19.790752: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10411 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:02:00.0, compute capability: 6.1)

I ran the code on a different machine with only one GPU and it worked just fine.

2019-03-13 09:55:10.169250: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX512F
2019-03-13 09:55:10.193346: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3801830000 Hz
2019-03-13 09:55:10.194746: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x55667abb6310 executing computations on platform Host. Devices:
2019-03-13 09:55:10.194784: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
2019-03-13 09:55:10.992667: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x55667bfd5a40 executing computations on platform CUDA. Devices:
2019-03-13 09:55:10.992720: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): TITAN RTX, Compute Capability 7.5
2019-03-13 09:55:10.994345: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: 
name: TITAN RTX major: 7 minor: 5 memoryClockRate(GHz): 1.77
pciBusID: 0000:65:00.0
totalMemory: 23.62GiB freeMemory: 23.45GiB
2019-03-13 09:55:10.994378: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-03-13 09:55:11.302104: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-13 09:55:11.302139: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
2019-03-13 09:55:11.302143: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
2019-03-13 09:55:11.302652: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 22722 MB memory) -> physical GPU (device: 0, name: TITAN RTX, pci bus id: 0000:65:00.0, compute capability: 7.5)
@sbrodehl
Copy link
Contributor Author

@davidenunes mentions the same issue in #18652 (comment), where downgrading the nvidia package to 415.27 and linux kernel to 4.20.11 solved the issue.

@sbrodehl
Copy link
Contributor Author

I was able to get some detailed error logs

Mar 13 10:58:49.229631 myhostname kernel: nvidia-uvm: Loaded the UVM driver in 8 mode, major device number 236
Mar 13 10:58:49.947274 myhostname kernel: BUG: unable to handle kernel NULL pointer dereference at 0000000000000040
Mar 13 10:58:49.947376 myhostname kernel: #PF error: [normal kernel read fault]
Mar 13 10:58:49.958319 myhostname kernel: PGD 8000000fd71fa067 P4D 8000000fd71fa067 PUD 0 
Mar 13 10:58:49.958392 myhostname kernel: Oops: 0000 [#1] PREEMPT SMP PTI
Mar 13 10:58:49.958413 myhostname kernel: CPU: 9 PID: 1208 Comm: python Tainted: P           OE     5.0.0-arch1-1-ARCH #1
Mar 13 10:58:49.958427 myhostname kernel: Hardware name: Gigabyte Technology Co., Ltd. Z370 AORUS Gaming 7/Z370 AORUS Gaming 7, BIOS F5l 01/10/2018
Mar 13 10:58:49.958441 myhostname kernel: RIP: 0010:nv_dma_map_peer+0xd0/0x160 [nvidia]
Mar 13 10:58:49.958453 myhostname kernel: Code: ce e8 b4 fd ff ff 48 8b 5c 24 10 65 48 33 1c 25 28 00 00 00 0f 85 8e 00 00 00 48 83 c4 18 5b 5d 41 5c c3 48 8b 05 78 dc 7e f6 <48> 83 78 40 00 75 ca 49 c1 e2 06 49 8b 78 10 48 89 e6 4b 8d 94 10
Mar 13 10:58:49.958465 myhostname kernel: RSP: 0018:ffffbb9c52edf9a0 EFLAGS: 00010246
Mar 13 10:58:49.958477 myhostname kernel: RAX: 0000000000000000 RBX: ffff998fe5fd08f0 RCX: 0000000000000010
Mar 13 10:58:49.958489 myhostname kernel: RDX: 0000000000000001 RSI: ffff99900888f000 RDI: ffff99900888c800
Mar 13 10:58:49.958500 myhostname kernel: RBP: 00000000d0000000 R08: ffff9990173f0000 R09: 00000000dfffffff
Mar 13 10:58:49.958512 myhostname kernel: R10: 0000000000000001 R11: 0000000000010000 R12: 00000000d0000000
Mar 13 10:58:49.958525 myhostname kernel: R13: ffff99900888f000 R14: 000000000888c800 R15: ffff998fe5fd08c8
Mar 13 10:58:49.958537 myhostname kernel: FS:  00007f10ab1f8600(0000) GS:ffff99901ea40000(0000) knlGS:0000000000000000
Mar 13 10:58:49.958548 myhostname kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 13 10:58:49.958560 myhostname kernel: CR2: 0000000000000040 CR3: 0000000fdf7f8001 CR4: 00000000003606e0
Mar 13 10:58:49.958572 myhostname kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Mar 13 10:58:49.958582 myhostname kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Mar 13 10:58:49.958594 myhostname kernel: Call Trace:
Mar 13 10:58:49.958604 myhostname kernel:  _nv029594rm+0x265/0x490 [nvidia]
Mar 13 10:58:49.958615 myhostname kernel:  ? _nv025108rm+0x17c/0x3f0 [nvidia]
Mar 13 10:58:49.958625 myhostname kernel:  ? _nv026018rm+0xd7/0x270 [nvidia]
Mar 13 10:58:49.958636 myhostname kernel:  ? _nv010042rm+0x122/0x1b0 [nvidia]
Mar 13 10:58:49.958646 myhostname kernel:  ? _nv010038rm+0xc7/0x300 [nvidia]
Mar 13 10:58:49.958656 myhostname kernel:  ? _nv009772rm+0x2a2/0x770 [nvidia]
Mar 13 10:58:49.958671 myhostname kernel:  ? _nv009773rm+0x2a1/0x500 [nvidia]
Mar 13 10:58:49.958683 myhostname kernel:  ? _nv029782rm+0x367/0x870 [nvidia]
Mar 13 10:58:49.958694 myhostname kernel:  ? _nv003494rm+0x4e/0x70 [nvidia]
Mar 13 10:58:49.958705 myhostname kernel:  ? _nv004154rm+0xd9/0x180 [nvidia]
Mar 13 10:58:49.958715 myhostname kernel:  ? _nv032625rm+0x48/0x90 [nvidia]
Mar 13 10:58:49.958726 myhostname kernel:  ? _nv006068rm+0x1ca/0x350 [nvidia]
Mar 13 10:58:49.958736 myhostname kernel:  ? _nv033934rm+0x3cc/0x5e0 [nvidia]
Mar 13 10:58:49.958747 myhostname kernel:  ? _nv033933rm+0x98/0xf0 [nvidia]
Mar 13 10:58:49.958758 myhostname kernel:  ? _nv032726rm+0xfc/0x270 [nvidia]
Mar 13 10:58:49.958770 myhostname kernel:  ? _nv032727rm+0x53/0x80 [nvidia]
Mar 13 10:58:49.958781 myhostname kernel:  ? _nv007017rm+0x4b/0x80 [nvidia]
Mar 13 10:58:49.958791 myhostname kernel:  ? _nv000935rm+0x46e/0x900 [nvidia]
Mar 13 10:58:49.958822 myhostname kernel:  ? _raw_spin_unlock_irqrestore+0x20/0x40
Mar 13 10:58:49.958834 myhostname kernel:  ? rm_ioctl+0x54/0xb0 [nvidia]
Mar 13 10:58:49.958843 myhostname kernel:  ? filemap_map_pages+0x185/0x380
Mar 13 10:58:49.958855 myhostname kernel:  ? nvidia_ioctl+0xb0/0x7c0 [nvidia]
Mar 13 10:58:49.958867 myhostname kernel:  ? nvidia_ioctl+0x5f0/0x7c0 [nvidia]
Mar 13 10:58:49.958877 myhostname kernel:  ? nvidia_frontend_unlocked_ioctl+0x3a/0x50 [nvidia]
Mar 13 10:58:49.958887 myhostname kernel:  ? do_vfs_ioctl+0xa4/0x630
Mar 13 10:58:49.958896 myhostname kernel:  ? handle_mm_fault+0x10a/0x250
Mar 13 10:58:49.958908 myhostname kernel:  ? ksys_ioctl+0x60/0x90
Mar 13 10:58:49.958918 myhostname kernel:  ? __x64_sys_ioctl+0x16/0x20
Mar 13 10:58:49.958929 myhostname kernel:  ? do_syscall_64+0x5b/0x170
Mar 13 10:58:49.958941 myhostname kernel:  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
Mar 13 10:58:49.958952 myhostname kernel: Modules linked in: nvidia_uvm(OE) ipt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c br_netfilter bridge stp llc mousedev hid_logitech_hidpp joydev input_leds hid_logitech_dj overlay hid_generic usbhid hid nvidia_drm(POE) nvidia_modeset(POE) snd_hda_codec_hdmi uas usb_storage nvidia(POE) intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass drm_kms_helper snd_hda_codec_realtek crct10dif_pclmul snd_hda_codec_generic drm ledtrig_audio crc32_pclmul ghash_clmulni_intel snd_hda_intel snd_hda_codec agpgart ipmi_devintf ipmi_msghandler snd_hda_core syscopyarea sysfillrect sysimgblt fb_sys_fops snd_hwdep iTCO_wdt iTCO_vendor_support aesni_intel wmi_bmof mxm_wmi intel_wmi_thunderbolt snd_pcm aes_x86_64 crypto_simd cryptd snd_timer glue_helper intel_cstate snd e1000e mei_me mei soundcore alx intel_uncore mdio i2c_i801
Mar 13 10:58:49.958976 myhostname kernel:  pcspkr intel_rapl_perf evdev mac_hid wmi pcc_cpufreq crypto_user ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 fscrypto sr_mod cdrom sd_mod ahci libahci libata xhci_pci crc32c_intel xhci_hcd scsi_mod
Mar 13 10:58:49.958989 myhostname kernel: CR2: 0000000000000040
Mar 13 10:58:49.958999 myhostname kernel: ---[ end trace b892d5812d52b5d5 ]---
Mar 13 10:58:49.959010 myhostname kernel: RIP: 0010:nv_dma_map_peer+0xd0/0x160 [nvidia]
Mar 13 10:58:49.959021 myhostname kernel: Code: ce e8 b4 fd ff ff 48 8b 5c 24 10 65 48 33 1c 25 28 00 00 00 0f 85 8e 00 00 00 48 83 c4 18 5b 5d 41 5c c3 48 8b 05 78 dc 7e f6 <48> 83 78 40 00 75 ca 49 c1 e2 06 49 8b 78 10 48 89 e6 4b 8d 94 10
Mar 13 10:58:49.959031 myhostname kernel: RSP: 0018:ffffbb9c52edf9a0 EFLAGS: 00010246
Mar 13 10:58:49.959044 myhostname kernel: RAX: 0000000000000000 RBX: ffff998fe5fd08f0 RCX: 0000000000000010
Mar 13 10:58:49.959054 myhostname kernel: RDX: 0000000000000001 RSI: ffff99900888f000 RDI: ffff99900888c800
Mar 13 10:58:49.959065 myhostname kernel: RBP: 00000000d0000000 R08: ffff9990173f0000 R09: 00000000dfffffff
Mar 13 10:58:49.959075 myhostname kernel: R10: 0000000000000001 R11: 0000000000010000 R12: 00000000d0000000
Mar 13 10:58:49.959085 myhostname kernel: R13: ffff99900888f000 R14: 000000000888c800 R15: ffff998fe5fd08c8
Mar 13 10:58:49.959097 myhostname kernel: FS:  00007f10ab1f8600(0000) GS:ffff99901ea40000(0000) knlGS:0000000000000000
Mar 13 10:58:49.959107 myhostname kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 13 10:58:49.959118 myhostname kernel: CR2: 0000000000000040 CR3: 0000000fdf7f8001 CR4: 00000000003606e0
Mar 13 10:58:49.959128 myhostname kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Mar 13 10:58:49.959138 myhostname kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

Here is a nvidia-smi call after the crash

Mar 13 11:01:55.588061 myhostname kernel: BUG: unable to handle kernel paging request at ffffbb9c52edfdb8
Mar 13 11:01:55.588156 myhostname kernel: #PF error: [normal kernel read fault]
Mar 13 11:01:55.594782 myhostname kernel: PGD 101e536067 P4D 101e536067 PUD 101e537067 PMD ffb0c6067 PTE 0
Mar 13 11:01:55.594839 myhostname kernel: Oops: 0000 [#2] PREEMPT SMP PTI
Mar 13 11:01:55.594871 myhostname kernel: CPU: 4 PID: 1334 Comm: nvidia-smi Tainted: P      D    OE     5.0.0-arch1-1-ARCH #1
Mar 13 11:01:55.594883 myhostname kernel: Hardware name: Gigabyte Technology Co., Ltd. Z370 AORUS Gaming 7/Z370 AORUS Gaming 7, BIOS F5l 01/10/2018
Mar 13 11:01:55.594895 myhostname kernel: RIP: 0010:_nv006968rm+0x2c/0x330 [nvidia]
Mar 13 11:01:55.594907 myhostname kernel: Code: 48 85 d2 74 07 48 63 47 08 48 01 d0 48 8b 17 48 85 d2 75 16 e9 9d 02 00 00 0f 1f 44 00 00 48 8b 4a 10 48 85 c9 74 17 48 89 ca <48> 39 32 77 ef 0f 83 29 02 00 00 48 8b 4a 18 48 85 c9 75 e9 48 89
Mar 13 11:01:55.594920 myhostname kernel: RSP: 0018:ffffbb9c462bfd30 EFLAGS: 00010086
Mar 13 11:01:55.594932 myhostname kernel: RAX: ffffbb9c462bfdb8 RBX: ffffbb9c462bfd60 RCX: 0000000000000000
Mar 13 11:01:55.594942 myhostname kernel: RDX: ffffbb9c52edfdb8 RSI: 0000000000000536 RDI: ffffffffc21bd2d8
Mar 13 11:01:55.594953 myhostname kernel: RBP: ffff998fc11baff0 R08: 0000000000000048 R09: 0000000000000060
Mar 13 11:01:55.594964 myhostname kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
Mar 13 11:01:55.594975 myhostname kernel: R13: ffff998fe897dd20 R14: 000000000000001f R15: 0000000000000048
Mar 13 11:01:55.594987 myhostname kernel: FS:  00007fed23e5fb80(0000) GS:ffff99901e900000(0000) knlGS:0000000000000000
Mar 13 11:01:55.594997 myhostname kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 13 11:01:55.595011 myhostname kernel: CR2: ffffbb9c52edfdb8 CR3: 0000001017b90003 CR4: 00000000003606e0
Mar 13 11:01:55.595024 myhostname kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Mar 13 11:01:55.595041 myhostname kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Mar 13 11:01:55.595054 myhostname kernel: Call Trace:
Mar 13 11:01:55.595066 myhostname kernel:  ? _nv035386rm+0xf1/0x1d0 [nvidia]
Mar 13 11:01:55.595078 myhostname kernel:  ? rm_perform_version_check+0x35/0x150 [nvidia]
Mar 13 11:01:55.595089 myhostname kernel:  ? __kmalloc+0x188/0x210
Mar 13 11:01:55.595100 myhostname kernel:  ? nvidia_ioctl+0xd6/0x7c0 [nvidia]
Mar 13 11:01:55.595112 myhostname kernel:  ? nvidia_ioctl+0x630/0x7c0 [nvidia]
Mar 13 11:01:55.595122 myhostname kernel:  ? nvidia_frontend_unlocked_ioctl+0x3a/0x50 [nvidia]
Mar 13 11:01:55.595134 myhostname kernel:  ? do_vfs_ioctl+0xa4/0x630
Mar 13 11:01:55.595147 myhostname kernel:  ? syscall_trace_enter+0x1d3/0x2d0
Mar 13 11:01:55.595159 myhostname kernel:  ? ksys_ioctl+0x60/0x90
Mar 13 11:01:55.595170 myhostname kernel:  ? __x64_sys_ioctl+0x16/0x20
Mar 13 11:01:55.595180 myhostname kernel:  ? do_syscall_64+0x5b/0x170
Mar 13 11:01:55.595191 myhostname kernel:  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
Mar 13 11:01:55.595202 myhostname kernel: Modules linked in: nvidia_uvm(OE) ipt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c br_netfilter bridge stp llc mousedev hid_logitech_hidpp joydev input_leds hid_logitech_dj overlay hid_generic usbhid hid nvidia_drm(POE) nvidia_modeset(POE) snd_hda_codec_hdmi uas usb_storage nvidia(POE) intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass drm_kms_helper snd_hda_codec_realtek crct10dif_pclmul snd_hda_codec_generic drm ledtrig_audio crc32_pclmul ghash_clmulni_intel snd_hda_intel snd_hda_codec agpgart ipmi_devintf ipmi_msghandler snd_hda_core syscopyarea sysfillrect sysimgblt fb_sys_fops snd_hwdep iTCO_wdt iTCO_vendor_support aesni_intel wmi_bmof mxm_wmi intel_wmi_thunderbolt snd_pcm aes_x86_64 crypto_simd cryptd snd_timer glue_helper intel_cstate snd e1000e mei_me mei soundcore alx intel_uncore mdio i2c_i801
Mar 13 11:01:55.595215 myhostname kernel:  pcspkr intel_rapl_perf evdev mac_hid wmi pcc_cpufreq crypto_user ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 fscrypto sr_mod cdrom sd_mod ahci libahci libata xhci_pci crc32c_intel xhci_hcd scsi_mod
Mar 13 11:01:55.595226 myhostname kernel: CR2: ffffbb9c52edfdb8
Mar 13 11:01:55.595237 myhostname kernel: ---[ end trace b892d5812d52b5d6 ]---
Mar 13 11:01:55.595248 myhostname kernel: RIP: 0010:nv_dma_map_peer+0xd0/0x160 [nvidia]
Mar 13 11:01:55.595258 myhostname kernel: Code: ce e8 b4 fd ff ff 48 8b 5c 24 10 65 48 33 1c 25 28 00 00 00 0f 85 8e 00 00 00 48 83 c4 18 5b 5d 41 5c c3 48 8b 05 78 dc 7e f6 <48> 83 78 40 00 75 ca 49 c1 e2 06 49 8b 78 10 48 89 e6 4b 8d 94 10
Mar 13 11:01:55.595267 myhostname kernel: RSP: 0018:ffffbb9c52edf9a0 EFLAGS: 00010246
Mar 13 11:01:55.595280 myhostname kernel: RAX: 0000000000000000 RBX: ffff998fe5fd08f0 RCX: 0000000000000010
Mar 13 11:01:55.595290 myhostname kernel: RDX: 0000000000000001 RSI: ffff99900888f000 RDI: ffff99900888c800
Mar 13 11:01:55.595300 myhostname kernel: RBP: 00000000d0000000 R08: ffff9990173f0000 R09: 00000000dfffffff
Mar 13 11:01:55.595310 myhostname kernel: R10: 0000000000000001 R11: 0000000000010000 R12: 00000000d0000000
Mar 13 11:01:55.595321 myhostname kernel: R13: ffff99900888f000 R14: 000000000888c800 R15: ffff998fe5fd08c8
Mar 13 11:01:55.595331 myhostname kernel: FS:  00007fed23e5fb80(0000) GS:ffff99901e900000(0000) knlGS:0000000000000000
Mar 13 11:01:55.595341 myhostname kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 13 11:01:55.595351 myhostname kernel: CR2: ffffbb9c52edfdb8 CR3: 0000001017b90003 CR4: 00000000003606e0
Mar 13 11:01:55.595361 myhostname kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Mar 13 11:01:55.595371 myhostname kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Mar 13 11:01:55.595381 myhostname kernel: note: nvidia-smi[1334] exited with preempt_count 1

@byronyi
Copy link
Contributor

byronyi commented Mar 13, 2019

Your kernel seems way too new for CUDA. Please make sure your CUDA is supported in your Linux distro. Otherwise, we cannot make sure it is a TF issue or not.

@sbrodehl
Copy link
Contributor Author

tensorflow only supports CUDA 10.0, so I am not able to upgrade CUDA.
nvidia driver 418.43 works with linux kernel v5, but I guess for now I will assume that there is an issue with CUDA/driver and linux kernel v5.

@hgaiser
Copy link
Contributor

hgaiser commented Mar 26, 2019

Thank you for reporting this issue, it helped me resolve it on my system. I'm running a nearly identical setup as you, Arch Linux, 2 GPUs, linux kernel 5.0.3 and it crashes with the same error. After seeing this issue I downgraded to linux-lts and nvidia-lts, that has resolved my problem. Apparently, something in linux / nvidia / tensorflow is causing issues.

@lissyx
Copy link
Contributor

lissyx commented Apr 23, 2019

I'm also hitting similar behavior, uptodate Debian Sid with two RTX2080Ti. Currently using nvidia-driver at version 418.56-2 on kernel 4.19, CUDA 10.0 and TensorFlow r1.13. Previously I could train properly for several hours, and indeed driver was v415.xx from experimental.

@lissyx
Copy link
Contributor

lissyx commented Apr 24, 2019

I'm also hitting similar behavior, uptodate Debian Sid with two RTX2080Ti. Currently using nvidia-driver at version 418.56-2 on kernel 4.19, CUDA 10.0 and TensorFlow r1.13. Previously I could train properly for several hours, and indeed driver was v415.xx from experimental.

@sbrodehl I don't know in your case, but in mine, adjusting the batch size in DeepSpeech when training model helped avoiding the crashes. I suspect something going on with new driver when GPU runs out of memory.

@lissyx
Copy link
Contributor

lissyx commented May 3, 2019

I'm also hitting similar behavior, uptodate Debian Sid with two RTX2080Ti. Currently using nvidia-driver at version 418.56-2 on kernel 4.19, CUDA 10.0 and TensorFlow r1.13. Previously I could train properly for several hours, and indeed driver was v415.xx from experimental.

@sbrodehl I don't know in your case, but in mine, adjusting the batch size in DeepSpeech when training model helped avoiding the crashes. I suspect something going on with new driver when GPU runs out of memory.

Ok, never mind, this was noise, in my case the recurrent shutdown was, as I expected at first, just a power issue. Looks like when installing the new setup I forgot we should wire two PCIe cables per RTX, and I only wired one using the Y connector to plug both on each RTX.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants