Skip to content

DELF GLDv2 training error: Fatal Python error: Segmentation fault. #8940

@kxhit

Description

@kxhit

I might be interrupting. @andrefaraujo

1. The entire URL of the file you are using

https://github.com/tensorflow/models/tree/master/research/delf/delf/python/training

2. Describe the bug

I run the train.py followed the instruction and an error occurs.
Fatal Python error: Segmentation fault.

Details are below. The problem may between 363-367 in train.py by following the output of logging.info().

I0722 08:21:29.993298 140327930558272 train.py:120] Running training script with

I0722 08:21:29.993462 140327930558272 train.py:121] logdir= gldv2_training
I0722 08:21:29.993936 140327930558272 train.py:122] initial_lr= 0.010000
I0722 08:21:29.994389 140327930558272 train.py:123] block3_strides= True
2020-07-22 08:21:29.995690: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-07-22 08:21:30.064180: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties:
pciBusID: 0000:04:00.0 name: Tesla P40 computeCapability: 6.1
coreClock: 1.531GHz coreCount: 30 deviceMemorySize: 23.88GiB deviceMemoryBandwidth: 323.21GiB/s
2020-07-22 08:21:30.066700: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 1 with properties:
pciBusID: 0000:06:00.0 name: Tesla P40 computeCapability: 6.1
coreClock: 1.531GHz coreCount: 30 deviceMemorySize: 23.88GiB deviceMemoryBandwidth: 323.21GiB/s
2020-07-22 08:21:30.069208: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 2 with properties:
pciBusID: 0000:07:00.0 name: Tesla P40 computeCapability: 6.1
coreClock: 1.531GHz coreCount: 30 deviceMemorySize: 23.88GiB deviceMemoryBandwidth: 323.21GiB/s
2020-07-22 08:21:30.071704: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 3 with properties:
pciBusID: 0000:08:00.0 name: Tesla P40 computeCapability: 6.1
coreClock: 1.531GHz coreCount: 30 deviceMemorySize: 23.88GiB deviceMemoryBandwidth: 323.21GiB/s
2020-07-22 08:21:30.074091: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 4 with properties:
pciBusID: 0000:0c:00.0 name: Tesla P40 computeCapability: 6.1
coreClock: 1.531GHz coreCount: 30 deviceMemorySize: 23.88GiB deviceMemoryBandwidth: 323.21GiB/s
2020-07-22 08:21:30.076454: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 5 with properties:
pciBusID: 0000:0d:00.0 name: Tesla P40 computeCapability: 6.1
coreClock: 1.531GHz coreCount: 30 deviceMemorySize: 23.88GiB deviceMemoryBandwidth: 323.21GiB/s
2020-07-22 08:21:30.078852: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 6 with properties:
pciBusID: 0000:0e:00.0 name: Tesla P40 computeCapability: 6.1
coreClock: 1.531GHz coreCount: 30 deviceMemorySize: 23.88GiB deviceMemoryBandwidth: 323.21GiB/s
2020-07-22 08:21:30.081129: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 7 with properties:
pciBusID: 0000:0f:00.0 name: Tesla P40 computeCapability: 6.1
coreClock: 1.531GHz coreCount: 30 deviceMemorySize: 23.88GiB deviceMemoryBandwidth: 323.21GiB/s
2020-07-22 08:21:30.081347: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-07-22 08:21:30.083051: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-07-22 08:21:30.084738: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-07-22 08:21:30.085027: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-07-22 08:21:30.086848: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-07-22 08:21:30.087870: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-07-22 08:21:30.091821: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-07-22 08:21:30.130943: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0, 1, 2, 3, 4, 5, 6, 7
2020-07-22 08:21:30.131306: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-07-22 08:21:30.146241: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 2199885000 Hz
2020-07-22 08:21:30.150533: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x56131c7b91b0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-07-22 08:21:30.150559: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2020-07-22 08:21:31.578771: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x56131c0f1340 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-07-22 08:21:31.578812: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Tesla P40, Compute Capability 6.1
2020-07-22 08:21:31.578822: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (1): Tesla P40, Compute Capability 6.1
2020-07-22 08:21:31.578830: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (2): Tesla P40, Compute Capability 6.1
2020-07-22 08:21:31.578837: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (3): Tesla P40, Compute Capability 6.1
2020-07-22 08:21:31.578844: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (4): Tesla P40, Compute Capability 6.1
2020-07-22 08:21:31.578851: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (5): Tesla P40, Compute Capability 6.1
2020-07-22 08:21:31.578859: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (6): Tesla P40, Compute Capability 6.1
2020-07-22 08:21:31.578866: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (7): Tesla P40, Compute Capability 6.1
2020-07-22 08:21:31.622038: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties:
pciBusID: 0000:04:00.0 name: Tesla P40 computeCapability: 6.1
coreClock: 1.531GHz coreCount: 30 deviceMemorySize: 23.88GiB deviceMemoryBandwidth: 323.21GiB/s
2020-07-22 08:21:31.624129: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 1 with properties:
pciBusID: 0000:06:00.0 name: Tesla P40 computeCapability: 6.1
coreClock: 1.531GHz coreCount: 30 deviceMemorySize: 23.88GiB deviceMemoryBandwidth: 323.21GiB/s
2020-07-22 08:21:31.626215: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 2 with properties:
pciBusID: 0000:07:00.0 name: Tesla P40 computeCapability: 6.1
coreClock: 1.531GHz coreCount: 30 deviceMemorySize: 23.88GiB deviceMemoryBandwidth: 323.21GiB/s
2020-07-22 08:21:31.628316: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 3 with properties:
pciBusID: 0000:08:00.0 name: Tesla P40 computeCapability: 6.1
coreClock: 1.531GHz coreCount: 30 deviceMemorySize: 23.88GiB deviceMemoryBandwidth: 323.21GiB/s
2020-07-22 08:21:31.630404: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 4 with properties:
pciBusID: 0000:0c:00.0 name: Tesla P40 computeCapability: 6.1
coreClock: 1.531GHz coreCount: 30 deviceMemorySize: 23.88GiB deviceMemoryBandwidth: 323.21GiB/s
2020-07-22 08:21:31.632471: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 5 with properties:
pciBusID: 0000:0d:00.0 name: Tesla P40 computeCapability: 6.1
coreClock: 1.531GHz coreCount: 30 deviceMemorySize: 23.88GiB deviceMemoryBandwidth: 323.21GiB/s
2020-07-22 08:21:31.634446: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 6 with properties:
pciBusID: 0000:0e:00.0 name: Tesla P40 computeCapability: 6.1
coreClock: 1.531GHz coreCount: 30 deviceMemorySize: 23.88GiB deviceMemoryBandwidth: 323.21GiB/s
2020-07-22 08:21:31.636402: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 7 with properties:
pciBusID: 0000:0f:00.0 name: Tesla P40 computeCapability: 6.1
coreClock: 1.531GHz coreCount: 30 deviceMemorySize: 23.88GiB deviceMemoryBandwidth: 323.21GiB/s
2020-07-22 08:21:31.636453: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-07-22 08:21:31.636473: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-07-22 08:21:31.636489: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-07-22 08:21:31.636506: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-07-22 08:21:31.636522: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-07-22 08:21:31.636537: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-07-22 08:21:31.636554: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-07-22 08:21:31.668968: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0, 1, 2, 3, 4, 5, 6, 7
2020-07-22 08:21:31.669017: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-07-22 08:21:31.686490: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-07-22 08:21:31.686513: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1108] 0 1 2 3 4 5 6 7
2020-07-22 08:21:31.686525: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 0: N Y Y Y Y Y Y Y
2020-07-22 08:21:31.686533: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 1: Y N Y Y Y Y Y Y
2020-07-22 08:21:31.686541: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 2: Y Y N Y Y Y Y Y
2020-07-22 08:21:31.686548: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 3: Y Y Y N Y Y Y Y
2020-07-22 08:21:31.686563: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 4: Y Y Y Y N Y Y Y
2020-07-22 08:21:31.686571: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 5: Y Y Y Y Y N Y Y
2020-07-22 08:21:31.686578: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 6: Y Y Y Y Y Y N Y
2020-07-22 08:21:31.686586: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 7: Y Y Y Y Y Y Y N
2020-07-22 08:21:31.710870: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 22837 MB memory) -> physical GPU (device: 0, name: Tesla P40, pci
bus id: 0000:04:00.0, compute capability: 6.1)
2020-07-22 08:21:31.713223: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 22837 MB memory) -> physical GPU (device: 1, name: Tesla P40, pci
bus id: 0000:06:00.0, compute capability: 6.1)
2020-07-22 08:21:31.715586: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 22837 MB memory) -> physical GPU (device: 2, name: Tesla P40, pci
bus id: 0000:07:00.0, compute capability: 6.1)
2020-07-22 08:21:31.717907: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 22837 MB memory) -> physical GPU (device: 3, name: Tesla P40, pci
bus id: 0000:08:00.0, compute capability: 6.1)
2020-07-22 08:21:31.720255: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:4 with 22837 MB memory) -> physical GPU (device: 4, name: Tesla P40, pci
bus id: 0000:0c:00.0, compute capability: 6.1)
2020-07-22 08:21:31.722603: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:5 with 22837 MB memory) -> physical GPU (device: 5, name: Tesla P40, pci
bus id: 0000:0d:00.0, compute capability: 6.1)
2020-07-22 08:21:31.724940: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:6 with 22837 MB memory) -> physical GPU (device: 6, name: Tesla P40, pci
bus id: 0000:0e:00.0, compute capability: 6.1)
2020-07-22 08:21:31.727318: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:7 with 22837 MB memory) -> physical GPU (device: 7, name: Tesla P40, pci
bus id: 0000:0f:00.0, compute capability: 6.1)
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/device:GPU:2', '/job:localhost/replica:0/task:0/device:GPU:3', '/job:localhost/replica:0/task:0/device:GPU:4', '/job:localhost/replica:0/task:0/device:GPU:5', '/job:localhost/replica:0/task:0/device:GPU:6', '/job:localhost/replica:0/task:0/device:GPU:7')
I0722 08:21:31.735625 140327930558272 mirrored_strategy.py:500] Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/device:GPU:2', '/job:localhost/replica:0/task:0/device:GPU:3', '/job:localhost/replica:0/task:0/device:GPU:4', '/job:localhost/replica:0/task:0/device:GPU:5', '/job:localhost/replica:0/task:0/device:GPU:6', '/job:localhost/replica:0/task:0/device:GPU:7')
I0722 08:21:31.735947 140327930558272 train.py:128] Number of devices: 8
WARNING:tensorflow:From /home/xinkong/.local/lib/python3.6/site-packages/tensorflow/python/ops/image_ops_impl.py:2827: sample_distorted_bounding_box (from tensorflow.python.ops.image_ops_impl) is deprecated and will be removed in a future version.
Instructions for updating:
seed2 arg is deprecated.Use sample_distorted_bounding_box_v2 instead.
W0722 08:21:33.257451 140327930558272 deprecation.py:323] From /home/xinkong/.local/lib/python3.6/site-packages/tensorflow/python/ops/image_ops_impl.py:2827: sample_distorted_bounding_box (from tensorflow.python.ops.image_ops_impl) is deprecated and will be removed in a future version.
Instructions for updating:
seed2 arg is deprecated.Use sample_distorted_bounding_box_v2 instead.
I0722 08:21:41.587657 140327930558272 train.py:210] Model, datasets loaded.
num_classes= 81313
I0722 08:21:41.596926 140327930558272 train.py:363] Attempting to load ImageNet pretrained weights.
INFO:tensorflow:batch_all_reduce: 214 all-reduces with algorithm = nccl, num_packs = 1
I0722 08:22:10.150980 140327930558272 cross_device_ops.py:698] batch_all_reduce: 214 all-reduces with algorithm = nccl, num_packs = 1
INFO:tensorflow:batch_all_reduce: 8 all-reduces with algorithm = nccl, num_packs = 1
I0722 08:22:20.667220 140327930558272 cross_device_ops.py:698] batch_all_reduce: 8 all-reduces with algorithm = nccl, num_packs = 1
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0722 08:22:22.086211 140327930558272 cross_device_ops.py:440] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0722 08:22:22.087918 140327930558272 cross_device_ops.py:440] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:batch_all_reduce: 214 all-reduces with algorithm = nccl, num_packs = 1
I0722 08:22:47.857083 140327930558272 cross_device_ops.py:698] batch_all_reduce: 214 all-reduces with algorithm = nccl, num_packs = 1
INFO:tensorflow:batch_all_reduce: 8 all-reduces with algorithm = nccl, num_packs = 1
I0722 08:22:57.193145 140327930558272 cross_device_ops.py:698] batch_all_reduce: 8 all-reduces with algorithm = nccl, num_packs = 1
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0722 08:22:58.614024 140327930558272 cross_device_ops.py:440] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0722 08:22:58.616227 140327930558272 cross_device_ops.py:440] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
2020-07-22 08:23:27.528045: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-07-22 08:23:27.791325: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
Fatal Python error: Segmentation fault

Thread 0x00007fa0a473f740 (most recent call first):
File "/home/xinkong/.local/lib/python3.6/site-packages/tensorflow/python/eager/execute.py", line 60 in quick_execute
File "/home/xinkong/.local/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 598 in call
File "/home/xinkong/.local/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 1746 in _call_flat
File "/home/xinkong/.local/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 1665 in _filtered_call
File "/home/xinkong/.local/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 2420 in call
File "/home/xinkong/.local/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py", line 644 in _call
File "/home/xinkong/.local/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py", line 580 in call
File "train.py", line 365 in main
File "/home/xinkong/.local/lib/python3.6/site-packages/absl/app.py", line 250 in _run_main
File "/home/xinkong/.local/lib/python3.6/site-packages/absl/app.py", line 299 in run
File "train.py", line 472 in
Segmentation fault (core dumped)

6. System information

  • OS Platform and Distribution: gcc version 4.8.2 20140120 (Red Hat 4.8.2-16)
  • TensorFlow installed from (source or binary): pip install tensorflow-gpu
  • TensorFlow version (use command below): the latest version
  • Python version: 3.6
  • GCC/Compiler version (if compiling from source):
  • CUDA/cuDNN version: 10
  • GPU model and memory: P40, 24G, 8 in total

Could anyone help me? Thanks!

Metadata

Metadata

Assignees

Labels

models:researchmodels that come under research directorytype:bugBug in the code

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions