python3 finetuning errors #30

xurongqiang · 2019-09-27T03:23:24Z

when running the command :

python training.py --model_dir ../data_finetuning/seqlen256_v1.ckpt/ --iterations 250

errors occur, could sombody help me ?

2019-09-27 11:08:55.815810: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 193 MB memory) -> physical GPU (device: 1, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:10.0, compute capability: 6.0)
2019-09-27 11:08:55.815889: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-27 11:08:55.822316: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 15190 MB memory) -> physical GPU (device: 2, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:11.0, compute capability: 6.0)
2019-09-27 11:08:55.822395: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-27 11:08:55.829322: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 15190 MB memory) -> physical GPU (device: 3, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:12.0, compute capability: 6.0)
2019-09-27 11:08:55.829429: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-27 11:08:55.835786: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:4 with 15190 MB memory) -> physical GPU (device: 4, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:13.0, compute capability: 6.0)
2019-09-27 11:08:55.835872: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-27 11:08:55.842883: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:5 with 15190 MB memory) -> physical GPU (device: 5, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:14.0, compute capability: 6.0)
2019-09-27 11:08:55.843003: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-27 11:08:55.850326: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:6 with 15190 MB memory) -> physical GPU (device: 6, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:15.0, compute capability: 6.0)
2019-09-27 11:08:55.850434: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-27 11:08:55.857728: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:7 with 15190 MB memory) -> physical GPU (device: 7, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:16.0, compute capability: 6.0)
E0927 11:08:55.868580 140547554760512 error_handling.py:70] Error recorded from training_loop: Cannot find any TPU cores in the system (master address ). This usually means the master address is incorrect or the TPU worker has some problems. Available devices: [_DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 268435456, 12519265597810562643), _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_GPU:0, XLA_GPU, 17179869184, 13700291500443683580), _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_GPU:1, XLA_GPU, 17179869184, 86262967647931383), _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_GPU:2, XLA_GPU, 17179869184, 3676913639991227464), _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_GPU:3, XLA_GPU, 17179869184, 5354296951385035528), _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_GPU:4, XLA_GPU, 17179869184, 12154468832020101184), _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_GPU:5, XLA_GPU, 17179869184, 13118045380692252360), _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_GPU:6, XLA_GPU, 17179869184, 9442972683431350141), _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_GPU:7, XLA_GPU, 17179869184, 13012334678599159156), _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 17179869184, 1063841961695883546), _DeviceAttributes(/job:localhost/replica:0/task:0/device:GPU:0, GPU, 15928269210, 2610604702973413960), _DeviceAttributes(/job:localhost/replica:0/task:0/device:GPU:1, GPU, 203292672, 17931462477742070628), _DeviceAttributes(/job:localhost/replica:0/task:0/device:GPU:2, GPU, 15928269210, 5846002352678548358), _DeviceAttributes(/job:localhost/replica:0/task:0/device:GPU:3, GPU, 15928269210, 10456649650628517216), _DeviceAttributes(/job:localhost/replica:0/task:0/device:GPU:4, GPU, 15928269210, 17379282422107701438), _DeviceAttributes(/job:localhost/replica:0/task:0/device:GPU:5, GPU, 15928269210, 8202577610745802132), _DeviceAttributes(/job:localhost/replica:0/task:0/device:GPU:6, GPU, 15928269210, 14481908658310636262), _DeviceAttributes(/job:localhost/replica:0/task:0/device:GPU:7, GPU, 15928269210, 278208692209243281)]
I0927 11:08:55.868802 140547554760512 error_handling.py:96] training_loop marked as finished
W0927 11:08:55.868901 140547554760512 error_handling.py:130] Reraising captured error
Traceback (most recent call last):
File "training.py", line 164, in
estimator_model.train(input_fn=input_fn, steps=args.iterations)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2876, in train
rendezvous.raise_errors()
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 131, in raise_errors
six.reraise(typ, value, traceback)
File "/usr/lib/python3/dist-packages/six.py", line 693, in reraise
raise value
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2871, in train
saving_listeners=saving_listeners)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 364, in train
hooks.extend(self._convert_train_steps_to_hooks(steps, max_steps))
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2746, in _convert_train_steps_to_hooks
if ctx.is_running_on_cpu():
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_context.py", line 442, in is_running_on_cpu
self._validate_tpu_configuration()
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_context.py", line 604, in _validate_tpu_configuration
num_cores = self.num_cores
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_context.py", line 349, in num_cores
metadata = self._get_tpu_system_metadata()
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_context.py", line 274, in _get_tpu_system_metadata
query_topology=self.model_parallelism_enabled))
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/tpu/tpu_system_metadata.py", line 128, in _query_tpu_system_metadata
master_address, devices))
RuntimeError: Cannot find any TPU cores in the system (master address ). This usually means the master address is incorrect or the TPU worker has some problems. Available devices: [_DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 268435456, 12519265597810562643), _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_GPU:0, XLA_GPU, 17179869184, 13700291500443683580), _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_GPU:1, XLA_GPU, 17179869184, 86262967647931383), _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_GPU:2, XLA_GPU, 17179869184, 3676913639991227464), _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_GPU:3, XLA_GPU, 17179869184, 5354296951385035528), _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_GPU:4, XLA_GPU, 17179869184, 12154468832020101184), _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_GPU:5, XLA_GPU, 17179869184, 13118045380692252360), _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_GPU:6, XLA_GPU, 17179869184, 9442972683431350141), _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_GPU:7, XLA_GPU, 17179869184, 13012334678599159156), _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 17179869184, 1063841961695883546), _DeviceAttributes(/job:localhost/replica:0/task:0/device:GPU:0, GPU, 15928269210, 2610604702973413960), _DeviceAttributes(/job:localhost/replica:0/task:0/device:GPU:1, GPU, 203292672, 17931462477742070628), _DeviceAttributes(/job:localhost/replica:0/task:0/device:GPU:2, GPU, 15928269210, 5846002352678548358), _DeviceAttributes(/job:localhost/replica:0/task:0/device:GPU:3, GPU, 15928269210, 10456649650628517216), _DeviceAttributes(/job:localhost/replica:0/task:0/device:GPU:4, GPU, 15928269210, 17379282422107701438), _DeviceAttributes(/job:localhost/replica:0/task:0/device:GPU:5, GPU, 15928269210, 8202577610745802132), _DeviceAttributes(/job:localhost/replica:0/task:0/device:GPU:6, GPU, 15928269210, 14481908658310636262), _DeviceAttributes(/job:localhost/replica:0/task:0/device:GPU:7, GPU, 15928269210, 278208692209243281)]

xurongqiang · 2019-09-27T07:38:00Z

python2 finetuning seems good, but Resource exhausted: OOM(P100)
19-09-27 14:29:48.886518: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7fba12398600 next 2020 of size 5120
2019-09-27 14:29:48.886523: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7fba12399a00 next 2021 of size 5120
2019-09-27 14:29:48.886527: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7fba1239ae00 next 2022 of size 6553600
2019-09-27 14:29:48.886538: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7fba129dae00 next 2023 of size 6553600
2019-09-27 14:29:48.886543: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7fba1301ae00 next 2024 of size 5120
2019-09-27 14:29:48.886548: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7fba1301c200 next 2025 of size 41943040
2019-09-27 14:29:48.886552: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7fba1581c200 next 2026 of size 6553600
2019-09-27 14:29:48.886557: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7fba15e5c200 next 2027 of size 5120
2019-09-27 14:29:48.886561: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7fba15e5d600 next 2028 of size 6553600
2019-09-27 14:29:48.886566: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7fba1649d600 next 2029 of size 5120
2019-09-27 14:29:48.886570: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7fba1649ea00 next 2030 of size 5120
2019-09-27 14:29:48.886575: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7fba1649fe00 next 2031 of size 5120
2019-09-27 14:29:48.886579: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7fba164a1200 next 2032 of size 5120
2019-09-27 14:29:48.886584: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7fba164a2600 next 2033 of size 5120
2019-09-27 14:29:48.886588: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7fba164a3a00 next 2034 of size 6553600
2019-09-27 14:29:48.886593: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7fba16ae3a00 next 2035 of size 6553600
2019-09-27 14:29:48.886597: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7fba17123a00 next 2036 of size 41943040
2019-09-27 14:29:48.886602: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7fba19923a00 next 2037 of size 5120
2019-09-27 14:29:48.886606: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7fba19924e00 next 2038 of size 6553600
2019-09-27 14:29:48.886611: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7fba19f64e00 next 2039 of size 41943040
2019-09-27 14:29:48.886615: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7fba1c764e00 next 2040 of size 41943040
2019-09-27 14:29:48.886620: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7fba1ef64e00 next 2041 of size 5120
2019-09-27 14:29:48.886624: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7fba1ef66200 next 2042 of size 5120
2019-09-27 14:29:48.886629: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7fba1ef67600 next 2043 of size 5120

keskarnitish · 2019-09-27T17:51:39Z

It's a bit difficult to debug since I don't have the whole log file but from what you've posted, it seems like you didn't (re-?)patch keras.py to have use_tpu=False?

xurongqiang changed the title ~~finetuning errors~~ python3 finetuning errors Sep 27, 2019

xurongqiang closed this as completed Sep 28, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

python3 finetuning errors #30

python3 finetuning errors #30

xurongqiang commented Sep 27, 2019

xurongqiang commented Sep 27, 2019

keskarnitish commented Sep 27, 2019

python3 finetuning errors #30

python3 finetuning errors #30

Comments

xurongqiang commented Sep 27, 2019

xurongqiang commented Sep 27, 2019

keskarnitish commented Sep 27, 2019