Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

python3 finetuning errors #30

Closed
xurongqiang opened this issue Sep 27, 2019 · 2 comments
Closed

python3 finetuning errors #30

xurongqiang opened this issue Sep 27, 2019 · 2 comments

Comments

@xurongqiang
Copy link

when running the command :

python training.py --model_dir ../data_finetuning/seqlen256_v1.ckpt/ --iterations 250

errors occur, could sombody help me ?

2019-09-27 11:08:55.815810: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 193 MB memory) -> physical GPU (device: 1, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:10.0, compute capability: 6.0)
2019-09-27 11:08:55.815889: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-27 11:08:55.822316: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 15190 MB memory) -> physical GPU (device: 2, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:11.0, compute capability: 6.0)
2019-09-27 11:08:55.822395: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-27 11:08:55.829322: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 15190 MB memory) -> physical GPU (device: 3, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:12.0, compute capability: 6.0)
2019-09-27 11:08:55.829429: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-27 11:08:55.835786: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:4 with 15190 MB memory) -> physical GPU (device: 4, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:13.0, compute capability: 6.0)
2019-09-27 11:08:55.835872: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-27 11:08:55.842883: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:5 with 15190 MB memory) -> physical GPU (device: 5, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:14.0, compute capability: 6.0)
2019-09-27 11:08:55.843003: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-27 11:08:55.850326: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:6 with 15190 MB memory) -> physical GPU (device: 6, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:15.0, compute capability: 6.0)
2019-09-27 11:08:55.850434: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-27 11:08:55.857728: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:7 with 15190 MB memory) -> physical GPU (device: 7, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:16.0, compute capability: 6.0)
E0927 11:08:55.868580 140547554760512 error_handling.py:70] Error recorded from training_loop: Cannot find any TPU cores in the system (master address ). This usually means the master address is incorrect or the TPU worker has some problems. Available devices: [_DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 268435456, 12519265597810562643), _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_GPU:0, XLA_GPU, 17179869184, 13700291500443683580), _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_GPU:1, XLA_GPU, 17179869184, 86262967647931383), _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_GPU:2, XLA_GPU, 17179869184, 3676913639991227464), _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_GPU:3, XLA_GPU, 17179869184, 5354296951385035528), _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_GPU:4, XLA_GPU, 17179869184, 12154468832020101184), _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_GPU:5, XLA_GPU, 17179869184, 13118045380692252360), _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_GPU:6, XLA_GPU, 17179869184, 9442972683431350141), _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_GPU:7, XLA_GPU, 17179869184, 13012334678599159156), _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 17179869184, 1063841961695883546), _DeviceAttributes(/job:localhost/replica:0/task:0/device:GPU:0, GPU, 15928269210, 2610604702973413960), _DeviceAttributes(/job:localhost/replica:0/task:0/device:GPU:1, GPU, 203292672, 17931462477742070628), _DeviceAttributes(/job:localhost/replica:0/task:0/device:GPU:2, GPU, 15928269210, 5846002352678548358), _DeviceAttributes(/job:localhost/replica:0/task:0/device:GPU:3, GPU, 15928269210, 10456649650628517216), _DeviceAttributes(/job:localhost/replica:0/task:0/device:GPU:4, GPU, 15928269210, 17379282422107701438), _DeviceAttributes(/job:localhost/replica:0/task:0/device:GPU:5, GPU, 15928269210, 8202577610745802132), _DeviceAttributes(/job:localhost/replica:0/task:0/device:GPU:6, GPU, 15928269210, 14481908658310636262), _DeviceAttributes(/job:localhost/replica:0/task:0/device:GPU:7, GPU, 15928269210, 278208692209243281)]
I0927 11:08:55.868802 140547554760512 error_handling.py:96] training_loop marked as finished
W0927 11:08:55.868901 140547554760512 error_handling.py:130] Reraising captured error
Traceback (most recent call last):
File "training.py", line 164, in
estimator_model.train(input_fn=input_fn, steps=args.iterations)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2876, in train
rendezvous.raise_errors()
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 131, in raise_errors
six.reraise(typ, value, traceback)
File "/usr/lib/python3/dist-packages/six.py", line 693, in reraise
raise value
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2871, in train
saving_listeners=saving_listeners)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 364, in train
hooks.extend(self._convert_train_steps_to_hooks(steps, max_steps))
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2746, in _convert_train_steps_to_hooks
if ctx.is_running_on_cpu():
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_context.py", line 442, in is_running_on_cpu
self._validate_tpu_configuration()
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_context.py", line 604, in _validate_tpu_configuration
num_cores = self.num_cores
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_context.py", line 349, in num_cores
metadata = self._get_tpu_system_metadata()
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_context.py", line 274, in _get_tpu_system_metadata
query_topology=self.model_parallelism_enabled))
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/tpu/tpu_system_metadata.py", line 128, in _query_tpu_system_metadata
master_address, devices))
RuntimeError: Cannot find any TPU cores in the system (master address ). This usually means the master address is incorrect or the TPU worker has some problems. Available devices: [_DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 268435456, 12519265597810562643), _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_GPU:0, XLA_GPU, 17179869184, 13700291500443683580), _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_GPU:1, XLA_GPU, 17179869184, 86262967647931383), _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_GPU:2, XLA_GPU, 17179869184, 3676913639991227464), _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_GPU:3, XLA_GPU, 17179869184, 5354296951385035528), _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_GPU:4, XLA_GPU, 17179869184, 12154468832020101184), _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_GPU:5, XLA_GPU, 17179869184, 13118045380692252360), _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_GPU:6, XLA_GPU, 17179869184, 9442972683431350141), _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_GPU:7, XLA_GPU, 17179869184, 13012334678599159156), _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 17179869184, 1063841961695883546), _DeviceAttributes(/job:localhost/replica:0/task:0/device:GPU:0, GPU, 15928269210, 2610604702973413960), _DeviceAttributes(/job:localhost/replica:0/task:0/device:GPU:1, GPU, 203292672, 17931462477742070628), _DeviceAttributes(/job:localhost/replica:0/task:0/device:GPU:2, GPU, 15928269210, 5846002352678548358), _DeviceAttributes(/job:localhost/replica:0/task:0/device:GPU:3, GPU, 15928269210, 10456649650628517216), _DeviceAttributes(/job:localhost/replica:0/task:0/device:GPU:4, GPU, 15928269210, 17379282422107701438), _DeviceAttributes(/job:localhost/replica:0/task:0/device:GPU:5, GPU, 15928269210, 8202577610745802132), _DeviceAttributes(/job:localhost/replica:0/task:0/device:GPU:6, GPU, 15928269210, 14481908658310636262), _DeviceAttributes(/job:localhost/replica:0/task:0/device:GPU:7, GPU, 15928269210, 278208692209243281)]

@xurongqiang xurongqiang changed the title finetuning errors python3 finetuning errors Sep 27, 2019
@xurongqiang
Copy link
Author

python2 finetuning seems good, but Resource exhausted: OOM(P100)
19-09-27 14:29:48.886518: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7fba12398600 next 2020 of size 5120
2019-09-27 14:29:48.886523: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7fba12399a00 next 2021 of size 5120
2019-09-27 14:29:48.886527: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7fba1239ae00 next 2022 of size 6553600
2019-09-27 14:29:48.886538: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7fba129dae00 next 2023 of size 6553600
2019-09-27 14:29:48.886543: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7fba1301ae00 next 2024 of size 5120
2019-09-27 14:29:48.886548: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7fba1301c200 next 2025 of size 41943040
2019-09-27 14:29:48.886552: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7fba1581c200 next 2026 of size 6553600
2019-09-27 14:29:48.886557: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7fba15e5c200 next 2027 of size 5120
2019-09-27 14:29:48.886561: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7fba15e5d600 next 2028 of size 6553600
2019-09-27 14:29:48.886566: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7fba1649d600 next 2029 of size 5120
2019-09-27 14:29:48.886570: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7fba1649ea00 next 2030 of size 5120
2019-09-27 14:29:48.886575: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7fba1649fe00 next 2031 of size 5120
2019-09-27 14:29:48.886579: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7fba164a1200 next 2032 of size 5120
2019-09-27 14:29:48.886584: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7fba164a2600 next 2033 of size 5120
2019-09-27 14:29:48.886588: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7fba164a3a00 next 2034 of size 6553600
2019-09-27 14:29:48.886593: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7fba16ae3a00 next 2035 of size 6553600
2019-09-27 14:29:48.886597: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7fba17123a00 next 2036 of size 41943040
2019-09-27 14:29:48.886602: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7fba19923a00 next 2037 of size 5120
2019-09-27 14:29:48.886606: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7fba19924e00 next 2038 of size 6553600
2019-09-27 14:29:48.886611: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7fba19f64e00 next 2039 of size 41943040
2019-09-27 14:29:48.886615: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7fba1c764e00 next 2040 of size 41943040
2019-09-27 14:29:48.886620: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7fba1ef64e00 next 2041 of size 5120
2019-09-27 14:29:48.886624: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7fba1ef66200 next 2042 of size 5120
2019-09-27 14:29:48.886629: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7fba1ef67600 next 2043 of size 5120

@keskarnitish
Copy link
Contributor

It's a bit difficult to debug since I don't have the whole log file but from what you've posted, it seems like you didn't (re-?)patch keras.py to have use_tpu=False?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants