-
Notifications
You must be signed in to change notification settings - Fork 566
Description
🐛 Bug
I am trying to train a model from this repository: https://github.com/ikergarcia1996/Self-Driving-Car-in-Video-Games using a TPU v3-8 VM
Even when I train a tiny 9M parameter model with a small batch size, I get the following error:
2022-03-28 11:57:16.078552: E tensorflow/core/tpu/kernels/tpu_compilation_cache_external.cc:113] Computation requires more parameters (3311) than supported (limit 3305).
The error always seems to happen at step 9. The model works as expected when running in a GPU/CPU.
I found that other people also found this issue with large models (#1963), but in my case it happens with models of any size.
To Reproduce
I am trying to train this model: https://github.com/ikergarcia1996/Self-Driving-Car-in-Video-Games/blob/master/model.py#L778 from this repository: https://github.com/ikergarcia1996/Self-Driving-Car-in-Video-Games
Here is a small colab notebook to reproduce the issue. There are no TPU available right now in colab so I cannot test if I get the same error in colab. I am using a TPU v3-8 VM: https://colab.research.google.com/drive/1nVbJooUMvMMc8V9F6ioqrYkhyuiveN1i?usp=sharing
Expected behavior
The model works fine when training with a GPU/CPU
Environment
tensorflow 2.9.0 (tf-nightly, if I use the stable release i a get a weird error: "DefaultDeviceShapeRepresentation not available in this library" error)
torch 1.11.0
torch-xla 1.11
pytorch-lightning==1.6.0rc1 (installed from source, wandb crashes with the stable release)
cloud-tpu-client==0.10
TPU
Full environment
absl-py==1.0.0
aiohttp==3.8.1
aiosignal==1.2.0
astunparse==1.6.3
async-timeout==4.0.2
attrs==19.3.0
Automat==0.8.0
blinker==1.4
cachetools==5.0.0
certifi==2021.10.8
chardet==3.0.4
charset-normalizer==2.0.12
Click==7.0
cloud-init==22.1
cloud-tpu-client==0.10
colorama==0.4.3
command-not-found==0.3
configobj==5.0.6
constantly==15.1.0
cryptography==2.8
Cython==0.29.14
dbus-python==1.2.16
distlib==0.3.4
distro==1.4.0
distro-info===0.23ubuntu1
docker-pycreds==0.4.0
entrypoints==0.3
filelock==3.6.0
flatbuffers==1.12
frozenlist==1.3.0
fsspec==2022.2.0
future==0.18.2
gast==0.4.0
gitdb==4.0.9
GitPython==3.1.27
google-api-core==2.7.1
google-api-python-client==2.42.0
google-auth==2.6.0
google-auth-httplib2==0.1.0
google-auth-oauthlib==0.4.6
google-pasta==0.2.0
googleapis-common-protos==1.55.0
grpcio==1.44.0
h5py==3.6.0
httplib2==0.20.4
hyperlink==19.0.0
idna==3.3
imageio==2.16.1
importlib-metadata==4.11.3
incremental==16.10.1
intel-openmp==2022.0.2
Jinja2==2.10.1
jsonpatch==1.22
jsonpointer==2.0
jsonschema==3.2.0
keras==2.8.0
Keras-Applications==1.0.8
keras-nightly==2.9.0.dev2022032707
Keras-Preprocessing==1.1.2
keyring==18.0.1
language-selector==0.1
launchpadlib==1.10.13
lazr.restfulclient==0.14.2
lazr.uri==1.0.3
libclang==13.0.0
libtpu-nightly==0.1.dev20220303
Markdown==3.3.6
MarkupSafe==1.1.0
mkl==2022.0.2
mkl-include==2022.0.2
mock==4.0.3
more-itertools==4.2.0
multidict==6.0.2
netifaces==0.10.4
networkx==2.7.1
numpy==1.22.3
oauth2client==4.1.3
oauthlib==3.1.0
opencv-python==4.5.5.64
opt-einsum==3.3.0
packaging==21.3
pathtools==0.1.2
pbr==5.8.1
pexpect==4.6.0
Pillow==9.0.1
platformdirs==2.5.1
promise==2.3
protobuf==3.19.4
psutil==5.9.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
pyDeprecate==0.3.1
PyGObject==3.36.0
PyHamcrest==1.9.0
PyJWT==1.7.1
pymacaroons==0.13.0
PyNaCl==1.3.0
pyOpenSSL==19.0.0
pyparsing==3.0.7
pyrsistent==0.15.5
pyserial==3.4
python-apt==2.0.0+ubuntu0.20.4.7
python-dateutil==2.8.2
python-debian===0.1.36ubuntu1
pytorch-lightning==1.6.0rc1
pytz==2021.3
PyWavelets==1.3.0
PyYAML==5.4.1
requests==2.27.1
requests-oauthlib==1.3.1
requests-unixsocket==0.2.0
rsa==4.8
scikit-image==0.19.2
scipy==1.8.0
SecretStorage==2.3.1
sentry-sdk==1.5.8
service-identity==18.1.0
setproctitle==1.2.2
shortuuid==1.0.8
simplejson==3.16.0
six==1.16.0
smmap==5.0.0
sos==4.3
ssh-import-id==5.10
systemd-python==234
tabulate==0.8.9
tb-nightly==2.9.0a20220326
tbb==2021.5.1
tensorboard==2.8.0
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.1
tensorflow==2.9.0
tensorflow-estimator==2.8.0
tensorflow-io-gcs-filesystem==0.24.0
termcolor==1.1.0
testresources==2.0.1
tf-estimator-nightly==2.9.0.dev2022032708
tf-nightly==2.9.0.dev20220327
tifffile==2022.3.25
torch==1.11.0
torch-xla==1.11
torchmetrics==0.7.3
torchvision==0.12.0
tqdm==4.63.1
Twisted==18.9.0
typing-extensions==4.1.1
ubuntu-advantage-tools==27.6
ufw==0.36
unattended-upgrades==0.1
uritemplate==3.0.1
urllib3==1.26.8
virtualenv==20.13.3
wadllib==1.3.3
wandb==0.12.11
Werkzeug==2.0.3
wrapt==1.14.0
yarl==1.7.2
yaspin==2.1.0
zipp==1.0.0
zope.interface==4.7.1
Additional context
Full Traceback
GPU available: False, used: False
TPU available: True, using: 8 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
2022-03-28 11:51:19.576529: E tensorflow/core/framework/op_kernel.cc:1676] OpKernel ('op: "TPURoundRobin" device_type: "CPU"') for unknown op: TPURoundRobin
2022-03-28 11:51:19.576621: E tensorflow/core/framework/op_kernel.cc:1676] OpKernel ('op: "TpuHandleToProtoKey" device_type: "CPU"') for unknown op: TpuHandleToProtoKey
2022-03-28 11:51:42.822620: E tensorflow/core/framework/op_kernel.cc:1676] OpKernel ('op: "TPURoundRobin" device_type: "CPU"') for unknown op: TPURoundRobin
2022-03-28 11:51:42.822684: E tensorflow/core/framework/op_kernel.cc:1676] OpKernel ('op: "TpuHandleToProtoKey" device_type: "CPU"') for unknown op: TpuHandleToProtoKey
2022-03-28 11:51:44.296726: E tensorflow/core/framework/op_kernel.cc:1676] OpKernel ('op: "TPURoundRobin" device_type: "CPU"') for unknown op: TPURoundRobin
2022-03-28 11:51:44.296816: E tensorflow/core/framework/op_kernel.cc:1676] OpKernel ('op: "TpuHandleToProtoKey" device_type: "CPU"') for unknown op: TpuHandleToProtoKey
2022-03-28 11:51:45.496880: E tensorflow/core/framework/op_kernel.cc:1676] OpKernel ('op: "TPURoundRobin" device_type: "CPU"') for unknown op: TPURoundRobin
2022-03-28 11:51:45.496949: E tensorflow/core/framework/op_kernel.cc:1676] OpKernel ('op: "TpuHandleToProtoKey" device_type: "CPU"') for unknown op: TpuHandleToProtoKey
2022-03-28 11:51:47.001208: E tensorflow/core/framework/op_kernel.cc:1676] OpKernel ('op: "TPURoundRobin" device_type: "CPU"') for unknown op: TPURoundRobin
2022-03-28 11:51:47.001272: E tensorflow/core/framework/op_kernel.cc:1676] OpKernel ('op: "TpuHandleToProtoKey" device_type: "CPU"') for unknown op: TpuHandleToProtoKey
2022-03-28 11:51:47.413516: E tensorflow/core/framework/op_kernel.cc:1676] OpKernel ('op: "TPURoundRobin" device_type: "CPU"') for unknown op: TPURoundRobin
2022-03-28 11:51:47.413576: E tensorflow/core/framework/op_kernel.cc:1676] OpKernel ('op: "TpuHandleToProtoKey" device_type: "CPU"') for unknown op: TpuHandleToProtoKey
2022-03-28 11:51:47.849663: E tensorflow/core/framework/op_kernel.cc:1676] OpKernel ('op: "TPURoundRobin" device_type: "CPU"') for unknown op: TPURoundRobin
2022-03-28 11:51:47.849722: E tensorflow/core/framework/op_kernel.cc:1676] OpKernel ('op: "TpuHandleToProtoKey" device_type: "CPU"') for unknown op: TpuHandleToProtoKey
2022-03-28 11:51:49.076428: E tensorflow/core/framework/op_kernel.cc:1676] OpKernel ('op: "TPURoundRobin" device_type: "CPU"') for unknown op: TPURoundRobin
2022-03-28 11:51:49.076495: E tensorflow/core/framework/op_kernel.cc:1676] OpKernel ('op: "TpuHandleToProtoKey" device_type: "CPU"') for unknown op: TpuHandleToProtoKey
Total training samples: 1270669.
Total training samples: 1270669.
Total validation samples: 4038.
Total validation samples: 4038.
Total training samples: 1270669.
Total training samples: 1270669.
Total training samples: 1270669.
Total training samples: 1270669.
Total validation samples: 4038.
Total validation samples: 4038.
Total validation samples: 4038.
Total validation samples: 4038.
Total training samples: 1270669.
Total validation samples: 4038.
Total training samples: 1270669.
Total validation samples: 4038.
| Name | Type | Params
----------------------------------------------------------------------
0 | model | TEDD1104Transformer | 9.3 M
1 | train_accuracy | Accuracy | 0
2 | test_accuracy_k1_macro | Accuracy | 0
3 | test_accuracy_k3_micro | Accuracy | 0
4 | validation_accuracy_k1_micro | Accuracy | 0
5 | validation_accuracy_k3_micro | Accuracy | 0
6 | validation_accuracy_k1_macro | Accuracy | 0
7 | validation_accuracy_k3_macro | Accuracy | 0
8 | test_accuracy_k1_micro | Accuracy | 0
9 | test_accuracy_k3_macro | Accuracy | 0
10 | validation_distance | MeanSquaredError | 0
11 | criterion | WeightedMseLoss | 0
12 | Controller2Keyboard | Controller2Keyboard | 0
----------------------------------------------------------------------
9.3 M Trainable params
0 Non-trainable params
9.3 M Total params
37.374 Total estimated model params size (MB)
Epoch 0: 0%| | 8/39963 [04:34<380:50:43, 34.31s/it, loss=0.43, v_num=base]2022-03-28 11:57:16.078552: E tensorflow/core/tpu/kernels/tpu_compilation_cache_external.cc:113] Computation requires more parameters (3311) than supported (limit 3305).
2022-03-28 11:57:16.078644: F tensorflow/core/tpu/kernels/tpu_program_group.cc:86] Check failed: xla_tpu_programs.size() > 0 (0 vs. 0)
https://symbolize.stripped_domain/r/?trace=7f19fc95103b,7f19fc9510bf,7f18fed31bcf,7f18f93a6922,7f18f9364ebd,7f18f93b4db0,7f18f93b48ae,7f18f5216ed3,7f18fa8581b8,7f18fe7e38a0,7f18fe7e5633,7f18fecfacb1,7f18fecfa4e0,7f18fece28cb,7f19fc8f3608&map=b5462df73b9bb298b2bca5d2f02176eed80a2e90:7f18f08d2000-7f1901bc7e30
*** SIGABRT received by PID 714187 (TID 714993) on cpu 24 from PID 714187; stack trace: ***
PC: @ 0x7f19fc95103b (unknown) raise
@ 0x7f18efd34cda 992 (unknown)
@ 0x7f19fc9510c0 3968 (unknown)
@ 0x7f18fed31bd0 16 tensorflow::internal::LogMessageFatal::~LogMessageFatal()
@ 0x7f18f93a6923 592 tensorflow::tpu::TpuProgramGroup::Initialize()
@ 0x7f18f9364ebe 1488 tensorflow::tpu::TpuCompilationCacheExternal::InitializeEntry()
@ 0x7f18f93b4db1 800 tensorflow::tpu::TpuCompilationCacheInterface::CompileIfKeyAbsentHelper()
@ 0x7f18f93b48af 496 tensorflow::tpu::TpuCompilationCacheInterface::CompileIfKeyAbsent()
@ 0x7f18f5216ed4 912 tensorflow::XRTCompileOp::Compute()
@ 0x7f18fa8581b9 432 tensorflow::XlaDevice::Compute()
@ 0x7f18fe7e38a1 2128 tensorflow::(anonymous namespace)::ExecutorState<>::Process()
@ 0x7f18fe7e5634 48 std::_Function_handler<>::_M_invoke()
@ 0x7f18fecfacb2 128 Eigen::ThreadPoolTempl<>::WorkerLoop()
@ 0x7f18fecfa4e1 48 tensorflow::thread::EigenEnvironment::CreateThread()::{lambda()#1}::operator()()
@ 0x7f18fece28cc 80 tensorflow::(anonymous namespace)::PThread::ThreadFn()
@ 0x7f19fc8f3609 (unknown) start_thread
https://symbolize.stripped_domain/r/?trace=7f19fc95103b,7f18efd34cd9,7f19fc9510bf,7f18fed31bcf,7f18f93a6922,7f18f9364ebd,7f18f93b4db0,7f18f93b48ae,7f18f5216ed3,7f18fa8581b8,7f18fe7e38a0,7f18fe7e5633,7f18fecfacb1,7f18fecfa4e0,7f18fece28cb,7f19fc8f3608&map=b5462df73b9bb298b2bca5d2f02176eed80a2e90:7f18f08d2000-7f1901bc7e30,50c831e765011c7eb7163b7f3cb5c4b6:7f18e158a000-7f18f00a2f00
E0328 11:57:16.307495 714993 coredump_hook.cc:365] RAW: Remote crash data gathering hook invoked.
E0328 11:57:16.307515 714993 coredump_hook.cc:411] RAW: Skipping coredump since rlimit was 0 at process start.
E0328 11:57:16.307529 714993 client.cc:222] RAW: Coroner client retries enabled (b/136286901), will retry for up to 30 sec.
E0328 11:57:16.307539 714993 coredump_hook.cc:473] RAW: Sending fingerprint to remote end.
E0328 11:57:16.307550 714993 coredump_socket.cc:124] RAW: Stat failed errno=2 on socket /var/google/services/logmanagerd/remote_coredump.socket
E0328 11:57:16.307559 714993 coredump_hook.cc:477] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] Missing crash reporting socket. Is the listener running?
E0328 11:57:16.307565 714993 coredump_hook.cc:550] RAW: Discarding core.
E0328 11:57:16.732287 714993 process_state.cc:771] RAW: Raising signal 6 with default behavior
2022-03-28 11:57:17.327159: W tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:157] RPC failed with status = "UNAVAILABLE: Socket closed" and grpc_error_string = "{"created":"@1648468637.326935100","description":"Error received from peer ipv4:127.0.0.1:51011","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC
2022-03-28 11:57:22.529565: W tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:157] RPC failed with status = "UNAVAILABLE: Socket closed" and grpc_error_string = "{"created":"@1648468642.529380655","description":"Error received from peer ipv4:127.0.0.1:51011","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC
2022-03-28 11:57:22.529851: W tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:157] RPC failed with status = "UNAVAILABLE: Socket closed" and grpc_error_string = "{"created":"@1648468642.529622009","description":"Error received from peer ipv4:127.0.0.1:51011","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC
2022-03-28 11:57:22.530091: W tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:157] RPC failed with status = "UNAVAILABLE: Socket closed" and grpc_error_string = "{"created":"@1648468642.529921516","description":"Error received from peer ipv4:127.0.0.1:51011","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC
2022-03-28 11:57:22.574131: W tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:157] RPC failed with status = "UNAVAILABLE: Socket closed" and grpc_error_string = "{"created":"@1648468642.573979797","description":"Error received from peer ipv4:127.0.0.1:51011","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC
2022-03-28 11:57:22.574730: W tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:157] RPC failed with status = "UNAVAILABLE: Connection reset by peer" and grpc_error_string = "{"created":"@1648468642.574629478","description":"Error received from peer ipv4:127.0.0.1:51011","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Connection reset by peer","grpc_status":14}", maybe retrying the RPC
2022-03-28 11:57:22.574755: W tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:157] RPC failed with status = "UNAVAILABLE: Connection reset by peer" and grpc_error_string = "{"created":"@1648468642.574701121","description":"Error received from peer ipv4:127.0.0.1:51011","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Connection reset by peer","grpc_status":14}", maybe retrying the RPC
2022-03-28 11:57:22.574797: W tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:157] RPC failed with status = "UNAVAILABLE: Connection reset by peer" and grpc_error_string = "{"created":"@1648468642.574746230","description":"Error received from peer ipv4:127.0.0.1:51011","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Connection reset by peer","grpc_status":14}", maybe retrying the RPC
2022-03-28 11:57:22.574778: W tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:157] RPC failed with status = "UNAVAILABLE: Socket closed" and grpc_error_string = "{"created":"@1648468642.574625986","description":"Error received from peer ipv4:127.0.0.1:51011","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC
2022-03-28 11:57:22.574723: W tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:157] RPC failed with status = "UNAVAILABLE: Socket closed" and grpc_error_string = "{"created":"@1648468642.574472508","description":"Error received from peer ipv4:127.0.0.1:51011","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC
2022-03-28 11:57:22.574892: W tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:157] RPC failed with status = "UNAVAILABLE: Connection reset by peer" and grpc_error_string = "{"created":"@1648468642.574855850","description":"Error received from peer ipv4:127.0.0.1:51011","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Connection reset by peer","grpc_status":14}", maybe retrying the RPC
https://symbolize.stripped_domain/r/?trace=7f19fc8fa376,7f19fc9510bf,0&map=
*** SIGTERM received by PID 714282 (TID 714282) on cpu 63 from PID 711980; stack trace: ***
PC: @ 0x7f19fc8fa376 (unknown) pthread_cond_wait@@GLIBC_2.3.2
@ 0x7f18efd34cda 992 (unknown)
@ 0x7f19fc9510c0 (unknown) (unknown)
@ 0x1 (unknown) (unknown)
https://symbolize.stripped_domain/r/?trace=7f19fc8fa376,7f18efd34cd9,7f19fc9510bf,0&map=50c831e765011c7eb7163b7f3cb5c4b6:7f18e158a000-7f18f00a2f00
E0328 11:57:24.459454 714282 coredump_hook.cc:320] RAW: Remote crash gathering disabled for SIGTERM.
E0328 11:57:24.487866 714282 process_state.cc:771] RAW: Raising signal 15 with default behavior
https://symbolize.stripped_domain/r/?trace=7f19fc8fa376,7f19fc9510bf,0&map=
*** SIGTERM received by PID 714288 (TID 714288) on cpu 72 from PID 711980; stack trace: ***
PC: @ 0x7f19fc8fa376 (unknown) pthread_cond_wait@@GLIBC_2.3.2
@ 0x7f18efd34cda 992 (unknown)
@ 0x7f19fc9510c0 (unknown) (unknown)
@ 0x1 (unknown) (unknown)
https://symbolize.stripped_domain/r/?trace=7f19fc8fa376,7f18efd34cd9,7f19fc9510bf,0&map=50c831e765011c7eb7163b7f3cb5c4b6:7f18e158a000-7f18f00a2f00
E0328 11:57:24.659237 714288 coredump_hook.cc:320] RAW: Remote crash gathering disabled for SIGTERM.
E0328 11:57:24.687005 714288 process_state.cc:771] RAW: Raising signal 15 with default behavior
https://symbolize.stripped_domain/r/?trace=7f19fc8fa376,7f19fc9510bf,0&map=
*** SIGTERM received by PID 714292 (TID 714292) on cpu 80 from PID 711980; stack trace: ***
PC: @ 0x7f19fc8fa376 (unknown) pthread_cond_wait@@GLIBC_2.3.2
@ 0x7f18efd34cda 992 (unknown)
@ 0x7f19fc9510c0 (unknown) (unknown)
@ 0x1 (unknown) (unknown)
https://symbolize.stripped_domain/r/?trace=7f19fc8fa376,7f18efd34cd9,7f19fc9510bf,0&map=50c831e765011c7eb7163b7f3cb5c4b6:7f18e158a000-7f18f00a2f00
E0328 11:57:24.788845 714292 coredump_hook.cc:320] RAW: Remote crash gathering disabled for SIGTERM.
E0328 11:57:24.817159 714292 process_state.cc:771] RAW: Raising signal 15 with default behavior
https://symbolize.stripped_domain/r/?trace=7f19fc8fa376,7f19fc9510bf,0&map=
*** SIGTERM received by PID 714296 (TID 714296) on cpu 63 from PID 711980; stack trace: ***
PC: @ 0x7f19fc8fa376 (unknown) pthread_cond_wait@@GLIBC_2.3.2
@ 0x7f18efd34cda 992 (unknown)
@ 0x7f19fc9510c0 (unknown) (unknown)
@ 0x1 (unknown) (unknown)
https://symbolize.stripped_domain/r/?trace=7f19fc8fa376,7f18efd34cd9,7f19fc9510bf,0&map=50c831e765011c7eb7163b7f3cb5c4b6:7f18e158a000-7f18f00a2f00
E0328 11:57:24.933581 714296 coredump_hook.cc:320] RAW: Remote crash gathering disabled for SIGTERM.
E0328 11:57:24.962191 714296 process_state.cc:771] RAW: Raising signal 15 with default behavior
https://symbolize.stripped_domain/r/?trace=7f19fc8fa376,7f19fc9510bf,0&map=
*** SIGTERM received by PID 714300 (TID 714300) on cpu 87 from PID 711980; stack trace: ***
PC: @ 0x7f19fc8fa376 (unknown) pthread_cond_wait@@GLIBC_2.3.2
@ 0x7f18efd34cda 992 (unknown)
@ 0x7f19fc9510c0 (unknown) (unknown)
@ 0x1 (unknown) (unknown)
https://symbolize.stripped_domain/r/?trace=7f19fc8fa376,7f18efd34cd9,7f19fc9510bf,0&map=50c831e765011c7eb7163b7f3cb5c4b6:7f18e158a000-7f18f00a2f00
E0328 11:57:25.137587 714300 coredump_hook.cc:320] RAW: Remote crash gathering disabled for SIGTERM.
E0328 11:57:25.166103 714300 process_state.cc:771] RAW: Raising signal 15 with default behavior
https://symbolize.stripped_domain/r/?trace=7f19fc8fa376,7f19fc9510bf,0&map=
*** SIGTERM received by PID 714307 (TID 714307) on cpu 79 from PID 711980; stack trace: ***
PC: @ 0x7f19fc8fa376 (unknown) pthread_cond_wait@@GLIBC_2.3.2
@ 0x7f18efd34cda 992 (unknown)
@ 0x7f19fc9510c0 (unknown) (unknown)
@ 0x1 (unknown) (unknown)
https://symbolize.stripped_domain/r/?trace=7f19fc8fa376,7f18efd34cd9,7f19fc9510bf,0&map=50c831e765011c7eb7163b7f3cb5c4b6:7f18e158a000-7f18f00a2f00
E0328 11:57:25.288173 714307 coredump_hook.cc:320] RAW: Remote crash gathering disabled for SIGTERM.
E0328 11:57:25.316177 714307 process_state.cc:771] RAW: Raising signal 15 with default behavior
https://symbolize.stripped_domain/r/?trace=7f19fc8fa376,7f19fc9510bf,0&map=
*** SIGTERM received by PID 714311 (TID 714311) on cpu 26 from PID 711980; stack trace: ***
PC: @ 0x7f19fc8fa376 (unknown) pthread_cond_wait@@GLIBC_2.3.2
@ 0x7f18efd34cda 992 (unknown)
@ 0x7f19fc9510c0 (unknown) (unknown)
@ 0x1 (unknown) (unknown)
https://symbolize.stripped_domain/r/?trace=7f19fc8fa376,7f18efd34cd9,7f19fc9510bf,0&map=50c831e765011c7eb7163b7f3cb5c4b6:7f18e158a000-7f18f00a2f00
E0328 11:57:25.438707 714311 coredump_hook.cc:320] RAW: Remote crash gathering disabled for SIGTERM.
E0328 11:57:25.466978 714311 process_state.cc:771] RAW: Raising signal 15 with default behavior
2022-03-28 11:57:25.597429: E tensorflow/core/framework/op_kernel.cc:1676] OpKernel ('op: "TPURoundRobin" device_type: "CPU"') for unknown op: TPURoundRobin
2022-03-28 11:57:25.597487: E tensorflow/core/framework/op_kernel.cc:1676] OpKernel ('op: "TpuHandleToProtoKey" device_type: "CPU"') for unknown op: TpuHandleToProtoKey
Traceback (most recent call last):
File "train.py", line 655, in <module>
train_new_model(
File "train.py", line 238, in train_new_model
train(
File "train.py", line 107, in train
trainer.fit(model, datamodule=data)
File "/home/ikergarcia/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 773, in fit
self._call_and_handle_interrupt(
File "/home/ikergarcia/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 724, in _call_and_handle_interrupt
return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
File "/home/ikergarcia/.local/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/xla_spawn.py", line 76, in launch
xmp.spawn(
File "/usr/local/lib/python3.8/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 389, in spawn
return torch.multiprocessing.start_processes(
File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 140, in join
raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGABRT
Traceback (most recent call last):
File "train.py", line 655, in <module>
train_new_model(
File "train.py", line 238, in train_new_model
train(
File "train.py", line 107, in train
trainer.fit(model, datamodule=data)
File "/home/ikergarcia/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 773, in fit
self._call_and_handle_interrupt(
File "/home/ikergarcia/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 724, in _call_and_handle_interrupt
return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
File "/home/ikergarcia/.local/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/xla_spawn.py", line 76, in launch
xmp.spawn(
File "/usr/local/lib/python3.8/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 389, in spawn
return torch.multiprocessing.start_processes(
File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 140, in join
raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGABRT
Traceback (most recent call last):
File "train.py", line 655, in <module>
train_new_model(
File "train.py", line 238, in train_new_model
train(
File "train.py", line 107, in train
trainer.fit(model, datamodule=data)
File "/home/ikergarcia/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 773, in fit
self._call_and_handle_interrupt(
File "/home/ikergarcia/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 724, in _call_and_handle_interrupt
return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
File "/home/ikergarcia/.local/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/xla_spawn.py", line 76, in launch
xmp.spawn(
File "/usr/local/lib/python3.8/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 389, in spawn
return torch.multiprocessing.start_processes(
File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 140, in join
raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGABRT