TPU dies after 3hrs (e.g. with no 'health' state) #590

pwais · 2019-11-05T13:21:12Z

Not sure what happened, can't see anything in stackdriver, but looks like the TPU RPC can have a malformed response:

Exception in thread Thread-4:
Traceback (most recent call last):
  File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/tpu/preempted_hook.py", line 89, in run
    self._cluster._tpu, response['state'], response['health'])  # pylint: disable=protected-access
KeyError: 'health'

The text was updated successfully, but these errors were encountered:

pwais · 2019-11-06T07:38:15Z

Seeing this error consistently. My TPU program will run for about 2hrs and then this traceback will happen. No error in stackdriver logs; in fact the logs suggest training is going along normally. Looks like somebody is not honoring the gRPC spec.

pwais · 2019-11-06T23:42:32Z

TPU service / hardware seems buggy. I'm consistently seeing it crash TPU-side after about 3 hours of training. Same script a while ago would run for >6hrs just fine.

pwais · 2019-11-07T03:41:30Z

Upgraded to TF 1.15 and am still seeing the TPU die silently after a few hours of training. No errors in stackdriver. No errors on python client stdout, just keeps spinning.

Am NOT using a preemptible TPU, and yet that seems to be the service I'm getting.

pwais · 2019-11-07T07:20:47Z

Again, like clockwork, 3hrs of training and then I start seeing health UNKNOWN and the client spins indefinitely. Nothing significant in stackdriver logs.

pwais · 2019-11-15T08:11:44Z

Just to bump this issue (mebbe @allenwang28 can help triage?), in the past few days I've had to manually restart jobs (mostly Retinanet) over 100 times now. I've observed this same behavior across three different TPUv3s.

Here's a snapshot of what some training runs look like (wall-clock time):

Job will train for 1-3 hours, and then dies and client spins indefinitely in this state:

WARNING:tensorflow:TPUPollingThread found TPU b'mytpu' in state READY, and health HEALTHY.
WARNING:tensorflow:TPUPollingThread found TPU b'mytpu' in state READY, and health HEALTHY.
WARNING:tensorflow:TPUPollingThread found TPU b'mytpu' in state READY, and health UNKNOWN.
WARNING:tensorflow:TPUPollingThread found TPU b'mytpu' in state READY, and health UNKNOWN.
WARNING:tensorflow:TPUPollingThread found TPU b'mytpu' in state READY, and health HEALTHY.
WARNING:tensorflow:TPUPollingThread found TPU b'mytpu' in state READY, and health HEALTHY.

I'm reporting this here because there's clearly a software bug in either tensorflow/tpu or tensorflow proper-- the client should not be spinning indefinitely when the service is clearly dead. There's probably a problem in Google Cloud, but my experience with them is that they're consistently unhelpful, and resolving problems like this can take months. I also don't see useful error logging in stackdriver, but I'm no expert in searching and the TPU stackdriver logs are extremely noisy.

jysohn23 · 2019-11-15T18:18:36Z

The KeyError and hanging with TPUPollingThread have been already fixed: tensorflow/tensorflow@36e672c and tensorflow/tensorflow@c1ef1c6. However, I believe those haven't made them into the 1.15 final RC.

Either use nightly, or you can simply patch those 2 commits locally on your client's installation .py files.

pwais · 2019-11-15T23:07:20Z

@jysohn23 Thanks for citing the keyerror fix, but woahhhh that definitely is not a fix for the (1) the TPU terminating for no reason after 1-3 hours as well as (2) the client (including TPUExecutor code in this repo) not being resilient to such failures if they're expected. Could you please reconsider your choice of closing this issue? My entire experience with the TPU team has been that they don't care about showstoppers like these. I invite you to show otherwise.

ngoanpv · 2019-11-20T07:41:45Z

It has not fixed yet @pwais

pwais · 2019-11-20T07:58:01Z

moved to a new issue to remove any confusion with the cited KeyError: #609

pwais changed the title ~~TPU dies with no 'health' state~~ TPU dies after 3hrs (e.g. with no 'health' state) Nov 7, 2019

jysohn23 self-assigned this Nov 15, 2019

jysohn23 closed this as completed Nov 15, 2019

pwais mentioned this issue Nov 20, 2019

TPU dies after 3hrs, no error promoted to client code #609

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TPU dies after 3hrs (e.g. with no 'health' state) #590

TPU dies after 3hrs (e.g. with no 'health' state) #590

pwais commented Nov 5, 2019

pwais commented Nov 6, 2019

pwais commented Nov 6, 2019

pwais commented Nov 7, 2019

pwais commented Nov 7, 2019

pwais commented Nov 15, 2019 •

edited

jysohn23 commented Nov 15, 2019

pwais commented Nov 15, 2019

ngoanpv commented Nov 20, 2019

pwais commented Nov 20, 2019

TPU dies after 3hrs (e.g. with no 'health' state) #590

TPU dies after 3hrs (e.g. with no 'health' state) #590

Comments

pwais commented Nov 5, 2019

pwais commented Nov 6, 2019

pwais commented Nov 6, 2019

pwais commented Nov 7, 2019

pwais commented Nov 7, 2019

pwais commented Nov 15, 2019 • edited

jysohn23 commented Nov 15, 2019

pwais commented Nov 15, 2019

ngoanpv commented Nov 20, 2019

pwais commented Nov 20, 2019

pwais commented Nov 15, 2019 •

edited