Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TPU dies after 3hrs (e.g. with no 'health' state) #590

Closed
pwais opened this issue Nov 5, 2019 · 9 comments
Closed

TPU dies after 3hrs (e.g. with no 'health' state) #590

pwais opened this issue Nov 5, 2019 · 9 comments
Assignees

Comments

@pwais
Copy link

pwais commented Nov 5, 2019

Not sure what happened, can't see anything in stackdriver, but looks like the TPU RPC can have a malformed response:

Exception in thread Thread-4:
Traceback (most recent call last):
  File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/tpu/preempted_hook.py", line 89, in run
    self._cluster._tpu, response['state'], response['health'])  # pylint: disable=protected-access
KeyError: 'health'
@pwais
Copy link
Author

pwais commented Nov 6, 2019

Seeing this error consistently. My TPU program will run for about 2hrs and then this traceback will happen. No error in stackdriver logs; in fact the logs suggest training is going along normally. Looks like somebody is not honoring the gRPC spec.

@pwais
Copy link
Author

pwais commented Nov 6, 2019

TPU service / hardware seems buggy. I'm consistently seeing it crash TPU-side after about 3 hours of training. Same script a while ago would run for >6hrs just fine.

Screen Shot 2019-11-06 at 15 42 13

@pwais
Copy link
Author

pwais commented Nov 7, 2019

Upgraded to TF 1.15 and am still seeing the TPU die silently after a few hours of training. No errors in stackdriver. No errors on python client stdout, just keeps spinning.

Am NOT using a preemptible TPU, and yet that seems to be the service I'm getting.

@pwais
Copy link
Author

pwais commented Nov 7, 2019

Again, like clockwork, 3hrs of training and then I start seeing health UNKNOWN and the client spins indefinitely. Nothing significant in stackdriver logs.
Screen Shot 2019-11-06 at 23 18 47

@pwais pwais changed the title TPU dies with no 'health' state TPU dies after 3hrs (e.g. with no 'health' state) Nov 7, 2019
@pwais
Copy link
Author

pwais commented Nov 15, 2019

Just to bump this issue (mebbe @allenwang28 can help triage?), in the past few days I've had to manually restart jobs (mostly Retinanet) over 100 times now. I've observed this same behavior across three different TPUv3s.

Screen Shot 2019-11-13 at 01 09 31

Here's a snapshot of what some training runs look like (wall-clock time):
Screen Shot 2019-11-15 at 00 17 48

Job will train for 1-3 hours, and then dies and client spins indefinitely in this state:

WARNING:tensorflow:TPUPollingThread found TPU b'mytpu' in state READY, and health HEALTHY.
WARNING:tensorflow:TPUPollingThread found TPU b'mytpu' in state READY, and health HEALTHY.
WARNING:tensorflow:TPUPollingThread found TPU b'mytpu' in state READY, and health UNKNOWN.
WARNING:tensorflow:TPUPollingThread found TPU b'mytpu' in state READY, and health UNKNOWN.
WARNING:tensorflow:TPUPollingThread found TPU b'mytpu' in state READY, and health HEALTHY.
WARNING:tensorflow:TPUPollingThread found TPU b'mytpu' in state READY, and health HEALTHY.

I'm reporting this here because there's clearly a software bug in either tensorflow/tpu or tensorflow proper-- the client should not be spinning indefinitely when the service is clearly dead. There's probably a problem in Google Cloud, but my experience with them is that they're consistently unhelpful, and resolving problems like this can take months. I also don't see useful error logging in stackdriver, but I'm no expert in searching and the TPU stackdriver logs are extremely noisy.

@jysohn23
Copy link
Contributor

The KeyError and hanging with TPUPollingThread have been already fixed: tensorflow/tensorflow@36e672c and tensorflow/tensorflow@c1ef1c6. However, I believe those haven't made them into the 1.15 final RC.

Either use nightly, or you can simply patch those 2 commits locally on your client's installation .py files.

@jysohn23 jysohn23 self-assigned this Nov 15, 2019
@pwais
Copy link
Author

pwais commented Nov 15, 2019

@jysohn23 Thanks for citing the keyerror fix, but woahhhh that definitely is not a fix for the (1) the TPU terminating for no reason after 1-3 hours as well as (2) the client (including TPUExecutor code in this repo) not being resilient to such failures if they're expected. Could you please reconsider your choice of closing this issue? My entire experience with the TPU team has been that they don't care about showstoppers like these. I invite you to show otherwise.

@ngoanpv
Copy link

ngoanpv commented Nov 20, 2019

It has not fixed yet @pwais

@pwais
Copy link
Author

pwais commented Nov 20, 2019

moved to a new issue to remove any confusion with the cited KeyError: #609

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants