New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TPU dies after 3hrs (e.g. with no 'health' state) #590
Comments
Seeing this error consistently. My TPU program will run for about 2hrs and then this traceback will happen. No error in stackdriver logs; in fact the logs suggest training is going along normally. Looks like somebody is not honoring the gRPC spec. |
Upgraded to TF 1.15 and am still seeing the TPU die silently after a few hours of training. No errors in stackdriver. No errors on python client stdout, just keeps spinning. Am NOT using a preemptible TPU, and yet that seems to be the service I'm getting. |
Just to bump this issue (mebbe @allenwang28 can help triage?), in the past few days I've had to manually restart jobs (mostly Retinanet) over 100 times now. I've observed this same behavior across three different TPUv3s. Here's a snapshot of what some training runs look like (wall-clock time): Job will train for 1-3 hours, and then dies and client spins indefinitely in this state:
I'm reporting this here because there's clearly a software bug in either tensorflow/tpu or tensorflow proper-- the client should not be spinning indefinitely when the service is clearly dead. There's probably a problem in Google Cloud, but my experience with them is that they're consistently unhelpful, and resolving problems like this can take months. I also don't see useful error logging in stackdriver, but I'm no expert in searching and the TPU stackdriver logs are extremely noisy. |
The Either use nightly, or you can simply patch those 2 commits locally on your client's installation |
@jysohn23 Thanks for citing the keyerror fix, but woahhhh that definitely is not a fix for the (1) the TPU terminating for no reason after 1-3 hours as well as (2) the client (including TPUExecutor code in this repo) not being resilient to such failures if they're expected. Could you please reconsider your choice of closing this issue? My entire experience with the TPU team has been that they don't care about showstoppers like these. I invite you to show otherwise. |
It has not fixed yet @pwais |
moved to a new issue to remove any confusion with the cited |
Not sure what happened, can't see anything in stackdriver, but looks like the TPU RPC can have a malformed response:
The text was updated successfully, but these errors were encountered: