-
Notifications
You must be signed in to change notification settings - Fork 45.3k
Description
System information
-
Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
No -
OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
CMLE -
TensorFlow installed from (source or binary):
CMLE -
TensorFlow version (use command below):
1.14 -
Python version:
2.7
Please provide the entire URL of the model you are using?
https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/running_pets.md
Describe the current behavior
Failed after ~20min. Retired a few times, same error.
Describe the expected behavior
Training succeeds.
Code to reproduce the issue
https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/running_pets.md
Other info / logs
{
insertId: "gng8v7fh35gc1"
jsonPayload: {
created: 1587446734.441076
levelname: "ERROR"
lineno: 328
message: "RuntimeError: There was no new checkpoint after the training. Eval status: missing checkpoint"
pathname: "/runcloudml.py"
}
labels: {
compute.googleapis.com/resource_id: "3602514279639793991"
compute.googleapis.com/resource_name: "gke-cml-0421-050704--n1-standard-8-30-48740b2f-bhw1"
compute.googleapis.com/zone: "us-central1-c"
ml.googleapis.com/job_id/log_area: "root"
ml.googleapis.com/trial_id: ""
}
logName: "projects/yongzhe-test/logs/master-replica-0"
receiveTimestamp: "2020-04-21T05:25:37.672555504Z"
resource: {
labels: {
job_id: "yongzhe_object_detection_pets_04_20_2020_22_07_01"
project_id: "yongzhe-test"
task_name: "master-replica-0"
}
type: "ml_job"
}
severity: "ERROR"
timestamp: "2020-04-21T05:25:34.441076039Z"
}