Skip to content

RuntimeError: There was no new checkpoint after the training. Eval status: missing checkpoint #8414

@yongzhe2160

Description

@yongzhe2160

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
    No

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
    CMLE

  • TensorFlow installed from (source or binary):
    CMLE

  • TensorFlow version (use command below):
    1.14

  • Python version:
    2.7

Please provide the entire URL of the model you are using?

https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/running_pets.md

Describe the current behavior
Failed after ~20min. Retired a few times, same error.

Describe the expected behavior
Training succeeds.

Code to reproduce the issue

https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/running_pets.md

Other info / logs

{
insertId: "gng8v7fh35gc1"
jsonPayload: {
created: 1587446734.441076
levelname: "ERROR"
lineno: 328
message: "RuntimeError: There was no new checkpoint after the training. Eval status: missing checkpoint"
pathname: "/runcloudml.py"
}
labels: {
compute.googleapis.com/resource_id: "3602514279639793991"
compute.googleapis.com/resource_name: "gke-cml-0421-050704--n1-standard-8-30-48740b2f-bhw1"
compute.googleapis.com/zone: "us-central1-c"
ml.googleapis.com/job_id/log_area: "root"
ml.googleapis.com/trial_id: ""
}
logName: "projects/yongzhe-test/logs/master-replica-0"
receiveTimestamp: "2020-04-21T05:25:37.672555504Z"
resource: {
labels: {
job_id: "yongzhe_object_detection_pets_04_20_2020_22_07_01"
project_id: "yongzhe-test"
task_name: "master-replica-0"
}
type: "ml_job"
}
severity: "ERROR"
timestamp: "2020-04-21T05:25:34.441076039Z"
}

Metadata

Metadata

Assignees

Labels

models:researchmodels that come under research directorytype:bugBug in the code

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions