Skip to content

Getting iterations in Checkpoint is wrong for global_step_transform #1148

@amatsukawa

Description

@amatsukawa

There seems to be two bugs in Checkpoint related to global_step_transform when it's attached to the valid rather than train engine.

First, global_step_transform does a lookup based on the event fired. This causes issues when the handler is not attached to an {EPOCH/ITERATION}_COMPLETED, eg. when it's attached to COMPLETED on the valid engine as the docs suggest.

Second, global_step_transform is intended to give the "true" count (iteration, epoch, whatever it may be). As such, it should not only be used in the filename, but also as the priority. Right now, priority is the iteration count of the engine it's attached to, which again does not work for valid engine.

A third point, which isn't really a bug but more usability: Checkpoint silently drops checkpoints if it has checkpointed the same filename before. I think such occurrences are likely user error (or in my case, framework error, since my iteration count of valid engine is always the same at COMPLETED). Perhaps a warning log is warranted. Alternatively, if the checkpoint is truly the same, writing it again is idempotent, so perhaps this check should be removed entirely.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions