-
-
Notifications
You must be signed in to change notification settings - Fork 655
Description
There seems to be two bugs in Checkpoint
related to global_step_transform
when it's attached to the valid rather than train engine.
First, global_step_transform
does a lookup based on the event fired. This causes issues when the handler is not attached to an {EPOCH/ITERATION}_COMPLETED
, eg. when it's attached to COMPLETED
on the valid engine as the docs suggest.
Second, global_step_transform
is intended to give the "true" count (iteration, epoch, whatever it may be). As such, it should not only be used in the filename, but also as the priority
. Right now, priority is the iteration count of the engine it's attached to, which again does not work for valid engine.
A third point, which isn't really a bug but more usability: Checkpoint
silently drops checkpoints if it has checkpointed the same filename before. I think such occurrences are likely user error (or in my case, framework error, since my iteration count of valid engine is always the same at COMPLETED
). Perhaps a warning log is warranted. Alternatively, if the checkpoint is truly the same, writing it again is idempotent, so perhaps this check should be removed entirely.