[tune] Fix durable(str) name for class trainables, preventing trial recovery #19223
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Why are these changes needed?
When using
tune.durable(str)
, a durable trainable is created from a registered trainable. The registry is a cluster-global KV=store, thus other nodes have access to it, even if they currently don't execute any code.This trainable inherits the name of the previous trainable. However, tune then overwrites the string trainable in the cluster-gobal KV-store (
Experiment.register_if_needed()
).It seems that this change is not correctly propagated to all nodes.
In effect, consider the following setup:
tune.durable("APPO")
or similarFor some reason, the new node creates a "APPO" trainable (a regular
Trainable
) and not the overwrittenDurableTrainable
. Thus, trial synchronization is not invoked and trial recovery failed because the checkpoint is not found.Exactly why this happens is puzzling to me, as we don't schedule trainables by string reference, but by type reference in Ray Tune.
However, just changing the overwritten name to
DurableAPPO
fixes the issue reliably.This is a follow-up to #19184, which introduced this change to function trainables, but not for class trainables.
cc @richardliaw @gjoliver
Edit: Maybe this has nothing to do with the KV-store. I'm not sure. I'll investigate further...
Related issue number
Checks
scripts/format.sh
to lint the changes in this PR.