Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tune] Fix durable(str) name for class trainables, preventing trial recovery #19223

Merged
merged 1 commit into from
Oct 8, 2021

Conversation

krfricke
Copy link
Contributor

@krfricke krfricke commented Oct 8, 2021

Why are these changes needed?

When using tune.durable(str), a durable trainable is created from a registered trainable. The registry is a cluster-global KV=store, thus other nodes have access to it, even if they currently don't execute any code.

This trainable inherits the name of the previous trainable. However, tune then overwrites the string trainable in the cluster-gobal KV-store (Experiment.register_if_needed()).

It seems that this change is not correctly propagated to all nodes.

In effect, consider the following setup:

  • A cluster with 4 nodes
  • A run with 2 parallel trials using tune.durable("APPO") or similar
  • One trial crashes after some time
  • Tune tries to automatically recover the trial from the checkpoint
  • Trial recovery happens on a different node than the one where the trial crashed.

For some reason, the new node creates a "APPO" trainable (a regular Trainable) and not the overwritten DurableTrainable. Thus, trial synchronization is not invoked and trial recovery failed because the checkpoint is not found.

Exactly why this happens is puzzling to me, as we don't schedule trainables by string reference, but by type reference in Ray Tune.

However, just changing the overwritten name to DurableAPPO fixes the issue reliably.

This is a follow-up to #19184, which introduced this change to function trainables, but not for class trainables.

cc @richardliaw @gjoliver

Edit: Maybe this has nothing to do with the KV-store. I'm not sure. I'll investigate further...

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Copy link
Member

@gjoliver gjoliver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix looks totally reasonable.
Maybe add a TODO here about the fact that we shouldn't need this override?

@krfricke krfricke merged commit f1606ac into ray-project:master Oct 8, 2021
@krfricke krfricke deleted the tune/sync-before-restore branch October 8, 2021 16:32
@krfricke
Copy link
Contributor Author

krfricke commented Oct 9, 2021

I added #19267 with a description of the source root of this problem to track and resolve it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants