Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RayTrain] Checkpoint API to recover from checkpoint from previous runs #45516

Closed
sathyanarays opened this issue May 23, 2024 · 5 comments
Closed
Labels
enhancement Request for new feature and/or capability train Ray Train Related Issue triage Needs triage (eg: priority, bug/not-bug, and owning component)

Comments

@sathyanarays
Copy link

Description

Steps to reproduce

There are examples that illustrate checkpointing and recovering from checkpointing in the Ray training frameworks. One such example illustrates how to configure checkpointing to a pytorch training job.

1. Trigger the training RayJob

 kubectl apply rayjob.yaml

2. Kill the head pod

Let the training job make a couple of checkpoints and then kill the head pod.

kubectl delete pods rayjob-sample-raycluster-lv85g-head-tbfwq

3. The new driver ignores the checkpoint

The current driver pod errors out and a new driver pod gets created. The new driver pod runs the training job again from scratch ignoring the checkpoints produced in the last run.

Hacky Fix

To overcome this problem, we have to write a function with a tightly coupled logic. For example, look at the function findLatestCheckpoint in this job definition.

Use case

It would be great if we have an API that we can call and get the latest checkpoint location for the previous iteration of the given run.

@sathyanarays sathyanarays added enhancement Request for new feature and/or capability triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 23, 2024
@sathyanarays sathyanarays changed the title [RayTrain] [RayTrain] Checkpoint API to recover from checkpoint from previous runs May 23, 2024
@sathyanarays
Copy link
Author

This relates to an issue in Kuberay: ray-project/kuberay#2155

cc: @kevin85421

@anyscalesam anyscalesam added the train Ray Train Related Issue label May 24, 2024
@kevin85421
Copy link
Member

kevin85421 commented May 29, 2024

It would be great if we have an API that we can call and get the latest checkpoint location for the previous iteration of the given run.

Do you mean a Ray Train API? It makes sense to me. cc @matthewdeng

@sathyanarays
Copy link
Author

It would be great if we have an API that we can call and get the latest checkpoint location for the previous iteration of the given run.

Do you mean a Ray Train API? It makes sense to me. cc @matthewdeng

@matthewdeng @kevin85421 , yes. Ray Train API that helps to find the last successful checkpoint.

@matthewdeng
Copy link
Contributor

@sathyanarays is this the API you'd be looking for?

@sathyanarays
Copy link
Author

@matthewdeng , yes. This works. Thank you for the pointers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Request for new feature and/or capability train Ray Train Related Issue triage Needs triage (eg: priority, bug/not-bug, and owning component)
Projects
None yet
Development

No branches or pull requests

4 participants