[RayTrain] Checkpoint API to recover from checkpoint from previous runs #45516

sathyanarays · 2024-05-23T06:49:06Z

Description

Steps to reproduce

There are examples that illustrate checkpointing and recovering from checkpointing in the Ray training frameworks. One such example illustrates how to configure checkpointing to a pytorch training job.

1. Trigger the training RayJob

 kubectl apply rayjob.yaml

2. Kill the head pod

Let the training job make a couple of checkpoints and then kill the head pod.

kubectl delete pods rayjob-sample-raycluster-lv85g-head-tbfwq

3. The new driver ignores the checkpoint

The current driver pod errors out and a new driver pod gets created. The new driver pod runs the training job again from scratch ignoring the checkpoints produced in the last run.

Hacky Fix

To overcome this problem, we have to write a function with a tightly coupled logic. For example, look at the function findLatestCheckpoint in this job definition.

Use case

It would be great if we have an API that we can call and get the latest checkpoint location for the previous iteration of the given run.

The text was updated successfully, but these errors were encountered:

sathyanarays · 2024-05-23T06:50:08Z

This relates to an issue in Kuberay: ray-project/kuberay#2155

cc: @kevin85421

kevin85421 · 2024-05-29T18:10:57Z

It would be great if we have an API that we can call and get the latest checkpoint location for the previous iteration of the given run.

Do you mean a Ray Train API? It makes sense to me. cc @matthewdeng

sathyanarays · 2024-05-31T03:59:25Z

It would be great if we have an API that we can call and get the latest checkpoint location for the previous iteration of the given run.

Do you mean a Ray Train API? It makes sense to me. cc @matthewdeng

@matthewdeng @kevin85421 , yes. Ray Train API that helps to find the last successful checkpoint.

matthewdeng · 2024-05-31T17:06:14Z

@sathyanarays is this the API you'd be looking for?

sathyanarays · 2024-06-03T06:06:13Z

@matthewdeng , yes. This works. Thank you for the pointers.

sathyanarays added enhancement Request for new feature and/or capability triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 23, 2024

sathyanarays changed the title ~~[RayTrain]~~ [RayTrain] Checkpoint API to recover from checkpoint from previous runs May 23, 2024

anyscalesam added the train Ray Train Related Issue label May 24, 2024

sathyanarays closed this as completed Jun 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RayTrain] Checkpoint API to recover from checkpoint from previous runs #45516

[RayTrain] Checkpoint API to recover from checkpoint from previous runs #45516

sathyanarays commented May 23, 2024

sathyanarays commented May 23, 2024

kevin85421 commented May 29, 2024 •

edited

Loading

sathyanarays commented May 31, 2024

matthewdeng commented May 31, 2024

sathyanarays commented Jun 3, 2024

[RayTrain] Checkpoint API to recover from checkpoint from previous runs #45516

[RayTrain] Checkpoint API to recover from checkpoint from previous runs #45516

Comments

sathyanarays commented May 23, 2024

Description

Steps to reproduce

1. Trigger the training RayJob

2. Kill the head pod

3. The new driver ignores the checkpoint

Hacky Fix

Use case

sathyanarays commented May 23, 2024

kevin85421 commented May 29, 2024 • edited Loading

sathyanarays commented May 31, 2024

matthewdeng commented May 31, 2024

sathyanarays commented Jun 3, 2024

kevin85421 commented May 29, 2024 •

edited

Loading