[tune] Remove temporary checkpoint directories after restore #37173

krfricke · 2023-07-06T23:24:16Z

Why are these changes needed?

FunctionTrainable.restore_from_object creates a temporary checkpoint directory.

This directory is kept around as we don't control how the user interacts with the checkpoint - they might load it several times, or no time at all.

Once a new checkpoint is tracked in the status reporter, there is no need to keep the temporary object around anymore.

In this PR, we add functionality to remove these temporary directories. Additionally we adjust the number of checkpoints to keep in pytorch_pbt_failure to 10 to reduce disk pressure in the release test. It looks like this lead to recent failures of the test. By removing the total number of checkpoints and fixing the issue with temporary directories we should see much less disk usage.

Related issue number

(Hopefully) closes #36561

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Kai Fricke <kai@anyscale.com>

krfricke · 2023-07-07T16:40:54Z

In the running release test it looks like the temp checkpoints are removed correctly. However, the num_to_keep param doesn't seem to be working well:

~/ray_results/TorchTrainer_2023-07-06_17-33-05/TorchTrainer_d6a87_00000_0_lr=0.0010_2023-07-06_17-33-07$ ls -alp | grep checkpoint_0 | wc -l
206

This looks like a separate problem though. I'll address this in a follow-up

justinvyu

Thanks, I think this makes sense. I have one minor suggestion and some clarifications:

Just to confirm: We don't ever track the tmp directory anywhere (ex: in a checkpoint manager), so we don't run the risk of trying to restore from this temp checkpoint ever again, after the initial restore_from_object. Otherwise, it may be deleted already. Answer: Yes.
Possible to include some disk usage before/after plots in the PR description? Answer: Grafana is broken :(

python/ray/tune/trainable/function_trainable.py

Signed-off-by: Kai Fricke <kai@anyscale.com>

justinvyu

LGTM. Waiting for tests to pass.

…ject#37173) `FunctionTrainable.restore_from_object` creates a temporary checkpoint directory. This directory is kept around as we don't control how the user interacts with the checkpoint - they might load it several times, or no time at all. Once a new checkpoint is tracked in the status reporter, there is no need to keep the temporary object around anymore. In this PR, we add functionality to remove these temporary directories. Additionally we adjust the number of checkpoints to keep in `pytorch_pbt_failure` to 10 to reduce disk pressure in the release test. It looks like this lead to recent failures of the test. By removing the total number of checkpoints and fixing the issue with temporary directories we should see much less disk usage. Signed-off-by: Kai Fricke <kai@anyscale.com>

#37173 changed a test in a previous iteration that is failing after additional changes. This PR reverts the changes to the test to fix broken master. Signed-off-by: Kai Fricke <kai@anyscale.com>

…#37219) `FunctionTrainable.restore_from_object` creates a temporary checkpoint directory. This directory is kept around as we don't control how the user interacts with the checkpoint - they might load it several times, or no time at all. Once a new checkpoint is tracked in the status reporter, there is no need to keep the temporary object around anymore. In this PR, we add functionality to remove these temporary directories. Additionally we adjust the number of checkpoints to keep in `pytorch_pbt_failure` to 10 to reduce disk pressure in the release test. It looks like this lead to recent failures of the test. By removing the total number of checkpoints and fixing the issue with temporary directories we should see much less disk usage. Signed-off-by: Kai Fricke <kai@anyscale.com>

ray-project#37173 changed a test in a previous iteration that is failing after additional changes. This PR reverts the changes to the test to fix broken master. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Bhavpreet Singh <singh.bhavpreet00@gmail.com>

ray-project#37173 changed a test in a previous iteration that is failing after additional changes. This PR reverts the changes to the test to fix broken master. Signed-off-by: Kai Fricke <kai@anyscale.com>

#37173 changed a test in a previous iteration that is failing after additional changes. This PR reverts the changes to the test to fix broken master. Signed-off-by: Kai Fricke <kai@anyscale.com>

…ject#37173) `FunctionTrainable.restore_from_object` creates a temporary checkpoint directory. This directory is kept around as we don't control how the user interacts with the checkpoint - they might load it several times, or no time at all. Once a new checkpoint is tracked in the status reporter, there is no need to keep the temporary object around anymore. In this PR, we add functionality to remove these temporary directories. Additionally we adjust the number of checkpoints to keep in `pytorch_pbt_failure` to 10 to reduce disk pressure in the release test. It looks like this lead to recent failures of the test. By removing the total number of checkpoints and fixing the issue with temporary directories we should see much less disk usage. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: 久龙 <guyang.sgy@antfin.com>

ray-project#37173 changed a test in a previous iteration that is failing after additional changes. This PR reverts the changes to the test to fix broken master. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: 久龙 <guyang.sgy@antfin.com>

ray-project#37173 changed a test in a previous iteration that is failing after additional changes. This PR reverts the changes to the test to fix broken master. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Bhavpreet Singh <singh.bhavpreet00@gmail.com>

…ject#37173) `FunctionTrainable.restore_from_object` creates a temporary checkpoint directory. This directory is kept around as we don't control how the user interacts with the checkpoint - they might load it several times, or no time at all. Once a new checkpoint is tracked in the status reporter, there is no need to keep the temporary object around anymore. In this PR, we add functionality to remove these temporary directories. Additionally we adjust the number of checkpoints to keep in `pytorch_pbt_failure` to 10 to reduce disk pressure in the release test. It looks like this lead to recent failures of the test. By removing the total number of checkpoints and fixing the issue with temporary directories we should see much less disk usage. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: harborn <gangsheng.wu@intel.com>

ray-project#37173 changed a test in a previous iteration that is failing after additional changes. This PR reverts the changes to the test to fix broken master. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: harborn <gangsheng.wu@intel.com>

…ject#37173) `FunctionTrainable.restore_from_object` creates a temporary checkpoint directory. This directory is kept around as we don't control how the user interacts with the checkpoint - they might load it several times, or no time at all. Once a new checkpoint is tracked in the status reporter, there is no need to keep the temporary object around anymore. In this PR, we add functionality to remove these temporary directories. Additionally we adjust the number of checkpoints to keep in `pytorch_pbt_failure` to 10 to reduce disk pressure in the release test. It looks like this lead to recent failures of the test. By removing the total number of checkpoints and fixing the issue with temporary directories we should see much less disk usage. Signed-off-by: Kai Fricke <kai@anyscale.com>

ray-project#37173 changed a test in a previous iteration that is failing after additional changes. This PR reverts the changes to the test to fix broken master. Signed-off-by: Kai Fricke <kai@anyscale.com>

…ject#37173) `FunctionTrainable.restore_from_object` creates a temporary checkpoint directory. This directory is kept around as we don't control how the user interacts with the checkpoint - they might load it several times, or no time at all. Once a new checkpoint is tracked in the status reporter, there is no need to keep the temporary object around anymore. In this PR, we add functionality to remove these temporary directories. Additionally we adjust the number of checkpoints to keep in `pytorch_pbt_failure` to 10 to reduce disk pressure in the release test. It looks like this lead to recent failures of the test. By removing the total number of checkpoints and fixing the issue with temporary directories we should see much less disk usage. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>

ray-project#37173 changed a test in a previous iteration that is failing after additional changes. This PR reverts the changes to the test to fix broken master. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>

Kai Fricke added 2 commits July 6, 2023 16:20

add test

76fc064

Signed-off-by: Kai Fricke <kai@anyscale.com>

num to keep

530c3a7

Signed-off-by: Kai Fricke <kai@anyscale.com>

krfricke requested a review from justinvyu July 6, 2023 23:25

krfricke assigned justinvyu Jul 6, 2023

Fix test

76d4a83

Signed-off-by: Kai Fricke <kai@anyscale.com>

justinvyu reviewed Jul 7, 2023

View reviewed changes

python/ray/tune/trainable/function_trainable.py Outdated Show resolved Hide resolved

Use is_temp_checkpoint_dir

9fb47db

Signed-off-by: Kai Fricke <kai@anyscale.com>

krfricke requested a review from justinvyu July 7, 2023 17:47

justinvyu approved these changes Jul 7, 2023

View reviewed changes

krfricke merged commit d7a3180 into ray-project:master Jul 7, 2023
61 of 63 checks passed

krfricke deleted the tune/remove-tmp-checkpoint branch July 7, 2023 20:27

krfricke mentioned this pull request Jul 8, 2023

[tune] Hotfix failing checkpoint test #37220

Merged

8 tasks

krfricke mentioned this pull request Jul 11, 2023

[tune] Hotfix failing checkpoint test (#37220) #37305

Merged

8 tasks

akshay-anyscale mentioned this pull request Jul 21, 2023

Add service deployment instructions to stable diffusion template #37645

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tune] Remove temporary checkpoint directories after restore #37173

[tune] Remove temporary checkpoint directories after restore #37173

krfricke commented Jul 6, 2023

krfricke commented Jul 7, 2023

justinvyu left a comment •

edited

Loading

justinvyu left a comment

[tune] Remove temporary checkpoint directories after restore #37173

[tune] Remove temporary checkpoint directories after restore #37173

Conversation

krfricke commented Jul 6, 2023

Why are these changes needed?

Related issue number

Checks

krfricke commented Jul 7, 2023

justinvyu left a comment • edited Loading

Choose a reason for hiding this comment

justinvyu left a comment

Choose a reason for hiding this comment

justinvyu left a comment •

edited

Loading