Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tune] Remove temporary checkpoint directories after restore #37173

Merged
merged 4 commits into from
Jul 7, 2023

Conversation

krfricke
Copy link
Contributor

@krfricke krfricke commented Jul 6, 2023

Why are these changes needed?

FunctionTrainable.restore_from_object creates a temporary checkpoint directory.

This directory is kept around as we don't control how the user interacts with the checkpoint - they might load it several times, or no time at all.

Once a new checkpoint is tracked in the status reporter, there is no need to keep the temporary object around anymore.

In this PR, we add functionality to remove these temporary directories. Additionally we adjust the number of checkpoints to keep in pytorch_pbt_failure to 10 to reduce disk pressure in the release test. It looks like this lead to recent failures of the test. By removing the total number of checkpoints and fixing the issue with temporary directories we should see much less disk usage.

Related issue number

(Hopefully) closes #36561

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Kai Fricke added 2 commits July 6, 2023 16:20
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
@krfricke
Copy link
Contributor Author

krfricke commented Jul 7, 2023

In the running release test it looks like the temp checkpoints are removed correctly. However, the num_to_keep param doesn't seem to be working well:

~/ray_results/TorchTrainer_2023-07-06_17-33-05/TorchTrainer_d6a87_00000_0_lr=0.0010_2023-07-06_17-33-07$ ls -alp | grep checkpoint_0 | wc -l
206

This looks like a separate problem though. I'll address this in a follow-up

Copy link
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I think this makes sense. I have one minor suggestion and some clarifications:

  1. Just to confirm: We don't ever track the tmp directory anywhere (ex: in a checkpoint manager), so we don't run the risk of trying to restore from this temp checkpoint ever again, after the initial restore_from_object. Otherwise, it may be deleted already. Answer: Yes.
  2. Possible to include some disk usage before/after plots in the PR description? Answer: Grafana is broken :(

python/ray/tune/trainable/function_trainable.py Outdated Show resolved Hide resolved
Signed-off-by: Kai Fricke <kai@anyscale.com>
@krfricke krfricke requested a review from justinvyu July 7, 2023 17:47
Copy link
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Waiting for tests to pass.

@krfricke krfricke merged commit d7a3180 into ray-project:master Jul 7, 2023
61 of 63 checks passed
@krfricke krfricke deleted the tune/remove-tmp-checkpoint branch July 7, 2023 20:27
krfricke added a commit to krfricke/ray that referenced this pull request Jul 8, 2023
…ject#37173)

`FunctionTrainable.restore_from_object` creates a temporary checkpoint directory.

This directory is kept around as we don't control how the user interacts with the checkpoint - they might load it several times, or no time at all.

Once a new checkpoint is tracked in the status reporter, there is no need to keep the temporary object around anymore. 

In this PR, we add functionality to remove these temporary directories. Additionally we adjust the number of checkpoints to keep in `pytorch_pbt_failure` to 10 to reduce disk pressure in the release test. It looks like this lead to recent failures of the test. By removing the total number of checkpoints and fixing the issue with temporary directories we should see much less disk usage.

Signed-off-by: Kai Fricke <kai@anyscale.com>
krfricke added a commit that referenced this pull request Jul 9, 2023
#37173 changed a test in a previous iteration that is failing after additional changes. This PR reverts the changes to the test to fix broken master.

Signed-off-by: Kai Fricke <kai@anyscale.com>
can-anyscale pushed a commit that referenced this pull request Jul 9, 2023
…#37219)

`FunctionTrainable.restore_from_object` creates a temporary checkpoint directory.

This directory is kept around as we don't control how the user interacts with the checkpoint - they might load it several times, or no time at all.

Once a new checkpoint is tracked in the status reporter, there is no need to keep the temporary object around anymore. 

In this PR, we add functionality to remove these temporary directories. Additionally we adjust the number of checkpoints to keep in `pytorch_pbt_failure` to 10 to reduce disk pressure in the release test. It looks like this lead to recent failures of the test. By removing the total number of checkpoints and fixing the issue with temporary directories we should see much less disk usage.

Signed-off-by: Kai Fricke <kai@anyscale.com>
Bhav00 pushed a commit to Bhav00/ray that referenced this pull request Jul 11, 2023
ray-project#37173 changed a test in a previous iteration that is failing after additional changes. This PR reverts the changes to the test to fix broken master.

Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Bhavpreet Singh <singh.bhavpreet00@gmail.com>
krfricke added a commit to krfricke/ray that referenced this pull request Jul 11, 2023
ray-project#37173 changed a test in a previous iteration that is failing after additional changes. This PR reverts the changes to the test to fix broken master.

Signed-off-by: Kai Fricke <kai@anyscale.com>
bveeramani pushed a commit that referenced this pull request Jul 11, 2023
#37173 changed a test in a previous iteration that is failing after additional changes. This PR reverts the changes to the test to fix broken master.

Signed-off-by: Kai Fricke <kai@anyscale.com>
SongGuyang pushed a commit to alipay/ant-ray that referenced this pull request Jul 12, 2023
…ject#37173)

`FunctionTrainable.restore_from_object` creates a temporary checkpoint directory.

This directory is kept around as we don't control how the user interacts with the checkpoint - they might load it several times, or no time at all.

Once a new checkpoint is tracked in the status reporter, there is no need to keep the temporary object around anymore.

In this PR, we add functionality to remove these temporary directories. Additionally we adjust the number of checkpoints to keep in `pytorch_pbt_failure` to 10 to reduce disk pressure in the release test. It looks like this lead to recent failures of the test. By removing the total number of checkpoints and fixing the issue with temporary directories we should see much less disk usage.

Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: 久龙 <guyang.sgy@antfin.com>
SongGuyang pushed a commit to alipay/ant-ray that referenced this pull request Jul 12, 2023
ray-project#37173 changed a test in a previous iteration that is failing after additional changes. This PR reverts the changes to the test to fix broken master.

Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: 久龙 <guyang.sgy@antfin.com>
Bhav00 pushed a commit to Bhav00/ray that referenced this pull request Jul 24, 2023
ray-project#37173 changed a test in a previous iteration that is failing after additional changes. This PR reverts the changes to the test to fix broken master.

Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Bhavpreet Singh <singh.bhavpreet00@gmail.com>
harborn pushed a commit to harborn/ray that referenced this pull request Aug 17, 2023
…ject#37173)

`FunctionTrainable.restore_from_object` creates a temporary checkpoint directory.

This directory is kept around as we don't control how the user interacts with the checkpoint - they might load it several times, or no time at all.

Once a new checkpoint is tracked in the status reporter, there is no need to keep the temporary object around anymore.

In this PR, we add functionality to remove these temporary directories. Additionally we adjust the number of checkpoints to keep in `pytorch_pbt_failure` to 10 to reduce disk pressure in the release test. It looks like this lead to recent failures of the test. By removing the total number of checkpoints and fixing the issue with temporary directories we should see much less disk usage.

Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: harborn <gangsheng.wu@intel.com>
harborn pushed a commit to harborn/ray that referenced this pull request Aug 17, 2023
ray-project#37173 changed a test in a previous iteration that is failing after additional changes. This PR reverts the changes to the test to fix broken master.

Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: harborn <gangsheng.wu@intel.com>
harborn pushed a commit to harborn/ray that referenced this pull request Aug 17, 2023
…ject#37173)

`FunctionTrainable.restore_from_object` creates a temporary checkpoint directory.

This directory is kept around as we don't control how the user interacts with the checkpoint - they might load it several times, or no time at all.

Once a new checkpoint is tracked in the status reporter, there is no need to keep the temporary object around anymore. 

In this PR, we add functionality to remove these temporary directories. Additionally we adjust the number of checkpoints to keep in `pytorch_pbt_failure` to 10 to reduce disk pressure in the release test. It looks like this lead to recent failures of the test. By removing the total number of checkpoints and fixing the issue with temporary directories we should see much less disk usage.

Signed-off-by: Kai Fricke <kai@anyscale.com>
harborn pushed a commit to harborn/ray that referenced this pull request Aug 17, 2023
ray-project#37173 changed a test in a previous iteration that is failing after additional changes. This PR reverts the changes to the test to fix broken master.

Signed-off-by: Kai Fricke <kai@anyscale.com>
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
…ject#37173)

`FunctionTrainable.restore_from_object` creates a temporary checkpoint directory.

This directory is kept around as we don't control how the user interacts with the checkpoint - they might load it several times, or no time at all.

Once a new checkpoint is tracked in the status reporter, there is no need to keep the temporary object around anymore.

In this PR, we add functionality to remove these temporary directories. Additionally we adjust the number of checkpoints to keep in `pytorch_pbt_failure` to 10 to reduce disk pressure in the release test. It looks like this lead to recent failures of the test. By removing the total number of checkpoints and fixing the issue with temporary directories we should see much less disk usage.

Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
ray-project#37173 changed a test in a previous iteration that is failing after additional changes. This PR reverts the changes to the test to fix broken master.

Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Release test long_running_distributed_pytorch_pbt_failure.aws failed
2 participants