-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tune] Remove temporary checkpoint directories after restore #37173
[tune] Remove temporary checkpoint directories after restore #37173
Conversation
Signed-off-by: Kai Fricke <kai@anyscale.com>
In the running release test it looks like the temp checkpoints are removed correctly. However, the num_to_keep param doesn't seem to be working well:
This looks like a separate problem though. I'll address this in a follow-up |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I think this makes sense. I have one minor suggestion and some clarifications:
- Just to confirm: We don't ever track the tmp directory anywhere (ex: in a checkpoint manager), so we don't run the risk of trying to restore from this temp checkpoint ever again, after the initial
restore_from_object
. Otherwise, it may be deleted already. Answer: Yes. - Possible to include some disk usage before/after plots in the PR description? Answer: Grafana is broken :(
Signed-off-by: Kai Fricke <kai@anyscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Waiting for tests to pass.
…ject#37173) `FunctionTrainable.restore_from_object` creates a temporary checkpoint directory. This directory is kept around as we don't control how the user interacts with the checkpoint - they might load it several times, or no time at all. Once a new checkpoint is tracked in the status reporter, there is no need to keep the temporary object around anymore. In this PR, we add functionality to remove these temporary directories. Additionally we adjust the number of checkpoints to keep in `pytorch_pbt_failure` to 10 to reduce disk pressure in the release test. It looks like this lead to recent failures of the test. By removing the total number of checkpoints and fixing the issue with temporary directories we should see much less disk usage. Signed-off-by: Kai Fricke <kai@anyscale.com>
#37173 changed a test in a previous iteration that is failing after additional changes. This PR reverts the changes to the test to fix broken master. Signed-off-by: Kai Fricke <kai@anyscale.com>
…#37219) `FunctionTrainable.restore_from_object` creates a temporary checkpoint directory. This directory is kept around as we don't control how the user interacts with the checkpoint - they might load it several times, or no time at all. Once a new checkpoint is tracked in the status reporter, there is no need to keep the temporary object around anymore. In this PR, we add functionality to remove these temporary directories. Additionally we adjust the number of checkpoints to keep in `pytorch_pbt_failure` to 10 to reduce disk pressure in the release test. It looks like this lead to recent failures of the test. By removing the total number of checkpoints and fixing the issue with temporary directories we should see much less disk usage. Signed-off-by: Kai Fricke <kai@anyscale.com>
ray-project#37173 changed a test in a previous iteration that is failing after additional changes. This PR reverts the changes to the test to fix broken master. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Bhavpreet Singh <singh.bhavpreet00@gmail.com>
ray-project#37173 changed a test in a previous iteration that is failing after additional changes. This PR reverts the changes to the test to fix broken master. Signed-off-by: Kai Fricke <kai@anyscale.com>
…ject#37173) `FunctionTrainable.restore_from_object` creates a temporary checkpoint directory. This directory is kept around as we don't control how the user interacts with the checkpoint - they might load it several times, or no time at all. Once a new checkpoint is tracked in the status reporter, there is no need to keep the temporary object around anymore. In this PR, we add functionality to remove these temporary directories. Additionally we adjust the number of checkpoints to keep in `pytorch_pbt_failure` to 10 to reduce disk pressure in the release test. It looks like this lead to recent failures of the test. By removing the total number of checkpoints and fixing the issue with temporary directories we should see much less disk usage. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: 久龙 <guyang.sgy@antfin.com>
ray-project#37173 changed a test in a previous iteration that is failing after additional changes. This PR reverts the changes to the test to fix broken master. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: 久龙 <guyang.sgy@antfin.com>
ray-project#37173 changed a test in a previous iteration that is failing after additional changes. This PR reverts the changes to the test to fix broken master. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Bhavpreet Singh <singh.bhavpreet00@gmail.com>
…ject#37173) `FunctionTrainable.restore_from_object` creates a temporary checkpoint directory. This directory is kept around as we don't control how the user interacts with the checkpoint - they might load it several times, or no time at all. Once a new checkpoint is tracked in the status reporter, there is no need to keep the temporary object around anymore. In this PR, we add functionality to remove these temporary directories. Additionally we adjust the number of checkpoints to keep in `pytorch_pbt_failure` to 10 to reduce disk pressure in the release test. It looks like this lead to recent failures of the test. By removing the total number of checkpoints and fixing the issue with temporary directories we should see much less disk usage. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: harborn <gangsheng.wu@intel.com>
ray-project#37173 changed a test in a previous iteration that is failing after additional changes. This PR reverts the changes to the test to fix broken master. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: harborn <gangsheng.wu@intel.com>
…ject#37173) `FunctionTrainable.restore_from_object` creates a temporary checkpoint directory. This directory is kept around as we don't control how the user interacts with the checkpoint - they might load it several times, or no time at all. Once a new checkpoint is tracked in the status reporter, there is no need to keep the temporary object around anymore. In this PR, we add functionality to remove these temporary directories. Additionally we adjust the number of checkpoints to keep in `pytorch_pbt_failure` to 10 to reduce disk pressure in the release test. It looks like this lead to recent failures of the test. By removing the total number of checkpoints and fixing the issue with temporary directories we should see much less disk usage. Signed-off-by: Kai Fricke <kai@anyscale.com>
ray-project#37173 changed a test in a previous iteration that is failing after additional changes. This PR reverts the changes to the test to fix broken master. Signed-off-by: Kai Fricke <kai@anyscale.com>
…ject#37173) `FunctionTrainable.restore_from_object` creates a temporary checkpoint directory. This directory is kept around as we don't control how the user interacts with the checkpoint - they might load it several times, or no time at all. Once a new checkpoint is tracked in the status reporter, there is no need to keep the temporary object around anymore. In this PR, we add functionality to remove these temporary directories. Additionally we adjust the number of checkpoints to keep in `pytorch_pbt_failure` to 10 to reduce disk pressure in the release test. It looks like this lead to recent failures of the test. By removing the total number of checkpoints and fixing the issue with temporary directories we should see much less disk usage. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
ray-project#37173 changed a test in a previous iteration that is failing after additional changes. This PR reverts the changes to the test to fix broken master. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
Why are these changes needed?
FunctionTrainable.restore_from_object
creates a temporary checkpoint directory.This directory is kept around as we don't control how the user interacts with the checkpoint - they might load it several times, or no time at all.
Once a new checkpoint is tracked in the status reporter, there is no need to keep the temporary object around anymore.
In this PR, we add functionality to remove these temporary directories. Additionally we adjust the number of checkpoints to keep in
pytorch_pbt_failure
to 10 to reduce disk pressure in the release test. It looks like this lead to recent failures of the test. By removing the total number of checkpoints and fixing the issue with temporary directories we should see much less disk usage.Related issue number
(Hopefully) closes #36561
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.