Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[<Ray component: Tune>] TensorflowCheckpoint Temp Folder is hard coded to system variable #40956

Closed
drukoz opened this issue Nov 5, 2023 · 5 comments · Fixed by #41033
Closed
Assignees
Labels
enhancement Request for new feature and/or capability P3 Issue moderate in impact or severity train Ray Train Related Issue tune Tune-related issues

Comments

@drukoz
Copy link

drukoz commented Nov 5, 2023

Description

We need to change this as this way we can set where we ant to save the tmp files that are used while trials are paused .
we have a HPC that has a small drive and a huge network storage we had to change the tmp folder for the windows env and i am guessing that linux has the same issue as well when this can be configured by python env variable or passing it to the class.

image

Use case

Configure the directory the temporary "model.keras" which are of a decent size is saved when using raytune tensorflow checkpoint or any derivative like ReportCheckpointCallback for tensorflow

@drukoz drukoz added enhancement Request for new feature and/or capability triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 5, 2023
@anyscalesam anyscalesam added the tune Tune-related issues label Nov 6, 2023
@matthewdeng
Copy link
Contributor

@drukoz are you using the ReportCheckpointCallback? If so, would it suffice if we deleted the tempdir after the checkpoint is reported?

@drukoz
Copy link
Author

drukoz commented Nov 6, 2023

@drukoz are you using the ReportCheckpointCallback? If so, would it suffice if we deleted the tempdir after the checkpoint is reported?

Hello Matthew,
Yes and maybe i think that would work my only issue that might be that depending on the type of search algo those tmp files might be required to restart paused trials and depending on samples and concurrency/resources it might fill up the tmp folder regardless!
hence why setting the folder would really be the best solution. this obviously being an optional parameter.

@matthewdeng
Copy link
Contributor

If you're using Ray Tune, the paused trials should be pointing to checkpoints that are persisted after calling train.report. These would be in a location defined in RunConfig.storage_path, rather than the temporary ones here.

@drukoz
Copy link
Author

drukoz commented Nov 6, 2023

ah ok i was not aware of that! i had noticed if i delete the tmp files that are not in use mid trials that it fails but maybe i did it wrong and deleted some that were mid use!

@matthewdeng matthewdeng added P3 Issue moderate in impact or severity train Ray Train Related Issue and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 8, 2023
@matthewdeng matthewdeng self-assigned this Nov 8, 2023
@drukoz
Copy link
Author

drukoz commented Nov 11, 2023

@matthewdeng do you know this this was included in the nightly build?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Request for new feature and/or capability P3 Issue moderate in impact or severity train Ray Train Related Issue tune Tune-related issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants