-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[train/tune] Refactor trial metadata organization #38165
Conversation
Signed-off-by: Kai Fricke <kai@anyscale.com>
# Conflicts: # python/ray/tune/experiment/trial.py
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
# Conflicts: # python/ray/tune/experiment/trial.py # python/ray/tune/result_grid.py
from ray.tune.utils.serialization import TuneFunctionEncoder, TuneFunctionDecoder | ||
|
||
|
||
class _TrainingRunMetadata: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is just a lump of some of the fields under Trial
right? And do we still think everything under Trial
is driver's view of things?
I feel like putting this under trainable
folder is kinda misleading or are we planning to have trainable take charge of this eventually?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Talked offline, the idea is to eventually have trainables taking care of the management of these data (low priority for 2.7)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stamp. waiting for tests to pass..
# Conflicts: # python/ray/tune/analysis/experiment_analysis.py
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
The Trial object currently keeps properties with different scopes: 1. Static properties that are set **on init** 2. Static properties that are set on init but can be overwritten **on restore** 3. **Temporary** properties that are not saved, e.g. the trial location, 4. **run metadata** that is updated during training, such as the last result, the available checkpoints, etc. This PR refactors the Trial class to explicitly capture 3) and 4) in sub classes. Specifically, it introduces a `_TemporaryTrialState` class that contains temporary properties, and a `_TrainingRunMetadata` class that contains run metadata such as the last result, error files, and available checkpoints. It also changes the way experiment checkpoints are saved. Specifically, we save the trial state (which contains the static properties as well as select runtime metadata (e.g. trial status) separately from the run metadata. This allows us to split these two sources of information in the future. The changes in this PR mean that loading experiment checkpoints from runs before this change will not be possible. However, this is true for any changes to the `Trial` class. Support for backwards compatibility to resume experiments is only guaranteed on a patch level basis, so these changes should be fine. A few other improvements have been made (`Trial.runner` is now `Trial.temporary_state.ray_actor`), and a lot of the changes in this PR are changes to tests. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: NripeshN <nn2012@hw.ac.uk>
The Trial object currently keeps properties with different scopes: 1. Static properties that are set **on init** 2. Static properties that are set on init but can be overwritten **on restore** 3. **Temporary** properties that are not saved, e.g. the trial location, 4. **run metadata** that is updated during training, such as the last result, the available checkpoints, etc. This PR refactors the Trial class to explicitly capture 3) and 4) in sub classes. Specifically, it introduces a `_TemporaryTrialState` class that contains temporary properties, and a `_TrainingRunMetadata` class that contains run metadata such as the last result, error files, and available checkpoints. It also changes the way experiment checkpoints are saved. Specifically, we save the trial state (which contains the static properties as well as select runtime metadata (e.g. trial status) separately from the run metadata. This allows us to split these two sources of information in the future. The changes in this PR mean that loading experiment checkpoints from runs before this change will not be possible. However, this is true for any changes to the `Trial` class. Support for backwards compatibility to resume experiments is only guaranteed on a patch level basis, so these changes should be fine. A few other improvements have been made (`Trial.runner` is now `Trial.temporary_state.ray_actor`), and a lot of the changes in this PR are changes to tests. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: harborn <gangsheng.wu@intel.com>
The Trial object currently keeps properties with different scopes: 1. Static properties that are set **on init** 2. Static properties that are set on init but can be overwritten **on restore** 3. **Temporary** properties that are not saved, e.g. the trial location, 4. **run metadata** that is updated during training, such as the last result, the available checkpoints, etc. This PR refactors the Trial class to explicitly capture 3) and 4) in sub classes. Specifically, it introduces a `_TemporaryTrialState` class that contains temporary properties, and a `_TrainingRunMetadata` class that contains run metadata such as the last result, error files, and available checkpoints. It also changes the way experiment checkpoints are saved. Specifically, we save the trial state (which contains the static properties as well as select runtime metadata (e.g. trial status) separately from the run metadata. This allows us to split these two sources of information in the future. The changes in this PR mean that loading experiment checkpoints from runs before this change will not be possible. However, this is true for any changes to the `Trial` class. Support for backwards compatibility to resume experiments is only guaranteed on a patch level basis, so these changes should be fine. A few other improvements have been made (`Trial.runner` is now `Trial.temporary_state.ray_actor`), and a lot of the changes in this PR are changes to tests. Signed-off-by: Kai Fricke <kai@anyscale.com>
The Trial object currently keeps properties with different scopes: 1. Static properties that are set **on init** 2. Static properties that are set on init but can be overwritten **on restore** 3. **Temporary** properties that are not saved, e.g. the trial location, 4. **run metadata** that is updated during training, such as the last result, the available checkpoints, etc. This PR refactors the Trial class to explicitly capture 3) and 4) in sub classes. Specifically, it introduces a `_TemporaryTrialState` class that contains temporary properties, and a `_TrainingRunMetadata` class that contains run metadata such as the last result, error files, and available checkpoints. It also changes the way experiment checkpoints are saved. Specifically, we save the trial state (which contains the static properties as well as select runtime metadata (e.g. trial status) separately from the run metadata. This allows us to split these two sources of information in the future. The changes in this PR mean that loading experiment checkpoints from runs before this change will not be possible. However, this is true for any changes to the `Trial` class. Support for backwards compatibility to resume experiments is only guaranteed on a patch level basis, so these changes should be fine. A few other improvements have been made (`Trial.runner` is now `Trial.temporary_state.ray_actor`), and a lot of the changes in this PR are changes to tests. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
The Trial object currently keeps properties with different scopes: 1. Static properties that are set **on init** 2. Static properties that are set on init but can be overwritten **on restore** 3. **Temporary** properties that are not saved, e.g. the trial location, 4. **run metadata** that is updated during training, such as the last result, the available checkpoints, etc. This PR refactors the Trial class to explicitly capture 3) and 4) in sub classes. Specifically, it introduces a `_TemporaryTrialState` class that contains temporary properties, and a `_TrainingRunMetadata` class that contains run metadata such as the last result, error files, and available checkpoints. It also changes the way experiment checkpoints are saved. Specifically, we save the trial state (which contains the static properties as well as select runtime metadata (e.g. trial status) separately from the run metadata. This allows us to split these two sources of information in the future. The changes in this PR mean that loading experiment checkpoints from runs before this change will not be possible. However, this is true for any changes to the `Trial` class. Support for backwards compatibility to resume experiments is only guaranteed on a patch level basis, so these changes should be fine. A few other improvements have been made (`Trial.runner` is now `Trial.temporary_state.ray_actor`), and a lot of the changes in this PR are changes to tests. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Victor <vctr.y.m@example.com>
Why are these changes needed?
The Trial object currently keeps properties with different scopes:
This PR refactors the Trial class to explicitly capture 3) and 4) in sub classes.
Specifically, it introduces a
_TemporaryTrialState
class that contains temporary properties, and a_TrainingRunMetadata
class that contains run metadata such as the last result, error files, and available checkpoints.It also changes the way experiment checkpoints are saved. Specifically, we save the trial state (which contains the static properties as well as select runtime metadata (e.g. trial status) separately from the run metadata. This allows us to split these two sources of information in the future.
The changes in this PR mean that loading experiment checkpoints from runs before this change will not be possible. However, this is true for any changes to the
Trial
class. Support for backwards compatibility to resume experiments is only guaranteed on a patch level basis, so these changes should be fine.A few other improvements have been made (
Trial.runner
is nowTrial.temporary_state.ray_actor
), and a lot of the changes in this PR are changes to tests.Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.